Wednesday, March 05, 2008

Optimizing JPEGs with jpegoptim

Over the weekend I deployed a new plugin I wrote for Perlbal that tracks bandwidth usage of certain features on Amie Street. The plugin is only running on 25% of our traffic for now, but after a couple of days I had collected plenty of data to start analyzing it this morning.

The first thing I noticed is that a single one of our images was accounting for 10x as much bandwidth use as the next most transferred image on the site. I loaded it up to find that it was only a 128x128 album thumbnail, yet was a 560KB file. I downloaded it and tried to figure out why it was so big, but without much luck. I soon figured out that I could use convert -scale 1x1 to scale it to a single pixel and it still took over 500KB.

After a bit of googling I came upon jpegoptim, a utility that performs lossless compression on JPEGs by reconstructing the huffman encoding without changing the image contents. Simply running jpegoptim image.jpg I cut the image size down to 28KB and reuploaded it to production.

After scanning the top 1000 largest thumbnail images on the site, I found that the majority of them could be decreased by a good 50% or more by simply using this tool. So, if you've got a site that has a lot of JPEGs (particularly user-uploaded content that's automatically scaled using ImageMagick) give jpegoptim a try. The worst that happens is that it will report that the file is already optimal; the best case is a 95% reduction in load time for users with slow connections.

Sunday, March 02, 2008

Always EXPLAIN your queries

One of the first tips if you're working on improving your database speed is to always run EXPLAIN on your queries. This lets you see the plan of execution that MySQL will take, and evaluate whether and how your indexes will be used.

Usually I only bother to take my own advice for complicated queries involving GROUP BY, subselects, etc. Tonight, though, I was working on optimizing some of our pages by moving data into summary tables, and a query took a lot longer than I expected:


mysql> select SQL_NO_CACHE s.title, ar.name from SongBuzz sb JOIN Song s ON sb.song_id = s.id JOIN Album al ON s.album_id = al.id JOIN Artist ar ON al.artist_id = ar.id ORDER BY sb.recs_1w desc limit 3;
+----------------------+------------------+
| title | name |
+----------------------+------------------+
| Deep Sea Green | Mandyleigh Storm |
| Suffer For Fashion | of Montreal |
| Just Enough | Dan Tharp |
+----------------------+------------------+
3 rows in set (1.17 sec)


All of the foreign key fields in the JOIN were indexed, as was sb.recs_1w. There's no reason this query should take 1.2 seconds when the entire data fits easily in RAM. So I ran EXPLAIN:


mysql> EXPLAIN select SQL_NO_CACHE s.title, ar.name from SongBuzz sb JOIN Song s ON sb.song_id = s.id JOIN Album al ON s.album_id = al.id JOIN Artist ar ON al.artist_id = ar.id ORDER BY sb.recs_1w desc limit 3;
+----+-------------+-------+--------+-------------------+-----------+---------+-------------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------+-----------+---------+-------------------------+-------+---------------------------------+
| 1 | SIMPLE | ar | ALL | PRIMARY | NULL | NULL | NULL | [redacted big number] | Using temporary; Using filesort |
| 1 | SIMPLE | al | ref | PRIMARY,artist_id | artist_id | 4 | amiest_production.ar.id | 1 | Using index |
| 1 | SIMPLE | s | ref | PRIMARY,album_id | album_id | 4 | amiest_production.al.id | 4 | |
| 1 | SIMPLE | sb | eq_ref | PRIMARY | PRIMARY | 4 | amiest_production.s.id | 1 | |
+----+-------------+-------+--------+-------------------+-----------+---------+-------------------------+-------+---------------------------------+
4 rows in set (0.02 sec)


From this query plan, it was clear that MySQL was first reading the artist table, then joining to find all the albums, then joining to find all the songs, then joining against the SongBuzz summary table. After getting all these rows, it was sorting the whole thing in a temporary table. The more logical plan would be to sort the SongBuzz table by recs_1w, since that column is already indexed, grab the top 3 rows, and then join through to find the song and artist names. Using the STRAIGHT_JOIN flag to the SELECT statement, we can force this:


mysql> explain select SQL_NO_CACHE STRAIGHT_JOIN s.title, ar.name from SongBuzz sb JOIN Song s ON sb.song_id = s.id JOIN Album al ON s.album_id = al.id JOIN Artist ar ON al.artist_id = ar.id ORDER BY sb.recs_1w desc limit 10;
+----+-------------+-------+--------+-------------------+---------+---------+--------------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------+---------+---------+--------------------------------+--------+-------------+
| 1 | SIMPLE | sb | index | PRIMARY | recs_1w | 2 | NULL | [redacted big number] | Using index |
| 1 | SIMPLE | s | eq_ref | PRIMARY,album_id | PRIMARY | 4 | amiest_production.sb.song_id | 1 | |
| 1 | SIMPLE | al | eq_ref | PRIMARY,artist_id | PRIMARY | 4 | amiest_production.s.album_id | 1 | |
| 1 | SIMPLE | ar | eq_ref | PRIMARY | PRIMARY | 4 | amiest_production.al.artist_id | 1 | |
+----+-------------+-------+--------+-------------------+---------+---------+--------------------------------+--------+-------------+


and the resulting query executes somewhere around 60x faster:


mysql> select SQL_NO_CACHE STRAIGHT_JOIN s.title, ar.name from SongBuzz sb JOIN Song s ON sb.song_id = s.id JOIN Album al ON s.album_id = al.id JOIN Artist ar ON al.artist_id = ar.id ORDER BY sb.recs_1w desc limit 3;
+----------------------+------------------+
| title | name |
+----------------------+------------------+
| Deep Sea Green | Mandyleigh Storm |
| Suffer For Fashion | of Montreal |
| Just Enough | Dan Tharp |
+----------------------+------------------+
3 rows in set (0.02 sec)