Wednesday Jun 09, 2010

Query Planners and Baseball

I'm not a big sports aficionado. However, I'm compelled by virtue of living in Boston to care about baseball, and I just finished reading Moneyball, which is all about baseball statistics. One thing about baseball I definitely appreciate is that there are lots of published statistics, because lots of statistics means nice big public data sets available for my relational database querying pleasure.

Let's use baseball statistics as an excuse to learn more about query planners (or use query planners as an excuse to learn more about baseball, if that's how you roll):

I tend to use Postgres in my day-to-day data processing, so that's what I'm using here. My data set is from baseball1.com, Sean Lahman's insanely comprehensive baseball statistics archive, specifically the batting and salaries tables from Version 5.7 (1871-2009).

Batting

For starters, how many people have hit a baseball, professionally? How many people have hit home runs? How many people have hit a lot of home runs (what does 'a lot' mean in this context)? I have no idea...

We can get this information from the batting table. This is also an opportunity to look at what the Postgres query planner thinks about how to conduct read queries on a single table. To give the query planner some choices, we'll create an index on the number of home runs.

baseball# CREATE INDEX bhomers ON batting USING btree(homeruns);

Postgres is nice enough to give us a number of indexing options. In this case we're using a B-tree, which is like a binary tree's shorter, fatter cousin: each node can have many children, within a range, and it's self-balancing, so the tree can have few levels and minimize the number of expensive lookups to get to a value.

Let's get a count on everyone who's been at bat since 1871:

baseball=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT(playerid)) FROM batting;

                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2609.83..2609.84 rows=1 width=9) (actual time=852.611..852.612 rows=1 loops=1)
   ->  Seq Scan on batting  (cost=0.00..2378.06 rows=92706 width=9) (actual time=0.020..27.904 rows=92706 loops=1)

 Total runtime: 852.718 ms

(3 rows)

baseball=# SELECT COUNT(DISTINCT(playerid)) FROM batting;

 count 
-------
 17256

(1 row)

Cost is measured in estimated disk page fetches. The first cost number is the estimated start-up cost. Actions like pre-sorting would be bucketed under start up costs; for this simple query we have a startup cost of 0.00 for the sequential scan on the table. The second cost number is the estimated total cost, or the total number of disk page fetches for this query, in this case 2378.06 for the sequential scan. There are 92706 tuples in the table, so 92706/2378.06 ≈ 40 tuples per 4 KB disk page on this machine. The "actual time" numbers are in milliseconds. For bonus fun, frob some of the planner cost constants and see how that affects queries.

We can get the size of the table by itself and the size of the table along with any auxiliary data structures using something like:

baseball=# SELECT pg_size_pretty(pg_relation_size('batting')) As table_size, pg_size_pretty(pg_total_relation_size('batting')) AS table_plus_aux_size;

 table_size | table_plus_aux_size 
------------+---------------------
 11 MB      | 15 MB

(1 row)

So on this sequential scan we're working through 11 MB of data in 27.904 milliseconds. This looks like a sequential read rate of (11 MB / 27.904 ms) x (1000 ms / 1 sec) = 409 MB / sec, which would be amazing on this low-end machine, but all queries in this post were run several times and results reflect the benefit of a warm cache (and that's a story for another post!).

We have to go through every tuple in the batting table to get the count, so it makes sense that we're doing a full sequential scan. One disk seek and a big sequential read are going to be cheaper than the disk seeks incurred for random-access reads if we try to take advantage of any of the auxiliary data structures for this table.

Okay, cool, over 17,000 people have been at bat since 1871. But how many have hit homers -- more than one, so it's not a fluke. Let's increase the selectivity by adding a condition (the WHERE clause) on homeruns, the column on which we've created an index:

baseball=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT(playerid)) FROM batting WHERE homeruns > 1;

                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2379.50..2379.51 rows=1 width=9) (actual time=218.128..218.128 rows=1 loops=1)
   ->  Bitmap Heap Scan on batting  (cost=516.91..2310.90 rows=27439 width=9) (actual time=8.241..19.220 rows=26953 loops=1)
         Recheck Cond: (homeruns > 1)
         ->  Bitmap Index Scan on bhomers  (cost=0.00..510.05 rows=27439 width=0) (actual time=7.735..7.735 rows=26953 loops=1)
               Index Cond: (homeruns > 1)

 Total runtime: 218.248 ms

(6 rows)

baseball=# SELECT COUNT(DISTINCT(playerid)) FROM batting WHERE homeruns > 1;

 count 
-------
  5231

(1 row)

5231/17256 = 30% of batters have hit home runs. Nice!

The "actual time" rows value is 26953, so 26953 tuples, or 26953/92706 = 29% of the rows in the table, matched the homeruns condition. The query planner has been maintaining statistics on the data distributions in this table and guessed that 27439 would match the condition, which is pretty close. This is apparently selective enough to switch us over to using bitmaps. In the bitmap index scan, we traverse the bhomer index B-tree to create an in-memory mapping of homeruns to rowids that match the condition (homeruns > 1). We then create a bitmap for the rowids, setting a bit if that row appeared in the mapping. This lets us scan for the matching tuples in the table (aka the "heap", hence "bitmap heap scan") in on-disk order, minimizing disk seeks and the number of pages that have to be fetched into memory.

Let's bump up the selectivity even more and see what it takes to be a real home run juggernaut:

baseball=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT(playerid)) FROM batting WHERE homeruns > 52;

                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=376.42..376.43 rows=1 width=9) (actual time=0.550..0.550 rows=1 loops=1)
   ->  Index Scan using bhomers on batting  (cost=0.00..376.11 rows=124 width=9) (actual time=0.055..0.431 rows=23 loops=1)
         Index Cond: (homeruns > 52)

 Total runtime: 0.640 ms

(4 rows)

baseball=# SELECT COUNT(DISTINCT(playerid)) FROM batting WHERE homeruns > 52;

 count 
-------
    15

(1 row)

This query returned 15 players, meaning more than 52 home runs in one season puts you in the top .1% of all home run hitters. This high degree of selectivity has also switched us over to an index scan. This time, we're fetching in the order of the bhomer index. It can mean more disk seeks and page fetches to get the matching tuples, but the query is so selective that the extra jumping around is less costly than the sorting that would be required for a bitmap heap scan.

(For the baseball cave-dwellers like myself, Barry Bonds still has the most home runs in a single season, at 73)

Salaries

For reads on a single table, the query planner had to decide between sequential, bitmap heap, and index scans. What happens if we throw a second table into the mix with a join?

First, for fun more indices:

baseball=# CREATE INDEX brbis ON batting USING btree(rbis);
baseball=# CREATE INDEX ssal ON salaries USING btree(salary);

And here's a join across batting and salaries that will get us information on players making lots of money who haven't completely failed to generate some RBIs in a given season:

baseball=# explain analyze select b.playerid from batting b, salaries s where b.playerid = s.playerid and b.yearid = s.yearid and s.salary > 30000000 and b.rbis > 10;

                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=11.71..2984.29 rows=1 width=9) (actual time=51.087..51.129 rows=1 loops=1)
   Hash Cond: (((b.playerid)::text = (s.playerid)::text) AND (b.yearid = s.yearid))
   ->  Seq Scan on batting b  (cost=0.00..2609.82 rows=36275 width=13) (actual time=0.025..42.779 rows=33921 loops=1)
         Filter: (rbis > 10)
   ->  Hash  (cost=11.68..11.68 rows=2 width=13) (actual time=0.038..0.038 rows=1 loops=1)
         ->  Index Scan using ssal on salaries s  (cost=0.00..11.68 rows=2 width=13) (actual time=0.031..0.033 rows=1 loops=1)
               Index Cond: (salary > 30000000::double precision)

 Total runtime: 51.231 ms

(8 rows)

Postgres thinks the smartest thing to do is an index scan on ssal because of the high salary selectivity and a sequential scan on rbis because of the low RBI selectivity. Then do a hash join on the two tables and spit out the rows matching the filters for a snappy runtime of 51 milliseconds.

What happens if we take away the query planner's ability to use a hash join?

baseball=# set enable_hashjoin to false;
SET
baseball=# explain analyze select b.playerid from batting b, salaries s where b.playerid = s.playerid and b.yearid = s.yearid and s.salary > 30000000 and b.rbis > 10;

                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=2824.10..5005.53 rows=1 width=9) (actual time=71.004..71.075 rows=1 loops=1)
   Join Filter: (((b.playerid)::text = (s.playerid)::text) AND (b.yearid = s.yearid))
   ->  Index Scan using ssal on salaries s  (cost=0.00..11.68 rows=2 width=13) (actual time=0.039..0.048 rows=1 loops=1)
         Index Cond: (salary > 30000000::double precision)
   ->  Materialize  (cost=2824.10..3364.85 rows=36275 width=13) (actual time=0.026..63.490 rows=33921 loops=1)
         ->  Seq Scan on batting b  (cost=0.00..2609.82 rows=36275 width=13) (actual time=0.018..44.113 rows=33921 loops=1)
               Filter: (rbis > 10)

 Total runtime: 71.518 ms

(8 rows)

Without a hash join at its disposal, the query planner opts for a nested loops join. In a nested loops join, we're basically doing:

for every row in salaries:
    for every row in batting:
        keep based on the join filter?

The materialize is like a cache of the filtered results of the inner loop, so it doesn't have to be reevaluated after the first iteration through the outer loop. Even if it looks inefficient, it turns out that this method isn't that much slower than a hash join.

Let's hamstring it further by disabling nested loops joins too:

baseball=# set enable_nestloop to false;
SET
baseball=# explain analyze select b.playerid from batting b, salaries s where b.playerid = s.playerid and b.yearid = s.yearid and s.salary > 30000000 and b.rbis > 10;

                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=5991.77..6263.62 rows=1 width=9) (actual time=374.931..374.935 rows=1 loops=1)
   Merge Cond: (((b.playerid)::text = (s.playerid)::text) AND (b.yearid = s.yearid))
   ->  Sort  (cost=5980.06..6070.74 rows=36275 width=13) (actual time=343.304..365.986 rows=26055 loops=1)
         Sort Key: b.playerid, b.yearid
         Sort Method:  external merge  Disk: 1056kB
         ->  Seq Scan on batting b  (cost=0.00..2609.82 rows=36275 width=13) (actual time=0.025..76.286 rows=33921 loops=1)
               Filter: (rbis > 10)
   ->  Sort  (cost=11.69..11.69 rows=2 width=13) (actual time=0.083..0.084 rows=1 loops=1)
         Sort Key: s.playerid, s.yearid
         Sort Method:  quicksort  Memory: 25kB
         ->  Index Scan using ssal on salaries s  (cost=0.00..11.68 rows=2 width=13) (actual time=0.045..0.046 rows=1 loops=1)
               Index Cond: (salary > 30000000::double precision)

 Total runtime: 375.410 ms

(13 rows)

Ouch. A merge join, our worst option, not only looks grossly over-complicated but is 7 times slower than the original hash join. The tuples returned from the filter on salaries are small enough to fit in memory, and so we get away with an in-memory quicksort there that is wicked fast, but there are so many RBI tuples that we're forced to do a merge sort on disk, incurring 343.3 milliseconds, over 90% of the query time in disk seeks, page fetches, and writes. There are even more ways to influence the query planner.

Bottom of the 9th

Well, I don't know about you, but I'm glad there's a query planner making all these choices for me. I'm ready to head over to Fenway, dump beer on Yankees fans, and pretend to be a sabermetrician. Anyone have a MySQL or Oracle instance handy? How do their query plans compare for these queries?

~jesstess

Wednesday May 12, 2010

Constant factors: constantly foiling you

There are many different kinds of bugs with many different causes. Sometimes, it's just some silly typo or type mismatch; other times, it's an algorithmic issue or a correctness error; and occasionally, it appears to be something much stranger. This is a story about that last kind of bug, where finding the location of the problem is relatively straightforward, but figuring out how to fix it, or even why it's happening, isn't so easy.

We first meet our hero, yours truly, as he debugs an efficiency problem in a particular program for what is now the third day in a row. This program is actually just a small part in an otherwise large and complex system that Ksplice uses for its internal operations, at the heart of which lies a large collection of data, stored in a carefully constructed and extensively optimized MySQL database.

But this had not always been the case. In the past, much of the data was stored in a large collection of JSON-encoded archives. Seamlessly moving all that data to MySQL was a project that I had been involved in, and this was, in a sense, the final stage of the transition. We still had the JSON files, but having moved to a much shinier and spiffier database, it seemed a good idea to port all the programs that used the data to it as well.

Now, this particular program was designed to make five queries to the MySQL database — all totally identical except for a time column that all the rows contained. The queries each covered a different 48-hour period, distributed at various points throughout the last two months (relative when the program gets run).

These queries seemed simple enough, but mysteriously, the queries for older time periods, which returned fewer rows, also took considerably longer to run!

Eventually, I was able to reduce this to a consistently reproducible example:

mysql> SELECT COUNT(*) FROM events
        WHERE time > '2010-04-25' AND time < '2010-04-27';

+----------+
| COUNT(*) |
+----------+
|  1196740 |
+----------+

1 row in set (12.13 sec)

mysql> SELECT COUNT(*) FROM events
        WHERE time > '2010-02-24' AND time < '2010-02-26';

+----------+
| COUNT(*) |
+----------+
|   740034 |
+----------+

1 row in set (61.92 sec)

As you can see, the more recent range has about a million rows and took just over 12 seconds to run. The older range has just over half as many, but takes almost 5 times as long. This really doesn't make any sense — after all, the queries are totally identical except for the ranges we are looking across, and those ranges are of completely equal sizes. There really shouldn't be such a huge disparity in the running times.

Let's look at how MySQL is performing these queries:

mysql> EXPLAIN EXTENDED SELECT COUNT(*) FROM events
                                 WHERE time > '2010-04-25' AND time < '2010-04-27';

+----+-------------+--------+-------+---------------+-------------+---------+------+---------+----------+--------------------------+
| id | select_type | table  | type  | possible_keys | key         | key_len | ref  | rows    | filtered | Extra                    |
+----+-------------+--------+-------+---------------+-------------+---------+------+---------+----------+--------------------------+
|  1 | SIMPLE      | events | range | events_time   | events_time | 8       | NULL | 2234568 |   100.00 | Using where; Using index |
+----+-------------+--------+-------+---------------+-------------+---------+------+---------+----------+--------------------------+

1 row in set, 1 warning (0.04 sec)

mysql> EXPLAIN EXTENDED SELECT COUNT(*) FROM events

                         WHERE time > '2010-02-24' AND time < '2010-02-26';

+----+-------------+--------+-------+---------------+-------------+---------+------+---------+----------+--------------------------+
| id | select_type | table  | type  | possible_keys | key         | key_len | ref  | rows    | filtered | Extra                    |
+----+-------------+--------+-------+---------------+-------------+---------+------+---------+----------+--------------------------+
|  1 | SIMPLE      | events | range | events_time   | events_time | 8       | NULL | 1202716 |   100.00 | Using where; Using index |
+----+-------------+--------+-------+---------------+-------------+---------+------+---------+----------+--------------------------+

1 row in set, 1 warning (0.04 sec)

Besides confirming that we are using an index on the time column, there isn't too much here. I tried using MySQL's built in profiler by running SET PROFILING = 1 and then, after rerunning the second query, running

mysql> SHOW PROFILE FOR QUERY 1;

+--------------------+-----------+
| Status             | Duration  |
+--------------------+-----------+
| starting           |  0.000116 |
| Opening tables     |  0.000018 |
| System lock        |  0.000006 |
| Table lock         |  0.000013 |
| init               |  0.000032 |
| optimizing         |  0.000016 |
| statistics         |  0.047627 |
| preparing          |  0.000038 |
| executing          |  0.000007 |
| Sending data       | 62.852915 |
| end                |  0.000035 |
| query end          |  0.000007 |
| freeing items      |  0.000047 |
| logging slow query |  0.000005 |
| cleaning up        |  0.000005 |
+--------------------+-----------+

15 rows in set (0.03 sec)

Huh. Pretty much all of the time is spent "sending data". What the heck is going on?

Fragmented Database
Fragmented Database. For the newer events, we can just get them all in one go, since they're already sorted by time.

The answer turns out to be fragmentation. Remember when I said that we had migrated to this database from an older flat-file based one? Well, those files weren't read and inserted into the database in the chronological order of their contents, but rather based on the contents itself. So the old data that we inserted was fragmented across the database relative the time column. Newer data, on the other hand, was inserted as it comes in, so it is already pretty much ordered by time.

MySQL stores its indices using B-trees, so we can locate any given row (or range of rows) in logarithmic time. But unfortunately we can't just get any random row off the disk immediately. Instead, MySQL can only read data off the disk on the granularity of a page, which means that for a row living in a page that hasn't yet been read into memory, we have to first do a disk seek to load the right page.

Since the rows are fragmented across the disk, we have to do far more expensive disk seeks. So even though we're getting fewer total rows and the query is locating them in logarithmic time, the constant factor ends up being very large due to all the disk seeking.

What are your favorite bitten-by-a-constant-factor stories?

~leonidg

About

Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today