Wednesday Jun 09, 2010

Query Planners and Baseball

I'm not a big sports aficionado. However, I'm compelled by virtue of living in Boston to care about baseball, and I just finished reading Moneyball, which is all about baseball statistics. One thing about baseball I definitely appreciate is that there are lots of published statistics, because lots of statistics means nice big public data sets available for my relational database querying pleasure.

Let's use baseball statistics as an excuse to learn more about query planners (or use query planners as an excuse to learn more about baseball, if that's how you roll):

I tend to use Postgres in my day-to-day data processing, so that's what I'm using here. My data set is from baseball1.com, Sean Lahman's insanely comprehensive baseball statistics archive, specifically the batting and salaries tables from Version 5.7 (1871-2009).

Batting

For starters, how many people have hit a baseball, professionally? How many people have hit home runs? How many people have hit a lot of home runs (what does 'a lot' mean in this context)? I have no idea...

We can get this information from the batting table. This is also an opportunity to look at what the Postgres query planner thinks about how to conduct read queries on a single table. To give the query planner some choices, we'll create an index on the number of home runs.

baseball# CREATE INDEX bhomers ON batting USING btree(homeruns);

Postgres is nice enough to give us a number of indexing options. In this case we're using a B-tree, which is like a binary tree's shorter, fatter cousin: each node can have many children, within a range, and it's self-balancing, so the tree can have few levels and minimize the number of expensive lookups to get to a value.

Let's get a count on everyone who's been at bat since 1871:

baseball=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT(playerid)) FROM batting;

                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2609.83..2609.84 rows=1 width=9) (actual time=852.611..852.612 rows=1 loops=1)
   ->  Seq Scan on batting  (cost=0.00..2378.06 rows=92706 width=9) (actual time=0.020..27.904 rows=92706 loops=1)

 Total runtime: 852.718 ms

(3 rows)

baseball=# SELECT COUNT(DISTINCT(playerid)) FROM batting;

 count 
-------
 17256

(1 row)

Cost is measured in estimated disk page fetches. The first cost number is the estimated start-up cost. Actions like pre-sorting would be bucketed under start up costs; for this simple query we have a startup cost of 0.00 for the sequential scan on the table. The second cost number is the estimated total cost, or the total number of disk page fetches for this query, in this case 2378.06 for the sequential scan. There are 92706 tuples in the table, so 92706/2378.06 ≈ 40 tuples per 4 KB disk page on this machine. The "actual time" numbers are in milliseconds. For bonus fun, frob some of the planner cost constants and see how that affects queries.

We can get the size of the table by itself and the size of the table along with any auxiliary data structures using something like:

baseball=# SELECT pg_size_pretty(pg_relation_size('batting')) As table_size, pg_size_pretty(pg_total_relation_size('batting')) AS table_plus_aux_size;

 table_size | table_plus_aux_size 
------------+---------------------
 11 MB      | 15 MB

(1 row)

So on this sequential scan we're working through 11 MB of data in 27.904 milliseconds. This looks like a sequential read rate of (11 MB / 27.904 ms) x (1000 ms / 1 sec) = 409 MB / sec, which would be amazing on this low-end machine, but all queries in this post were run several times and results reflect the benefit of a warm cache (and that's a story for another post!).

We have to go through every tuple in the batting table to get the count, so it makes sense that we're doing a full sequential scan. One disk seek and a big sequential read are going to be cheaper than the disk seeks incurred for random-access reads if we try to take advantage of any of the auxiliary data structures for this table.

Okay, cool, over 17,000 people have been at bat since 1871. But how many have hit homers -- more than one, so it's not a fluke. Let's increase the selectivity by adding a condition (the WHERE clause) on homeruns, the column on which we've created an index:

baseball=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT(playerid)) FROM batting WHERE homeruns > 1;

                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2379.50..2379.51 rows=1 width=9) (actual time=218.128..218.128 rows=1 loops=1)
   ->  Bitmap Heap Scan on batting  (cost=516.91..2310.90 rows=27439 width=9) (actual time=8.241..19.220 rows=26953 loops=1)
         Recheck Cond: (homeruns > 1)
         ->  Bitmap Index Scan on bhomers  (cost=0.00..510.05 rows=27439 width=0) (actual time=7.735..7.735 rows=26953 loops=1)
               Index Cond: (homeruns > 1)

 Total runtime: 218.248 ms

(6 rows)

baseball=# SELECT COUNT(DISTINCT(playerid)) FROM batting WHERE homeruns > 1;

 count 
-------
  5231

(1 row)

5231/17256 = 30% of batters have hit home runs. Nice!

The "actual time" rows value is 26953, so 26953 tuples, or 26953/92706 = 29% of the rows in the table, matched the homeruns condition. The query planner has been maintaining statistics on the data distributions in this table and guessed that 27439 would match the condition, which is pretty close. This is apparently selective enough to switch us over to using bitmaps. In the bitmap index scan, we traverse the bhomer index B-tree to create an in-memory mapping of homeruns to rowids that match the condition (homeruns > 1). We then create a bitmap for the rowids, setting a bit if that row appeared in the mapping. This lets us scan for the matching tuples in the table (aka the "heap", hence "bitmap heap scan") in on-disk order, minimizing disk seeks and the number of pages that have to be fetched into memory.

Let's bump up the selectivity even more and see what it takes to be a real home run juggernaut:

baseball=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT(playerid)) FROM batting WHERE homeruns > 52;

                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=376.42..376.43 rows=1 width=9) (actual time=0.550..0.550 rows=1 loops=1)
   ->  Index Scan using bhomers on batting  (cost=0.00..376.11 rows=124 width=9) (actual time=0.055..0.431 rows=23 loops=1)
         Index Cond: (homeruns > 52)

 Total runtime: 0.640 ms

(4 rows)

baseball=# SELECT COUNT(DISTINCT(playerid)) FROM batting WHERE homeruns > 52;

 count 
-------
    15

(1 row)

This query returned 15 players, meaning more than 52 home runs in one season puts you in the top .1% of all home run hitters. This high degree of selectivity has also switched us over to an index scan. This time, we're fetching in the order of the bhomer index. It can mean more disk seeks and page fetches to get the matching tuples, but the query is so selective that the extra jumping around is less costly than the sorting that would be required for a bitmap heap scan.

(For the baseball cave-dwellers like myself, Barry Bonds still has the most home runs in a single season, at 73)

Salaries

For reads on a single table, the query planner had to decide between sequential, bitmap heap, and index scans. What happens if we throw a second table into the mix with a join?

First, for fun more indices:

baseball=# CREATE INDEX brbis ON batting USING btree(rbis);
baseball=# CREATE INDEX ssal ON salaries USING btree(salary);

And here's a join across batting and salaries that will get us information on players making lots of money who haven't completely failed to generate some RBIs in a given season:

baseball=# explain analyze select b.playerid from batting b, salaries s where b.playerid = s.playerid and b.yearid = s.yearid and s.salary > 30000000 and b.rbis > 10;

                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=11.71..2984.29 rows=1 width=9) (actual time=51.087..51.129 rows=1 loops=1)
   Hash Cond: (((b.playerid)::text = (s.playerid)::text) AND (b.yearid = s.yearid))
   ->  Seq Scan on batting b  (cost=0.00..2609.82 rows=36275 width=13) (actual time=0.025..42.779 rows=33921 loops=1)
         Filter: (rbis > 10)
   ->  Hash  (cost=11.68..11.68 rows=2 width=13) (actual time=0.038..0.038 rows=1 loops=1)
         ->  Index Scan using ssal on salaries s  (cost=0.00..11.68 rows=2 width=13) (actual time=0.031..0.033 rows=1 loops=1)
               Index Cond: (salary > 30000000::double precision)

 Total runtime: 51.231 ms

(8 rows)

Postgres thinks the smartest thing to do is an index scan on ssal because of the high salary selectivity and a sequential scan on rbis because of the low RBI selectivity. Then do a hash join on the two tables and spit out the rows matching the filters for a snappy runtime of 51 milliseconds.

What happens if we take away the query planner's ability to use a hash join?

baseball=# set enable_hashjoin to false;
SET
baseball=# explain analyze select b.playerid from batting b, salaries s where b.playerid = s.playerid and b.yearid = s.yearid and s.salary > 30000000 and b.rbis > 10;

                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=2824.10..5005.53 rows=1 width=9) (actual time=71.004..71.075 rows=1 loops=1)
   Join Filter: (((b.playerid)::text = (s.playerid)::text) AND (b.yearid = s.yearid))
   ->  Index Scan using ssal on salaries s  (cost=0.00..11.68 rows=2 width=13) (actual time=0.039..0.048 rows=1 loops=1)
         Index Cond: (salary > 30000000::double precision)
   ->  Materialize  (cost=2824.10..3364.85 rows=36275 width=13) (actual time=0.026..63.490 rows=33921 loops=1)
         ->  Seq Scan on batting b  (cost=0.00..2609.82 rows=36275 width=13) (actual time=0.018..44.113 rows=33921 loops=1)
               Filter: (rbis > 10)

 Total runtime: 71.518 ms

(8 rows)

Without a hash join at its disposal, the query planner opts for a nested loops join. In a nested loops join, we're basically doing:

for every row in salaries:
    for every row in batting:
        keep based on the join filter?

The materialize is like a cache of the filtered results of the inner loop, so it doesn't have to be reevaluated after the first iteration through the outer loop. Even if it looks inefficient, it turns out that this method isn't that much slower than a hash join.

Let's hamstring it further by disabling nested loops joins too:

baseball=# set enable_nestloop to false;
SET
baseball=# explain analyze select b.playerid from batting b, salaries s where b.playerid = s.playerid and b.yearid = s.yearid and s.salary > 30000000 and b.rbis > 10;

                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=5991.77..6263.62 rows=1 width=9) (actual time=374.931..374.935 rows=1 loops=1)
   Merge Cond: (((b.playerid)::text = (s.playerid)::text) AND (b.yearid = s.yearid))
   ->  Sort  (cost=5980.06..6070.74 rows=36275 width=13) (actual time=343.304..365.986 rows=26055 loops=1)
         Sort Key: b.playerid, b.yearid
         Sort Method:  external merge  Disk: 1056kB
         ->  Seq Scan on batting b  (cost=0.00..2609.82 rows=36275 width=13) (actual time=0.025..76.286 rows=33921 loops=1)
               Filter: (rbis > 10)
   ->  Sort  (cost=11.69..11.69 rows=2 width=13) (actual time=0.083..0.084 rows=1 loops=1)
         Sort Key: s.playerid, s.yearid
         Sort Method:  quicksort  Memory: 25kB
         ->  Index Scan using ssal on salaries s  (cost=0.00..11.68 rows=2 width=13) (actual time=0.045..0.046 rows=1 loops=1)
               Index Cond: (salary > 30000000::double precision)

 Total runtime: 375.410 ms

(13 rows)

Ouch. A merge join, our worst option, not only looks grossly over-complicated but is 7 times slower than the original hash join. The tuples returned from the filter on salaries are small enough to fit in memory, and so we get away with an in-memory quicksort there that is wicked fast, but there are so many RBI tuples that we're forced to do a merge sort on disk, incurring 343.3 milliseconds, over 90% of the query time in disk seeks, page fetches, and writes. There are even more ways to influence the query planner.

Bottom of the 9th

Well, I don't know about you, but I'm glad there's a query planner making all these choices for me. I'm ready to head over to Fenway, dump beer on Yankees fans, and pretend to be a sabermetrician. Anyone have a MySQL or Oracle instance handy? How do their query plans compare for these queries?

~jesstess

About

Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today