Trading off Efficiency for the Sake of Flexibility
By realneel on Feb 13, 2009
One of the most important features of MySQL is the support for pluggable storage engines. Since users use MySQL in different ways, one storage engine may not fit everyone 's needs. There are lots of engines for a user/application to choose from. Applications (and users) access MySQL using a uniform interface irrespective of what kind of storage engine is being used. As with any kind of software layering, it is possible to lose optimization opportunities as we cross layer boundaries. In this blog I will discuss one such lost opportunity.
Internally MySQL uses the PSAPI (public storage engine api) to communicate with the storage engines. The MySQL server uses a fixed row format. Storage engines are free to use whatever row format they choose (for ex, InnoDB uses a different row format). The advantage of using a different row format is that specialized storage engines could store rows optimally. For example, a storage engine for DSS could store data in column based format. The disadvantage of having a different row format is that you need to convert data to and from the MySQL row format every time data traverses the storage engine api. This copying does not come free.
Let me illustrate this by an example. I have a table with 1000 records. I execute a query (select id from sbtest1 where length(c)>0;) that does a full table scan and count how many times a function is executed. Note this is a count, and does not include how much time the function actually took. Looking at function counts, we see
130 rec_copy_prefix_to_buf 161 mtr_memo_slot_release 1000 Item_field::val_str(String\*) 1000 ha_innobase::unlock_row() 1000 ha_innobase::general_fetch(unsigned char\*, unsigned, unsigned) 1000 Field_string::val_str(String\*, String\*) 1000 Item_func_gt::val_int() 1000 Arg_comparator::compare_int_signed() 1000 Item_func_length::val_int() 1000 lock_clust_rec_cons_read_sees 1000 my_lengthsp_8bit 1000 row_sel_store_mysql_rec 1001 rr_sequential(st_read_record\*) 1001 evaluate_join_record(JOIN\*, st_join_table\*, int) 1001 ha_innobase::rnd_next(unsigned char\*) 1001 Item_int::val_int() 1001 rec_get_offsets_func 1001 row_search_for_mysql 1004 handler::ha_statistic_increment(unsigned long system_status_var::\*) const 1010 thd_ha_data 2152 memcpy
As expected, we see quite a few functions being called 1000 times. It is interesting to see that memcpy is being called two times the row count. Looking closer at what calls memcpy, we see
8 net_write_buff(st_net\*, const unsigned char\*, unsigned long) 9 alloc_root 125 mtr_memo_slot_release 130 rec_copy_prefix_to_buf 871 row_search_for_mysql 1000 row_sel_store_mysql_rec
row_sel_store_mysql_rec is used to convert a row from the InnoDB format to the MySQL format. It uses memcpy to copy every field that is required (i.e part of the select or condition). For integers, conversion is done using a for loop. For everything else (except blobs), memcpy is used (once per field). If I change my query to two columns in the where clause, I expect to see twice the number of calls to memcpy. Innodb also does an additional memcpy for each query (called from row_search_for_mysql). I need to take a closer look at why its done.
So why am I picking on memcpy? During a recent investigation of CPU consumers for a sysbench read-only test, memcpy was the top consumer of CPU. The cost of memcpy can get very high if there is a lot of data being transferred between MySQL and the storage engine. You can always reduce this penalty by minimizing the data transfer between the MySQL server and storage engine by reducing full table scans, or pushing the query condition down to the storage engine etc..
Remember, software layering does not come free, but sometimes it might be worth it!