MySQL for Developers

this is a write up of info I gathered from   Jay PipesHigh Performance MySQL , Colin Charles and others (see References below) for a Sun Tech Days MySQL presentation.

MySQL powers many high-volume Web sites,  including industry leaders such as Yahoo!, Alcatel-Lucent, Google, Nokia, YouTube, and Zappos.com.

MySQL for Developers

If you are a developer using MySQL, you should learn enough to take advantage of its strengths, because having an understanding of the database can help you develop better-performing applications. This can be especially important for Hibernate or JPA developers, since ORM frameworks tend to obscure the database schema and SQL  for the developer, which can lead to poorly-performing index and schema strategies and inefficient SQL.

MySQL Storage Engine Architecture


A key difference between MySQL and other database platforms is the pluggable storage engine architecture of MySQL, which allows you to select a specialized storage engine for a particular application need such as data warehousing, transaction processing, high availability...
A storage engine is  responsible for storing and retrieving all the data stored .  The storage engines have different functionality, capabilities and performance characteristics, in many applications choosing the right storage engine can greatly improve performance. Here is a brief summary of some of the more common engines:
  • MyISAM - offers high-speed query and insert capability, is nontransactional, provides table-level locking, full-text indexing, compression, GIS functions, and supports indexes. MyISAM is a good choice for read-only or read-mostly applications that don't require transactions and issue primarily either SELECT or INSERT queries, such as Web Catalogs or listing of jobs, auctions, real estate, etc., or data warehousing applications. You can use multiple storage engines in a single application, sometimes specific components of an application (such as logging) fall into this category. MyISAM is not a good general purpose storage engine for high concurrent UPDATEs or DELETEs since these obtain exclusive write locks, however new rows can be inserted with shared read locks.
  • InnoDB - supports ACID transactions, multi-versioning, row-level locking, foreign key constraints, crash recovery,  and  good query performance depending on indexes. InnoDB uses row-level locking with multiversion concurrency control (MVCC). MVCC can allow fewer row locks by keeping data snapshots. Depending on the isolation level, InnoDB does not require any locking for a SELECT. This makes high concurrency possible, with some trade-offs: InnoDB requires more disk space compared to MyISAM, and for the best performance, lots of memory is required for the InnoDB buffer pool. InnoDB is a good choice for any order processing application, any application where transactions are required.  
  • Memory - stores all data in RAM for extremely fast access. Useful when you need fast access to data that doesn't change or doesn't need to persist after a restart.  Good for "lookup" or "mapping" tables, for caching the results of periodically aggregated data, for intermediate results when analyzing data.
    MEMORY tables do not support variable-length columns, this means that VARCHAR columns get implicitly converted to CHAR columns, and it is impossible to create a MEMORY table with TEXT or BLOB columns.
  • Merge - allows to logically group together a series of identical MyISAM tables and reference them as one object. Good for very large DBs like data warehousing.
  • Archive - provides for storing and retrieving large amounts of seldom-referenced historical, archived, or security audit information.
More specialized engines:
  • Federated - Allows a remote server's tables to be used as if they were local, by linking together separate MySQL servers as one logical database. Can be good for distributed or data mart environments. 
  • Cluster/NDB -Highly-available clustered storage engine.  Should be used where very high speed, availability and redundancy are absolute requirements. Other tables (even in the same database) should use one or more of the other storage engines.
  • CSV-references comma-separated files as database tables.Useful for large bulk imports or exports
  • Blackhole- the /dev/null storage engine for temporarily disabling application input to the database .  Useful for benchmarking and some replication scenarios.
You can use multiple storage engines in a single application, a storage engine for the same table on a slave can be different than that of the master. This can be very useful for taking advantage of an engine's abilities.

Schema, the basic foundation of performance

Database normalization minimizes duplication of information, this makes updates simpler and faster because the same information doesn't have to be updated in multiple tables.  In a denormalized database, information is duplicated, or stored in multiple places. With a normalized database:
  • updates are usually faster.
  • there's less data to change.
  • tables are usually smaller, use less memory, which can give better performance.
  • better performance for distinct or group by queries
The disadvantages of a normalized schema are queries typically involve more tables and require more joins which can reduce performance.  Also normalizing may place columns in different tables that would benefit from belonging to the same index, which can also reduce query performance.  More normalized schemas are better for applications involving many transactions, less normalized are better for reporting types of application.  You should normalize your schema first, then de-normalize later.  Applications often need to mix the approaches, for example use a partially normalized schema, and duplicate, or cache, selected columns from one table in another table.

Data Types

In general, try to use the smallest data type that you can.  Small and simple data types usually give better performance because it  means fewer disk accesses (less I/O), more data in memory, and less CPU to process operations.

Numeric Data Types

MySQL has 9 numeric data types.

Whole Numbers:
TINYINT, SMALLINT, MEDIUMINT, INT, BIGINT:   require 8, 16, 24, 32, and 64 bits of storage space, respectively. They can store values from –2(n-1) to 2(n-1)–1, where N is the number of bits of storage space they use. 

Use UNSIGNED when you don't need negative numbers, this doubles the bits of storage space.  BIGINT is not needed for AUTO_INCREMENT,  INT UNSIGNED stores 4.3 billion values!

INT(1) does not mean 1 digit!  The number in parentheses is the ZEROFILL argument, and specifies the number of characters some tools reserve for display purposes. For storage and computational purposes, INT(1) is identical to INT(20).

Integer data types work best for primary key data types. 

Real Numbers
:
FLOAT, DOUBLE: supports approximate calculations with standard floating-point math.
DECIMAL: use DECIMAL when you need exact results, always use for monetary/currency fields.

Other:
BIT: to store  0,1 values. BIT(1) stores 1 bit, BIT(2) stores 2... max is 64.

Character Data Types

The CHAR and VARCHAR types are declared with a length that indicates the maximum number of characters to store.
VARCHAR(n)
stores variable-length character strings. VARCHAR uses only as much space as it needs, which helps performance because it saves disk space. Use when the maximum column length is larger than the average length and when updates to the field are rare, so fragmentation is not a problem.  VARCHAR gives more efficient disk usage, however MySQL internal memory buffers are fixed width, so  internal memory usage isn't more efficient, therefore try to keep VARCHAR Max Length short.
CHAR(n) is fixed-length: MySQL allocates enough space for the specified number of characters. Useful to store very short strings, when all the values are nearly the same length, and  for data that's changed frequently.

Use NOT NULL

Always define columns as NOT NULL unless there is a very good reason not to do so:
  • can save up to a byte per column per row of data
  • nullable columns make indexes, index statistics, and value comparisons more complicated.

Indexes

Indexes are data structures that help retrieve row data with specific column values faster. Indexes can especially improve performance for larger data bases.  Most MySQL storage engines support  B-tree indexes. a B-tree is (remember data strucutres 101 ?) a tree data structure that sorts data values, tree nodes define the upper and lower bounds of the values in the child nodes.  Leaf nodes have pointers to the data instead of pointers  to child nodes. B-trees are kept balanced by requiring that all leaf nodes are at the same depth.  MyISAM leaf nodes refer to the indexed row by the position of the row,  InnoDB leaf nodes refers to the index by its primary key values.

InnoDB's clustered indexes store the row data in the leaf nodes, it's called clustered because rows with close primary key values are stored close to each other.  This can make retrieving indexed data fast, since the data is in the index.  But this can be slower for updates , secondary indexes, and for full table scans.

Covering Indexes are indexes that contain all the data values needed for a query, these queries can improve performance because the row does not have to be read.

Know Every SQL Statement

You need to understand the SQL queries your application makes and  evaluate their performance. For this you can harness the MySQL slow query log and use EXPLAIN.  Consider adding an index for queries that are taking too long, or cause a lot of load. Know how your query is executed by MySQL, consider disk I/O in particular.  Basically you want to make your queries access less data, Indexes are a good way to reduce data access.

Know Every SQL Statement for Hibernate/JPA developers

If you are using Hibernate, JPA, or another ORM framework you want to look at the generated SQL-statements. If you are using Hibernate, enable the Hibernate SQL log.  If you are using Glassfish with Toplink as the JPA provider, you can log the SQL that is being sent to the database by adding this property in the definition of the persistence unit in the persistence.xml file, as follows: <property name="toplink.logging.level" value="FINE">. Then run a use case of your application and examine the SQL statements which are executed by JPA or your ORM framework. You want to make sure you are only retrieving the data your application needs, that is you want to optimize the number of SQL statements executed (see lazy loading below), and you want to examine the execution plan for queries.

Optimize data access

Basically you want to make your queries access less data:
  • is your application retrieving more data than it needs, are queries accessing too many rows or columns?
  • is MySQL analyzing more rows than it needs?

Understanding EXPLAIN

EXPLAIN provides the execution plan chosen by the MySQL optimiser for a specific SELECT statement.  It is important to run EXPLAIN on all SELECT statements that your code is executing against the database.  This step ensures that missing indexes are picked up early in the development process and gives developers insight into how the MySQL optimizer has chosen to execute the query.

To use it just precede a SELECT statement with the keyword EXPLAIN and  MySQL will display information from the optimizer about the query execution plan, which explains how it would process the SELECT,  including information about how tables are joined and in which order. The EXPLAIN EXTENDED SELECT variant, followed by SHOW WARNINGS gives some additional information, including the final rewritten query as used by the optimizer.

With the help of EXPLAIN, you can see where you should add indexes to tables to get a faster SELECT that uses indexes to find rows. You can also use EXPLAIN to check whether the optimizer joins the tables in an optimal order.

EXPLAIN
returns a row of information for each "table" used in the SELECT statement, which shows each part and the order of the execution plan.  The "table" can mean a real schema table, a derived or temporary table, a subquery, a union result. Here is an explanation of some of the output for each row:
  • table:  the real table name or an alias.
  • select_type: shows whether the row is a simple or complex SELECT. If complex it can be: PRIMARY (Outermost SELECT), UNION, SUBQUERY, DERIVED, or UNCACHEABLE.
  • type: shows the "access strategy" type to find rows. The different access types ordered from the best  to worst:
    • system, or const: very fast because  the table has at most one matching row (For example a primary key used in the WHERE)
    • eq_ref: index lookup returning one value
    • ref: index access
    • fulltext:  FULLTEXT index.
    • ref_or_null:  like ref, but with an extra search for rows that contain NULL values.
    • range:   index scan to select the rows in given range
    • index:   index tree is scanned.
    • ALL: A full table scan
  • key: the index that MySQL decided to use, from the possible_keys, to look up rows.
  • rows: the number of rows MySQL estimates it must examine to execute the query.
  • ref: columns or constants which are compared to key to select rows .
  • Extra: additional information about how MySQL resolves the query. Watch out for Extra values of Using filesort and Using temporaryUsing index means information is retrieved from the table using only information in the index tree without having to do an additional seek to read the actual row. This strategy can be used when the query uses only columns that are part of a single index (Covering Index).

When do you get a full table scan?

  • No WHERE condition
  • No index on any field in WHERE condition
  • When your range returns a large number of rows, i.e. too many records in WHERE condition
  • When optimizer sees that the condition will return > ~20% of the records in a table, it will prefer a scan versus many seeks
  • Pre-MySQL 5, using OR in a WHERE clause
    • now fixed with an index merge, so the optimiser can use more than one index to satisfy a join condition
How do you know if a scan is used?
In the EXPLAIN output, the “type” for the table/set will be “ALL” or “index”.  “ALL” means a full table data record scan is performed.  “index” means a full index record scan.  Avoid this by ensuring indexes are on columns that are used in the WHERE, ON, and GROUP BY clauses.

Scans and seeks

A seek jumps into a random place (on disk or in memory) to fetch data. A scan will jump to the start of the data, and sequentially read (from either disk or memory) until the end of the data. With large amounts of data sequentially scanning through contiguous data on disk or in memory is faster than performing many random seek operations.  But scans can also be a sign of poor indexing, if an index contains  many rows with the same value it is not very useful.  MySQL keeps stats about the  uniqueness of values in an index in order to the estimated rows returned (rows in the explain output). If the rows is greater than a certain  % of rows in the table, then MySQL will do a scan.

Indexed columns and functions don't mix

Indexes can quickly find the rows that match a WHERE clause, however this works only if the index is NOT used in a function or expression in the WHERE clause. Below is an example where an index is used :

mysql> EXPLAIN SELECT \* FROM film WHERE title LIKE 'Tr%'\\G
\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* 1. row \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
           id: 1
  select_type: SIMPLE
        table: film
type: range
possible_keys: idx_title
key: idx_title
      key_len: 767
          ref: NULL
         rows: 15
        Extra: Using where


In the example above, a  fast range "access strategy" is chosen by the optimizer, and the index scan on title is used to winnow the query results down.
Below is an example where an index can not be used :

mysql> EXPLAIN SELECT \* FROM film WHERE LEFT(title,2) = 'Tr' \\G
\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* 1. row \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
           id: 1
  select_type: SIMPLE
        table: film
type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 951
        Extra: Using where


A slow full table scan (the ALL"access strategy") is used because a function (LEFT) is operating on the title column. Operating on an indexed column with a function (in this case the LEFT() function) means the optimizer cannot use the index to satisfy the query.  Typically, you can rewrite queries in order to not operate on an indexed column with a function.

Solving multiple problems in one query

Here is an example of improving a query:

SELECT \* FROM Orders WHERE TO_DAYS(CURRENT_DATE()) – TO_DAYS(order_created) <= 7;

First, we are operating on an indexed column (order_created) with a function TO_DAYS – let's fix that:

SELECT \* FROM Orders WHERE order_created >= CURRENT_DATE() - INTERVAL 7 DAY;

Although we rewrote the WHERE expression to remove the function on the index, we still have a non-deterministic function CURRENT_DATE() in the statement, which eliminates this query from being placed in the query cache. Any time a non-deterministic function is used in a SELECT statement, the query cache ignores the query.  In read-intensive applications, this can be a significant performance problem. – let's fix that:

SELECT \* FROM Orders WHERE order_created >= '2008-01-11' - INTERVAL 7 DAY;

We replaced the function with a constant (probably using our application programming language).  However, we are specifying SELECT \* instead of the actual fields we need from the table.  What if there is a TEXT field in Orders called order_memo that we don't need to see?  Well, having it included in the result means a larger result set which may not fit into the query cache and may force a disk-based temporary table. – let's fix that:

SELECT order_id, customer_id, order_total, order_created
FROM Orders WHERE order_created >= '2008-01-11' - INTERVAL 7 DAY;


A Few Things to consider for  JPA/Hibernate  devlopers:

Lazy loading and JPA

With JPA many-to-one and many-to-many relationships lazy load by default , meaning they will be loaded when the entity in the relationship is accessed. Lazy loading is usually good, but if you need to access all of the "many" objects in a relationship, it will cause n+1 selects where n is the number of  "many" objects.  You can change the relationship to be loaded eagerly as follows :
public class Employee{

    @OneToMany(mappedBy = "employee", fetch = FetchType.EAGER)
    private Collection<Address> addresses;

 .....
}
However you should be careful with eager loading which could cause SELECT statements that fetch too much data. It can cause a Cartesian product  if you eagerly load entities with several related collections. 

If you want to temporarily override the LAZY fetch type, you could use Fetch Join.  For example this query would eagerly load the employee addresses: 
@NamedQueries({ @NamedQuery(name="getItEarly", 
                 query="SELECT e FROM Employee e JOIN FETCH e.addresses")})

public class Employee{
.....
}
 

Optimistic locking and JPA

The Java Persistence API 1.0 persistence providers support optimistic locking, where your application will get an exception whenever a transaction tries to commit an object that was updated in the database since this transaction began. To enable this for a entity, you need to add a version attribute:

@Entity
public class Item {
    ...
    @Version
    private int version;
    ...
}


In Conclusion

  • Understand the storage engines
  • Keep data types small and compact
  • Understand your query execution plans with the EXPLAIN output
  • Understand the scan vs. seek choice the optimizer must make
  • Don't mix functions and indexed columns

References

High Performance MySQL book
MySQL Pluggable Storage Engine Architecture
MySQL Storage Engine Architecture, Part 2: An In-Depth Look
Optimizing Queries with EXPLAIN
Java Persistence with Hibernate book
Jay Pipes blog
Colin Charles blog
mysql performance blog
Ronald Bradford blog
Taking JPA for a Test Drive
Pro EJB 3: Java Persistence API
Pro MySQL, Chapter 6: Benchmarking and Profiling
Comments:

Post a Comment:
Comments are closed for this entry.
About

caroljmcdonald

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today