Friday Aug 28, 2009

JPA Performance, Don't Ignore the Database


Database Schema

Good Database schema design is important for performance. One of the most basic optimizations is to design your tables to take as little space on the disk as possible , this makes disk reads faster and uses less memory for query processing.

Data Types

You should use the smallest data types possible, especially for indexed fields. The smaller your data types, the more indexes (and data) can fit into a block of memory, the faster your queries will be.


Database Normalization eliminates redundant data, which usually makes updates faster since there is less data to change. However a Normalized schema causes joins for queries, which makes queries slower, denormalization speeds retrieval. More normalized schemas are better for applications involving many transactions, less normalized are better for reporting types of applications.  You should normalize your schema first, then de-normalize later.  Applications often need to mix the approaches, for example use a partially normalized schema, and duplicate, or cache, selected columns from one table in another table. With JPA O/R mapping you can use the @Embedded annotation for denormalized columns to specify a persistent field whose @Embeddable type can be stored as an intrinsic part of the owning entity and share the identity of the entity.

Database Normalization and Mapping Inheritance Hiearchies

The Class Inheritance hierarchy shown below will be used as an example of JPA O/R mapping.

In the Single table per class mapping shown below, all classes in the hierarchy are mapped to a single table in the database. This table has a discriminator column (mapped by @DiscriminatorColumn), which identifies the subclass.  Advantages: This is fast for querying, no joins are required. Disadvantages:  wastage of space since all inherited fields are in every row, a deep inheritance hierarchy will result in wide tables with many, some empty columns.

In the Joined Subclass mapping shown below, the root of the class hierarchy is represented by a single table, and each subclass has a separate table that only contains those fields specific to that subclass. This is normalized (eliminates redundant data) which is better for storage and updates. However queries cause joins which makes queries slower especially for deep hierachies, polymorphic queries and relationships.

In the Table per Class mapping (in JPA 2.0, optional in JPA 1.0),  every concrete class is mapped to a table in the database and all the inherited state is repeated in that table. This is not normlalized, inherited data is repeated which wastes space.  Queries for Entities of the same type are fast, however  polymorphic queries cause unions which are slower.

Know what SQL is executed

You need to understand the SQL queries your application makes and evaluate their performance. Its a good idea to enable SQL logging, then go through a use case scenario to check the executed SQL.  Logging is not part of the JPA specification, With EclipseLink you can enable logging of SQL by setting the following property in the persistence.xml file:

    <property name="eclipselink.logging.level" value="FINE"/>

With Hibernate you set the following property in the persistence.xml file:

    <property name="hibernate.show_sql" value="true" />

Basically you want to make your queries access less data, is your application retrieving more data than it needs, are queries accessing too many rows or columns? Is the database query analyzing more rows than it needs? Watch out for the following:
  • queries which execute too often to retrieve needed data
  • retrieving more data than needed
  • queries which are too slow
    • you can use EXPLAIN to see where you should add indexes

With MySQL you can use the slow query log to see which queries are executing slowly, or you can use the MySQL query analyzer to see slow queries, query execution counts, and results of EXPLAIN statements.

Understanding EXPLAIN

For slow queries, you can precede a SELECT statement with the keyword EXPLAIN  to get information about the query execution plan, which explains how it would process the SELECT,  including information about how tables are joined and in which order. This helps find missing indexes early in the development process.

You should index columns that are frequently used in Query WHERE, GROUP BY clauses, and columns frequently used in joins, but be aware that indexes can slow down inserts and updates.

Lazy Loading and JPA

With JPA many-to-one and many-to-many relationships lazy load by default, meaning they will be loaded when the entity in the relationship is accessed. Lazy loading is usually good, but if you need to access all of the "many" objects in a relationship, it will cause n+1 selects where n is the number of "many" objects.

You can change the relationship to be loaded eagerly as follows :

However you should be careful with eager loading which could cause SELECT statements that fetch too much data. It can cause a Cartesian product if you eagerly load entities with several related collections.

If you want to override the LAZY fetch type for specific use cases, you can use Fetch Join. For example this query would eagerly load the employee addresses:

In General you should lazily load relationships, test your use case scenarios, check the SQL log, and use @NameQueries with JOIN FETCH to eagerly load when needed.


the main goal of partitioning is to reduce the amount of data read for particular SQL operations so that the overall response time is reduced

Vertical Partitioning  splits tables with many columns into multiple tables with fewer columns, so that only certain columns are included in a particular dataset, with each partition including all rows.

Horizontal Partitioning segments table rows so that distinct groups of physical row-based datasets are formed. All columns defined to a table are found in each set of partitions. An example of horizontal partitioning might be a table that contains historical data being partitioned by date.

Vertical Partitioning

In the example of vertical partitioning below a table that contains a number of very wide text or BLOB columns that aren't referenced often is split into two tables with the most referenced columns in one table and the seldom-referenced text or BLOB columns in another.

By removing the large data columns from the table, you get a faster query response time for the more frequently accessed Customer data. Wide tables can slow down queries, so you should always ensure that all columns defined to a table are actually needed.

The example below shows the JPA mapping for the tables above. The Customer data table with the more frequently accessed and smaller data types  is mapped to the Customer Entity, the CustomerInfo table with the less frequently accessed and larger data types is mapped to the CustomerInfo Entity with a lazily loaded one to one relationship to the Customer.

Horizontal Partitioning

The major forms of horizontal partitioning are by Range, Hash, Hash Key, List, and Composite.

Horizontal partitioning can make queries faster because the query optimizer knows what partitions contain the data that will satisfy a particular query and will access only those necessary partitions during query execution. Horizontal Partitioning works best for large database Applications that contain a lot of query activity that targets specific ranges of database tables.

Hibernate Shards

Partitioning data horizontally into "Shards" is used by google, linkedin, and others to give extreme scalability for very large amounts of data. eBay "shards" data horizontally along its primary access path.

Hibernate Shards is a framework that is designed to encapsulate support for horizontal partitioning into the Hibernate Core.


JPA Level 2 caching avoids database access for already loaded entities, this make reading reading frequently accessed unmodified entities faster, however it can give bad scalability for frequent or concurrently updated entities.

You should configure L2 caching for entities that are:
  • read often
  • modified infrequently
  • Not critical if stale
You should also configure L2 (vendor specific) caching for maxElements, time to expire, refresh...

References and More Information:

JPA Best Practices presentation
MySQL for Developers Article
MySQL for developers presentation
MySQL for developers screencast
Keeping a Relational Perspective for Optimizing Java Persistence
Java Persistence with Hibernate
Pro EJB 3: Java Persistence API
Java Persistence API 2.0: What's New ?
High Performance MySQL book
Pro MySQL, Chapter 6: Benchmarking and Profiling
EJB 3 in Action
sharding the hibernate way
JPA Caching
Best Practices for Large-Scale Web Sites: Lessons from eBay

Friday Aug 21, 2009

JPA Caching

JPA Level 1 caching

JPA has 2 levels of caching. The first level of caching is the persistence context.

The JPA Entity Manager maintains a set of Managed Entities in the Persistence Context.

The Entity Manager guarantees that within a single Persistence Context, for any particular database row, there will be only one object instance. However the same entity could be managed in another User's transaction, so you should use either optimistic or pessimistic locking  as explained in JPA 2.0 Concurrency and locking

The code below shows that a find on a managed entity with the same id and class as another in the same persistence context , will return the same instance.

@Stateless public ShoppingCartBean implements ShoppingCart {

 @PersistenceContext EntityManager entityManager;

 public OrderLine createOrderLine(Product product,Order order) {
        OrderLine orderLine = new OrderLine(order, product);
        entityManager.persist(orderLine);   //Managed
        OrderLine orderLine2 =entityManager.find(OrderLine,
     (orderLine == orderLine2  // TRUE
        return (orderLine);


The diagram below shows the life cycle of an Entity in relation to the Persistent Context.

The code below illustrates the life cycle of an Entity. A reference to a container managed EntityManager is injected using the persistence context annotation. A new order entity is created and the entity has the state of new. Persist is called, making this a managed entity. because it is a stateless session bean it is by default using container managed transactions , when this transaction commits , the order is made persistent in the database. When the orderline entity is returned at the end of the transaction it is a detached entity.

The Persistence Context can be either Transaction Scoped-- the Persistence Context 'lives' for the length of the transaction, or Extended-- the Persistence Context spans multiple transactions. With a Transaction scoped Persistence Context, Entities are "Detached" at the end of a transaction.

As shown below, to persist the changes on a detached entity, you call the EntityManager's merge() operation, which returns an updated managed entity, the entity updates will be persisted to the database at the end of the transaction.

An Extended Persistence Context spans multiple transactions, and the set of Entities in the Persistence Context stay Managed. This can be useful in a work flow scenario where a "conversation" with a user spans multiple requests.

The code below shows an example of a Stateful Session EJB with an Extended Persistence Context in a use case scenario to add line Items to an Order. After the Order is persisted in the createOrder method, it remains managed until the EJB remove method is called. In the addLineItem method , the Order Entity can be updated because it is managed, and the updates will be persisted at the end of the transaction.

The example below contrasts updating the Order using a transaction scoped Persistence Context verses an extended Persistence context. With the transaction scoped persistence context, an Entity Manager find must be done to look up the Order, this returns a Managed Entity which can be updated. With the Extended Persistence Context the find is not necessary. The performance advantage of not doing a database read to look up the Entity, must be weighed against the disadvantages of memory consumption for caching, and the risk of cached entities being updated by another transaction.  Depending on the application and the risk of contention among concurrent transactions this may or may not give better performance / scalability.

JPA second level (L2) caching

JPA second level (L2) caching shares entity state across various persistence contexts.

JPA 1.0 did not specify support of a second level cache, however, most of the persistence providers provided support for second level cache(s). JPA 2.0 specifies support for basic cache operations with the new Cache API, which is accessible from the EntityManagerFactory, shown below:

If L2 caching is enabled, entities not found in persistence context, will be loaded from L2 cache, if found.

The advantages of L2 caching are:
  • avoids database access for already loaded entities
  • faster for reading frequently accessed  unmodified entities
The disadvantages of L2 caching are:
  • memory consumption for large amount of objects
  • Stale data for updated objects
  • Concurrency for write (optimistic lock exception, or pessimistic lock)
  • Bad scalability for frequent or concurrently updated entities

You should configure L2 caching for entities that are:
  • read often
  • modified infrequently
  • Not critical if stale
You should protect any data that can be concurrently modified with a locking strategy:
  • Must handle optimistic lock failures on flush/commit
  • configure expiration, refresh policy to minimize lock failures
The Query cache is useful for queries that are run frequently with the same parameters, for not modified tables.

The EclipseLink JPA persistence provider caching Architecture

The  EclipseLink caching Architecture is shown below.

Support for second level cache in EclipseLink is turned on by default, entities read are L2 cached. You can disable the L2 cache. EclipseLink caches entities in L2, Hibernate caches entity id and state in L2. You can configure caching by Entity type or Persistence Unit with the following configuration parameters:
  • Cache isolation, type, size, expiration, coordination, invalidation,refreshing
  • Coordination (cluster-messaging)
  • Messaging: JMS, RMI, RMI-IIOP, …
The example below shows configuring the L2 cache for an entity using the @Cache annotation

The Hibernate JPA persistence provider caching Architecture

The Hibernate JPA persistence provider caching architecture is different than EclipseLink: it is not configured by default, it does not cache enities just id and state, and you can plug in different L2 caches. The diagram below shows the different L2 cache types that you can plug into Hibernate.

The configuration of the cache depends on the type of caching plugged in. The example below shows configuring the hibernate L2 cache for an entity using the @Cache annotation

For More Information:

Introducing EclipseLink
EclipseLink JPA User Guide
Hibernate Second Level Cache
Speed Up Your Hibernate Applications with Second-Level Caching
Hibernate caching
Java Persistence API 2.0: What's New ?
Beginning Java™ EE 6 Platform with GlassFish™ 3
Pro EJB 3: Java Persistence API (JPA 1.0)

Tuesday Sep 16, 2008

MySQL for Developers

this is a write up of info I gathered from   Jay PipesHigh Performance MySQL , Colin Charles and others (see References below) for a Sun Tech Days MySQL presentation.

MySQL powers many high-volume Web sites,  including industry leaders such as Yahoo!, Alcatel-Lucent, Google, Nokia, YouTube, and

MySQL for Developers

If you are a developer using MySQL, you should learn enough to take advantage of its strengths, because having an understanding of the database can help you develop better-performing applications. This can be especially important for Hibernate or JPA developers, since ORM frameworks tend to obscure the database schema and SQL  for the developer, which can lead to poorly-performing index and schema strategies and inefficient SQL.

MySQL Storage Engine Architecture

A key difference between MySQL and other database platforms is the pluggable storage engine architecture of MySQL, which allows you to select a specialized storage engine for a particular application need such as data warehousing, transaction processing, high availability...
A storage engine is  responsible for storing and retrieving all the data stored .  The storage engines have different functionality, capabilities and performance characteristics, in many applications choosing the right storage engine can greatly improve performance. Here is a brief summary of some of the more common engines:
  • MyISAM - offers high-speed query and insert capability, is nontransactional, provides table-level locking, full-text indexing, compression, GIS functions, and supports indexes. MyISAM is a good choice for read-only or read-mostly applications that don't require transactions and issue primarily either SELECT or INSERT queries, such as Web Catalogs or listing of jobs, auctions, real estate, etc., or data warehousing applications. You can use multiple storage engines in a single application, sometimes specific components of an application (such as logging) fall into this category. MyISAM is not a good general purpose storage engine for high concurrent UPDATEs or DELETEs since these obtain exclusive write locks, however new rows can be inserted with shared read locks.
  • InnoDB - supports ACID transactions, multi-versioning, row-level locking, foreign key constraints, crash recovery,  and  good query performance depending on indexes. InnoDB uses row-level locking with multiversion concurrency control (MVCC). MVCC can allow fewer row locks by keeping data snapshots. Depending on the isolation level, InnoDB does not require any locking for a SELECT. This makes high concurrency possible, with some trade-offs: InnoDB requires more disk space compared to MyISAM, and for the best performance, lots of memory is required for the InnoDB buffer pool. InnoDB is a good choice for any order processing application, any application where transactions are required.  
  • Memory - stores all data in RAM for extremely fast access. Useful when you need fast access to data that doesn't change or doesn't need to persist after a restart.  Good for "lookup" or "mapping" tables, for caching the results of periodically aggregated data, for intermediate results when analyzing data.
    MEMORY tables do not support variable-length columns, this means that VARCHAR columns get implicitly converted to CHAR columns, and it is impossible to create a MEMORY table with TEXT or BLOB columns.
  • Merge - allows to logically group together a series of identical MyISAM tables and reference them as one object. Good for very large DBs like data warehousing.
  • Archive - provides for storing and retrieving large amounts of seldom-referenced historical, archived, or security audit information.
More specialized engines:
  • Federated - Allows a remote server's tables to be used as if they were local, by linking together separate MySQL servers as one logical database. Can be good for distributed or data mart environments. 
  • Cluster/NDB -Highly-available clustered storage engine.  Should be used where very high speed, availability and redundancy are absolute requirements. Other tables (even in the same database) should use one or more of the other storage engines.
  • CSV-references comma-separated files as database tables.Useful for large bulk imports or exports
  • Blackhole- the /dev/null storage engine for temporarily disabling application input to the database .  Useful for benchmarking and some replication scenarios.
You can use multiple storage engines in a single application, a storage engine for the same table on a slave can be different than that of the master. This can be very useful for taking advantage of an engine's abilities.

Schema, the basic foundation of performance

Database normalization minimizes duplication of information, this makes updates simpler and faster because the same information doesn't have to be updated in multiple tables.  In a denormalized database, information is duplicated, or stored in multiple places. With a normalized database:
  • updates are usually faster.
  • there's less data to change.
  • tables are usually smaller, use less memory, which can give better performance.
  • better performance for distinct or group by queries
The disadvantages of a normalized schema are queries typically involve more tables and require more joins which can reduce performance.  Also normalizing may place columns in different tables that would benefit from belonging to the same index, which can also reduce query performance.  More normalized schemas are better for applications involving many transactions, less normalized are better for reporting types of application.  You should normalize your schema first, then de-normalize later.  Applications often need to mix the approaches, for example use a partially normalized schema, and duplicate, or cache, selected columns from one table in another table.

Data Types

In general, try to use the smallest data type that you can.  Small and simple data types usually give better performance because it  means fewer disk accesses (less I/O), more data in memory, and less CPU to process operations.

Numeric Data Types

MySQL has 9 numeric data types.

Whole Numbers:
TINYINT, SMALLINT, MEDIUMINT, INT, BIGINT:   require 8, 16, 24, 32, and 64 bits of storage space, respectively. They can store values from –2(n-1) to 2(n-1)–1, where N is the number of bits of storage space they use. 

Use UNSIGNED when you don't need negative numbers, this doubles the bits of storage space.  BIGINT is not needed for AUTO_INCREMENT,  INT UNSIGNED stores 4.3 billion values!

INT(1) does not mean 1 digit!  The number in parentheses is the ZEROFILL argument, and specifies the number of characters some tools reserve for display purposes. For storage and computational purposes, INT(1) is identical to INT(20).

Integer data types work best for primary key data types. 

Real Numbers
FLOAT, DOUBLE: supports approximate calculations with standard floating-point math.
DECIMAL: use DECIMAL when you need exact results, always use for monetary/currency fields.

BIT: to store  0,1 values. BIT(1) stores 1 bit, BIT(2) stores 2... max is 64.

Character Data Types

The CHAR and VARCHAR types are declared with a length that indicates the maximum number of characters to store.
stores variable-length character strings. VARCHAR uses only as much space as it needs, which helps performance because it saves disk space. Use when the maximum column length is larger than the average length and when updates to the field are rare, so fragmentation is not a problem.  VARCHAR gives more efficient disk usage, however MySQL internal memory buffers are fixed width, so  internal memory usage isn't more efficient, therefore try to keep VARCHAR Max Length short.
CHAR(n) is fixed-length: MySQL allocates enough space for the specified number of characters. Useful to store very short strings, when all the values are nearly the same length, and  for data that's changed frequently.


Always define columns as NOT NULL unless there is a very good reason not to do so:
  • can save up to a byte per column per row of data
  • nullable columns make indexes, index statistics, and value comparisons more complicated.


Indexes are data structures that help retrieve row data with specific column values faster. Indexes can especially improve performance for larger data bases.  Most MySQL storage engines support  B-tree indexes. a B-tree is (remember data strucutres 101 ?) a tree data structure that sorts data values, tree nodes define the upper and lower bounds of the values in the child nodes.  Leaf nodes have pointers to the data instead of pointers  to child nodes. B-trees are kept balanced by requiring that all leaf nodes are at the same depth.  MyISAM leaf nodes refer to the indexed row by the position of the row,  InnoDB leaf nodes refers to the index by its primary key values.

InnoDB's clustered indexes store the row data in the leaf nodes, it's called clustered because rows with close primary key values are stored close to each other.  This can make retrieving indexed data fast, since the data is in the index.  But this can be slower for updates , secondary indexes, and for full table scans.

Covering Indexes are indexes that contain all the data values needed for a query, these queries can improve performance because the row does not have to be read.

Know Every SQL Statement

You need to understand the SQL queries your application makes and  evaluate their performance. For this you can harness the MySQL slow query log and use EXPLAIN.  Consider adding an index for queries that are taking too long, or cause a lot of load. Know how your query is executed by MySQL, consider disk I/O in particular.  Basically you want to make your queries access less data, Indexes are a good way to reduce data access.

Know Every SQL Statement for Hibernate/JPA developers

If you are using Hibernate, JPA, or another ORM framework you want to look at the generated SQL-statements. If you are using Hibernate, enable the Hibernate SQL log.  If you are using Glassfish with Toplink as the JPA provider, you can log the SQL that is being sent to the database by adding this property in the definition of the persistence unit in the persistence.xml file, as follows: <property name="toplink.logging.level" value="FINE">. Then run a use case of your application and examine the SQL statements which are executed by JPA or your ORM framework. You want to make sure you are only retrieving the data your application needs, that is you want to optimize the number of SQL statements executed (see lazy loading below), and you want to examine the execution plan for queries.

Optimize data access

Basically you want to make your queries access less data:
  • is your application retrieving more data than it needs, are queries accessing too many rows or columns?
  • is MySQL analyzing more rows than it needs?

Understanding EXPLAIN

EXPLAIN provides the execution plan chosen by the MySQL optimiser for a specific SELECT statement.  It is important to run EXPLAIN on all SELECT statements that your code is executing against the database.  This step ensures that missing indexes are picked up early in the development process and gives developers insight into how the MySQL optimizer has chosen to execute the query.

To use it just precede a SELECT statement with the keyword EXPLAIN and  MySQL will display information from the optimizer about the query execution plan, which explains how it would process the SELECT,  including information about how tables are joined and in which order. The EXPLAIN EXTENDED SELECT variant, followed by SHOW WARNINGS gives some additional information, including the final rewritten query as used by the optimizer.

With the help of EXPLAIN, you can see where you should add indexes to tables to get a faster SELECT that uses indexes to find rows. You can also use EXPLAIN to check whether the optimizer joins the tables in an optimal order.

returns a row of information for each "table" used in the SELECT statement, which shows each part and the order of the execution plan.  The "table" can mean a real schema table, a derived or temporary table, a subquery, a union result. Here is an explanation of some of the output for each row:
  • table:  the real table name or an alias.
  • select_type: shows whether the row is a simple or complex SELECT. If complex it can be: PRIMARY (Outermost SELECT), UNION, SUBQUERY, DERIVED, or UNCACHEABLE.
  • type: shows the "access strategy" type to find rows. The different access types ordered from the best  to worst:
    • system, or const: very fast because  the table has at most one matching row (For example a primary key used in the WHERE)
    • eq_ref: index lookup returning one value
    • ref: index access
    • fulltext:  FULLTEXT index.
    • ref_or_null:  like ref, but with an extra search for rows that contain NULL values.
    • range:   index scan to select the rows in given range
    • index:   index tree is scanned.
    • ALL: A full table scan
  • key: the index that MySQL decided to use, from the possible_keys, to look up rows.
  • rows: the number of rows MySQL estimates it must examine to execute the query.
  • ref: columns or constants which are compared to key to select rows .
  • Extra: additional information about how MySQL resolves the query. Watch out for Extra values of Using filesort and Using temporaryUsing index means information is retrieved from the table using only information in the index tree without having to do an additional seek to read the actual row. This strategy can be used when the query uses only columns that are part of a single index (Covering Index).

When do you get a full table scan?

  • No WHERE condition
  • No index on any field in WHERE condition
  • When your range returns a large number of rows, i.e. too many records in WHERE condition
  • When optimizer sees that the condition will return > ~20% of the records in a table, it will prefer a scan versus many seeks
  • Pre-MySQL 5, using OR in a WHERE clause
    • now fixed with an index merge, so the optimiser can use more than one index to satisfy a join condition
How do you know if a scan is used?
In the EXPLAIN output, the “type” for the table/set will be “ALL” or “index”.  “ALL” means a full table data record scan is performed.  “index” means a full index record scan.  Avoid this by ensuring indexes are on columns that are used in the WHERE, ON, and GROUP BY clauses.

Scans and seeks

A seek jumps into a random place (on disk or in memory) to fetch data. A scan will jump to the start of the data, and sequentially read (from either disk or memory) until the end of the data. With large amounts of data sequentially scanning through contiguous data on disk or in memory is faster than performing many random seek operations.  But scans can also be a sign of poor indexing, if an index contains  many rows with the same value it is not very useful.  MySQL keeps stats about the  uniqueness of values in an index in order to the estimated rows returned (rows in the explain output). If the rows is greater than a certain  % of rows in the table, then MySQL will do a scan.

Indexed columns and functions don't mix

Indexes can quickly find the rows that match a WHERE clause, however this works only if the index is NOT used in a function or expression in the WHERE clause. Below is an example where an index is used :

mysql> EXPLAIN SELECT \* FROM film WHERE title LIKE 'Tr%'\\G
\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* 1. row \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
           id: 1
  select_type: SIMPLE
        table: film
type: range
possible_keys: idx_title
key: idx_title
      key_len: 767
          ref: NULL
         rows: 15
        Extra: Using where

In the example above, a  fast range "access strategy" is chosen by the optimizer, and the index scan on title is used to winnow the query results down.
Below is an example where an index can not be used :

mysql> EXPLAIN SELECT \* FROM film WHERE LEFT(title,2) = 'Tr' \\G
\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* 1. row \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
           id: 1
  select_type: SIMPLE
        table: film
type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 951
        Extra: Using where

A slow full table scan (the ALL"access strategy") is used because a function (LEFT) is operating on the title column. Operating on an indexed column with a function (in this case the LEFT() function) means the optimizer cannot use the index to satisfy the query.  Typically, you can rewrite queries in order to not operate on an indexed column with a function.

Solving multiple problems in one query

Here is an example of improving a query:

SELECT \* FROM Orders WHERE TO_DAYS(CURRENT_DATE()) – TO_DAYS(order_created) <= 7;

First, we are operating on an indexed column (order_created) with a function TO_DAYS – let's fix that:

SELECT \* FROM Orders WHERE order_created >= CURRENT_DATE() - INTERVAL 7 DAY;

Although we rewrote the WHERE expression to remove the function on the index, we still have a non-deterministic function CURRENT_DATE() in the statement, which eliminates this query from being placed in the query cache. Any time a non-deterministic function is used in a SELECT statement, the query cache ignores the query.  In read-intensive applications, this can be a significant performance problem. – let's fix that:

SELECT \* FROM Orders WHERE order_created >= '2008-01-11' - INTERVAL 7 DAY;

We replaced the function with a constant (probably using our application programming language).  However, we are specifying SELECT \* instead of the actual fields we need from the table.  What if there is a TEXT field in Orders called order_memo that we don't need to see?  Well, having it included in the result means a larger result set which may not fit into the query cache and may force a disk-based temporary table. – let's fix that:

SELECT order_id, customer_id, order_total, order_created
FROM Orders WHERE order_created >= '2008-01-11' - INTERVAL 7 DAY;

A Few Things to consider for  JPA/Hibernate  devlopers:

Lazy loading and JPA

With JPA many-to-one and many-to-many relationships lazy load by default , meaning they will be loaded when the entity in the relationship is accessed. Lazy loading is usually good, but if you need to access all of the "many" objects in a relationship, it will cause n+1 selects where n is the number of  "many" objects.  You can change the relationship to be loaded eagerly as follows :
public class Employee{

    @OneToMany(mappedBy = "employee", fetch = FetchType.EAGER)
    private Collection<Address> addresses;

However you should be careful with eager loading which could cause SELECT statements that fetch too much data. It can cause a Cartesian product  if you eagerly load entities with several related collections. 

If you want to temporarily override the LAZY fetch type, you could use Fetch Join.  For example this query would eagerly load the employee addresses: 
@NamedQueries({ @NamedQuery(name="getItEarly", 
                 query="SELECT e FROM Employee e JOIN FETCH e.addresses")})

public class Employee{

Optimistic locking and JPA

The Java Persistence API 1.0 persistence providers support optimistic locking, where your application will get an exception whenever a transaction tries to commit an object that was updated in the database since this transaction began. To enable this for a entity, you need to add a version attribute:

public class Item {
    private int version;

In Conclusion

  • Understand the storage engines
  • Keep data types small and compact
  • Understand your query execution plans with the EXPLAIN output
  • Understand the scan vs. seek choice the optimizer must make
  • Don't mix functions and indexed columns


High Performance MySQL book
MySQL Pluggable Storage Engine Architecture
MySQL Storage Engine Architecture, Part 2: An In-Depth Look
Optimizing Queries with EXPLAIN
Java Persistence with Hibernate book
Jay Pipes blog
Colin Charles blog
mysql performance blog
Ronald Bradford blog
Taking JPA for a Test Drive
Pro EJB 3: Java Persistence API
Pro MySQL, Chapter 6: Benchmarking and Profiling



« July 2016