Thursday May 30, 2013

How big is a database?

I got a question a while ago that I thought was quite simple, but turned into an interesting discussion: How much hard disk space does a database take up?

As it happens, there's a simple answer and a much, much more involved (yet ill-defined) answer, and which one you choose depends on what you think goes into a database and how very, very pedantic you are.

The Simple Answer

You can query the size of table and index data using Information Schema or SHOW TABLE STATUS. The columns DATA_LENGTH and INDEX_LENGTH contain the number of bytes for the table and index data respectively. You can construct simple queries based on that data, or do a quick search on the web for more interesting queries (For example, Peter Zaitsev of Percona wrote a post on Researching your MySQL Table Sizes). Adding up the totals for all tables in a database gives you a figure that shows how big the data rows and index contents for the database are.

This really is where you should stop.

Let's Get Slightly Pedantic: Taking the File System's Perspective

In MySQL, databases are little more than file folders, so they don't consume space themselves. The space is consumed by the data within, specifically tables and indexes: Views, stored routines, triggers are little more than command text and so take a trivial amount of space, so for now we can ignore them. 

In MySQL 5.6, table and index data are stored in individual files in the appropriate directory. So, in my /var/lib/mysql/world_innodb folder, I've got City.ibd, Country.ibd, and CountryLanguage.ibd. To check the size of the database, I can simply read the size of that folder, which gives me the approximate size of the table and index data, along with any extra space allocated. The extra space comes from the fact that the file size is typically bigger than the amount of data stored within it, because InnoDB increases the file size by 64MB (8MB in 5.5) every time it fills up. You can change this value by setting the innodb_autoextend_increment option.

In MySQL 5.1 and 5.5, the default InnoDB configuration stores all InnoDB information in a single tablespace with a filename of (typically) /var/lib/mysql/ibdata1. By default, this file contains all table data and index information for all InnoDB tables in all databases, which means you cannot simply calculate the space taken on disk by the data of any single database. Although all InnoDB data defaults to a single tablespace file, you can change this by setting innodb_file_per_table, which is enabled by default in 5.6. By doing so, you store table and index data for each database in a file named for the table, in the appropriate database directory.

Let's Get Really Pedantic: Going Beyond the File System 

The simple answer at the top really is quite simple: It only considers data, so if there is no data, there are no rows taking up space. This means that regardless of how many functions, procedures, triggers, views, tables, permissions, or anything else you've configured in your (otherwise empty) database, MySQL considers that it is empty, and therefore takes up 0 space. Of course, if you're quite pedantic about it, you'll know better.

What Else is in a Database?

MySQL does not allocate space per database, but per table and any indexes associated with that table. If every table in a database contain no data, then MySQL considers that the database contains no data. Queries on DATA_LENGTH and INDEX_LENGTH show information about data (as opposed to metadata), so for the purpose of that query, a database with no data takes up 0 space. This might be a quite simplistic way of looking at it, so let's consider the difference between this viewpoint—a database with no data takes up no space—and another—a database with objects, even empty ones, has to exist, and that must take up some space.

As well as table data and indexes, the other information MySQL maintains about a database is its metadata. Some of this metadata is contained within:

  • The file system: as a database directory as db.opt, .frm, and .trg files
  • The mysql database: stored routines, events, privileges, and other database metadata; these are stored in MyISAM tables, each of which contains information for all databases on the server
As the metadata is distributed in multiple locations, MySQL does not have a single location that you could point to and say "This is the database," so for the purposes of explanation, let's drill into my local copy of the sakila database.

Metadata on the File System

As an administrator, I can see that the structure of my sakila database takes up space in the form of the following file system artefacts:

  •     The database directory inode takes up 4KB on my machine.
  •     The db.opt file takes up 65 bytes (for my sakila database)
  •     Triggers take just over 1KB each (TRG and TRN files)
  •     Table metadata (.frm files) are about 8KB each
  •     View metadata (.frm files) depend on the complexity of the view, but let's say 1KB each

On my system, these add up to 336KB for my slightly-extended copy of the sakila database.

Metadata in the mysql Database

There is also row data in mysql relevant to the sakila database:

  •     The mysql.proc table contains 6 rows for sakila; with an average row length of 1116 bytes—from SHOW TABLE STATUS LIKE 'sakila';—that gives approximately 6KB.
  •     The mysql.db table on my system has two rows on my system (I've set up a couple of users), at ~440 bytes each
  •     The mysql.tables_priv table has one row at 851 bytes (unrepresentative, but included for detail)
  •     I have no mysql.event, mysql.columns_priv, or mysql.procs_priv rows relevant to sakila at this time (again, unrepresentative, but included for detail).

Have We Considered Everything Yet?

This isn't even the whole story; I haven't included mysql.user records for users who only use this database, or log file entries relevant to the sakila database. You can take this as a sign that what constitutes a database is actually quite a fuzzy thing.

So, taking all of that together, you might consider that the metadata for my sakila database takes up ~344KB, but it should be apparent that it is quite a difficult task to define what is and isn't sakila metadata, and there is certainly no built-in way to arrive at this figure in a comprehensive and consistent way that would satisfy the most pedantic administrator.

Best to just consider the data, right?

Friday May 10, 2013

Circular Replication in MySQL

Replication is a hot topic in MySQL 5.6, and for good reason: There are many excellent features that make it a strong well-supported feature, from the new Global Transaction Identifiers (GTIDs), to simplified replication configuration and automated failover using MySQL Utilities (now available in alpha as a separate download).

Four servers configured in a circular replication topology
Circular Replication

The simplest topology consists of a master server that accepts changes, and slaves that replicate those changes from the master. A common requirement is for a network to have multiple servers that accept changes and replicate to each other. This is possible by means of circular replication, where each master is also the slave of another master, in a circular fashion. However, this configuration is prone to certain problems.

Asynchronous Replication

Firstly, you have to know a bit about how replication works. MySQL replication is asynchronous, which means each server executes operations without waiting for another server to replicate them. It does this by logging every event and subsequently transmitting these events to connected slaves. This works perfectly—without conflicts—when you have a single master that accepts changes from client applications. That master can have any number of slaves that can execute queries for applications, but don't change the contents of the database.

Circular Replication 

Circular replication enforces replication from only one other master, so MySQL avoids some sorts of conflicts such as time-order conflicts.

A time-order conflict occurs when masters can replicate directly from two or more sources, and conflicting event pairs replicate to masters in a different order. This problem cannot occur in MySQL replication.

Circular replication introduces the possibility that two or more servers can replicate concurrent changes to each other. It is therefore is subject to conflicts, because it is possible that two servers update the same row at the same (or nearly the same) time and that each subsequently replicates its changes to the other server. If the replicated changes apply to the same row, then that row ends up with a different value on each server.

MySQL does not perform conflict resolution in such situations, so you have to make sure your application caters for that possibility.


Conflicts in circular replication occur when your application allows updating the same data (rows and related rows) on different servers. If you do not take care to avoid conflicts, you risk a situation where two servers accept conflicting changes at nearly the same time, and replicate them to the other side.

For example, imagine a product that costs $520:

  • The "promotions" team updates it on the sales server by subtracting $50
  • At the same time, the "brands" team on the management server increases its price by 20%

Operation  Sales server  Management server
(0) Initial price  $520  $520
(1) Promotions update (subtract $50)  $470
(2) Brands update (increase by 20%)  $624
(3) Promotions update replicates  $574
(4) Brands update replicates  $564
(5) Final price $564  $574

After each of the changes replicates to the other server—after being performed on its local server—the price on the sales server ends up being $564, and that on the management server (for the same product) is $574. There are no errors generated, because each server executes the statements in the order it receives them (either from a connected client application or a replicating master), which, due to the nature of asynchronous replication, is not always the same order on each server.

Further Examples 

The preceding example describes a simple situation that could occur when you use statement-based replication. If you use row-based replication, you are not immune either, because there are several ways in which operations can conflict. Here are some further examples of conflicts that can occur whether you use statement- or row-based replication:

  • One server deletes several rows based on a WHERE clause that should include a row that has just changed on the other server - the UPDATE propagates in one direction and the DELETE in the other, resulting in a row that exists on one server but not the other.
  • One server updates a set of records at the same time as the other server updates an overlapping set
  • Two servers generate an identical report at exactly 18:00 each night, but one server executes an update statement immediately before the report, that replicates to the other server just after it generates its report.
In short, a conflict is likely to occur at some point if you allow changes to common data on two or more servers.

Avoiding Conflicts

The simplest way to avoid conflicts in a circular replication topology is to ensure that each master server can only update rows that no other master server changes. For example:

  • One master server updates rows within the sales table but not the products table, and the other updates products but not sales.
  • The stock table can be updated by one master during the day, when the warehouse is packing and dispatching; another updates overnight when deliveries arrive.
  • Masters can update each order based on the order's status. One master updates orders that have not yet shipped, another updates only shipped orders. Care must be taken that the order status changes in a controlled way, to avoid conflicts. For example, only the master with write access at that time (based on order status) can change its status; other masters use the order status to decide if they have write access at that time.

If you need to improve the performance of your replicated network, you can then scale out each master to multiple slaves to allow for greater bandwidth when running read-only queries. In practice, this means that you can load-balance queries across the slaves, and dedicate the master to handling writes for its tables.

Of course, to avoid conflicts all together, you should avoid circular replication and ensure that only one master accepts writes. But where would be the fun in that?


Jeremy Smyth writes MySQL training courses, and likes exploring interesting questions that come up from novices and experts alike.



« May 2013 »