As a follow up article to The evolution of of DB architectures in the quest for scalability, this article talks about the different techniques that can be used for massive read and write scaling for OLTP databases. The traditional methods for write scaling all have serious down sides. Two alternative techniques are discussed. Ultimately how you choose to scale your database is as much a business decision as a technical one. Just make sure you know all of the pros and cons.
OLTP systems need to be able to scale both reads and writes. Generally scaling the reads is a lot easier than scaling the writes.
Scaling Reads for OLTP databases
The following techniques are applicable for most database management systems:
- Make sure that the appropriate indexes are being used:
- Doing less work is always the fastest. Doing a hash lookup for exact matches or reading a few B+-tree nodes for range queries will always tend to be faster than doing a sequential scan for potentially billions of rows for highly concurrent OLTP workloads.
- If your database management system is relational and uses a cost based optimizer, then make sure that your statistics are up to date.
- Use faster disks:
- Most databases use a buffer pool to move data blocks to and from disk to enable the working set to reside in RAM. No matter how large the buffer pool is, you will still eventually need to do 'disk' reads or writes, so persistent devices with low latency and high bandwidth are desirable.
- SANs or arrays of 15K RPM SCSI disks used to be the solution, but now NVMe SSDs are the state of the art for low latency and high bandwidth storage devices.
- NVMe SSD are constantly improving, but the Intel Optane SSD DC P4800X Series is a good example of a fast SSD.
- You can even stripe the latest SSDs for even better throughput.
- Increase the buffer pool size:
- The bigger the buffer pool, the better the cache hit ratio. A database read that finds the data blocks in the buffer pool will tend to be much faster than requiring one or more disk reads to get those data blocks from disk.
- Make sure that you have enough CPU cores:
- If you do not have enough CPU cores your database management system will not be able to handle the concurrency required for read scaling. Modern CPUs from companies like Intel, IBM and Oracle have many CPU cores per socket. For example Intel (28 cores), IBM (24 cores) and Oracle (32 cores).
- The number of cores per socket is only half of the CPU story. The other half is how many CPU sockets do you have in your database server machine? Some examples of multi socket servers from companies like Oracle (8 sockets), Dell, (four sockets), IBM (four sockets) and HPE (32 sockets)
- Make sure that your DB server is balanced:
- The weakest link will be the scaling bottleneck.
- Having lots of CPU cores but not enough RAM will limit read scaling.
- Having lots of RAM but not enough CPU cores will limit read scaling.
- Having lots of CPU but slow disks will limit read scaling.
- Having lot of CPU cores but a slow NIC will limit read scaling
- The above are all examples of scaling up. ie using a bigger / balanced box. If the biggest, balanced box is the bottleneck to read scaling then you will need to scale out.
- Scale out by using read only replicas:
- Some database management systems support multiple read only replicas. For instance MySQL, MariaDB, PostgreSQL and Oracle TimesTen all support configurations where there is one read/write DB and many read only replicas
- Use read caches:
- Another approach is to use a set of low latency read only caches for your database management system. These caches need to be kept sufficiently fresh to be useful. Read only caches are typically either refreshed based on time or changes.
- Some examples of low latency read only caches are Redis, memcached and Oracle Application Tier Database Cache.
- With Oracle Application Tier Database Cache, the DBA defines (on a table by table basis) the columns and rows of interest from an Oracle database and whether those rows should be a read only or a read/write cache. The DBA also defines how often that those rows should be refreshed into the TimesTen tables.
- Use Database Sharding:
- Another approach to enable database read scaling is database sharding. The shards are usually existing database management systems [eg MySQL or PostgreSQL] and 'glue code' is needed to route requests to the relevant shards.
- Combine the techniques:
- Theoretically there is nothing stopping you having a balanced machine [CPU/RAM/disks/NICs] that is sharded where each shard has read only replicas and there is a read cache in front of each shard. Each shard would also have the appropriate [local/global] indexes.
- While this approach is possible, you need to design and develop for this approach and every component needs to 'do the right thing' during failure scenarios. As there are so many moving parts, monitoring and managing all of these components is non trivial.
The following table shows the pros and cons of each read scaling technique:
Scaling Writes for OLTP databases
There are less useful techniques for write scaling and they are harder to apply than those for read scaling:
- Adding indexes slows down writes and makes write scaling harder:
- OLTP systems usually have both reads and writes. You want the indexes for the reads, but they hurt the writes.
- Use faster disks:
- Faster disks do help with write scaling. NVMe SSDs are still the best technology for this.
- Usually you need fast sequential writes for write scaling.
- Usually you need fast random reads for read scaling.
- Large buffer pools do not help write scaling:
- You want to persist the committed transaction as soon as possible.
- Having more CPU cores does help write scaling.
- Having a balanced DB server does help write scaling.
- Database writes tend to be much more expensive than database reads:
- Durably committed transactions usually require multiple disk writes.
- Database reads may not require any disk reads as the data may be in the buffer cache.
- Rolling back a database operation usually requires multiple disk reads and writes.
- It is very easy for a database to bottleneck waiting for disk IOs given highly concurrent database writes.
- NVMe disks really help here, but they do not have infinite bandwidth.
- Using Scale out read only replicas does not help write scaling:
- By definition, read only replicas cannot accept database writes.
- This means that writes can only occur on the active/master DB node.
- This means that the write scaling will be limited by the write bandwidth of a single database server.
- Read caches do not help write scaling.
- Database sharding does help write scaling:
- The database writes can be routed to the relevant shard.
- By having lots of database shards, the required write bandwidth can be spread over many database servers
So in summary, to scale database writes, scale up as much as you can [afford] and then scale out.
Developing applications for a sharded database [eg manually sharded MySQL or PostgreSQL] is difficult. You need to have custom code to route reads and writes to the correct shard. Usually doing complex queries that span database shards is either difficult or not allowed. Unless you have expert DBAs and developers who know all of the ins and outs of designing, developing, operating and maintaining a sharded database system, do not consider this approach.
Is there a simpler approach for designing, developing, operating and maintaining a database system which scales both reads and writes?
I am aware of two approaches that make designing, developing, operating and maintaining a database system for massive read/write scaling simple:
- Using Oracle Exadata:
- Exadata uses a scale out, shared disk architecture based on Oracle Real Application Clusters and state of the art hardware.
- You choose the hardware size based on your needs [eg 1/8 rack up to multiple racks].
- Exadata data uses NVMe storage, huge buffer pools, lots of CPU cores, Infiniband for networking and smart storage cells to minimize disk IO.
- Using a scale out RDBMS that automatically shards the database and is designed to make the scale out database transparent to the developers and applications:
These four different database have many desirable properties:
- They are In-Memory RDBMS so reads and writes tend to be very fast
- They are RDBMS so you can use the full power of the SQL language
- You can read from any node to help read scaling
- You can write to any node to help write scaling
- Unlike sharded databases, you do not need expert developers or DBAs to design, develop, operate and maintain systems built for these RDBMS
Based on some benchmarks, we believe that Oracle TimesTen Scaleout is the World's Fastest OLTP Database.
There are many techniques to enable massive database read scaling. There are only a few techniques that enable massive database write scaling. You need expert developers and DBAs to design, develop, operate and maintain manually sharded databases. Oracle Exadata, VoltDB, memSQL, NuoDB and Oracle TimesTen Scaleout enable massive read and write scaling without requiring expert developers or DBAs.
If you need databases with massive read and write scaling, either hire expert developers & DBAs to design/develop/operate/maintain manually sharded databases and applications [forever], or choose from a set of RDBMS that make massive read/write scaling easy.
Disclaimer: these are my personal thoughts and do not represent Oracle's official viewpoint in any way, shape, or form.