By Ingo Strüwing on Sep 18, 2013
DisclaimerThe views expressed on this blog are my own and do not necessarily reflect the views of Oracle.
IntroductionThere are database usage patterns, where tables grow big at times and many rows are deleted from them later. InnoDB does never shrink a table space. In these cases we can end up with big data files, which contain a lot of unused pages. It is a waste of disk- and I/O- resources to back them up.
Users have manifold requested that MySQL Enterprise Backup does not back up unused InnoDB data pages. Some want smaller backups, some want less I/O, some want shrinked table spaces.
MySQL Enterprise Backup 3.9.0 can help with smaller backups. The effect on I/O is not that remarkable. InnoDB data files must be expanded to their original size when they are restored. Backup cannot accomplish a shrinkage of InnoDB table spaces.
In the following I will try to explain, how things work, and why not all wishes can be satisfied. I will also try to show the complexity of the feature. This should clarify, why it comes so late. Because my explanations might be a bit technical, I'll try to summarize important facts in advance.
Administrative SummaryYou can use the command line option --skip-unused-pages with any backup operation, but it will be ignored and give a warning for the following cases:
- backup-and-apply-log. The skipped pages must be re-inserted at the beginning of the apply-log operation. So it would be a waste of resources to do this when backup and apply-log are combined in one operation.
- incremental-with-redo-log-only. This operation copies log pages only. No data pages are copied. So there is no data page to be skipped.
- incremental backup: This is simply not implemented in MEB 3.9.0.
At the beginning of the apply-log operation, the skipped pages must be re-inserted. This means that every backed-up InnoDB data file must be copied over with the skipped pages stuffed in. This means that the total apply-log operation will take at least as long as the backup operation took.
Be prepared that the backup directory grows up to the original database size. At the end of each file, the backup file will be removed, but right before that point, the file exists in the compact and expanded form. You need to have the space for one extra copy of your biggest table.
You can combine --compress and --skip-unused pages. However, in MEB 3.9.0, the uncompression and expansion operations are executed in separate steps. Since both operations copy over all data files, it will take about twice the time as the backup operation took.
In MEB 3.9.0, the single-step restore operation copy-back-and-apply-log cannot be used on backups that have been taken with --skip-unused-pages.
PrerequisiteThe InnoDB table spaces contain bitmap pages, which indicates pages as free or not free (in-use, that is). The bitmap pages occur every page_size pages in the tablespace. For uncompressed tablespaces, that would be on pages 0, 16384, 32768, 49152, 65536, and so on. For a compressed tablespace with a 2K page size, the bitmap would be on pages 0, 2048, 4096, and so on. The bitmap page always covers the following page_size pages.
NomenclatureIn the following, the set of pages, which are covered by a bitmap page, is called a "map-set".
In the following, the terms "free page" and "unused page" are used interchangeably.
In the following, the term "zero page" means an InnoDB page, which has all its bytes set to zero.
The "free limit" is an InnoDB internal term. It is a page number. It is less or equal to the table space size in pages. If the free limit is less than the table space size in pages, itself and all pages above it are free.
A LSN is a Log Sequence Number. It is the offset in bytes from the logical start of the redo log. The LSN marks the start of a log entry.
The general problemThe task sounds simple. Read the bitmaps. Identify unused pages. Omit them from backup. Replace them with empty pages on restore. All over.
It would be almost as easy as this if we would take backups from cleanly shut down databases. That is, if all data files would be in a consistent state and not be modified during the backup operation.
But if a backup is taken from a hot database, the data files are modified while the backup operation is reading them. Since we cannot read all pages at once, we likely read pages that are modified at different times. When we read a bitmap page and later the pages, which are covered by this bitmap, then they may not be consistent. The bitmap could declare some page as free, but while we read other pages in between, the page may no longer be free when we read it. And vice versa.
Another problem is the InnoDB data cache. Pages are written from the cache to disk at different times. The main constraint is, that a page is never written before all log entries, which describe its modifications, are on disk. Another constraint is that at the time, when an InnoDB checkpoint is noted in the redo log, all page modifications up to that log entry are on disk. Since we start copying the redo log from the latest checkpoint, each page on disk can have any state between that
checkpoint and the current state.
Sure, we have the redo log. We keep copying it in parallel to the data file copy. It should be able to replay all changes that are done on the data pages during the backup operation. But the replay algorithm requires each modified page to be in a certain state. That is, it expects that there are certain data in the page. Each redo log entry describes a transformation of the page contents from one state to another. If the page doesn't have the expected contents, the algorithm fails.
This means that we need to take care, what page contents to restore for pages that were marked free when the corresponding bitmap page was copied. The following shall show, which problems needed to be resolved.
Empty pagesAbove we said that we will replace skipped unused pages by empty pages. However, the term "empty page" does not mean a zero page. This won't work because of the following reasons.
The redo algorithm is an idempotent algorithm. The idem-potency is based on the page LSN. Every change to a data page is logged in a log entry of the redo log. That log entry's LSN is written into the page. When the redo log is applied to a database or backup, a log entry is applied to a page only if the page's LSN is lower than the log entry's LSN.
The redo algorithm does not use a pure physical logging. Most log entries do not set a certain number of bytes at a certain offset in a page, but transform a page from one state to another. In other words, the algorithm relies on the correct page contents.
We copied the redo log from the latest checkpoint onward. Usually this contains log entries from before the backup start. So it could happen that there were log entries, which modified a page, before it became unused and before the backup started.
If the unused pages had been recreated as zero pages, their LSN would be zero. Every log entry's LSN is greater than zero. So every log entry for that page would be tried to apply. But the page contents is not, what the log entry expected. The redo log algorithm would fail.
If the unused pages had been recreated as zero pages, but with their LSN set to infinite, no log entry would be applied to unused pages. The apply-log algorithm would finish without errors and warnings. But if there are log entries, which re-initialize and modify the page after the backup operation read the bitmap page, they would not be applied due to the high LSN. The result would be an inconsistent table space.
The correct solution is to use empty pages with the LSN of the bitmap page, which claims it to be an unused page. We know that the page was unused at that LSN. If a page gets freed or re-used, the bitmap is changed and that makes an redo log entry for the bitmap page. The bitmap page gets that LSN. When we copy the bitmap page, we get the bitmap and the corresponding LSN into the backup. Our algorithm assures that each page is consistent in itself. So we have assured knowledge, which pages were free at that LSN. The next higher LSN that affects an unused page must consequently be a log entry, which re-initializes the page. Any log entry with an LSN lower than the bitmap LSN is irrelevant to the pages that are marked free by that bitmap.
Besides the LSN, empty pages will also have the page number, which corresponds to their location in the table space, the table space id, and the checksums set. The remaining part of empty pages are all zero bytes.
Non-free zero pagesThe page number of a zero page is zero. Usually one would expect that pages with a page number of zero have never been used and are marked free (except of the page at offset zero - the table space header). But sometimes a page like this is not marked free.
One possible situation in which a page could have a zero page number and not be marked free could occur if a page did become used for the first time short before the backup started, and was not flushed to disk until backup read the page. When the page became used, the bitmap was updated and flushed and the page was updated in memory but not flushed yet. It is possible, yes even probable, that there are redo log entries in the log file, which manipulate the page. The corresponding log entries could have LSNs below the bitmap page's LSN if later more changes were done to the bitmap page. If we would skip the page, and thus effectively declare it as free, the expansion algorithm would insert a page with the bitmap page's LSN, which could be too high.
To be safe, we include non-free zero pages in a backup. They are rare and thus don't make a big difference.
Skip unused pages when reading from the original data filesIf we use the bitmaps to avoid reading of unused pages, we turn sequential reading into random reading. Depending on the distribution of unused pages among used pages, this could even drop the read performance. Only if unused pages occur in big contiguous chunks, skipping those could give a speed increase.
Since a bitmap page occurs every page_size pages, only page_size - 1 pages can be skipped in one go at best. Hence, the optimal performance enhancement cannot be reached. All bitmap pages are used pages and must be copied to the backup, even if they mark all "their" pages as free. After all, we could detect such case only after having read the bitmap page. Anyway, it will be rare cases that multiple bitmap pages in a table space mark all their pages free. In theory this could happen in a new table space, which was created way too big for the data. But then InnoDB maintains the free limit. Bitmap pages at and above the free limit are not initialized and don't need to be backed up.
At the beginning of each map-set, only the bitmap page needs to be read. Then unused pages can be identified in contiguous chunks. If a chunk is big enough, then that chunk can be skipped from reading.
It does not seem desirable to read in different chunk sizes. So the read algorithm is now designed so that the read size is the data buffer size. At the beginning of a map-set, a data buffer is always read. Skipping of read pages is done in multiples of the data buffer size. A buffer can only be skipped, if all its pages are unused. Since MEB does currently use a fixed buffer size of 16 MB, it contains at least 1024 pages, depending on the table's page size. The probability for 1024 contiguous pages to be free isn't that high. That's why we won't reduce the I/O load much. And so we haven't implemented this part yet.
Skip unused pages when writing to the backup filesOn the write side it is advantageous to suppress even single unused pages. It doesn't break sequential writes.
The algorithm is similar to incremental backup. From the read side we get a buffer, which could contain unused pages. For the write side we produce a buffer with only the used pages from the read buffer. For each page, we decide independently, if we copy it to the output buffer. Every bitmap page needs to be included.
Compressed files must not be empty. If the backup files are compressed before they are written, we must assure that they contain some contents. This is a requirement of our compression algorithm. Since we include all bitmap pages, this is not a problem for file-per-table table spaces, nor for the first file of the system table space. But it can happen if a follow-up data file of the system table space has all its pages free, and does not have a bitmap page below the free limit. To work around this problem, we do always include the first page of a file.
A map-set can cover pages from multiple buffers. We need to keep them in memory until all covered buffers are written.
RestoreOn restore, MEB has to recreate unused pages at the right places. The reason for this is explained below in the section "Backup cannot shrink table spaces". Since the contents of the inserted pages do not matter, except of page number, space id, LSN, and checksums, empty pages are written.
In the following, the algorithm to recreate the skipped pages is called "expansion".
Please note that the expansion must take place before an apply-log operation. The apply-log algorithm works on a data file where all pages are at their correct places. Apply-log can modify initially unused pages. Hence those must also be present at the right places in the data file, and have the right LSN.
In MEB 3.9.0, a sequential algorithm is used, which detects skipped pages by a mismatch of a page number and the current write position in the expanded file. Empty pages are inserted until the page number matches the current write position.
If the last page from the backup file is below the free limit from the table space header, empty pages are appended up to the free limit. If the resulting page is below the table space size from the table space header, zero pages are appended up to the table space size.
Backup cannot shrink table spacesWhat users really want, is a feature to remove unused pages from a table space, and make the data file(s) smaller.
MEB cannot help with it. Every InnoDB page has a page number, which corresponds to its position in a data file. The InnoDB tablespaces contain tree structures, where a page can reference one or more other pages. These references are done by page numbers. Hence it is vital that every page retains its position in a data file. If there are unused pages among used pages, the used pages cannot be simply shifted down in a file to take the place of an unused page.
If one wants to change a position of a page, one must assign it a new page number (the one that corresponds to its new place) and modify the page number in all places that reference the page. This means that a bunch of random-access page-read and page-write operations can be necessary for each shifted page.
If a table has a single-file table space (innodb-file-per-table=ON) then an OPTIMIZE TABLE statement would create a new tablespaace with freshly constructed trees and thus take the minimum amount of space. Unfortunately that operation can put a too big burden on the database. With MEB, one could think of a workaround. Do a backup of the table, restore it to a temporary place, run a server on it, OPTIMIZE TABLE, and transport the resulting tablespace to the original server.