MDEV-37949: Implement innodb_log_archive, innodb_log_recovery_start, innodb_log_recovery_target, … by dr-m · Pull Request #4405 · MariaDB/server

dr-m · 2025-10-28T13:33:04Z

The Jira issue number for this PR is: MDEV-37949

Description

MDEV-37949: Implement `innodb_log_archive`

The InnoDB write-ahead log file in the old innodb_log_archive=OFF format is named ib_logfile0, pre-allocated to innodb_log_file_size and written as a ring buffer. This is good for write performance and space management, but unsuitable for arbitrary point-in-time recovery or for facilitating efficient incremental backup.

innodb_log_archive=ON: A new format where InnoDB will create and preallocate files ib_%016x.log, instead of writing a circular file ib_logfile0. Each file will be pre-allocated to innodb_log_file_size (between 4M and 4G; we impose a stricter upper limit of 4 GiB for innodb_log_archive=ON). Once a log fills up, we will create and pre-allocate another log file, to which log records will be written. Upon the completion of the first log checkpoint in a recently created log file, the old log file will be marked read-only, signaling that there will be no further writes to that file, and that the file may safely be moved to long-term storage.

The file name includes the log sequence number (LSN) at file offset 12288 (log_t::START_OFFSET). Limiting the file size to 4 GiB allows us to identify each checkpoint by storing a 32-bit big-endian offset into the optional FILE_MODIFY and the mandatory FILE_CHECKPOINT records, between 12288 and the end of the file.

The innodb_encrypt_log format is identified by storing the encryption information at the start of the log file. The first 32-bit value will be 1, which is an invalid checkpoint offset. Each innodb_log_archive=ON log must use the same encryption parameters. Changing innodb_encrypt_log or related parameters is only possible by setting innodb_log_archive=OFF and restarting the server, which will permanently lose the history of the archived log.

The maximum number of log checkpoints that the innodb_log_archive=ON file header can represent is limited to 12288/4=3072 when using innodb_encrypt_log=OFF. If we run out of slots in a log file, each subsequently completed checkpoint in that log file will overwrite the last slot in the checkpoint header, until we switch to the next log.

innodb_log_recovery_start: The checkpoint LSN to start recovery from. This will be useful when recovering from an archived log. This is useful for restoring an incremental backup (applying InnoDB log files that were copied since the previous restore).

innodb_log_recovery_target: The requested LSN to end recovery at. When this is set, all persistent InnoDB tables will be read-only, and no writes to the log are allowed. The intended purpose of this setting is to prepare an incremental backup, as well as to allow data retrieval as of a particular logical point of time.

Setting innodb_log_recovery_target>0 is much like setting innodb_read_only=ON, with the exception that the data files may be written to by crash recovery, and locking reads will conflict with any incomplete transactions as necessary, and all transaction isolation levels will work normally (not hard-wired to READ UNCOMMITTED).

The status variable innodb_lsn_archived will reflect the LSN since when a complete InnoDB log archive is available. Its initial value will be that of the new parameter innodb_log_archive_start. If that variable is 0 (the default), the innodb_lsn_archived will be recovered from the available log files. If innodb_log_archive=OFF, innodb_lsn_archived will be adjusted to the latest checkpoint every time a log checkpoint is executed. If innodb_log_archive=ON, the value
should not change.

Statements like

SET GLOBAL innodb_log_archive=!@@GLOBAL.innodb_log_archive;

will take effect as soon as possible, possibly after a log checkpoint has been completed. The log file will be renamed between ib_logfile0 and ib_%016x.log as appropriate.

When innodb_log_archive=ON, the setting SET GLOBAL innodb_log_file_size will affect subsequently created log files when the file that is being currently written is running out. If we are switching log files exactly at the same time, then a somewhat misleading error message "innodb_log_file_size change is already in progress" will be issued.

When innodb_read_only=ON, innodb_log_recovery_target will be set to the current LSN. This ensures that it suffices to check only one of these variables when blocking writes to persistent tables.

See the commit message for a more detailed description of the changed data structures and functions.

Release Notes

See the previous section.

How can this PR be tested?

mysql-test/mtr --parallel=auto --force --big-test --mysqld=--loose-innodb-log-archive --skip-test=mariabackup
mysql-test/mtr --parallel=auto --force --big-test --mysqld=--loose-innodb-log-archive --mysqld=--loose-innodb-log-recovery-start=12288 --skip-test=mariabackup
mysql-test/mtr --parallel=auto --force --big-test --mysqld=--loose-innodb-log-archive --mysqld=--loose-innodb-log-recovery-start=12288 --mysqld=--loose-inodb-log-file-mmap=OFF --skip-test=mariabackup
mysql-test/mtr --parallel=auto --force --big-test --mysqld=--loose-innodb-log-archive --mysqld=--loose-innodb-log-recovery-start=12288 --mysqld=--loose-inodb-log-file-mmap=ON --skip-test=mariabackup

The mariabackup suite must be skipped when setting innodb_log_archive=ON, because the mariadb-backup tool will only support the old innodb_log_archive=OFF format (copying from ib_logfile0).

Unfortunately, all --suite=encryption tests that use innodb_encrypt_log must be skipped when using innodb_log_archive. This is because the server would have to be reinitialized; we do not allow changing the format of an archived log on startup (such as adding or removing encryption). This combination is covered by the test innodb.log_file_size_online,encrypted.

no_checkpoint_prepare.inc: A new file, to prepare for subsequent inclusion of no_checkpoint_end.inc. We will invoke the server to parse the log and to determine the latest checkpoint.

A number of tests that would fail when the parameter innodb_log_recovery_start=12288 is present, which is forcing
recovery to start from the beginning of the history (the database creation), have been adjusted with
explicit --innodb-log-recovery-start=0 to override that:

Some injected corruption may be "healed" by replaying the log from the beginning. Some tests expect an empty buffer pool after a restart, with no page I/O due to crash recovery.
Any test that sets innodb_read_only=ON would fail with an error message that the setting prevents crash recovery, unless innodb_log_recovery_start=0.
Any test that changes innodb_undo_tablespaces would fail in crash recovery, because crash recovery assumes that the undo tablespace ID that is available from the undo* files corresponds with the start of the log. This is an unforunate design bug which we cannot fix easily.

Basing the PR against the correct MariaDB version

This is a new feature or a refactoring, and the PR is based against the main branch.
This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

This is a new feature, but for now based on the 11.4 branch so that any unrelated errors that may be found during testing can be fixed rather quickly. Merges to the main branch may be blocked for weeks at a time.

PR quality check

I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

CLAassistant · 2025-10-28T13:33:17Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

The InnoDB write-ahead log file in the old innodb_log_archive=OFF format is named ib_logfile0, pre-allocated to innodb_log_file_size and written as a ring buffer. This is good for write performance and space management, but unsuitable for arbitrary point-in-time recovery or for facilitating efficient incremental backup. innodb_log_archive=ON: A new format where InnoDB will create and preallocate files ib_%016x.log, instead of writing a circular file ib_logfile0. Each file will be pre-allocated to innodb_log_file_size (between 4M and 4G; we impose a stricter upper limit of 4 GiB for innodb_log_archive=ON). Once a log fills up, we will create and pre-allocate another log file, to which log records will be written. Upon the completion of the first log checkpoint in a recently created log file, the old log file will be marked read-only, signaling that there will be no further writes to that file, and that the file may safely be moved to long-term storage. The file name includes the log sequence number (LSN) at file offset 12288 (log_t::START_OFFSET). Limiting the file size to 4 GiB allows us to identify each checkpoint by storing a 32-bit big-endian offset into the optional FILE_MODIFY and the mandatory FILE_CHECKPOINT records, between 12288 and the end of the file. The innodb_encrypt_log format is identified by storing the encryption information at the start of the log file. The first 32-bit value will be 1, which is an invalid checkpoint offset. Each innodb_log_archive=ON log must use the same encryption parameters. Changing innodb_encrypt_log or related parameters is only possible by setting innodb_log_archive=OFF and restarting the server, which will permanently lose the history of the archived log. The maximum number of log checkpoints that the innodb_log_archive=ON file header can represent is limited to 12288/4=3072 when using innodb_encrypt_log=OFF. If we run out of slots in a log file, each subsequently completed checkpoint in that log file will overwrite the last slot in the checkpoint header, until we switch to the next log. innodb_log_recovery_start: The checkpoint LSN to start recovery from. This will be useful when recovering from an archived log. This is useful for restoring an incremental backup (applying InnoDB log files that were copied since the previous restore). innodb_log_recovery_target: The requested LSN to end recovery at. When this is set, all persistent InnoDB tables will be read-only, and no writes to the log are allowed. The intended purpose of this setting is to prepare an incremental backup, as well as to allow data retrieval as of a particular logical point of time. Setting innodb_log_recovery_target>0 is much like setting innodb_read_only=ON, with the exception that the data files may be written to by crash recovery, and locking reads will conflict with any incomplete transactions as necessary, and all transaction isolation levels will work normally (not hard-wired to READ UNCOMMITTED). srv_read_only_mode: When this is set (innodb_read_only=ON), also recv_sys.rpo (innodb_log_recovery_target) will be set to the current LSN. This ensures that it will suffice to check only one of these variables when blocking writes to persistent tables. The status variable innodb_lsn_archived will reflect the LSN since when a complete InnoDB log archive is available. Its initial value will be that of the new parameter innodb_log_archive_start. If that variable is 0 (the default), the innodb_lsn_archived will be recovered from the available log files. If innodb_log_archive=OFF, innodb_lsn_archived will be adjusted to the latest checkpoint every time a log checkpoint is executed. If innodb_log_archive=ON, the value should not change. SET GLOBAL innodb_log_archive=!@@GLOBAL.innodb_log_archive will take effect as soon as possible, possibly after a log checkpoint has been completed. The log file will be renamed between ib_logfile0 and ib_%016x.log as appropriate. When innodb_log_archive=ON, the setting SET GLOBAL innodb_log_file_size will affect subsequently created log files when the file that is being currently written is running out. If we are switching log files exactly at the same time, then a somewhat misleading error message "innodb_log_file_size change is already in progress" will be issued. no_checkpoint_prepare.inc: A new file, to prepare for subsequent inclusion of no_checkpoint_end.inc. We will invoke the server to parse the log and to determine the latest checkpoint. All --suite=encryption tests that use innodb_encrypt_log will be skipped for innodb_log_archive=ON, because enabling or disabling encryption on the log is not possible without temporarily setting innodb_log_archive=OFF and restarting the server. The idea is to add the following arguments to an invocation of mysql-test/mtr: --mysqld=--loose-innodb-log-archive \ --mysqld=--loose-innodb-log-recovery-start=12288 \ --mysqld=--loose-innodb-log-file-mmap=OFF \ --skip-test=mariabackup Alternatively, specify --mysqld=--loose-innodb-log-file-mmap=ON to cover both code paths. The mariabackup test suite must be skipped when using the innodb_log_archive=ON format, because mariadb-backup will only support the old ib_logfile0 format (innodb_log_archive=OFF). A number of tests would fail when the parameter innodb_log_recovery_start=12288 is present, which is forcing recovery to start from the beginning of the history (the database creation). The affected tests have been adjusted with explicit --innodb-log-recovery-start=0 to override that: (0) Some injected corruption may be "healed" by replaying the log from the beginning. Some tests expect an empty buffer pool after a restart, with no page I/O due to crash recovery. (1) Any test that sets innodb_read_only=ON would fail with an error message that the setting prevents crash recovery, unless innodb_log_recovery_start=0. (2) Any test that changes innodb_undo_tablespaces would fail in crash recovery, because crash recovery assumes that the undo tablespace ID that is available from the undo* files corresponds with the start of the log. This is an unforunate design bug which we cannot fix easily. log_sys.first_lsn: The start of the current log file, to be consulted in log_t::write_checkpoint() when renaming files. log_sys.archived_lsn: New field: The value of innodb_lsn_archived. log_sys.end_lsn: New field: The log_sys.get_lsn() when the latest checkpoint was initiated. That is, the start LSN of a possibly empty sequence of FILE_MODIFY records followed by FILE_CHECKPOINT. log_sys.resize_target: The value of innodb_log_file_size that will be used for creating the next archive log file once the current file (of log_sys.file_size) fills up. log_sys.archive: New field: The value of innodb_log_archive. log_sys.next_checkpoint_no: Widen to uint16_t. There may be up to 12288/4=3072 checkpoints in the header. log_sys.log: If innodb_log_archive=ON, this file handle will be kept open also in the PMEM code path. log_sys.resize_log: If innodb_log_archive=ON, we may have two log files open both during normal operation and when parsing the log. This will store the other handle (old or new file). log_sys.resize_buf: In the memory-mapped code path, this will point to the file resize_log when innodb_log_archive=ON. recv_sys.log_archive: All innodb_log_archive=ON files that will be considered in recovery. recv_sys.was_archive: A flag indicating that an innodb_log_archive=ON file is in innodb_log_archive=OFF format. recv_sys_t::find_checkpoint(): Find and remember the checkpoint position in the last file when innodb_log_recovery_start points to an older file. When innodb_log_file_mmap=OFF, restore log_sys.checkpoint_buf from the latest log file. If the last archive log file is actually in innodb_log_archive=OFF format despite being named ib_%016.log, try to recover it in that format. log_sys.is_pmem, log_t::is_mmap_writeable(): A new predicate. log_t::archive_new_write(): Create and allocate a new log file, and write the outstanding data to both the current and the new file, or only to the new file, until write_checkpoint() completes the first checkpoint in the new file. log_t::archived_mmap_switch_prepare(): Create and memory-map a new log file, and update file_size to resize_target. Remember the file handle of the current log in resize_log, so that write_checkpoint() will be able to make it read-only. log_t::archived_mmap_switch_complete(): Switch to the buffer that was created in archived_mmap_switch_prepare(). log_t::write_checkpoint(): Allow an old checkpoint to be completed in the old log file even after a new one has been created. If we are writing the first checkpoint in a new log file, we will mark the old log file read-only. We will also update log_sys.first_lsn unless it was already updated in ARCHIVED_MMAP code path. In that code path, there is the special case where log_sys.resize_buf == nullptr and log_sys.checkpoint_buf points to log_sys.resize_log (the old log file that is about to be made read-only). In this case, log_sys.first_lsn will already point to the start of the current log_sys.log, even though the switch has not been fully completed yet. log_t::header_rewrite(my_bool): Rewrite the log file header before or after renaming the log file. The recovery of the last ib_%016%.log file must tolerate also the ib_logfile0 format. log_t::set_archive(my_bool,THD): Implement SET GLOBAL innodb_log_archive. An error will be returned if non-archived SET GLOBAL innodb_log_file_size (log file resizing) is in progress. Wait for checkpoint if necessary. The current log file will be renamed to either ib_logfile0 or ib_%016x.log, as appropriate. log_t::archive_rename(): Rename an archived log to ib_logfile0 on recovery in case there had been a crash during set_archive(). log_t::archive_set_size(): A new function, to ensure that log_sys.resize_target is set on startup. log_checkpoint_low(): Do not prevent a checkpoint at the start of a file. We want the first innodb_log_archive=ON file to start with a checkpoint. log_t::create(lsn_t): Initialize last_checkpoint_lsn. Initialize the log header as specified by log_sys.archive (innodb_log_archive). log_write_buf(): Add the parameter max_length, the file wrap limit. log_write_up_to(), mtr_t::commit_log_release<bool mmap=true>(): If we are switching log files, invoke buf_flush_ahead(lsn, true) to ensure that a log checkpoint will be completed in the new file. mtr_t::finish_writer(): Specialize for innodb_log_archive=ON. log_t::append_prepare<log_t::ARCHIVED_MMAP>(): Special case. log_t::get_path(): Get the name of the current log file. log_t::get_circular_path(size_t): Get the path name of a circular file. Replaces get_log_file_path(). log_t::get_archive_path(lsn_t): Return a name of an archived log file. log_t::get_next_archive_path(): Return the name of the next archived log. log_t::append_archive_name(): Append the archive log file name to a path string. mtr_t::finish_writer(): Invoke log_close() only if innodb_log_archive=OFF. In the innodb_log_archive=ON, we only force log checkpoints after creating a new archive file, to ensure that the first checkpoint will be written as soon as possible. log_t::checkpoint_margin(): Replaces log_checkpoint_margin(). If a new archived log file has been created, wait for the first checkpoint in that file. srv_log_rebuild_if_needed(): Never rebuild if innodb_log_archive=ON. The setting innodb_log_file_size will affect the creation of subsequent log files. The parameter innodb_encrypt_log cannot be changed while the log is in the innodb_log_archive=ON format. log_t::attach(), log_mmap(): Add the parameter log_access, to distinguish memory-mapped or read-only access. log_t::attach(): When disabling innodb_log_file_mmap, read checkpoint_buf from the last innodb_log_archive=ON file. log_t::clear_mmap(): Clear the tail of the checkpoint buffer if is_mmap_writeable(). log_t::set_recovered(): Invoke clear_mmap(), and restore the log buffer to the correct position. recv_sys_t::apply(): Let log_t::clear_mmap() enable log writes. recv_sys_t::find_checkpoint(): If the circular ib_logfile0 is missing, determine the oldest archived log file with contiguous LSN. If innodb_log_archive=ON, refuse to start if ib_logfile0 exists. Open non-last archived log files in read-only mode. recv_sys_t::find_checkpoint_archived(): Validate each checkpoint in the current file header, and by default aim to recover from the last valid one. Terminate the search if the last validated checkpoint spanned two files. If innodb_log_recovery_start has been specified, attempt to validate it even if there is no such information stored in the checkpoint header. log_parse_file(): Do not invoke fil_name_process() during recv_sys_t::find_checkpoint_archived(), when we tolerate FILE_MODIFY records while looking for a FILE_CHECKPOINT record. recv_scan_log(): Invoke log_t::archived_switch_recovery() upon reaching the end of the current archived log file. log_t::archived_switch_recovery_prepare(): Make use of recv_sys.log_archive and open all but the last file read-only. log_t::archived_switch_recovery(): Switch files in the pread() code path. log_t::archived_mmap_switch_recovery_complete(): Switch files in the memory-mapped code path. recv_warp: A pointer wrapper for memory-mapped parsing that spans two archive log files. recv_sys_t::parse_mmap(): Use recv_warp for innodb_log_archive=ON. recv_sys_t::parse(): Tweak some logic for innodb_log_archive=ON. log_t::set_recovered_checkpoint(): Set the checkpoint on recovery. Updates also the end_lsn. log_t::set_recovered_lsn(): Also update flush_lock and write_lock, to ensure that log_write_up_to() will be a no-op. log_t::persist(): Even if the flushed_to_disk_lsn does not change, we may want to reset the write_lsn_offset.

storage/innobase/buf/buf0flu.cc

storage/innobase/log/log0recv.cc

storage/innobase/mtr/mtr0mtr.cc

extra/mariabackup/backup_copy.cc

mtr_t::commit_file(): Ensure that log archive rotation will complete. log_checkpoint_low(): Prevent duplicated fil_names_clear().

storage/innobase/mtr/mtr0mtr.cc

log_t::archive_rename(): Check the correct return status

Make innodb_log_recovery_target>0 block any operations that are blocked by innodb_read_only_mode or innodb_force_recovery.

log_t::header_rewrite(): Zero out the first block before header_write(). Also, write a message about the change, so that there will be a chance to recover in case the server is being killed during this operation.

storage/innobase/buf/buf0flu.cc

mariadb-SaahilAlam · 2026-03-02T09:15:18Z

extra/mariabackup/xtrabackup.cc

 	if (dst_log_file == NULL) {
-		msg("Error: failed to open the target stream for '%s'.",
-		    LOG_FILE_NAME);
+		msg("Error: failed to open the target stream"


Noticing a silent crash, please check if it's related

origin/MDEV-37949 218b238d6f3918c028bb595dac06c03ca5b4ea5e Error log shows InnoDB: Starting crash recovery from checkpoint LSN=95653039 InnoDB: End of log at LSN=95698132 Then the server silently disappears - no assertion, no error message

Please take a look
RR trace is present on SDP:-
/data/results/1772438757/Silent_crash

It looks like there was a possible anomaly at rr record time, possibly affecting the pread64 system call. In my rr replay attempt I got the following:

2026-03-02 0:30:25 0 [Note] InnoDB: End of log at LSN=95698132 [FATAL src/ReplaySession.cc:755:check_pending_sig()] (task 125299 (rec:1049701) at time 3447) -> Assertion `false' failed to hold. Replaying `SIGNAL: SIGSEGV(det)': expecting tracee signal or trap, but instead at `pread64' (ticks: 7914513)

When I attach GDB to the crashed rr replay I can see that it occurred deep inside the following:

#21 0x00005e20ebf7f5a1 in buf_read_page (page_id=page_id@entry={m_id = 0x3}, err=err@entry=0x7ffc1851501c, chain=@0x5e20ef0cc060: {first = 0x79e096001520}, unzip=unzip@entry=0x1) at /data/Server/MDEV-37949A/storage/innobase/buf/buf0rea.cc:540 #22 0x00005e20ebf63909 in buf_page_get_gen (page_id={m_id = 0x3}, zip_size=zip_size@entry=0x0, rw_latch=rw_latch@entry=RW_NO_LATCH, guess=guess@entry=0x0, mode=mode@entry=0x9, mtr=mtr@entry=0x7ffc18515020, err=0x7ffc1851501c) at /data/Server/MDEV-37949A/storage/innobase/buf/buf0buf.cc:2781 #23 0x00005e20ebddf6c0 in recv_sys_t::recover (this=<optimized out>, page_id=page_id@entry={m_id = 0x3}, mtr=mtr@entry=0x7ffc18515020, err=err@entry=0x7ffc1851501c) at /data/Server/MDEV-37949A/storage/innobase/log/log0recv.cc:4518 #24 0x00005e20ec015dd7 in ibuf_upgrade_needed () at /data/Server/MDEV-37949A/storage/innobase/ibuf/ibuf0ibuf.cc:1029 #25 0x00005e20ebee9e1e in srv_start (create_new_db=<optimized out>) at /data/Server/MDEV-37949A/storage/innobase/srv/srv0start.cc:1594 #26 0x00005e20ebd4a2b6 in innodb_init (p=<optimized out>) at /data/Server/MDEV-37949A/storage/innobase/handler/ha_innodb.cc:4187

No buffer page access code path should be changed in this pull request, so in any case it should not be related to these changes.

This could be a bug in the underlying Linux kernel or in the way how rr instruments system calls.

shall we open a seperate MDEV for this issue?

If you can reproduce and diagnose the problem, you could file a bug against https://github.com/rr-debugger/rr/. It could be a race condition between ptrace and signal handling, something similar to https://lkml.org/lkml/2021/10/31/311 perhaps. It definitely is not something that https://jira.mariadb.org is tracking.

Thirunarayanan · 2026-03-02T07:41:53Z

storage/innobase/include/log0log.h

  lsn_t next_checkpoint_lsn;
+  /** end_lsn of the first available checkpoint, or 0;
+  protected by latch.wr_lock() */
+  lsn_t archived_lsn;


SHOW GLOBAL STATUS access this variable without any mutex. Won't have torn read in case of 32 bit platform?

You are right, we should borrow the trx_t::max_inactive_id_atomic trick from #3668. On -march=i686 it actually is possible to perform 64-bit loads and stores by using floating-point operations. Already the Intel Pentium had a 64-bit data bus.

Sorry, I realized that using Atomic_relaxed<lsn_t> would not help, because the read would be via the following:

{"lsn_archived", &log_sys.archived_lsn, SHOW_ULONGLONG},

I will check the generated code for the actual read. We could add export_vars indirection for 32-bit systems. srv_export_innodb_status() is already reading some fields while holding exclusive log_sys.latch.

storage/innobase/include/log0recv.h

Thirunarayanan · 2026-03-02T08:07:49Z

storage/innobase/log/log0log.cc

+      }
+      pmem_persist(buf, 64);
+      memset_aligned<64>(buf + 64, 0, START_OFFSET - 64);
+      pmem_persist(buf, START_OFFSET);


Writes the checkpoint slot at buf[0] with offset value

persists buf[0...64)

zero all buf[64..start_offset)

persists buf[0..start_offset)

Why (4) persists which was already persist in (2)? Can we avoid this one?

We could, but I don’t think it is significant. SET GLOBAL innodb_log_archive should be a very rare operation, and PMEM is rarely available. Depending on the implementation of the underlying instruction, it could be that the extra flush is a no-op. For the clflush and clflushopt it isn’t, but modern AMD64 implementations should support clwb.

This is why I optimized for size here. A much bigger performance issue is that we are hogging log_sys.latch when durably changing innodb_log_archive, even more so in the non-PMEM code path. I don’t think we have any other practical choice.

Thirunarayanan · 2026-03-02T08:31:27Z

storage/innobase/log/log0log.cc

+      IF_WIN(log_resize_release(), latch.wr_unlock());
+      if (wait_lsn)
+        buf_flush_wait_flushed(wait_lsn);
+      continue;


If checkpoint_pending and wait_lsn == 0 then we could be releasing and acquiring log_t::latch. Is this acceptable?

In that scenario, we really have no other choice than to release and reacquire log_sys.latch in order to allow our "competitor" to get past the following in log_t::write_checkpoint():

checkpoint_pending= true; latch.wr_unlock(); log_write_and_flush_prepare();

Thirunarayanan · 2026-03-02T08:35:59Z

storage/innobase/log/log0log.cc

+    else
+    {
+      status= RESIZE_NO_CHANGE;
+      /* When the current log becomes full and a new archivable log file


In innodb_log_file_size_update(), we do the following:

switch (log_sys.resize_start(*static_cast<const ulonglong*>(save), thd)) { case log_t::RESIZE_NO_CHANGE: break;

Should this return a new status (kind of deferred). so the handler
can inform the user that the size will apply to the next archive file?

SET GLOBAL has not actually designed to return any status after the initial validate step. The error reporting and the entire logic should some day be improved in the following:
MDEV-36828 SET GLOBAL cannot be atomic when inter-parameter constraints exist

We are reporting errors only when we really have to. As you can see in log_file_size_online.result it is pretty ugly:

SET GLOBAL innodb_log_file_size=4294971392; ERROR HY000: Failed to create specific handler file

Thirunarayanan · 2026-03-02T08:59:30Z

storage/innobase/log/log0recv.cc

+        LARGE_INTEGER filesize;
+        filesize.LowPart= entry.nFileSizeLow;
+        filesize.HighPart= entry.nFileSizeHigh;
+        if ((filesize.LowPart & 4095) ||


why 4k? innodb_log_write_ahead_size can be 512kb.
Should the alignment check use write_size - 1 instead?

The allocation unit of innodb_log_file_size is 4096 bytes:

static MYSQL_SYSVAR_ULONGLONG(log_file_size, srv_log_file_size, PLUGIN_VAR_RQCMDARG, "Desired log file size in bytes", nullptr, innodb_log_file_size_update, 96 << 20, log_t::FILE_SIZE_MIN, std::numeric_limits<ulonglong>::max(), 4096);

In the innodb_log_archive=ON format, the file size must be a multiple of 4096 bytes. In innodb_log_archive=OFF we allow arbitrary file sizes above 12288+16 bytes, because that is the minimum-size ib_logfile0 that mariadb-backup can produce.

Thirunarayanan · 2026-03-02T09:00:49Z

storage/innobase/log/log0recv.cc

+          prev->second.access= log_t::READ_ONLY;
+        else
+        {
+          log_archive.erase(log_archive.begin(), start= i);


If the gap is found then all the files before gap are erased. Shouldn't we add message about it?

We should already refuse to start up InnoDB if innodb_log_recovery_start is before the first available log file. See the test innodb.log_archive:

--let $restart_parameters= --innodb-read-only --innodb-log-recovery-start=12288 --source include/start_mysqld.inc SELECT variable_name, variable_value FROM information_schema.global_status WHERE variable_name LIKE 'INNODB_LSN%'; let SEARCH_FILE = $MYSQLTEST_VARDIR/log/mysqld.1.err; let SEARCH_PATTERN = InnoDB: No matching file found for innodb_log_recovery_start=12288; --source include/search_pattern_in_file.inc

We must not allow any gaps between the first accepted file and the last found log file. We can only apply changes that reside within the available contiguous logs.

Note: It is possible that a DBA has moved some archived log to remote storage. This handling will allow us to recover if the archived log is restored.

Normally, when no innodb_log_recovery_start has been specified, we should start recovery from one of the last 2 log files.

Thirunarayanan · 2026-03-02T09:02:30Z

storage/innobase/log/log0recv.cc

+          case log_t::FORMAT_ENC_11:
+            if (recv_check_log_block(buf))
+            {
+              recv_sys.was_archive= true;


Is this case where archived file name contains circular log format?

If the server is killed in the middle of log_t::set_archive(), we may end up with a file ib_*.log whose contents is actually in the innodb_log_archive=OFF format. This is being exercised in the test innodb.log_archive by some injection in Perl. In 218b238 a sql_print_information message was added so that users will be able to recover from a situation where a crash during SET GLOBAL innodb_log_archive=OFF corrupts the log header.

Thirunarayanan · 2026-03-02T09:04:58Z

storage/innobase/log/log0recv.cc

+  while (n_checkpoint < log_sys.START_OFFSET / 4)
+  {
+    const uint32_t d{mach_read_from_4(my_assume_aligned<4>(buf))};
+    if (d < log_sys.START_OFFSET || d >= log_sys.file_size)


This one reads 4-byte checkpoint offsets sequentially. If a crash occurred while
writing a checkpoint slot (partial write), d could be a garbage value that happens
to be >= START_OFFSET and < file_size. How would handle this case?

Garbage values will be validated in the loop body by invoking recv_scan_log(false, parser) and by checking if recv_sys.file_checkpoint had been set. If not, we encountered something else than a valid FILE_CHECKPOINT record optionally preceded by FILE_MODIFY records. We silently ignore such invalid records. We will only be loud if no valid checkpoint is found and no valid innodb_log_recovery_start has been specified.

This is the reason why I believe that it is OK to write the checkpoint headers without any CRC-32C protection. Even if we crashed with completely invalid header in a log file, we should be able to recover from the start of a preceding log file.

storage/innobase/buf/buf0flu.cc

storage/innobase/mtr/mtr0mtr.cc

mysql-test/main/analyze_stmt_prefetch_count.test

storage/innobase/buf/buf0flu.cc

storage/innobase/log/log0log.cc

mysql-test/main/analyze_engine_stats2.test

mariadb-SaahilAlam · 2026-03-03T12:02:29Z

storage/innobase/buf/buf0flu.cc

  ut_ad(end_lsn <= current_lsn);
  ut_ad(end_lsn + SIZE_OF_FILE_CHECKPOINT <= current_lsn ||
        srv_shutdown_state > SRV_SHUTDOWN_INITIATED);
+  ut_ad(this->end_lsn <= end_lsn);


Another assertion found:-

origin/MDEV-37949 d9664ddec78e23d517887df6311204bd00044240 # 2026-03-02T13:49:16 [1000110] | mariadbd: /data/Server/MDEV-37949B/storage/innobase/buf/buf0flu.cc:1827: void log_t::write_checkpoint(lsn_t): Assertion `this->end_lsn <= end_lsn' failed.

Only core dump available on SDP :-

/data/results/1772479413/TBR-2379-MDEV-37949B

log_sys.end_lsn is 0x18a bytes ahead of end_lsn, that is, a checkpoint has apparently been written out of order. Both these LSN are after the start of the file (log_sys.first_lsn). I see that the file is in innodb_log_archive=OFF format. In that format, the old value of end_lsn should not play much role at all.

I’d like to see an rr replay trace of this, if possible.

log_parse_start(): Validate the mini-transaction before enforcing recv_sys.rpo.

recv_sys_t::find_checkpoint(): Close the PMEM file if was_archive. This is exercised by the test innodb.log_archive.

recv_sys.log_archive: Better comments describing the data structure

dr-m self-assigned this Oct 28, 2025

dr-m added the MariaDB Corporation label Oct 28, 2025

dr-m force-pushed the MDEV-37949 branch 4 times, most recently from 4a9a384 to 9d14e2c Compare November 12, 2025 14:35

dr-m force-pushed the MDEV-37949 branch from a035050 to bd2153e Compare December 4, 2025 11:10

dr-m force-pushed the MDEV-37949 branch 2 times, most recently from 1ab3880 to 254bfaa Compare January 8, 2026 15:13

dr-m force-pushed the MDEV-37949 branch from ce2cab3 to 50143aa Compare January 15, 2026 13:41

This was referenced Jan 19, 2026

MDEV-31956 SSD based InnoDB buffer pool extension #4510

Open

MDEV-38595: Simplify InnoDB doublewrite buffer creation #4554

Merged

dr-m mentioned this pull request Feb 4, 2026

MDEV-38748: Merge recv_recovery_read_checkpoint() to srv_start() #4614

Merged

dr-m force-pushed the MDEV-37949 branch 2 times, most recently from b9728c2 to 091c2b3 Compare February 5, 2026 05:28

dr-m force-pushed the MDEV-37949 branch from bdc7185 to 4dc41fd Compare February 12, 2026 07:20

dr-m changed the title ~~MDEV-37949: Implement innodb_log_archive_file_size, innodb_log_archive_path, …~~ MDEV-37949: Implement innodb_log_archive, innodb_log_recovery_start, innodb_log_recovery_target, … Feb 12, 2026

dr-m force-pushed the MDEV-37949 branch 3 times, most recently from 1112487 to 998c03a Compare February 13, 2026 13:54

dr-m mentioned this pull request Feb 17, 2026

MDEV-38850 Dormant corruption in log_t::clear_mmap() #4662

Merged

dr-m force-pushed the MDEV-37949 branch 6 times, most recently from c29ff0c to 4d2e8ad Compare February 19, 2026 13:36

dr-m force-pushed the MDEV-37949 branch from 4d2e8ad to 99959b5 Compare February 19, 2026 13:47

mariadb-SaahilAlam reviewed Feb 26, 2026

View reviewed changes

storage/innobase/buf/buf0flu.cc Show resolved Hide resolved

mariadb-SaahilAlam reviewed Feb 26, 2026

View reviewed changes

storage/innobase/log/log0recv.cc Show resolved Hide resolved

mariadb-SaahilAlam reviewed Feb 27, 2026

View reviewed changes

storage/innobase/mtr/mtr0mtr.cc Show resolved Hide resolved

fixup! 99959b5

da9ef71

mariadb-SaahilAlam reviewed Feb 27, 2026

View reviewed changes

extra/mariabackup/backup_copy.cc Show resolved Hide resolved

dr-m added 2 commits February 27, 2026 14:37

fixup! 99959b5

79107c4

fixup! 99959b5

ad6c7ac

mtr_t::commit_file(): Ensure that log archive rotation will complete. log_checkpoint_low(): Prevent duplicated fil_names_clear().

mariadb-SaahilAlam reviewed Feb 28, 2026

View reviewed changes

storage/innobase/mtr/mtr0mtr.cc Show resolved Hide resolved

dr-m added 3 commits March 2, 2026 08:24

fixup! 99959b5

b43e762

log_t::archive_rename(): Check the correct return status

fixup! 99959b5

daaf676

Make innodb_log_recovery_target>0 block any operations that are blocked by innodb_read_only_mode or innodb_force_recovery.

squash! 99959b5

218b238

log_t::header_rewrite(): Zero out the first block before header_write(). Also, write a message about the change, so that there will be a chance to recover in case the server is being killed during this operation.

mariadb-SaahilAlam reviewed Mar 2, 2026

View reviewed changes

storage/innobase/buf/buf0flu.cc Outdated Show resolved Hide resolved

fixup! daaf676

8698d1f

mariadb-SaahilAlam reviewed Mar 2, 2026

View reviewed changes

Thirunarayanan requested changes Mar 2, 2026

View reviewed changes

mariadb-SaahilAlam reviewed Mar 2, 2026

View reviewed changes

storage/innobase/buf/buf0flu.cc Outdated Show resolved Hide resolved

mariadb-SaahilAlam reviewed Mar 2, 2026

View reviewed changes

storage/innobase/mtr/mtr0mtr.cc Show resolved Hide resolved

mariadb-SaahilAlam reviewed Mar 2, 2026

View reviewed changes

mysql-test/main/analyze_stmt_prefetch_count.test Show resolved Hide resolved

fixup! daaf676

d9664dd

mariadb-SaahilAlam reviewed Mar 3, 2026

View reviewed changes

storage/innobase/buf/buf0flu.cc Outdated Show resolved Hide resolved

fixup! 99959b5

c23f2ba

mariadb-SaahilAlam reviewed Mar 3, 2026

View reviewed changes

storage/innobase/log/log0log.cc Show resolved Hide resolved

mariadb-SaahilAlam reviewed Mar 3, 2026

View reviewed changes

storage/innobase/log/log0log.cc Show resolved Hide resolved

mariadb-SaahilAlam reviewed Mar 3, 2026

View reviewed changes

mysql-test/main/analyze_engine_stats2.test Show resolved Hide resolved

mariadb-SaahilAlam reviewed Mar 3, 2026

View reviewed changes

dr-m added 5 commits March 3, 2026 15:30

fixup! 99959b5

0376193

log_parse_start(): Validate the mini-transaction before enforcing recv_sys.rpo.

fixup! 99959b5

ad359ec

fixup! 99959b5

8fa921f

fixup! 99959b5

2d8347a

recv_sys_t::find_checkpoint(): Close the PMEM file if was_archive. This is exercised by the test innodb.log_archive.

fixup! 99959b5

0d2613c

recv_sys.log_archive: Better comments describing the data structure

Uh oh!

Conversation

dr-m commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description