diff --git a/src/ondisk/chunked_format.md b/src/ondisk/chunked_format.md new file mode 100644 index 0000000..9e20460 --- /dev/null +++ b/src/ondisk/chunked_format.md @@ -0,0 +1,135 @@ +(chunk_based_format)= +# Chunk-based File Format + +The chunk-based file format allows large files to be split into fixed-size chunks, +each backed by consecutive physical blocks. This enables efficient deduplication +and multi-device storage by addressing each chunk independently. + +## Superblock Fields for Chunked Files + +The full superblock layout is defined in {ref}`on_disk_superblock`. This page +lists only the superblock fields whose meaning matters for chunk-based files and +their optional multi-device addressing support. + +| Offset | Size | Type | Name | Description | +|--------|------|-------|--------------------|-------------| +| 0x50 | 4 | `u32` | `feature_incompat` | `EROFS_FEATURE_INCOMPAT_CHUNKED_FILE` enables chunk-based inodes. `EROFS_FEATURE_INCOMPAT_DEVICE_TABLE` enables the external device table described in {ref}`device_table` | +| 0x56 | 2 | `u16` | `extra_devices` | Number of external devices. Valid when `EROFS_FEATURE_INCOMPAT_DEVICE_TABLE` is set | +| 0x58 | 2 | `u16` | `devt_slotoff` | Start slot offset of the external device table in the metadata area. Valid when `EROFS_FEATURE_INCOMPAT_DEVICE_TABLE` is set | + +## Inode Fields for Chunked Files + +The full compact and extended inode layouts are defined in {ref}`on_disk_inodes`. +For chunk-based files, both inode variants share the same feature-specific +fields: + +| Offset | Size | Type | Name | Description | +|--------|------|-------|------------|-------------| +| 0x00 | 2 | `u16` | `i_format` | Bits 1-3 encode `EROFS_INODE_CHUNK_BASED` (4) | +| 0x10 | 4 | `u32` | `i_u` | Chunk info record described below | + +## Inode Data Layout for Chunked Files + +The data layout of an inode is encoded in bits 1–3 of `i_format`. + +### `EROFS_INODE_CHUNK_BASED` (4) + +The entire inode data is split into fixed-size chunks, each occupying consecutive +physical blocks. Requires `EROFS_FEATURE_INCOMPAT_CHUNKED_FILE`. `i_u` encodes a +chunk info record and an array of per-chunk address entries follows the inode body. + +(chunk_based_structures)= +## Chunk-based Structures + +When the data layout is `EROFS_INODE_CHUNK_BASED`, the `i_u` field (4 bytes at +inode offset 0x10) is interpreted as a chunk info record: + +### Chunk Info Record + +| Bits | Width | Description | +|-------|-------|-------------| +| 0–4 | 5 | `chunkbits`: chunk size = 2 to the power of (`blkszbits + chunkbits`) | +| 5 | 1 | `EROFS_CHUNK_FORMAT_INDEXES`: entry format selector (see below) | +| 6 | 1 | 48-bit layout specific; ignored for normal chunk-based inodes | +| 7–31 | 25 | Reserved; must be 0 | + +An array of per-chunk address entries is stored immediately after the inode body +(and inline xattr region, if any). The number of entries is +`⌈i_size / chunk_size⌉`. + +### Chunk Entry Formats + +The `EROFS_CHUNK_FORMAT_INDEXES` bit in the chunk info record selects one of two +per-chunk entry formats: + +#### Block Map Entry (4 bytes) + +When `EROFS_CHUNK_FORMAT_INDEXES` is not set, each chunk is described by a +single 32-bit block address entry. + +Without a device table, this entry is a primary-device block address. +With `EROFS_FEATURE_INCOMPAT_DEVICE_TABLE`, it is interpreted in +the unified address space and can resolve to either the primary or an extra device. + +| Offset | Size | Type | Name | Description | +|--------|------|-------|------------|-------------| +| 0x00 | 4 | `u32` | `startblk` | Starting block address of this chunk | + +#### Chunk Index Entry (8 bytes) + +When `EROFS_CHUNK_FORMAT_INDEXES` is set, each chunk is described by an 8-byte +record that supports multi-device addressing. + +| Offset | Size | Type | Name | Description | +|--------|------|-------|-------------|-------------| +| 0x00 | 2 | `u16` | _dontcare_ | 48-bit layout specific; ignored for normal chunk-based inodes | +| 0x02 | 2 | `u16` | `device_id` | External device index; 0 = primary device | +| 0x04 | 4 | `u32` | `startblk` | 32-bit starting block address | + +The `device_id` indexes into the external device table to resolve the physical +device and unified address offset. See +{ref}`address-resolution-for-chunk-based-inodes`. + +(multi_device_support)= +## Multi-device Support + +(device_table)= +### Device Table + +This section applies when `EROFS_FEATURE_INCOMPAT_DEVICE_TABLE` is set. + +When `EROFS_FEATURE_INCOMPAT_DEVICE_TABLE` is set, the `extra_devices` superblock +field gives the number of additional block devices. They are described by an array +of 128-byte device slot records. The array begins at slot offset `devt_slotoff` +within the metadata area, where each slot is 128 bytes. + +Each record contains: + +| Offset | Size | Type | Name | Description | +|--------|------|--------|------------|-------------| +| 0x00 | 64 | `u8[]` | `tag` | Device identifier (e.g. SHA-256 digest); not null-terminated | +| 0x40 | 4 | `u32` | `blocks` | 32-bit total block count of this device | +| 0x44 | 4 | `u32` | `uniaddr` | 32-bit unified starting block address of this device | +| 0x48 | 6 | `u8[]` | _dontcare_ | 48-bit layout specific; ignored for normal chunk-based inodes | +| 0x4E | 50 | `u8[]` | _reserved_ | Reserved; must be 0 | + +(address-resolution-for-chunk-based-inodes)= +### Address Resolution for Chunk-based Inodes + +#### Chunk Index Entry (8 bytes) + +When the chunk index entry format is used, each entry carries an explicit +`device_id` that directly identifies the target device: + +- `device_id = 0`: `startblk` is a block address on the primary device. +- `device_id = N` (1 ≤ N ≤ `extra_devices`): `startblk` is a block address on + extra device N, whose slot is at `devt_slotoff + (N − 1)` in the metadata area. + +#### Block Map Entry (4 bytes) + +The simple block map entry format has no `device_id` field. Instead, `startblk` +is an absolute block address in the unified address space, and the reader uses +`uniaddr` from the device table to identify the target device: it finds the +slot `i` whose range `[uniaddr[i], uniaddr[i] + blocks[i])` contains +`startblk`, then derives the intra-device block address as +`startblk − uniaddr[i]`. diff --git a/src/ondisk/core_ondisk.md b/src/ondisk/core_ondisk.md index 79dde69..df454b1 100644 --- a/src/ondisk/core_ondisk.md +++ b/src/ondisk/core_ondisk.md @@ -77,7 +77,7 @@ installation of x86 boot sectors and other oddities. | 0x68 | 1 | `u8` | _reserved_ | Reserved; must be 0 | | 0x69 | 1 | `u8` | _dontcare_ | Xattr specific; ignored in core format | | 0x6A | 2 | `u16` | _reserved_ | Reserved; must be 0 | -| 0x6C | 12 | `u8[]` | _dontcare_ | 48 bit specific; ignored in core format | +| 0x6C | 12 | `u8[]` | _dontcare_ | 48-bit layout specific; ignored in core format | | 0x78 | 8 | `u64` | _reserved_ | Reserved; must be 0 | Note the difference between _reserved_ and _dontcare_ fields: diff --git a/src/ondisk/index.md b/src/ondisk/index.md index 17fe20a..9660769 100644 --- a/src/ondisk/index.md +++ b/src/ondisk/index.md @@ -22,7 +22,19 @@ The entire filesystem tree is built from just three core on-disk structures: - **Directory entries** — 12-byte records, sorted lexicographically by filename at the beginning of each directory block (each data block of a directory inode). +Optional features extend this foundation without breaking the core design: + +- **{doc}`Extended attributes (xattrs) `** support per-inode metadata. + Several mechanisms — including a shared xattr pool, long prefix tables, and a + per-inode Bloom filter — keep storage overhead low even when xattrs are used + extensively. +- **{doc}`Chunk-based layout `** splits large files into + fixed-size, independently-addressed chunks, enabling cross-file deduplication + and multi-device storage. + ```{toctree} :hidden: core_ondisk +xattrs +chunked_format ``` diff --git a/src/ondisk/xattrs.md b/src/ondisk/xattrs.md new file mode 100644 index 0000000..25e80d9 --- /dev/null +++ b/src/ondisk/xattrs.md @@ -0,0 +1,203 @@ +(xattrs)= +# Extended Attributes (Xattrs) + +EROFS supports Extended Attributes ([xattr(7)](https://man7.org/linux/man-pages/man7/xattr.7.html)), which are name:value pairs associated with inodes. + +## Superblock Fields for Xattr Support + +The full superblock layout is defined in {ref}`on_disk_superblock`. This page +lists only the superblock fields whose meaning matters for xattr-related +features. + +| Offset | Size | Type | Name | Description | +|--------|------|-------|--------------------------|-------------| +| 0x08 | 4 | `u32` | `feature_compat` | Xattr-related compatible feature flags; see {ref}`xattr_feature_flags` | +| 0x2C | 4 | `u32` | `xattr_blkaddr` | Start block address of the shared xattr area; see {ref}`shared_xattr_area` | +| 0x50 | 4 | `u32` | `feature_incompat` | Xattr-related incompatible feature flags; see {ref}`xattr_feature_flags` | +| 0x5B | 1 | `u8` | `xattr_prefix_count` | Number of long xattr name prefixes; valid when `EROFS_FEATURE_INCOMPAT_XATTR_PREFIXES` is set; see {ref}`long_xattr_prefixes` | +| 0x5C | 4 | `u32` | `xattr_prefix_start` | Location of the standalone long xattr prefix table when `EROFS_FEATURE_INCOMPAT_XATTR_PREFIXES` and `EROFS_FEATURE_COMPAT_PLAIN_XATTR_PFX` are both set; see {ref}`long_xattr_prefixes` | +| 0x60 | 8 | `u64` | `packed_nid` | Packed inode NID. Relevant when long xattr prefixes are embedded in the packed inode's data region | +| 0x68 | 1 | `u8` | `xattr_filter_reserved` | Must be 0 for the xattr Bloom filter to operate; see {ref}`xattr_filter` | +| 0x69 | 1 | `u8` | `ishare_xattr_prefix_id` | Long-prefix table index used by image-share xattrs; valid when `EROFS_FEATURE_COMPAT_ISHARE_XATTRS` is set; see {ref}`image_share_xattrs` | +| 0x80 | 8 | `u64` | `metabox_nid` | Metabox inode NID. Relevant when shared xattrs or long xattr prefixes are stored in the metabox inode's data region | + +(xattr_feature_flags)= +## Xattr Feature Flags + +The following superblock feature bits are directly relevant to xattr support. + +### `feature_compat` Bits + +| Bit mask | Name | Description | +|--------------|---------------------------------------------|-------------| +| `0x00000020` | `EROFS_FEATURE_COMPAT_XATTR_FILTER` | Enables the per-inode xattr Bloom filter described in {ref}`xattr_filter` | +| `0x00000040` | `EROFS_FEATURE_COMPAT_PLAIN_XATTR_PFX` | Stores the long xattr prefix table as a standalone region; valid when `EROFS_FEATURE_INCOMPAT_XATTR_PREFIXES` is set; see {ref}`long_xattr_prefixes` | +| `0x00000080` | `EROFS_FEATURE_COMPAT_SHARED_EA_IN_METABOX` | Stores the shared xattr area in the metabox inode's decoded data region instead of at `xattr_blkaddr` | +| `0x00000100` | `EROFS_FEATURE_COMPAT_ISHARE_XATTRS` | Enables image-share xattrs and makes `ishare_xattr_prefix_id` valid; see {ref}`image_share_xattrs` | + +### `feature_incompat` Bits + +| Bit mask | Name | Description | +|--------------|--------------------------------------|-------------| +| `0x00000020` | `EROFS_FEATURE_INCOMPAT_XATTR_PREFIXES` | Enables the long xattr prefix table described in {ref}`long_xattr_prefixes` | + +(xattr_inode_fields)= +## Inode Fields for Xattr Support + +The full compact and extended inode layouts are defined in {ref}`on_disk_inodes`. +For xattr support, both inode variants share the same feature-specific field: + +| Offset | Size | Type | Name | Description | +|--------|------|-------|------------------|-------------| +| 0x02 | 2 | `u16` | `i_xattr_icount` | When non-zero, the inline xattr region size is `(i_xattr_icount - 1) * 4 + 12` bytes | + +Two storage classes exist: + +- **Inline xattrs**: stored directly in the metadata block immediately following + the inode body (and any inline data tail). They are private to the inode and + encoded as a sequence of xattr entry records within the inline xattr region + described by `i_xattr_icount`. +- **Shared xattrs**: stored once in the global shared xattr area. An inode references + a shared entry through a 4-byte index stored immediately after the fixed + 12-byte inline xattr header. + Multiple inodes that carry an identical xattr (same name and value) can reference + the same shared entry, avoiding redundant per-inode copies. + +To further reduce storage overhead for xattrs whose names share a common prefix +(for example `trusted.overlay.*` or `security.ima.*`), EROFS supports a long xattr +prefix table (`EROFS_FEATURE_INCOMPAT_XATTR_PREFIXES`). Each prefix entry records a +`base_index` that refers to one of the standard short namespace prefixes and an +additional infix string. An xattr entry whose `e_name_index` selects a long prefix +stores only the suffix that follows the full reconstructed prefix, rather than +repeating the prefix in every entry. + +(inline_xattr_region)= +## Inline Xattr Region Layout + +The inline xattr region immediately follows the inode body (at a 4-byte aligned +offset). Its total size is `(i_xattr_icount - 1) * 4 + 12` bytes, where 12 is the +fixed size of the inline xattr body header. The region is structured as: + +1. The {ref}`inline_xattr_body_header` (12 bytes, fixed). +2. `h_shared_count` × 4-byte shared xattr index values. +3. Zero or more {ref}`xattr_entry_record` (inline entries). + +(inline_xattr_body_header)= +### Inline Xattr Body Header + +| Offset | Size | Type | Name | Description | +|--------|------|--------|------------------|-------------| +| 0x00 | 4 | `u32` | `h_name_filter` | Inverted Bloom filter over xattr names; valid when `EROFS_FEATURE_COMPAT_XATTR_FILTER` is set; see {ref}`Xattr Filter ` | +| 0x04 | 1 | `u8` | `h_shared_count` | Number of shared xattr index entries | +| 0x05 | 7 | `u8[]` | _reserved_ | Reserved; must be 0 | + +(xattr_entry_record)= +### Xattr Entry Record + +Each inline or shared xattr entry has the following layout: + +| Offset | Size | Type | Name | Description | +|--------|------|-------|----------------|-------------| +| 0x00 | 1 | `u8` | `e_name_len` | Length of the name suffix in bytes | +| 0x01 | 1 | `u8` | `e_name_index` | Namespace index (maps to a prefix string); see below | +| 0x02 | 2 | `u16` | `e_value_size` | Length of the value in bytes | + +Immediately following the 4-byte xattr entry header: `e_name_len` bytes of name +suffix, then `e_value_size` bytes of value. The entire entry (header + name + value) +is padded to a 4-byte boundary. + +(e_name_index-namespace-mapping)= +#### `e_name_index` Namespace Mapping + +Rather than storing the full namespace prefix string in every entry, EROFS encodes +the xattr namespace prefix as a 1-byte index. Bit 7 of `e_name_index` is the +`EROFS_XATTR_LONG_PREFIX` flag. When set, the lower 7 bits index into the long +xattr name prefix table (see {ref}`long_xattr_prefixes`). +When clear, the full byte selects one of the built-in short namespace prefixes: + +| Value | Prefix | +|-------|--------| +| 1 | `user.` | +| 2 | `system.posix_acl_access` | +| 3 | `system.posix_acl_default` | +| 4 | `trusted.` | +| 6 | `security.` | + +All other `e_name_index` values (including `0` and `5`) are reserved and must not be used unless defined by a future format extension. + +(shared_xattr_area)= +## Shared Xattr Area + +Normally, the shared xattr area begins at block address `xattr_blkaddr`. Each shared +entry is an xattr entry record stored contiguously in this area. An inode references +a shared entry by its 32-bit index, stored in the inline xattr region immediately +after the 12-byte inline xattr header. The index is a byte offset within the +shared area divided by 4. + +When `EROFS_FEATURE_COMPAT_SHARED_EA_IN_METABOX` is set, the shared xattr pool is +stored in the metabox inode's decoded data region rather than at `xattr_blkaddr`. + +(long_xattr_prefixes)= +## Long Xattr Name Prefixes + +This section applies when `EROFS_FEATURE_INCOMPAT_XATTR_PREFIXES` is set. + +When this feature is set, a table of `xattr_prefix_count` prefix entries is +present; see {ref}`xattr_prefix_table_placement` for where that table is stored. +Each entry has the following fixed header. The full entry, including the +variable-length `infix` payload, is padded to a 4-byte boundary. + +| Offset | Size | Type | Name | Description | +|--------|------|-------|--------------|-------------| +| 0x00 | 2 | `u16` | `size` | Byte length of the following content: `base_index` plus the variable-length `infix` payload | +| 0x02 | 1 | `u8` | `base_index` | Built-in short namespace prefix index (see {ref}`e_name_index-namespace-mapping`) | + +The variable-length `infix` bytes begin at offset `0x03`. Their length is +`size - 1`, and they are not null-terminated. + +The full reconstructed prefix is the concatenation of the short prefix indicated by +`base_index` and the `infix` bytes. An xattr entry using a long prefix stores only +the name suffix after the reconstructed prefix; `e_name_len` counts only those suffix +bytes. + +For example, an xattr named `trusted.overlay.opaque` can be represented with +`base_index = 4` (`trusted.`) and `infix = "overlay."`, yielding the full prefix +`trusted.overlay.`; the stored name suffix is `opaque` with `e_name_len = 6`. + +(xattr_prefix_table_placement)= +### Prefix Table Placement + +The xattr prefix table may be: +- embedded in the metabox or packed inode's data region (when `EROFS_FEATURE_COMPAT_PLAIN_XATTR_PFX` is not set), or +- stored as a standalone region located by `xattr_prefix_start` (when `EROFS_FEATURE_COMPAT_PLAIN_XATTR_PFX` is set). + +(xattr_filter)= +## Xattr Filter + +This section applies when `EROFS_FEATURE_COMPAT_XATTR_FILTER` is set. + +When this feature is set, `h_name_filter` in the inline xattr body header holds a +32-bit inverted Bloom filter over the inode's xattr names. Each bit position +corresponds to one hash bucket: + +- A bit value of **1** guarantees that no xattr present on this inode hashes to that + bucket, so the queried name is **definitely absent**. +- A bit value of **0** means a matching xattr **may exist** and a full scan is + required. + +When `xattr_filter_reserved` in the superblock is non-zero, the Bloom filter is +disabled unconditionally for all inodes in the image. + +(image_share_xattrs)= +## Image-share Xattrs + +This section applies when `EROFS_FEATURE_COMPAT_ISHARE_XATTRS` is set. + +When this feature is set, the superblock field `ishare_xattr_prefix_id` is valid +and identifies an entry in the long xattr prefix table. Regular files may carry an +xattr whose name equals the prefix identified by `ishare_xattr_prefix_id` +(i.e. `e_name_index` selects that entry and `e_name_len` is 0) and whose value is +a SHA-256 content fingerprint in the form `sha256:`. + +This convention enables tools to identify files with identical content across +different EROFS images by comparing these fingerprints.