From 1279a9ba0711c16dab011b2759d1eea768c9ea97 Mon Sep 17 00:00:00 2001 From: Adam Green Date: Fri, 30 Jan 2026 00:39:12 +1300 Subject: [PATCH 1/3] feat --- README.md | 2 +- content/blog/what-is-scd.md | 90 +++++++++++++++++------------------- layouts/_default/baseof.html | 26 +++++++++++ 3 files changed, 69 insertions(+), 49 deletions(-) diff --git a/README.md b/README.md index 36edb4c..3595209 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ [![Netlify Status](https://api.netlify.com/api/v1/badges/cb1be350-b7c9-4a7f-9774-8daf510a9f65/deploy-status)](https://app.netlify.com/projects/adgefficiency/deploys) -[http://adgefficiency.com](http://adgefficiency.com) is my professional blog. +[The professional blog](http://adgefficiency.com) of [Adam Green](https://www.linkedin.com/in/adgefficiency/). ## Use diff --git a/content/blog/what-is-scd.md b/content/blog/what-is-scd.md index 5b59694..6f735f7 100644 --- a/content/blog/what-is-scd.md +++ b/content/blog/what-is-scd.md @@ -1,52 +1,46 @@ --- -draft: true -title: What are/is Slowly Changing Dimensions (SCD)? -description: TODO +title: What are Slowly Changing Dimensions (SCD)? +description: Slowly changing dimensions (SCD) are a group of techniques used to track changes to data. date_created: 2025-12-31 competencies: - Data Engineering --- -Slowly changing dimensions (SCD) are a group of techniques used to track changes to a row of data. +Slowly changing dimensions (SCD) are a group of techniques used to track changes to a row of data. There are seven types from Type 0 to Type 6, which trade off accuracy, data complexity and database performance: -There are six techniques from Type 1 to Type 6, which trade off accuracy, data complexity, database performance: - -- Type 0: Never update -- Type 1: Overwrite, with no history kept -- Type 2: Add new row with version & data tracking -- Type 3: Add column for previous value, limited history -- Type 4: Keep history in separate table -- Type 6: Hybrid of types 1, 2, and 3 - -Type 2 is the most common when you need audit trails or point-in-time analysis. +- **Type 0**: Never update +- **Type 1**: Overwrite, with no history kept +- **Type 2**: Add new row with version & date tracking +- **Type 3**: Add column for previous value, limited history +- **Type 4**: Keep history in separate table +- **Type 5**: Mini-dimension plus embedded current values +- **Type 6**: Hybrid of types 1, 2, and 3 ## Type 0 -Only use for immutable data. If your data changes - avoid. +Only use for immutable data. If your data changes (i.e. is not immutable) - avoid using this. ## Type 1 Actively harmful to data integrity, should be avoided except in very specific cases. -Only use for correcting data entry problems, or things that aren't worth keeping (like typos in a name). - -Avoid +Only use for correcting data entry problems, or things that aren't worth keeping (like typos in a name). Avoid. ## Type 2 -Type 2 is the most common when you need audit trails or point-in-time analysis. +Type 2 is the most common when you need audit trails or point-in-time analysis. Gold standard for analytics & reporting. A common type of SCD is SCD Type 2, where a row has (in addition to other columns): -- Start date -- End date -- Is current flag +- **Start date**: When this version of the row became active +- **End date**: When this version was superseded +- **Is current flag**: Whether this is the latest version Updating an existing row in a SCD type 2 table: -1. Insert a new row -2. Update the old row `is_current` flag to `False` -3. Update the old row `end_date` +1. **Insert**: Add a new row with the updated values +2. **Flag**: Set the old row `is_current` to `False` +3. **Close**: Set the old row `end_date` to the change date This will allow you to maintain a history of changes to your data over time. @@ -71,7 +65,7 @@ sk | customer_id | name | region | start_date | end_date | is_current When ingesting data like this, you need: -``` +```sql SELECT sk FROM dim_customer WHERE customer_id = 1001 AND transaction_date >= start_date @@ -84,39 +78,42 @@ When you load a fact with a transaction date in the past, after the dimension ha You can also skip surrogate keys in facts and do the date-range join at query time instead. Simpler ETL, but slower queries and easier to get wrong. -Risk dropping all current period facts if you don't handle the null `end_date` in a join of fact & dimensions. +Risks of skipping surrogate keys: -Risk facts joining to multiple dimensions if the historical periods overlap. - -Gold standard for analytics & reporting. +- **Null end dates**: Drop all current period facts if you don't handle the null `end_date` in a join of fact & dimensions +- **Overlapping periods**: Facts join to multiple dimensions if the historical periods overlap ## Type 3 -Limited, can't know history at arbitrary past dates - -Avoid +Add column for previous value. Limited history, can't know history at arbitrary past dates. Avoid. ## Type 4 -Good for historic, best when dimensions change a lot, and you often only want the current state (ie `is_current`) +Keep history in a separate table. Good for historical data, best when dimensions change a lot, and you often only want the current state (i.e. `is_current`). + +## Type 5 + +Hybrid of Types 1 and 4. Uses a mini-dimension table for frequently changing attributes, with the current mini-dimension key also embedded in the main dimension (Type 1 overwrite). Useful when a small set of attributes change often and you need both current and historical views. ## Type 6 -Good when you want to access the current value on the historical row, without needing a self join (like you would for Type2) +Combination of Types 1, 2, and 3. Good when you want to access the current value on the historical row, without needing a self join (like you would for Type 2). ## Deletes -SCDs typically focus on updates. What happens when a dimension record is deleted? Options: +SCDs typically focus on updates. -Soft delete (flag) -Keep row with "deleted" status -Actually delete (breaks referential integrity) +When a dimension record is deleted, options are: ---- +- **Soft delete**: Flag the row as deleted +- **Status marker**: Keep row with "deleted" status +- **Hard delete**: Actually delete (breaks referential integrity) -TODO - example in DuckDB SQL for the power MW of a hydro turbine - OR do I leave this for later??? +## Example in DuckDB SQL -``` +The example below shows a hydro turbine that is upgraded from 25 MW to 32 MW capacity, and how to track that with SCD Type 2: + +```sql -- Create dimension table with SCD Type 2 CREATE TABLE dim_turbine ( sk INTEGER PRIMARY KEY, @@ -177,7 +174,7 @@ SELECT * FROM dim_turbine WHERE is_current = true; SELECT * FROM dim_turbine WHERE turbine_id = 'HT-001' ORDER BY start_date; ``` -``` +```shell-session $ duckdb < scd.sql ┌─────────────────┬────────────┬──────────────────┬────────────────┬─────────────────────┐ │ generation_date │ name │ capacity_at_time │ generation_mwh │ capacity_factor_pct │ @@ -191,16 +188,13 @@ $ duckdb < scd.sql │ sk │ turbine_id │ name │ power_mw │ start_date │ end_date │ is_current │ │ int32 │ varchar │ varchar │ decimal(10,2) │ date │ date │ boolean │ ├───────┼────────────┼────────────┼───────────────┼────────────┼──────────┼────────────┤ -│ 2 │ HT-001 │ Karapiro 1 │ 32.00 │ 2024-07-01 │ │ true │ +│ 2 │ HT-001 │ Karapiro 1 │ 32.00 │ 2024-07-01 │ NULL │ true │ └───────┴────────────┴────────────┴───────────────┴────────────┴──────────┴────────────┘ ┌───────┬────────────┬────────────┬───────────────┬────────────┬────────────┬────────────┐ │ sk │ turbine_id │ name │ power_mw │ start_date │ end_date │ is_current │ │ int32 │ varchar │ varchar │ decimal(10,2) │ date │ date │ boolean │ ├───────┼────────────┼────────────┼───────────────┼────────────┼────────────┼────────────┤ │ 1 │ HT-001 │ Karapiro 1 │ 25.00 │ 2020-01-01 │ 2024-07-01 │ false │ -│ 2 │ HT-001 │ Karapiro 1 │ 32.00 │ 2024-07-01 │ │ true │ +│ 2 │ HT-001 │ Karapiro 1 │ 32.00 │ 2024-07-01 │ NULL │ true │ └───────┴────────────┴────────────┴───────────────┴────────────┴────────────┴────────────┘ ``` - -TODO -- change karapiro to benmore or something diff --git a/layouts/_default/baseof.html b/layouts/_default/baseof.html index 75c655c..0ad811e 100644 --- a/layouts/_default/baseof.html +++ b/layouts/_default/baseof.html @@ -195,6 +195,32 @@ >Who Am I + + + + + + + Work With Me + {{ partial "dark-mode-toggle.html" . }} Date: Sat, 31 Jan 2026 15:49:15 +1300 Subject: [PATCH 2/3] feat --- content/blog/what-is-scd.md | 38 ++++++++++++++++++++++++++++-------- layouts/_default/baseof.html | 32 ++++++++++++++++++++++++------ 2 files changed, 56 insertions(+), 14 deletions(-) diff --git a/content/blog/what-is-scd.md b/content/blog/what-is-scd.md index 6f735f7..fa3a1d3 100644 --- a/content/blog/what-is-scd.md +++ b/content/blog/what-is-scd.md @@ -1,12 +1,14 @@ --- title: What are Slowly Changing Dimensions (SCD)? description: Slowly changing dimensions (SCD) are a group of techniques used to track changes to data. -date_created: 2025-12-31 +date_created: 2026-01-30 competencies: - Data Engineering --- -Slowly changing dimensions (SCD) are a group of techniques used to track changes to a row of data. There are seven types from Type 0 to Type 6, which trade off accuracy, data complexity and database performance: +**Slowly changing dimensions (SCD) are a group of techniques used to track changes to a row of data**. + +There are seven SCD types from Type 0 to Type 6, which trade off accuracy, data complexity and database performance: - **Type 0**: Never update - **Type 1**: Overwrite, with no history kept @@ -18,19 +20,23 @@ Slowly changing dimensions (SCD) are a group of techniques used to track changes ## Type 0 -Only use for immutable data. If your data changes (i.e. is not immutable) - avoid using this. +**Type 0 is never updating your data**. + +You should only use this for immutable data. If your data changes, avoid using this. ## Type 1 -Actively harmful to data integrity, should be avoided except in very specific cases. +**Type 1 overwrites existing data with no history**. -Only use for correcting data entry problems, or things that aren't worth keeping (like typos in a name). Avoid. +This is actively harmful to data integrity, should be avoided except in very specific cases. + +You should only use this for correcting data entry problems, or things that aren't worth keeping (like typos in a name). ## Type 2 -Type 2 is the most common when you need audit trails or point-in-time analysis. Gold standard for analytics & reporting. +**Type 2 adds extra rows and columns to track history**. SCD Type 2 is the gold standard for analytics. It enables audit trails and point-in-time analysis. -A common type of SCD is SCD Type 2, where a row has (in addition to other columns): +In SCD Type 2 a row has (in addition to the data columns): - **Start date**: When this version of the row became active - **End date**: When this version was superseded @@ -63,7 +69,7 @@ sk | customer_id | name | region | start_date | end_date | is_current 2 | 1001 | Alice | Texas | 2024-06-01 | NULL | true ``` -When ingesting data like this, you need: +When reading data like this, you need to join on the date range to get the correct dimension record for the fact's date: ```sql SELECT sk FROM dim_customer @@ -198,3 +204,19 @@ $ duckdb < scd.sql │ 2 │ HT-001 │ Karapiro 1 │ 32.00 │ 2024-07-01 │ NULL │ true │ └───────┴────────────┴────────────┴───────────────┴────────────┴────────────┴────────────┘ ``` + +## Summary + +**Slowly changing dimensions are techniques for tracking how data changes over time, with each type trading off simplicity against historical accuracy**. + +- **Type 0**: Never update, only for truly immutable data +- **Type 1**: Overwrite with no history, only for correcting errors +- **Type 2**: Add rows with date tracking, the gold standard for analytics +- **Type 3**: Add column for previous value, too limited to be useful +- **Type 4**: Separate history table, good when dimensions change frequently +- **Type 5**: Mini-dimension with embedded current values, hybrid of Types 1 and 4 +- **Type 6**: Hybrid of Types 1, 2, and 3, gives current values on historical rows + +**For most analytics use cases, Type 2 is the right choice**. It enables audit trails, point-in-time analysis, and works well with surrogate keys and modern ETL tools like dbt snapshots. + +Thanks for reading! diff --git a/layouts/_default/baseof.html b/layouts/_default/baseof.html index 0ad811e..1e25d5a 100644 --- a/layouts/_default/baseof.html +++ b/layouts/_default/baseof.html @@ -39,20 +39,40 @@ css: { maxWidth: 'none', h1: { - marginTop: '1rem', + marginTop: '1.5rem', marginBottom: '1rem', }, h2: { - marginTop: '1em', - marginBottom: '0.5rem', + marginTop: '1.5rem', + marginBottom: '0.75rem', }, h3: { - marginTop: '1rem', + marginTop: '1.25rem', marginBottom: '0.75rem', }, h4: { - marginTop: '0.875rem', - marginBottom: '0.5rem', + marginTop: '1rem', + marginBottom: '0.75rem', + }, + p: { + marginTop: '0.75rem', + marginBottom: '0.75rem', + }, + figure: { + marginTop: '1rem', + marginBottom: '1rem', + }, + pre: { + marginTop: '1rem', + marginBottom: '1rem', + }, + ul: { + marginTop: '0.75rem', + marginBottom: '0.75rem', + }, + ol: { + marginTop: '0.75rem', + marginBottom: '0.75rem', }, 'code::before': { content: '""', // Removes the backtick From d5da13b5ff1f1e5f0ff0d8df8a57e51771cc6216 Mon Sep 17 00:00:00 2001 From: Adam Green Date: Fri, 13 Mar 2026 01:48:05 +1300 Subject: [PATCH 3/3] fea --- content/blog/what-is-scd.md | 85 +++++++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) diff --git a/content/blog/what-is-scd.md b/content/blog/what-is-scd.md index fa3a1d3..9852c76 100644 --- a/content/blog/what-is-scd.md +++ b/content/blog/what-is-scd.md @@ -42,6 +42,8 @@ In SCD Type 2 a row has (in addition to the data columns): - **End date**: When this version was superseded - **Is current flag**: Whether this is the latest version +The `is_current` flag is technically redundant with `end_date IS NULL`, but it's worth keeping. A boolean column is easier to index than scanning for NULLs, and `WHERE is_current = true` is more readable than `WHERE end_date IS NULL`. The trade-off is two sources of truth that can drift if your ETL has bugs. + Updating an existing row in a SCD type 2 table: 1. **Insert**: Add a new row with the updated values @@ -78,10 +80,14 @@ WHERE customer_id = 1001 AND (transaction_date < end_date OR end_date IS NULL) ``` +This uses a half-open interval `[start_date, end_date)` where `start_date` is inclusive and `end_date` is exclusive. This convention prevents off-by-one bugs and ensures a date always falls into exactly one version, with no gaps or overlaps between consecutive records. + Surrogates put the complexity of managing slowly changing dimensions into ETL (one place, tested once) rather than requiring users to correct filters in every downstream query. When you load a fact with a transaction date in the past, after the dimension has already changed, the surrogate key lookup needs to find the historical dimension record, not the current one. +The harder case is late-arriving dimension changes. You discover that a dimension changed at some point in the past, but you've already loaded facts against the old version. You need to split the existing version row at the backdated change date, creating a new historical record, and then re-key any facts that fall into the new period. This is one of the most operationally painful SCD 2 scenarios, and most teams handle it with a manual backfill process. + You can also skip surrogate keys in facts and do the date-range join at query time instead. Simpler ETL, but slower queries and easier to get wrong. Risks of skipping surrogate keys: @@ -115,6 +121,23 @@ When a dimension record is deleted, options are: - **Status marker**: Keep row with "deleted" status - **Hard delete**: Actually delete (breaks referential integrity) +dbt snapshots handle this with the `hard_deletes` config (v1.9+, replacing the older `invalidate_hard_deletes`): + +- **`ignore`**: Default, deleted source rows are not tracked and `dbt_valid_to` stays `NULL` +- **`invalidate`**: Sets `dbt_valid_to` on the snapshot row when the source row disappears, closing out the record +- **`new_record`**: Inserts a new snapshot row with a `dbt_is_deleted` column set to `True`, preserving continuous history + +```yaml +snapshots: + - name: snap_turbine + config: + hard_deletes: new_record + strategy: timestamp + updated_at: updated_at +``` + +`new_record` is the most complete option. If a source record is deleted and later restored, dbt tracks both events as separate rows, giving you a full audit trail of the deletion and restoration. + ## Example in DuckDB SQL The example below shows a hydro turbine that is upgraded from 25 MW to 32 MW capacity, and how to track that with SCD Type 2: @@ -205,6 +228,60 @@ $ duckdb < scd.sql └───────┴────────────┴────────────┴───────────────┴────────────┴────────────┴────────────┘ ``` +## Data Quality + +SCD Type 2 tables are prone to gaps and overlaps between version records. A gap means there's a period where no version is active for a business key, and facts in that window join to nothing. An overlap means a fact joins to multiple versions, causing fan-out. + +Check for these with a self-join that compares consecutive versions: + +```sql +SELECT + a.turbine_id, + a.end_date AS prev_end, + b.start_date AS next_start, + CASE + WHEN a.end_date < b.start_date THEN 'gap' + WHEN a.end_date > b.start_date THEN 'overlap' + END AS issue +FROM dim_turbine a +JOIN dim_turbine b + ON a.turbine_id = b.turbine_id + AND a.end_date IS NOT NULL + AND b.start_date > a.start_date +WHERE a.end_date != b.start_date +ORDER BY a.turbine_id, a.start_date; +``` + +Run this as a scheduled data quality check. In dbt, this is a good candidate for a custom test. + +## Performance + +Type 2 tables only grow. Every change adds a row, and nothing is ever deleted. For high-churn dimensions this becomes a problem for query performance. + +Strategies to manage table growth: + +- **Partition by `is_current`**: Most queries only need the current state, so partitioning lets the engine skip all historical rows +- **Create a current-state view**: A `dim_turbine_current` view with `WHERE is_current = true` baked in hides the filter from downstream users and ensures consistency +- **Consider Type 4**: If the current-state query path dominates and you rarely need history, move history to a separate table + +Indexing matters for SCD 2 tables at scale: + +- **Surrogate key lookup**: Index on `(business_key, is_current)` for the common pattern of finding the current record +- **Date-range joins**: Composite index on `(business_key, start_date, end_date)` for point-in-time lookups against fact tables +- **Covering indexes**: Include frequently queried columns to avoid table lookups entirely + +## Surrogate Keys in Distributed Systems + +Auto-increment keys don't work when multiple workers write to the same dimension table concurrently, which is the default in Spark and Databricks pipelines. + +Alternatives: + +- **Deterministic hash**: Hash the business key and start date, e.g. `md5(concat(turbine_id, start_date))`, giving you a reproducible surrogate key that any worker can compute independently +- **Monotonically increasing ID**: Spark's `monotonically_increasing_id()` is unique within a job but not across runs, so it's only safe if you're doing a full rebuild each time +- **Centralized sequence**: Use a database sequence or Delta Lake's identity columns if you need strict ordering, at the cost of a coordination bottleneck + +The deterministic hash approach is the most common in practice because it's idempotent. Re-running the same pipeline produces the same surrogate keys. + ## Summary **Slowly changing dimensions are techniques for tracking how data changes over time, with each type trading off simplicity against historical accuracy**. @@ -219,4 +296,12 @@ $ duckdb < scd.sql **For most analytics use cases, Type 2 is the right choice**. It enables audit trails, point-in-time analysis, and works well with surrogate keys and modern ETL tools like dbt snapshots. +- **Half-open intervals**: Use `[start_date, end_date)` to avoid off-by-one bugs in date-range joins +- **`is_current` flag**: Redundant with `end_date IS NULL` but worth keeping for indexing and readability +- **Late-arriving dimensions**: The hardest operational scenario, requiring row splits and fact re-keying +- **Data quality**: Check for gaps and overlaps between consecutive version records with a self-join +- **Performance**: Partition by `is_current`, create current-state views, and index on `(business_key, start_date, end_date)` +- **Distributed surrogate keys**: Use deterministic hashes in Spark/Databricks since auto-increment doesn't work across parallel workers +- **dbt `hard_deletes`**: Use `new_record` (v1.9+) to track deletions and restorations as separate snapshot rows + Thanks for reading!