Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![Netlify Status](https://api.netlify.com/api/v1/badges/cb1be350-b7c9-4a7f-9774-8daf510a9f65/deploy-status)](https://app.netlify.com/projects/adgefficiency/deploys)

[http://adgefficiency.com](http://adgefficiency.com) is my professional blog.
[The professional blog](http://adgefficiency.com) of [Adam Green](https://www.linkedin.com/in/adgefficiency/).

## Use

Expand Down
197 changes: 149 additions & 48 deletions content/blog/what-is-scd.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,54 @@
---
draft: true
title: What are/is Slowly Changing Dimensions (SCD)?
description: TODO
date_created: 2025-12-31
title: What are Slowly Changing Dimensions (SCD)?
description: Slowly changing dimensions (SCD) are a group of techniques used to track changes to data.
date_created: 2026-01-30
competencies:
- Data Engineering
---

Slowly changing dimensions (SCD) are a group of techniques used to track changes to a row of data.
**Slowly changing dimensions (SCD) are a group of techniques used to track changes to a row of data**.

There are six techniques from Type 1 to Type 6, which trade off accuracy, data complexity, database performance:
There are seven SCD types from Type 0 to Type 6, which trade off accuracy, data complexity and database performance:

- Type 0: Never update
- Type 1: Overwrite, with no history kept
- Type 2: Add new row with version & data tracking
- Type 3: Add column for previous value, limited history
- Type 4: Keep history in separate table
- Type 6: Hybrid of types 1, 2, and 3

Type 2 is the most common when you need audit trails or point-in-time analysis.
- **Type 0**: Never update
- **Type 1**: Overwrite, with no history kept
- **Type 2**: Add new row with version & date tracking
- **Type 3**: Add column for previous value, limited history
- **Type 4**: Keep history in separate table
- **Type 5**: Mini-dimension plus embedded current values
- **Type 6**: Hybrid of types 1, 2, and 3

## Type 0

Only use for immutable data. If your data changes - avoid.
**Type 0 is never updating your data**.

You should only use this for immutable data. If your data changes, avoid using this.

## Type 1

Actively harmful to data integrity, should be avoided except in very specific cases.
**Type 1 overwrites existing data with no history**.

Only use for correcting data entry problems, or things that aren't worth keeping (like typos in a name).
This is actively harmful to data integrity, should be avoided except in very specific cases.

Avoid
You should only use this for correcting data entry problems, or things that aren't worth keeping (like typos in a name).

## Type 2

Type 2 is the most common when you need audit trails or point-in-time analysis.
**Type 2 adds extra rows and columns to track history**. SCD Type 2 is the gold standard for analytics. It enables audit trails and point-in-time analysis.

In SCD Type 2 a row has (in addition to the data columns):

A common type of SCD is SCD Type 2, where a row has (in addition to other columns):
- **Start date**: When this version of the row became active
- **End date**: When this version was superseded
- **Is current flag**: Whether this is the latest version

- Start date
- End date
- Is current flag
The `is_current` flag is technically redundant with `end_date IS NULL`, but it's worth keeping. A boolean column is easier to index than scanning for NULLs, and `WHERE is_current = true` is more readable than `WHERE end_date IS NULL`. The trade-off is two sources of truth that can drift if your ETL has bugs.

Updating an existing row in a SCD type 2 table:

1. Insert a new row
2. Update the old row `is_current` flag to `False`
3. Update the old row `end_date`
1. **Insert**: Add a new row with the updated values
2. **Flag**: Set the old row `is_current` to `False`
3. **Close**: Set the old row `end_date` to the change date

This will allow you to maintain a history of changes to your data over time.

Expand All @@ -69,54 +71,78 @@ sk | customer_id | name | region | start_date | end_date | is_current
2 | 1001 | Alice | Texas | 2024-06-01 | NULL | true
```

When ingesting data like this, you need:
When reading data like this, you need to join on the date range to get the correct dimension record for the fact's date:

```
```sql
SELECT sk FROM dim_customer
WHERE customer_id = 1001
AND transaction_date >= start_date
AND (transaction_date < end_date OR end_date IS NULL)
```

This uses a half-open interval `[start_date, end_date)` where `start_date` is inclusive and `end_date` is exclusive. This convention prevents off-by-one bugs and ensures a date always falls into exactly one version, with no gaps or overlaps between consecutive records.

Surrogates put the complexity of managing slowly changing dimensions into ETL (one place, tested once) rather than requiring users to correct filters in every downstream query.

When you load a fact with a transaction date in the past, after the dimension has already changed, the surrogate key lookup needs to find the historical dimension record, not the current one.

You can also skip surrogate keys in facts and do the date-range join at query time instead. Simpler ETL, but slower queries and easier to get wrong.
The harder case is late-arriving dimension changes. You discover that a dimension changed at some point in the past, but you've already loaded facts against the old version. You need to split the existing version row at the backdated change date, creating a new historical record, and then re-key any facts that fall into the new period. This is one of the most operationally painful SCD 2 scenarios, and most teams handle it with a manual backfill process.

Risk dropping all current period facts if you don't handle the null `end_date` in a join of fact & dimensions.
You can also skip surrogate keys in facts and do the date-range join at query time instead. Simpler ETL, but slower queries and easier to get wrong.

Risk facts joining to multiple dimensions if the historical periods overlap.
Risks of skipping surrogate keys:

Gold standard for analytics & reporting.
- **Null end dates**: Drop all current period facts if you don't handle the null `end_date` in a join of fact & dimensions
- **Overlapping periods**: Facts join to multiple dimensions if the historical periods overlap

## Type 3

Limited, can't know history at arbitrary past dates

Avoid
Add column for previous value. Limited history, can't know history at arbitrary past dates. Avoid.

## Type 4

Good for historic, best when dimensions change a lot, and you often only want the current state (ie `is_current`)
Keep history in a separate table. Good for historical data, best when dimensions change a lot, and you often only want the current state (i.e. `is_current`).

## Type 5

Hybrid of Types 1 and 4. Uses a mini-dimension table for frequently changing attributes, with the current mini-dimension key also embedded in the main dimension (Type 1 overwrite). Useful when a small set of attributes change often and you need both current and historical views.

## Type 6

Good when you want to access the current value on the historical row, without needing a self join (like you would for Type2)
Combination of Types 1, 2, and 3. Good when you want to access the current value on the historical row, without needing a self join (like you would for Type 2).

## Deletes

SCDs typically focus on updates. What happens when a dimension record is deleted? Options:
SCDs typically focus on updates.

Soft delete (flag)
Keep row with "deleted" status
Actually delete (breaks referential integrity)
When a dimension record is deleted, options are:

---
- **Soft delete**: Flag the row as deleted
- **Status marker**: Keep row with "deleted" status
- **Hard delete**: Actually delete (breaks referential integrity)

dbt snapshots handle this with the `hard_deletes` config (v1.9+, replacing the older `invalidate_hard_deletes`):

TODO - example in DuckDB SQL for the power MW of a hydro turbine - OR do I leave this for later???
- **`ignore`**: Default, deleted source rows are not tracked and `dbt_valid_to` stays `NULL`
- **`invalidate`**: Sets `dbt_valid_to` on the snapshot row when the source row disappears, closing out the record
- **`new_record`**: Inserts a new snapshot row with a `dbt_is_deleted` column set to `True`, preserving continuous history

```yaml
snapshots:
- name: snap_turbine
config:
hard_deletes: new_record
strategy: timestamp
updated_at: updated_at
```

`new_record` is the most complete option. If a source record is deleted and later restored, dbt tracks both events as separate rows, giving you a full audit trail of the deletion and restoration.

## Example in DuckDB SQL

The example below shows a hydro turbine that is upgraded from 25 MW to 32 MW capacity, and how to track that with SCD Type 2:

```sql
-- Create dimension table with SCD Type 2
CREATE TABLE dim_turbine (
sk INTEGER PRIMARY KEY,
Expand Down Expand Up @@ -177,7 +203,7 @@ SELECT * FROM dim_turbine WHERE is_current = true;
SELECT * FROM dim_turbine WHERE turbine_id = 'HT-001' ORDER BY start_date;
```

```
```shell-session
$ duckdb < scd.sql
┌─────────────────┬────────────┬──────────────────┬────────────────┬─────────────────────┐
│ generation_date │ name │ capacity_at_time │ generation_mwh │ capacity_factor_pct │
Expand All @@ -191,16 +217,91 @@ $ duckdb < scd.sql
│ sk │ turbine_id │ name │ power_mw │ start_date │ end_date │ is_current │
│ int32 │ varchar │ varchar │ decimal(10,2) │ date │ date │ boolean │
├───────┼────────────┼────────────┼───────────────┼────────────┼──────────┼────────────┤
│ 2 │ HT-001 │ Karapiro 1 │ 32.00 │ 2024-07-01 │ │ true │
│ 2 │ HT-001 │ Karapiro 1 │ 32.00 │ 2024-07-01 │ NULL │ true │
└───────┴────────────┴────────────┴───────────────┴────────────┴──────────┴────────────┘
┌───────┬────────────┬────────────┬───────────────┬────────────┬────────────┬────────────┐
│ sk │ turbine_id │ name │ power_mw │ start_date │ end_date │ is_current │
│ int32 │ varchar │ varchar │ decimal(10,2) │ date │ date │ boolean │
├───────┼────────────┼────────────┼───────────────┼────────────┼────────────┼────────────┤
│ 1 │ HT-001 │ Karapiro 1 │ 25.00 │ 2020-01-01 │ 2024-07-01 │ false │
│ 2 │ HT-001 │ Karapiro 1 │ 32.00 │ 2024-07-01 │ │ true │
│ 2 │ HT-001 │ Karapiro 1 │ 32.00 │ 2024-07-01 │ NULL │ true │
└───────┴────────────┴────────────┴───────────────┴────────────┴────────────┴────────────┘
```

TODO
- change karapiro to benmore or something
## Data Quality

SCD Type 2 tables are prone to gaps and overlaps between version records. A gap means there's a period where no version is active for a business key, and facts in that window join to nothing. An overlap means a fact joins to multiple versions, causing fan-out.

Check for these with a self-join that compares consecutive versions:

```sql
SELECT
a.turbine_id,
a.end_date AS prev_end,
b.start_date AS next_start,
CASE
WHEN a.end_date < b.start_date THEN 'gap'
WHEN a.end_date > b.start_date THEN 'overlap'
END AS issue
FROM dim_turbine a
JOIN dim_turbine b
ON a.turbine_id = b.turbine_id
AND a.end_date IS NOT NULL
AND b.start_date > a.start_date
WHERE a.end_date != b.start_date
ORDER BY a.turbine_id, a.start_date;
```

Run this as a scheduled data quality check. In dbt, this is a good candidate for a custom test.

## Performance

Type 2 tables only grow. Every change adds a row, and nothing is ever deleted. For high-churn dimensions this becomes a problem for query performance.

Strategies to manage table growth:

- **Partition by `is_current`**: Most queries only need the current state, so partitioning lets the engine skip all historical rows
- **Create a current-state view**: A `dim_turbine_current` view with `WHERE is_current = true` baked in hides the filter from downstream users and ensures consistency
- **Consider Type 4**: If the current-state query path dominates and you rarely need history, move history to a separate table

Indexing matters for SCD 2 tables at scale:

- **Surrogate key lookup**: Index on `(business_key, is_current)` for the common pattern of finding the current record
- **Date-range joins**: Composite index on `(business_key, start_date, end_date)` for point-in-time lookups against fact tables
- **Covering indexes**: Include frequently queried columns to avoid table lookups entirely

## Surrogate Keys in Distributed Systems

Auto-increment keys don't work when multiple workers write to the same dimension table concurrently, which is the default in Spark and Databricks pipelines.

Alternatives:

- **Deterministic hash**: Hash the business key and start date, e.g. `md5(concat(turbine_id, start_date))`, giving you a reproducible surrogate key that any worker can compute independently
- **Monotonically increasing ID**: Spark's `monotonically_increasing_id()` is unique within a job but not across runs, so it's only safe if you're doing a full rebuild each time
- **Centralized sequence**: Use a database sequence or Delta Lake's identity columns if you need strict ordering, at the cost of a coordination bottleneck

The deterministic hash approach is the most common in practice because it's idempotent. Re-running the same pipeline produces the same surrogate keys.

## Summary

**Slowly changing dimensions are techniques for tracking how data changes over time, with each type trading off simplicity against historical accuracy**.

- **Type 0**: Never update, only for truly immutable data
- **Type 1**: Overwrite with no history, only for correcting errors
- **Type 2**: Add rows with date tracking, the gold standard for analytics
- **Type 3**: Add column for previous value, too limited to be useful
- **Type 4**: Separate history table, good when dimensions change frequently
- **Type 5**: Mini-dimension with embedded current values, hybrid of Types 1 and 4
- **Type 6**: Hybrid of Types 1, 2, and 3, gives current values on historical rows

**For most analytics use cases, Type 2 is the right choice**. It enables audit trails, point-in-time analysis, and works well with surrogate keys and modern ETL tools like dbt snapshots.

- **Half-open intervals**: Use `[start_date, end_date)` to avoid off-by-one bugs in date-range joins
- **`is_current` flag**: Redundant with `end_date IS NULL` but worth keeping for indexing and readability
- **Late-arriving dimensions**: The hardest operational scenario, requiring row splits and fact re-keying
- **Data quality**: Check for gaps and overlaps between consecutive version records with a self-join
- **Performance**: Partition by `is_current`, create current-state views, and index on `(business_key, start_date, end_date)`
- **Distributed surrogate keys**: Use deterministic hashes in Spark/Databricks since auto-increment doesn't work across parallel workers
- **dbt `hard_deletes`**: Use `new_record` (v1.9+) to track deletions and restorations as separate snapshot rows

Thanks for reading!
58 changes: 52 additions & 6 deletions layouts/_default/baseof.html
Original file line number Diff line number Diff line change
Expand Up @@ -39,20 +39,40 @@
css: {
maxWidth: 'none',
h1: {
marginTop: '1rem',
marginTop: '1.5rem',
marginBottom: '1rem',
},
h2: {
marginTop: '1em',
marginBottom: '0.5rem',
marginTop: '1.5rem',
marginBottom: '0.75rem',
},
h3: {
marginTop: '1rem',
marginTop: '1.25rem',
marginBottom: '0.75rem',
},
h4: {
marginTop: '0.875rem',
marginBottom: '0.5rem',
marginTop: '1rem',
marginBottom: '0.75rem',
},
p: {
marginTop: '0.75rem',
marginBottom: '0.75rem',
},
figure: {
marginTop: '1rem',
marginBottom: '1rem',
},
pre: {
marginTop: '1rem',
marginBottom: '1rem',
},
ul: {
marginTop: '0.75rem',
marginBottom: '0.75rem',
},
ol: {
marginTop: '0.75rem',
marginBottom: '0.75rem',
},
'code::before': {
content: '""', // Removes the backtick
Expand Down Expand Up @@ -195,6 +215,32 @@
>Who Am I</span
>
</a>
<a
href="/consulting"
class="group flex items-center text-white p-2 rounded-lg hover:bg-emerald-500 hover:text-emerald-50 hover:-translate-y-1 hover:shadow-lg transition-all duration-200"
aria-label="Consulting"
>
<svg
xmlns="http://www.w3.org/2000/svg"
width="24"
height="24"
viewBox="0 0 24 24"
fill="none"
stroke="currentColor"
stroke-width="2"
stroke-linecap="round"
stroke-linejoin="round"
class="flex-shrink-0"
>
<rect x="2" y="3" width="20" height="14" rx="2" ry="2"></rect>
<line x1="8" y1="21" x2="16" y2="21"></line>
<line x1="12" y1="17" x2="12" y2="21"></line>
</svg>
<span
class="max-w-0 overflow-hidden whitespace-nowrap group-hover:max-w-xs group-hover:ml-2 transition-all duration-300"
>Work With Me</span
>
</a>
{{ partial "dark-mode-toggle.html" . }}
<a
href="https://github.com/ADGEfficiency/adgefficiency.com"
Expand Down