GitHub - devika-hari/DataConnector_Airbyte: Building a Data Connector via Airbyte for a SaaS Data Source

SaaS Data Connector via Airbyte & Normalization

Implemented a minimal but production-minded ingestion and normalization flow for a mock third‑party SaaS source.

Part 1: Custom Airbyte Python CDK source reading from local JSON fixtures (no external APIs or keys).

Streams: accounts, users, events, invoices.
Pagination: Implemented via next_cursor fields in the fixtures.
Incremental sync: Based on updated_at, with proper Airbyte state.

Part 2:

Normalisation: Local pipeline producing analytics‑ready tables from the raw Airbyte output.

All components run locally and only depend on this repo and Python.

1. Prerequisites

Python: 3.11 [Dependency issues with higher versions]
Ensure input jsons are kept in fixtures folder.

2. Checkout & Setup

Clone the repository to machine:

git clone <your-repo-url> DataConnector_Airbyte
cd DataConnector_Airbyte

Create and activate a virtual environment (ensure Python 3.11):

python3 -m venv .venv
source .venv/bin/activate  # macOS / Linux
# .venv\Scripts\activate   # Windows PowerShell

Install dependencies:

pip install -r requirements.txt

3. Configuration

The connector treats the fixtures/ folder in this repo as the upstream API. Create a minimal config file at config.json in the repo root:

{
  "fixtures_base_path": "fixtures"
}

An empty state.json can be used to drive incremental syncs.

{}

4. Running the Airbyte Source Locally

main.py exposes the standard Airbyte CLI interface.

4.1. Connector Spec

Inspect the connector specification:

python main.py spec

This prints an Airbyte SPEC message describing the config schema (only fixtures_base_path).

4.2. Check Connection

Validate that the connector can read the local fixtures:

python main.py check --config config.json

You should see an Airbyte CONNECTION_STATUS message with "status": "SUCCEEDED". If it fails, it will list missing fixture files or an invalid path.

4.3. Discover Streams

List the available streams (accounts, users, events, invoices):

python main.py discover --config config.json

Prints a CATALOG message with stream schemas and supported sync modes (full refresh and incremental).

4.4. Read Data (Ingestion)

Run a read to fetch data from all streams:

python main.py read \
  --config config.json \
  --catalog configured_catalog.json \
  --state state.json \
> raw_output.jsonl

Notes:

Output is a stream of Airbyte messages (JSON‑per‑line):
Pagination is handled internally via the next_cursor fields inside the fixtures.
Incremental sync:
- The connector tracks the max updated_at per stream and emits it as STATE messages.
- On subsequent runs, records with updated_at earlier than or equal to the stored cursor are skipped.

5. Normalisation Pipeline

The file normalize.py implements a small, local normalization step:

Reads raw_output.jsonl (Airbyte output).
Extracts RECORD messages per stream.
Builds:
- dim_accounts: an account dimension with unified status and billing fields.
- fact_invoices: an invoice fact table at the invoice grain.
Exports these tables as csv files into an analytics/ folder.

5.1. Run Normalisation

After you have a raw_output.jsonl:

python normalize.py

Outputs:

analytics/dim_accounts.csv
analytics/fact_invoices.csv

You can inspect them with DuckDB directly:

duckdb
duckdb> SELECT * FROM 'analytics/dim_accounts.csv' LIMIT 5;
duckdb> SELECT * FROM 'analytics/fact_invoices.csv' LIMIT 5;

5.2. Handling Schema Drift

The normalization step intentionally smooths schema differences across fixture pages:

Accounts
- Some pages use account_status + billing.country / billing.city.
- Later pages use status + address.billing_country / address.billing_city.
- normalize.py maps these into a single shape:
  - status
  - billing_country
  - billing_city
Invoices
- All invoices are kept at invoice grain (id, account_id, amount, currency, status, timestamps).
- Line item detail is intentionally omitted from the first version of fact_invoices to keep the model simple.

These two tables (dim_accounts, fact_invoices) are analytics‑ready: they have stable schemas, consistent keys, and correct grain.

You now have both the raw Airbyte output as well as two curated tables under analytics/.

6. Assumptions & Trade‑offs

Assumptions

The provided JSON fixtures are an accurate representation of the external SaaS API, including its pagination (next_cursor) and incremental contract (updated_at).
We only need a single global cursor per stream (max updated_at), and late‑arriving updates are acceptable as long as they increase updated_at.
Local filesystem is an acceptable stand‑in for HTTP calls for this exercise.

Time‑boxed trade‑offs

Normalisation is kept intentionally small:
- Only two analytics tables (dim_accounts, fact_invoices) are produced, although more (e.g. fact_events, dim_users, fact_invoice_line_items) could be added.
- No complex type‑casting or currency handling beyond what the fixtures provide.
Robust error handling and logging is minimal but the structure is ready for production‑grade enhancements.
The connector encodes a fixture‑specific page_map rather than a fully generic paginator.

7. Next Improvements

Richer modelling
- Add fact_events, dim_users, and fact_invoice_line_items with proper foreign keys.
- Introduce slowly changing dimensions (SCD) handling for account attribute changes.
Connector hardening
- Add configuration for per‑stream start timestamps.
- Add retry / backoff and HTTP client configuration for a real API.
- Emit more structured logs and metrics (per‑stream row counts, latency).
Testing & CI
- Unit tests for incremental behaviour and schema drift handling.
- CI pipeline to run connector acceptance tests and normalisation regression tests.

This repo demonstrates the core ingestion + normalisation pattern using only local files and Python, which can be lifted into a production stack with relatively small changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SaaS Data Connector via Airbyte & Normalization

1. Prerequisites

2. Checkout & Setup

3. Configuration

4. Running the Airbyte Source Locally

4.1. Connector Spec

4.2. Check Connection

4.3. Discover Streams

4.4. Read Data (Ingestion)

5. Normalisation Pipeline

5.1. Run Normalisation

5.2. Handling Schema Drift

6. Assumptions & Trade‑offs

7. Next Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
source_saas		source_saas
.gitignore		.gitignore
README.md		README.md
SELECT		SELECT
catalog.json		catalog.json
config.json		config.json
configured_catalog.json		configured_catalog.json
img.png		img.png
main.py		main.py
normalize.py		normalize.py
requirements.txt		requirements.txt
state.json		state.json

Folders and files

Latest commit

History

Repository files navigation

SaaS Data Connector via Airbyte & Normalization

1. Prerequisites

2. Checkout & Setup

3. Configuration

4. Running the Airbyte Source Locally

4.1. Connector Spec

4.2. Check Connection

4.3. Discover Streams

4.4. Read Data (Ingestion)

5. Normalisation Pipeline

5.1. Run Normalisation

5.2. Handling Schema Drift

6. Assumptions & Trade‑offs

7. Next Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages