This project demonstrates a cloud-native Data Lakehouse on AWS, built around the Gold layer of the Medallion Architecture. It ingests financial data from a Trading API/Platform, converts raw CSV to Parquet via an AWS Glue ETL job, and exposes a curated Gold dataset to multiple consumer roles via three independent query engines, with an optional ML/AI path.
Raw financial data is streamed from a Trading API/Platform and stored as CSV in S3. An AWS Glue ETL job converts the raw CSV into Parquet format and loads it into the Gold S3 bucket. Access to both the raw bucket and Glue job is governed by S3 + Glue IAM policies scoped per user/role.
From the Gold dataset, three independent query engines are available:
- QE 1 — AWS Athena (Cloud): Data Engineers and Analysts query the Gold dataset via Apache Iceberg tables with partitioning using AWS Athena, enabling versioned, ACID-compliant serverless SQL analytics directly in the cloud.
- QE 2 — On-Premises Trino (Docker): Data Engineers and Analysts can run high-performance SQL queries locally via a Trino query engine deployed in Docker, connecting back to the Gold S3 dataset.
- QE 3 — On-Premises Spark (Docker): Data Engineers and Analysts can run large-scale distributed analytics locally via an Apache Spark query engine deployed in Docker, also connecting to the Gold S3 dataset.
Additionally, ML/AI Engineers have read access to the Gold dataset to optionally train and deploy ML models — this path is separate from the three query engines.
| Component | Description |
|---|---|
| Trading API / Platform | Source of real-time financial market data |
| S3 Raw Bucket | Landing zone storing ingested data as CSV |
| Component | Description |
|---|---|
| AWS Glue ETL Job | Converts raw CSV to Parquet format |
| S3 Gold Bucket | Stores the curated, query-optimised Gold dataset |
| S3 + Glue IAM Policies | Scoped access policies enforced per user/role |
| Component | Description |
|---|---|
| AWS Athena + Apache Iceberg Tables (QE 1) | Serverless cloud SQL querying on partitioned, versioned Iceberg tables in S3 |
| On-Prem Trino Query Engine (QE 2) | Local high-performance SQL analytics via Docker |
| On-Prem Spark Query Engine (QE 3) | Local distributed data processing and analytics via Docker |
| ML/AI Model Training (Optional) | Model training and deployment on the Gold dataset |
| Role | Access |
|---|---|
| Data Engineer + Admin | Full access — ingestion, raw S3 bucket, Glue ETL job |
| Data Engineer / Data Analyst | Gold S3 bucket, Iceberg tables (QE 1), Trino (QE 2), Spark (QE 3) |
| ML / AI Engineer | Read access to Gold S3 dataset |
- Amazon S3 — object storage for raw and Gold layers
- AWS Glue — serverless ETL (CSV → Parquet)
- AWS IAM — role-based access control via S3 + Glue policies
- Apache Iceberg — open table format with partitioning and versioning (QE 1)
- AWS Athena — serverless cloud SQL query engine for Iceberg tables (QE 1)
- Trino — on-premises SQL query engine via Docker (QE 2)
- Apache Spark — on-premises distributed query engine via Docker (QE 3)
- Docker — container runtime for on-premises query engines
- Trading API / Platform — financial market data source
For further assistance, please visit my YouTube channel, I have a fully detailed video explaining how to setup the components of this repository. YouTube Channel: https://www.youtube.com/@BDB5905
%20Mar%202026.png)