Skip to content

feat: enhance data validation with comprehensive profiling and Soda Core integration#30

Merged
rprabhat merged 5 commits intomainfrom
sugary-walnut
Apr 6, 2026
Merged

feat: enhance data validation with comprehensive profiling and Soda Core integration#30
rprabhat merged 5 commits intomainfrom
sugary-walnut

Conversation

@rprabhat
Copy link
Copy Markdown
Contributor

@rprabhat rprabhat commented Apr 6, 2026

Summary

This PR enhances the data validation pipeline with:

  • Comprehensive data profiling in validate.py (per-column statistics, null checks, calculation reconciliation)
  • JSON/HTML report generation for validation results
  • Soda Core integration for data quality checks
  • Updated CI/CD pipelines to include data quality validation
  • Test suite for calculation verification

Changes

  • Enhanced backend/etl/validate.py with full data profiling capabilities
  • Added soda-core-duckdb dependency to requirements.txt
  • Created Soda Core YAML check files for sales, growth, and summary data
  • Added dq_runner.py to orchestrate Soda scans
  • Updated .github/workflows/etl-pipeline.yml with DQ validation and R2 upload steps
  • Enhanced .github/workflows/ci.yml to include ETL and data quality tests
  • Added tests/test_data_quality.py for calculation verification

Testing

  • All validation tests pass locally
  • Data quality checks execute successfully
  • ETL pipeline runs with enhanced validation steps

…ore integration

- Enhanced validate.py with per-column statistics, data profiling, and calculation reconciliation
- Added JSON/HTML report generation for validation results
- Installed soda-core-duckdb dependency
- Created Soda Core YAML check files for sales, growth, and summary data
- Added dq_runner.py to execute Soda scans and output JSON results
- Updated ETL pipeline workflow with DQ validation steps and R2 artifact upload
- Enhanced CI workflow to include ETL and data quality test execution
- Added test suite for data quality calculations verification
import argparse
import json
import logging
import os

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'os' is not used.

Copilot Autofix

AI about 23 hours ago

To fix an unused import, the general approach is to delete the import line for the unused module, ensuring that no remaining code references it. This reduces unnecessary dependencies and noise without changing runtime behavior.

Here, the best fix is to remove the import os statement on line 15 in backend/etl/data_quality/dq_runner.py. No other code changes are required, since nothing in the provided snippet uses os. This change preserves all existing functionality, as the remaining imports (argparse, json, logging, subprocess, sys, datetime, Path, Dict, Any) still cover the used features.

Concretely, edit backend/etl/data_quality/dq_runner.py to delete the import os line in the import block near the top of the file; no new methods, imports, or definitions are needed.

Suggested changeset 1
backend/etl/data_quality/dq_runner.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/backend/etl/data_quality/dq_runner.py b/backend/etl/data_quality/dq_runner.py
--- a/backend/etl/data_quality/dq_runner.py
+++ b/backend/etl/data_quality/dq_runner.py
@@ -12,7 +12,6 @@
 import argparse
 import json
 import logging
-import os
 import subprocess
 import sys
 from datetime import datetime, timezone
EOF
@@ -12,7 +12,6 @@
import argparse
import json
import logging
import os
import subprocess
import sys
from datetime import datetime, timezone
Copilot is powered by AI and may make mistakes. Always verify output.
from typing import Any

import pandas as pd
import numpy as np

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'np' is not used.

Copilot Autofix

AI about 23 hours ago

To fix an unused import, the general solution is to remove the import statement for any module or symbol that is not referenced anywhere in the file. This reduces unnecessary dependencies and improves code clarity without changing runtime behavior.

In this case, the best fix is to delete the line import numpy as np from backend/etl/validate.py. This line is currently at line 26, immediately after import pandas as pd. No other changes are required, since the rest of the visible imports (argparse, json, logging, sys, datetime, Path, Any, pandas) may be in use elsewhere in the file (which we are not shown) and should not be altered. Removing this line will not affect existing functionality, as np is reported as unused.


Suggested changeset 1
backend/etl/validate.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/backend/etl/validate.py b/backend/etl/validate.py
--- a/backend/etl/validate.py
+++ b/backend/etl/validate.py
@@ -23,7 +23,6 @@
 from typing import Any
 
 import pandas as pd
-import numpy as np
 
 logging.basicConfig(
     level=logging.INFO,
EOF
@@ -23,7 +23,6 @@
from typing import Any

import pandas as pd
import numpy as np

logging.basicConfig(
level=logging.INFO,
Copilot is powered by AI and may make mistakes. Always verify output.
)

# ── Enum/accepted values check ──
accepted_values_issues = {}

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable accepted_values_issues is not used.

Copilot Autofix

AI about 23 hours ago

In general, to fix an unused local variable you either (1) delete the variable assignment (ensuring you do not remove any right-hand-side expression with side effects), or (2) rename it to a conventional “unused” name if it is intentionally unused for documentation or interface reasons. Here, accepted_values_issues is assigned an empty dict literal with no side effects, and then never referenced, so the best fix is to remove this assignment line.

Concretely, in backend/etl/validate.py, in the function that contains the “Enum/accepted values check” section (around line 283), delete the line:

accepted_values_issues = {}

and leave the rest of the logic (the checks for "primary_purpose" and "nature_of_property") unchanged. No imports, new methods, or additional definitions are required, since we’re only removing an unused variable.

Suggested changeset 1
backend/etl/validate.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/backend/etl/validate.py b/backend/etl/validate.py
--- a/backend/etl/validate.py
+++ b/backend/etl/validate.py
@@ -281,7 +281,6 @@
                 )
 
     # ── Enum/accepted values check ──
-    accepted_values_issues = {}
     if "primary_purpose" in df.columns:
         purposes = df["primary_purpose"].dropna().unique()
         logger.info(f"Primary purpose values: {list(purposes)}")
EOF
@@ -281,7 +281,6 @@
)

# ── Enum/accepted values check ──
accepted_values_issues = {}
if "primary_purpose" in df.columns:
purposes = df["primary_purpose"].dropna().unique()
logger.info(f"Primary purpose values: {list(purposes)}")
Copilot is powered by AI and may make mistakes. Always verify output.

import pytest
import pandas as pd
import numpy as np

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'np' is not used.

Copilot Autofix

AI about 23 hours ago

In general, an unused import should be removed to clean up the code, reduce unnecessary dependencies, and improve readability. The best fix is to delete the numpy import line while leaving all other imports and code intact, since no current functionality relies on NumPy.

Concretely, in tests/test_data_quality.py, remove line 8: import numpy as np. No other changes are required because no references to np exist in the file. This will resolve the CodeQL "unused import" warning without affecting any behavior of the tests.

Suggested changeset 1
tests/test_data_quality.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tests/test_data_quality.py b/tests/test_data_quality.py
--- a/tests/test_data_quality.py
+++ b/tests/test_data_quality.py
@@ -5,7 +5,6 @@
 
 import pytest
 import pandas as pd
-import numpy as np
 from pathlib import Path
 import tempfile
 import os
EOF
@@ -5,7 +5,6 @@

import pytest
import pandas as pd
import numpy as np
from pathlib import Path
import tempfile
import os
Copilot is powered by AI and may make mistakes. Always verify output.
import pytest
import pandas as pd
import numpy as np
from pathlib import Path

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'Path' is not used.

Copilot Autofix

AI about 23 hours ago

In general, an unused-import issue is fixed by either removing the import statement or by using the imported symbol where intended. Here, there is no usage of Path anywhere in tests/test_data_quality.py, and the tests only use pytest, pandas, numpy, tempfile, and os. The safest fix that does not change existing functionality is to delete the unused from pathlib import Path line.

Concretely, in tests/test_data_quality.py, remove line 9 containing from pathlib import Path. No other code changes are needed; there are no references to Path to update, and no replacement imports are required.

Suggested changeset 1
tests/test_data_quality.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tests/test_data_quality.py b/tests/test_data_quality.py
--- a/tests/test_data_quality.py
+++ b/tests/test_data_quality.py
@@ -6,7 +6,6 @@
 import pytest
 import pandas as pd
 import numpy as np
-from pathlib import Path
 import tempfile
 import os
 
EOF
@@ -6,7 +6,6 @@
import pytest
import pandas as pd
import numpy as np
from pathlib import Path
import tempfile
import os

Copilot is powered by AI and may make mistakes. Always verify output.
import pandas as pd
import numpy as np
from pathlib import Path
import tempfile

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'tempfile' is not used.

Copilot Autofix

AI about 23 hours ago

To fix an unused import, remove the corresponding import statement, leaving all other code untouched. This eliminates an unnecessary dependency and slightly improves readability.

In this case, in tests/test_data_quality.py, delete the line import tempfile (currently line 10). No other changes are required because nothing in the shown code references tempfile. The remaining imports (pytest, pandas, numpy, Path, os) should stay as-is since we cannot confirm they are unused from the snippet.

Suggested changeset 1
tests/test_data_quality.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tests/test_data_quality.py b/tests/test_data_quality.py
--- a/tests/test_data_quality.py
+++ b/tests/test_data_quality.py
@@ -7,7 +7,6 @@
 import pandas as pd
 import numpy as np
 from pathlib import Path
-import tempfile
 import os
 
 
EOF
@@ -7,7 +7,6 @@
import pandas as pd
import numpy as np
from pathlib import Path
import tempfile
import os


Copilot is powered by AI and may make mistakes. Always verify output.
import numpy as np
from pathlib import Path
import tempfile
import os

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'os' is not used.

Copilot Autofix

AI about 23 hours ago

To fix an unused import, the general approach is to remove the import statement for any module that is not referenced anywhere in the file. This eliminates unnecessary dependencies and minor overhead in module loading, and makes the code clearer.

In this case, the single best fix is to delete the line import os at line 11 of tests/test_data_quality.py. No other code changes are required, since there are no references to os in the visible sections, and removing an unused import does not affect existing functionality. Specifically, in tests/test_data_quality.py, remove the entire line 11 that contains import os, keeping all other imports and code as-is.

Suggested changeset 1
tests/test_data_quality.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tests/test_data_quality.py b/tests/test_data_quality.py
--- a/tests/test_data_quality.py
+++ b/tests/test_data_quality.py
@@ -8,7 +8,6 @@
 import numpy as np
 from pathlib import Path
 import tempfile
-import os
 
 
 class TestDataQualityCalculations:
EOF
@@ -8,7 +8,6 @@
import numpy as np
from pathlib import Path
import tempfile
import os


class TestDataQualityCalculations:
Copilot is powered by AI and may make mistakes. Always verify output.
@rprabhat rprabhat merged commit d87fc00 into main Apr 6, 2026
8 checks passed
@rprabhat rprabhat deleted the sugary-walnut branch April 6, 2026 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants