feat: enhance data validation with comprehensive profiling and Soda Core integration#30
feat: enhance data validation with comprehensive profiling and Soda Core integration#30
Conversation
…ore integration - Enhanced validate.py with per-column statistics, data profiling, and calculation reconciliation - Added JSON/HTML report generation for validation results - Installed soda-core-duckdb dependency - Created Soda Core YAML check files for sales, growth, and summary data - Added dq_runner.py to execute Soda scans and output JSON results - Updated ETL pipeline workflow with DQ validation steps and R2 artifact upload - Enhanced CI workflow to include ETL and data quality test execution - Added test suite for data quality calculations verification
| import argparse | ||
| import json | ||
| import logging | ||
| import os |
Check notice
Code scanning / CodeQL
Unused import Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 23 hours ago
To fix an unused import, the general approach is to delete the import line for the unused module, ensuring that no remaining code references it. This reduces unnecessary dependencies and noise without changing runtime behavior.
Here, the best fix is to remove the import os statement on line 15 in backend/etl/data_quality/dq_runner.py. No other code changes are required, since nothing in the provided snippet uses os. This change preserves all existing functionality, as the remaining imports (argparse, json, logging, subprocess, sys, datetime, Path, Dict, Any) still cover the used features.
Concretely, edit backend/etl/data_quality/dq_runner.py to delete the import os line in the import block near the top of the file; no new methods, imports, or definitions are needed.
| @@ -12,7 +12,6 @@ | ||
| import argparse | ||
| import json | ||
| import logging | ||
| import os | ||
| import subprocess | ||
| import sys | ||
| from datetime import datetime, timezone |
| from typing import Any | ||
|
|
||
| import pandas as pd | ||
| import numpy as np |
Check notice
Code scanning / CodeQL
Unused import Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 23 hours ago
To fix an unused import, the general solution is to remove the import statement for any module or symbol that is not referenced anywhere in the file. This reduces unnecessary dependencies and improves code clarity without changing runtime behavior.
In this case, the best fix is to delete the line import numpy as np from backend/etl/validate.py. This line is currently at line 26, immediately after import pandas as pd. No other changes are required, since the rest of the visible imports (argparse, json, logging, sys, datetime, Path, Any, pandas) may be in use elsewhere in the file (which we are not shown) and should not be altered. Removing this line will not affect existing functionality, as np is reported as unused.
| @@ -23,7 +23,6 @@ | ||
| from typing import Any | ||
|
|
||
| import pandas as pd | ||
| import numpy as np | ||
|
|
||
| logging.basicConfig( | ||
| level=logging.INFO, |
| ) | ||
|
|
||
| # ── Enum/accepted values check ── | ||
| accepted_values_issues = {} |
Check notice
Code scanning / CodeQL
Unused local variable Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 23 hours ago
In general, to fix an unused local variable you either (1) delete the variable assignment (ensuring you do not remove any right-hand-side expression with side effects), or (2) rename it to a conventional “unused” name if it is intentionally unused for documentation or interface reasons. Here, accepted_values_issues is assigned an empty dict literal with no side effects, and then never referenced, so the best fix is to remove this assignment line.
Concretely, in backend/etl/validate.py, in the function that contains the “Enum/accepted values check” section (around line 283), delete the line:
accepted_values_issues = {}and leave the rest of the logic (the checks for "primary_purpose" and "nature_of_property") unchanged. No imports, new methods, or additional definitions are required, since we’re only removing an unused variable.
| @@ -281,7 +281,6 @@ | ||
| ) | ||
|
|
||
| # ── Enum/accepted values check ── | ||
| accepted_values_issues = {} | ||
| if "primary_purpose" in df.columns: | ||
| purposes = df["primary_purpose"].dropna().unique() | ||
| logger.info(f"Primary purpose values: {list(purposes)}") |
|
|
||
| import pytest | ||
| import pandas as pd | ||
| import numpy as np |
Check notice
Code scanning / CodeQL
Unused import Note test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 23 hours ago
In general, an unused import should be removed to clean up the code, reduce unnecessary dependencies, and improve readability. The best fix is to delete the numpy import line while leaving all other imports and code intact, since no current functionality relies on NumPy.
Concretely, in tests/test_data_quality.py, remove line 8: import numpy as np. No other changes are required because no references to np exist in the file. This will resolve the CodeQL "unused import" warning without affecting any behavior of the tests.
| @@ -5,7 +5,6 @@ | ||
|
|
||
| import pytest | ||
| import pandas as pd | ||
| import numpy as np | ||
| from pathlib import Path | ||
| import tempfile | ||
| import os |
| import pytest | ||
| import pandas as pd | ||
| import numpy as np | ||
| from pathlib import Path |
Check notice
Code scanning / CodeQL
Unused import Note test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 23 hours ago
In general, an unused-import issue is fixed by either removing the import statement or by using the imported symbol where intended. Here, there is no usage of Path anywhere in tests/test_data_quality.py, and the tests only use pytest, pandas, numpy, tempfile, and os. The safest fix that does not change existing functionality is to delete the unused from pathlib import Path line.
Concretely, in tests/test_data_quality.py, remove line 9 containing from pathlib import Path. No other code changes are needed; there are no references to Path to update, and no replacement imports are required.
| @@ -6,7 +6,6 @@ | ||
| import pytest | ||
| import pandas as pd | ||
| import numpy as np | ||
| from pathlib import Path | ||
| import tempfile | ||
| import os | ||
|
|
| import pandas as pd | ||
| import numpy as np | ||
| from pathlib import Path | ||
| import tempfile |
Check notice
Code scanning / CodeQL
Unused import Note test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 23 hours ago
To fix an unused import, remove the corresponding import statement, leaving all other code untouched. This eliminates an unnecessary dependency and slightly improves readability.
In this case, in tests/test_data_quality.py, delete the line import tempfile (currently line 10). No other changes are required because nothing in the shown code references tempfile. The remaining imports (pytest, pandas, numpy, Path, os) should stay as-is since we cannot confirm they are unused from the snippet.
| @@ -7,7 +7,6 @@ | ||
| import pandas as pd | ||
| import numpy as np | ||
| from pathlib import Path | ||
| import tempfile | ||
| import os | ||
|
|
||
|
|
| import numpy as np | ||
| from pathlib import Path | ||
| import tempfile | ||
| import os |
Check notice
Code scanning / CodeQL
Unused import Note test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 23 hours ago
To fix an unused import, the general approach is to remove the import statement for any module that is not referenced anywhere in the file. This eliminates unnecessary dependencies and minor overhead in module loading, and makes the code clearer.
In this case, the single best fix is to delete the line import os at line 11 of tests/test_data_quality.py. No other code changes are required, since there are no references to os in the visible sections, and removing an unused import does not affect existing functionality. Specifically, in tests/test_data_quality.py, remove the entire line 11 that contains import os, keeping all other imports and code as-is.
| @@ -8,7 +8,6 @@ | ||
| import numpy as np | ||
| from pathlib import Path | ||
| import tempfile | ||
| import os | ||
|
|
||
|
|
||
| class TestDataQualityCalculations: |
Summary
This PR enhances the data validation pipeline with:
Changes
Testing