Skip to content

Automate anonymized production DB snapshot for UAT#93

Open
techy4shri wants to merge 41 commits intoPSMRI:mainfrom
techy4shri:anon-uat
Open

Automate anonymized production DB snapshot for UAT#93
techy4shri wants to merge 41 commits intoPSMRI:mainfrom
techy4shri:anon-uat

Conversation

@techy4shri
Copy link
Copy Markdown

@techy4shri techy4shri commented Feb 15, 2026

📋 Description

#112
JIRA ID:
This master plan provides a complete roadmap for implementing an automated, secure, and repeatable process to refresh the UAT environment with anonymized production data.

Small workflow:
Prod DB (DB1, read only perm applied)-> anonimizer (lookup the datatype of the field to anonymize, type of anon to apply) -> stream read and apply in batches with pagination for perf -> output: anonymized sql dump/direct restore to DB2 (anonymized DB)

Done.

✅ Type of Change

  • 🐞 Bug fix (non-breaking change which resolves an issue)
  • New feature (non-breaking change which adds functionality)
  • [X ] ⚙️ Config change (configuration file or build script updates)
  • [x ] 📚 Documentation (updates to docs or readme)

ℹ️ Additional Information

Please describe how the changes were tested, and include any relevant screenshots, logs, or other information that provides additional context.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced database anonymization tool with CLI support for anonymization execution and schema comparison
    • Added comprehensive safety mechanisms including approval workflows and host validation
    • Implemented anonymization rules system with configurable strategies per table and column
    • Added audit reporting with execution metrics and PII-safe logging
  • Documentation

    • Added comprehensive user guide covering architecture, configuration, and usage
    • Included example configuration and rules files for local, Docker, and production environments
  • Configuration

    • Standardized database property naming convention with hyphenation
    • Updated README with mandatory database specifications and main class configuration
    • Expanded .gitignore for anonymization outputs and environment files
  • Tests

    • Added unit tests for configuration parsing, safety guards, and anonymization components
    • Included sample SQL test data

fix: readme and gitignore
chore: update application properties and environment configs
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive database anonymization framework for AMRIT DB, including deterministic HMAC-based data masking, CLI tools for execution and schema diffing, configuration-driven rule management, safety guardrails for production databases, and complete documentation and testing infrastructure.

Changes

Cohort / File(s) Summary
CLI & Core Commands
src/main/java/com/db/piramalswasthya/anonymizer/AmritDbAnonymizer.java, src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java, src/main/java/com/db/piramalswasthya/anonymizer/cli/DiffSchemaCommand.java
New Picocli-based CLI entry point and two subcommands: RunCommand orchestrates multi-schema anonymization with safety checks and progress reporting; DiffSchemaCommand compares live database schemas against rules and suggests missing configurations.
Core Anonymization Engine
src/main/java/com/db/piramalswasthya/anonymizer/core/HmacAnonymizer.java, src/main/java/com/db/piramalswasthya/anonymizer/core/AnonymizationEngine.java, src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java, src/main/java/com/db/piramalswasthya/anonymizer/output/DirectRestoreWriter.java
Core anonymization components: HmacAnonymizer provides HMAC-SHA256 hashing and deterministic transformation utilities; AnonymizationEngine applies per-column rules; KeysetPaginator streams large tables efficiently; DirectRestoreWriter handles transactional batch inserts with schema management.
Configuration Model
src/main/java/com/db/piramalswasthya/anonymizer/config/AnonymizerConfig.java, src/main/java/com/db/piramalswasthya/anonymizer/config/AnonymizationRules.java
YAML-deserializable configuration classes with nested structures for source/target databases, safety settings, performance tuning, and per-database/table/column anonymization rules.
Safety & Reporting
src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java, src/main/java/com/db/piramalswasthya/anonymizer/report/RunReport.java
SafetyGuard enforces production database safeguards via allowlists, deny patterns, and explicit approval tokens; RunReport aggregates per-table metrics and PII-safe execution details.
Environment & Example Configs
config.yaml.example, rules.yaml.example, src/main/environment/anonymization_*.yaml*
Configuration templates for different deployment environments (local, production, Docker, CI) with documented safety, performance, and credential management settings.
Documentation & Registry
docs/anonymization/GUIDE.md, src/main/resources/anonymization-registry.json, src/main/resources/anonymization-registry.yml, src/main/resources/anonymization/pii-detection-rules.yml
Comprehensive user guide, data masking registries, and PII detection rules covering architecture, quickstart, configuration schema, command usage, troubleshooting, and best practices.
Spring Configuration Updates
src/main/java/com/db/piramalswasthya/config/DatasourceConfig.java, src/main/java/com/db/piramalswasthya/config/FlywayConfig.java, src/main/java/com/db/piramalswasthya/config/FlywayMigrator.java, src/main/environment/common_example.properties, src/main/environment/common_test.properties, src/main/resources/application.properties
Standardized datasource property keys from camelCase to hyphenated format (e.g., dbiemrdb-iemr); added Flyway configuration properties; cleaned up legacy property definitions.
Application Profile & Build
src/main/resources/application-analyzer.properties, src/main/resources/application-executor.properties, pom.xml, README.md
New Spring profiles for analyzer and executor modes; added Picocli and Jackson YAML dependencies; added exec-maven-plugin; updated README with mandatory database info and main class specification.
Test Suite
src/test/java/com/db/piramalswasthya/anonymizer/config/ConfigurationTest.java, src/test/java/com/db/piramalswasthya/anonymizer/core/HmacAnonymizerTest.java, src/test/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuardTest.java, src/test/resources/sample-dump.sql
Comprehensive unit tests for YAML configuration parsing, deterministic HMAC operations, and safety guard validation; sample test dataset.
Project Setup & Cleanup
.gitignore, src/main/java/com/db/piramalswasthya/data/FlywaySchemaVersion.java, src/main/java/com/db/piramalswasthya/controller/VersionController.java, src/test/java/com/db/piramalswasthya/AmritDbApplicationTests.java
Expanded .gitignore for anonymization artifacts; replaced Lombok @Data with explicit getters/setters in FlywaySchemaVersion; removed unused logging imports; cleanup of test imports.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as AmritDbAnonymizer
    participant RunCmd as RunCommand
    participant SafetyGuard
    participant SourceDB as Source Database
    participant KeysetPaginator
    participant AnonymizationEngine
    participant TargetDB as Target Database
    participant DirectRestoreWriter
    participant RunReport

    User->>CLI: Execute with config.yaml & approval token
    CLI->>RunCmd: Delegate to RunCommand.call()
    
    RunCmd->>RunCmd: Load config & rules from YAML
    RunCmd->>RunCmd: Validate configuration
    
    RunCmd->>SafetyGuard: validateSafeToConnect(source_host, approval_token)
    SafetyGuard->>SafetyGuard: Check allowlist/deny patterns
    SafetyGuard-->>RunCmd: Success or SafetyViolationException
    
    RunCmd->>SafetyGuard: validateSafeToConnect(target_host, approval_token)
    SafetyGuard-->>RunCmd: Success or SafetyViolationException
    
    RunCmd->>DirectRestoreWriter: resetSchema(sourceDataSource)
    DirectRestoreWriter->>TargetDB: DROP/recreate schema
    DirectRestoreWriter->>SourceDB: Clone table structures
    DirectRestoreWriter-->>RunCmd: Schema ready
    
    loop For each source schema & table
        RunCmd->>KeysetPaginator: streamTable(table, primaryKey, columns, batchHandler)
        KeysetPaginator->>SourceDB: Query batch with keyset pagination
        SourceDB-->>KeysetPaginator: RowData batch
        
        KeysetPaginator->>AnonymizationEngine: anonymizeBatch(database, table, rows)
        AnonymizationEngine->>AnonymizationEngine: Apply per-column rules (HMAC_HASH/PRESERVE)
        AnonymizationEngine-->>KeysetPaginator: Anonymized rows + strategy counts
        
        KeysetPaginator->>DirectRestoreWriter: writeInsert(tableName, columns, rows)
        DirectRestoreWriter->>TargetDB: Batch INSERT anonymized data
        DirectRestoreWriter-->>KeysetPaginator: Insert confirmed
        
        KeysetPaginator-->>RunCmd: Next batch or completion
    end
    
    RunCmd->>RunReport: addTableReport(table, rowCount, duration, strategies)
    RunCmd->>RunReport: writeToFile(outputPath)
    RunReport-->>RunCmd: Report written
    
    RunCmd->>DirectRestoreWriter: close() (commit/rollback)
    DirectRestoreWriter-->>RunCmd: Transaction finalized
    
    RunCmd-->>User: Exit with status code & JSON report
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 With whiskers twitching, I've prepared the way,
Where secrets hide in hashes, safe each day,
From source to target, keyset paginate,
Anonymized with HMAC—data sanitized and straight!
Safety guards stand watch at every gate. 🛡️

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Automate anonymized production DB snapshot for UAT' directly and clearly summarizes the main objective of this PR, which is to implement automation for creating anonymized production database snapshots for UAT use.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use Trivy to scan for security misconfigurations and secrets in Infrastructure as Code files.

Add a .trivyignore file to your project to customize which findings Trivy reports.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/main/java/com/db/piramalswasthya/config/FlywayMigrator.java (1)

37-37: 🛠️ Refactor suggestion | 🟠 Major

Remove System.out.println("SUCCESS") — use the logger instead.

Line 38 already logs the success message via SLF4J. The System.out.println bypasses structured logging and should be removed.

Proposed fix
-        System.out.println("SUCCESS");
         logger.info("Flyway migration completed successfully");
🤖 Fix all issues with AI agents
In `@config.yaml.example`:
- Around line 7-11: The schemas list in config.yaml.example uses "dbiemr" but
rules.yaml.example defines rules under "db_iemr", causing AnonymizationEngine's
direct lookup (rules.getDatabases().get("dbiemr")) to return null and skip the
schema; fix by making the schema name consistent—update config.yaml.example's
schemas entry from "dbiemr" to "db_iemr" (or alternatively rename the key in
rules.yaml.example to "dbiemr") so that the schema name in the schemas list
matches the key used by rules.getDatabases() and the AnonymizationEngine.

In `@rules.yaml.example`:
- Around line 38-39: Cross-schema referential integrity is broken because
db_iemr.t_benvisitdetail.BeneficiaryRegID is set to HMAC_HASH while the matching
foreign-key column db_identity.m_beneficiaryregidmapping.BenRegID is PRESERVE;
update the rules so both sides use the same strategy (either PRESERVE on both if
the ID is non-sensitive or HMAC_HASH on both if sensitive) and then search the
ruleset for any other cross-schema join pairs to apply the same consistent
strategy (look for BeneficiaryRegID and BenRegID in the rules and make the
strategies identical).

In `@src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java`:
- Around line 204-216: The DirectRestoreWriter opened in the try-with-resources
is never marked successful, so its close() will always rollback; after
processing all tables for the schema you must call writer.markSuccess() to
commit the inserts. Update the block that calls processSchemaTables(schema,
rules, engine, paginator, writer, report) (or inside processSchemaTables once
all writes/reports succeed) to invoke DirectRestoreWriter.markSuccess() when no
errors occurred, ensuring commit before the try-with-resources scope ends; keep
the writer.resetSchema(...) call and only markSuccess() after successful
completion of all table processing.

In
`@src/main/java/com/db/piramalswasthya/anonymizer/core/AnonymizationEngine.java`:
- Around line 79-101: The loop incorrectly casts KeysetPaginator.RowData to a
Map which causes ClassCastException; instead, access the underlying
Map<String,Object> (e.g. via RowData#getData() or add a getData()/getDataMap()
accessor on KeysetPaginator.RowData) and iterate a snapshot of its keyset like
new ArrayList<>(row.getData().keySet()) to avoid
ConcurrentModificationException, use Map<String,Object> for the value types (not
Integer), remove the bad cast, keep using row.get(column)/row.put(column,
anonymizedValue), and preserve strategyCounts.merge(rule.getStrategy(), 1,
Integer::sum) and handleUnknownColumn(database, table, column) as before.

In `@src/main/java/com/db/piramalswasthya/anonymizer/core/HmacAnonymizer.java`:
- Around line 81-93: The current fakeName in HmacAnonymizer uses
Math.abs(hash.hashCode()) which can remain negative for Integer.MIN_VALUE;
change seed to a non-negative long (e.g., compute int h = hash.hashCode(); long
seed = (h == Integer.MIN_VALUE) ? ((long)Integer.MAX_VALUE + 1L) :
Math.abs((long)h);) and then use that seed for the name selection (firstNames[
(int)(seed % firstNames.length) ] and lastNames[ (int)((seed / 10) %
lastNames.length) ]) so array indices are always non-negative; update the seed
variable type and usages in fakeName accordingly.

In `@src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java`:
- Around line 94-152: The SELECT built in streamTable doesn't guarantee the
primary key is selected, causing processBatch to fail when it calls
rs.getObject(primaryKey); update streamTable so the constructed columnList
always includes the quoted primary key (use quoteIdentifier(primaryKey)) if it's
not already present (avoid duplicating if present in columns), then build
selectSql from that augmented columnList; keep identifier quoting and the
existing validateNumericPrimaryKey call.

In `@src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java`:
- Around line 90-106: The loop over BUILT_IN_DENY_PATTERNS incorrectly applies
deny checks even for allowlisted hosts; update the logic in SafetyGuard (around
the loop that calls matchesPattern(host, pattern) / matchesPattern(database,
pattern)) to first detect allowlist membership (use the existing allowlist check
or method that determines if a host/database is allowlisted) and skip the
built-in deny pattern check for allowlisted entries, otherwise keep the existing
approval-flag bypass using config.isRequireExplicitApproval(), approvalFlag and
config.getApprovalFlag(); ensure you still throw SafetyViolationException with
the same message when a non-allowlisted host/database matches a deny pattern and
approval is not provided.

In `@src/main/resources/anonymization-registry.json`:
- Line 27: The JSON anonymization registry uses hyphenated keys (e.g.,
"db-1097-identity", "db-iemr") that don't match the underscored names used
elsewhere (e.g., db_1097_identity, db_iemr); update the keys in
src/main/resources/anonymization-registry.json to use the underscored convention
to match the YAML registry and environment configs (replace "db-1097-identity"
with "db_1097_identity" and "db-iemr" with "db_iemr" wherever they appear,
including the entries referenced around lines 27 and 35) so the engine can
correlate entries across files.

In `@src/main/resources/anonymization-registry.yml`:
- Around line 133-134: The registry uses the wrong database key "dbiemr" which
must match the rest of the config that uses "db_iemr"; replace the "dbiemr"
entry with "db_iemr" (i.e., update the key in the anonymization-registry.yml so
the schemaHash entry is under db_iemr) to ensure lookups succeed against the
existing environment configs and JSON registry.
- Around line 26-35: Several columns in the YAML use the key anonymizationRule
but the Java model ColumnRule only defines strategy, causing rule.getStrategy()
to return null and crash when applyStrategy() calls toUpperCase(); update every
column entry that uses anonymizationRule (e.g., BenRegID, FirstName, LastName,
PhoneNo, AadhaarNo, EmailID, Address, DateOfBirth, Gender, UserName, Password,
Diagnosis, DrugName, DoctorName, TestResult, LabTechnicianName) to use strategy
instead (match the existing BeneficiaryID entry) so deserialization populates
ColumnRule.strategy and applyStrategy() receives a non-null value.

In `@src/test/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuardTest.java`:
- Around line 37-50: The test fails because SafetyGuard.validateSafeToConnect
applies the built-in deny regex (e.g., ".*prod.*") before checking the
allowlist, causing allowlisted hosts like "replica.nonprod.internal" to be
blocked; change the logic in SafetyGuard.validateSafeToConnect so that allowed
hosts (from AnonymizerConfig.SafetyConfig.getAllowedHosts / AllowedHosts) are
checked first and bypass built-in deny checks (i.e., if host is explicitly
allowlisted and requireExplicitApproval is false, return/permit immediately),
ensuring allowlist takes precedence over built-in deny patterns.
🟠 Major comments (13)
src/main/environment/anonymization_production.yaml.example-21-25 (1)

21-25: ⚠️ Potential issue | 🟠 Major

Target schemas share identical names with source schemas — risk of accidental production overwrite.

The target schemas (db_iemr, db_identity, db_reporting, db_1097_identity) are identical to the source schemas on lines 9-12. If someone misconfigures the target host to point at a production replica or primary, the anonymization engine could overwrite live production data. The local config template correctly uses _uat-suffixed schema names for targets — apply the same convention here, or add a safety guard that rejects same-host + same-schema combinations.

src/main/resources/anonymization-registry.json-1-77 (1)

1-77: ⚠️ Potential issue | 🟠 Major

Strategy mismatches between JSON and YAML registries for the same columns.

Several columns have conflicting anonymization strategies between this file and the YAML registry:

  • BeneficiaryID: HASH_SHA256 (JSON) vs HMAC_HASH (YAML)
  • Diagnosis: PRESERVE (JSON) vs GENERALIZE (YAML)
  • TestResult: PRESERVE (JSON) vs GENERALIZE (YAML)
  • Name fields: RANDOM_UUID (JSON) vs RANDOM_FAKE_DATA (YAML)

Clarify which file is the source of truth (the YAML registry header claims to be) and ensure the JSON registry is either auto-generated from it or removed to avoid drift.

src/main/resources/anonymization/pii-detection-rules.yml-7-15 (1)

7-15: ⚠️ Potential issue | 🟠 Major

riskLevel: LINKING_KEY is not a standard risk level.

All other rules use CRITICAL, HIGH, MEDIUM, LOW, or NONE for riskLevel. Line 12 uses LINKING_KEY, which appears to be a copy of the piiCategory value. This will likely fail validation or be ignored by the engine.

Proposed fix
     piiCategory: LINKING_KEY
-    riskLevel: LINKING_KEY
+    riskLevel: CRITICAL
src/main/resources/anonymization/pii-detection-rules.yml-150-158 (1)

150-158: ⚠️ Potential issue | 🟠 Major

PAN regex .*[Pp][Aa][Nn].* is overly broad — will match unrelated columns.

This pattern matches any column containing "pan" (case-insensitive), which will produce false positives on columns like CompanyName, PanelID, Expansion, PancardNo, etc. Consider anchoring the pattern more tightly.

Proposed fix
-    columnNamePattern: ".*[Pp][Aa][Nn].*|^PANNo$|^PANCard$"
+    columnNamePattern: "^PAN$|^PANNo$|^PANCard$|^PAN[Nn]umber$|.*[Pp]an[Cc]ard.*"
src/main/resources/application.properties-28-48 (1)

28-48: ⚠️ Potential issue | 🟠 Major

Hardcoded database credentials (root / 1234) committed to version control.

application.properties is packaged into the WAR artifact. Use placeholders or environment-variable references (${DB_PASSWORD}) as done in the example/test configs, and override via environment-specific profiles or runtime properties.

Proposed fix
-spring.datasource.db-iemr.username=root
-spring.datasource.db-iemr.password=1234
+spring.datasource.db-iemr.username=${DB_IEMR_USERNAME}
+spring.datasource.db-iemr.password=${DB_IEMR_PASSWORD}

Apply the same pattern for db-identity, db-reporting, and db-1097-identity.

src/main/environment/common_test.properties-18-20 (1)

18-20: ⚠️ Potential issue | 🟠 Major

Property keys not migrated to hyphenated form—datasources will fail to bind.

DatasourceConfig.java expects hyphenated prefixes (spring.datasource.db-iemr, spring.datasource.db-identity, spring.datasource.db-reporting, spring.datasource.db-1097-identity), and common_example.properties was updated accordingly. However, common_test.properties still uses the old non-hyphenated keys (dbiemr, dbidentity, dbreporting). Spring's @ConfigurationProperties binding will fail to load these values, leaving all datasource beans uninitialized.

Update lines 19–22, 26–29, 32–35, and similar sections to use the hyphenated prefixes: spring.datasource.db-iemr.*, spring.datasource.db-identity.*, spring.datasource.db-reporting.*, etc.

src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java-78-80 (1)

78-80: ⚠️ Potential issue | 🟠 Major

Regex injection risk in allowlist host matching.

Line 80: allowed.replace("*", ".*") does not escape other regex metacharacters. Hostnames contain dots (.), which are regex wildcards. For example, an allowlist entry db.example.com becomes the regex db.example.com, which matches dbXexampleYcom. For a safety-critical guard, this is too permissive.

Use Pattern.quote on literal segments and only expand * globs explicitly:

🛡️ Proposed fix: escape regex metacharacters
             inAllowlist = config.getAllowedHosts().stream()
-                .anyMatch(allowed -> host.equalsIgnoreCase(allowed) || 
-                                     host.matches(allowed.replace("*", ".*")));
+                .anyMatch(allowed -> host.equalsIgnoreCase(allowed) || 
+                                     matchesPattern(host, 
+                                         "\\Q" + allowed.replace("*", "\\E.*\\Q") + "\\E"));
src/main/java/com/db/piramalswasthya/anonymizer/cli/DiffSchemaCommand.java-71-126 (1)

71-126: ⚠️ Potential issue | 🟠 Major

No SafetyGuard validation before connecting to the database.

The diff-schema command connects directly to the source database without invoking SafetyGuard.validateSafeToConnect(). This bypasses the safety layer designed to prevent accidental execution against production databases. The RunCommand (per the PR description) presumably enforces this, but diff-schema also reads from a live DB and should be equally guarded.

🛡️ Proposed fix: add safety check before connecting
+import com.db.piramalswasthya.anonymizer.safety.SafetyGuard;
+
 // Inside call(), after loading config:
+SafetyGuard safetyGuard = new SafetyGuard(config.getSafety());
+safetyGuard.validateSafeToConnect(
+    config.getSource().getHost(),
+    String.join(",", config.getSource().getSchemas()),
+    null  // or wire up an --approval-flag option
+);
src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java-364-370 (1)

364-370: ⚠️ Potential issue | 🟠 Major

SSL certificate verification is disabled — MITM risk for database connections carrying credentials and PII.

setVerifyServerCertificate(false) with setUseSSL(true) means TLS is enabled but the server certificate is not verified. An attacker in a network position between the tool and the database could intercept credentials and data. For a tool explicitly designed to handle production database access and PII, this weakens the security posture.

At minimum, make certificate verification configurable rather than hardcoded to false, and default to true.

src/main/java/com/db/piramalswasthya/anonymizer/config/AnonymizerConfig.java-44-53 (1)

44-53: ⚠️ Potential issue | 🟠 Major

@Data-generated toString() will include the database password.

If a DatabaseConfig instance is ever logged (directly or via the parent AnonymizerConfig), the password will appear in plaintext in logs. This is a compliance/privacy risk for a tool handling production database credentials.

Exclude password from toString() using Lombok's @ToString.Exclude:

Proposed fix
+import lombok.ToString;
 ...
 `@Data`
 public static class DatabaseConfig {
     private String host;
     private int port = 3306;
     private List<String> schemas;
     private String username;
+    `@ToString.Exclude`
     private String password;
     private boolean readOnly = true;
     private int connectionTimeout = 30000;
 }
src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java-257-280 (1)

257-280: ⚠️ Potential issue | 🟠 Major

Strategy counts are incorrectly computed — return value of anonymizeBatch is discarded.

Line 260 calls engine.anonymizeBatch(...) but discards its return value (the actual per-strategy counts). Instead, lines 273–276 iterate over the rule definitions and increment by 1 per column per batch — this counts batches × columns, not actual row-level applications.

Proposed fix
                     try {
                         // Anonymize batch
-                        engine.anonymizeBatch(schema, tableName, batch);
+                        Map<String, Integer> batchStrategyCounts = 
+                            engine.anonymizeBatch(schema, tableName, batch);
                         
                         // Convert RowData to Map
                         List<Map<String, Object>> rowMaps = batch.stream()
                             .map(row -> row.getData())
                             .toList();
                         
                         // Write to target
                         writer.writeInsert(tableName, allColumns, rowMaps);
                         
                         rowCount[0] += batch.size();
                         
                         // Track strategy counts
-                        for (String column : tableRules.getColumns().keySet()) {
-                            String strategy = tableRules.getColumns().get(column).getStrategy();
-                            strategyCounts.merge(strategy, 1, Integer::sum);
-                        }
+                        batchStrategyCounts.forEach((strategy, count) ->
+                            strategyCounts.merge(strategy, count, Integer::sum));
src/main/java/com/db/piramalswasthya/anonymizer/output/DirectRestoreWriter.java-146-154 (1)

146-154: ⚠️ Potential issue | 🟠 Major

setCatalog(schema) is called after SHOW TABLES has already executed.

In the try-with-resources, all three resources (connection, statement, result set) are initialized sequentially. stmt.executeQuery("SHOW TABLES") runs before the body's srcConn.setCatalog(schema). This currently works only because the sourceDataSource is already configured with the correct schema via setDatabaseName(), making the setCatalog call a no-op.

Move setCatalog before executing the query, or remove it if it's intentionally redundant.

Proposed fix
-        try (Connection srcConn = sourceDataSource.getConnection();
-             Statement stmt = srcConn.createStatement();
-             ResultSet rs = stmt.executeQuery("SHOW TABLES")) {
-            
-            srcConn.setCatalog(schema);
+        try (Connection srcConn = sourceDataSource.getConnection()) {
+            srcConn.setCatalog(schema);
+            try (Statement stmt = srcConn.createStatement();
+                 ResultSet rs = stmt.executeQuery("SHOW TABLES")) {
             while (rs.next()) {
                 tables.add(rs.getString(1));
             }
+            }
         }
src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java-421-430 (1)

421-430: ⚠️ Potential issue | 🟠 Major

Add mapper.findAndRegisterModules() to serialize LocalDateTime fields

RunReport contains LocalDateTime fields (startTime, endTime). The plain new ObjectMapper() in writeReport() will not serialize these without registering the JSR310 module. While the project includes jackson-datatype-jsr310 via Spring Boot, plain ObjectMapper instances require explicit module registration. RunReport.writeToFile() already does this correctly with mapper.findAndRegisterModules().

Proposed fix
     private void writeReport(RunReport report, String loggingPath, String executionId) 
             throws IOException {
         Files.createDirectories(Paths.get(loggingPath));
         String reportPath = loggingPath + "/run-report-" + executionId + ".json";
         
         ObjectMapper mapper = new ObjectMapper();
+        mapper.findAndRegisterModules();
         mapper.writerWithDefaultPrettyPrinter().writeValue(new File(reportPath), report);
         
         log.info("Report written to: {}", reportPath);
     }
🟡 Minor comments (13)
README.md-39-39 (1)

39-39: ⚠️ Potential issue | 🟡 Minor

Database naming inconsistency across documentation.

The README lists dbiemr (without underscore) as the required database name, but:

  • anonymization_example.yaml line 8 uses db_iemr (with underscore)
  • application-executor.properties line 22 uses db_iemr (with underscore)
  • Lines 40-42 of this file use underscores for other databases (db_reporting, db_identity, etc.)

Clarify whether the database should be named dbiemr or db_iemr and ensure consistency across all configuration files and documentation.

src/main/resources/application-executor.properties-22-22 (1)

22-22: ⚠️ Potential issue | 🟡 Minor

Update anonymization.lookup.database to match actual database name.

Line 22 uses db_iemr, but the actual MySQL database name is dbiemr (as shown in JDBC URLs and Spring datasource configuration). Update this to dbiemr for consistency with the application configuration and anonymization-registry.yml.

src/main/environment/anonymization_example.yaml-45-45 (1)

45-45: ⚠️ Potential issue | 🟡 Minor

Document the expected location for rules.yaml.

The rulesFile: rules.yaml resolves relative to the current working directory (where the JVM is invoked), but this is not documented. Users running from different directories may encounter "file not found" errors. Add guidance to GUIDE.md clarifying:

  1. The expected location (e.g., repo root, same directory as config file)
  2. Whether relative paths are supported
  3. An example showing the correct directory structure

Consider documenting this in the Configuration section alongside the note that config files are stored in src/main/environment/.

src/main/environment/anonymization_production.yaml.example-41-41 (1)

41-41: ⚠️ Potential issue | 🟡 Minor

Approval flag comment says "Rotate monthly" but value suggests quarterly (Q1).

The approval flag PROD_TO_UAT_2025_Q1 implies quarterly rotation, but the inline comment says # Rotate monthly. Clarify the intended rotation cadence and update either the flag naming convention or the comment.

docs/anonymization/GUIDE.md-714-720 (1)

714-720: ⚠️ Potential issue | 🟡 Minor

Spot-check example contradicts rules: BenRegID is PRESERVE, not hashed.

The rules file marks BenRegID as PRESERVE, so it should retain its original integer value in UAT — not appear as a "long hex hash (64 chars)." Update this example to show the preserved original value for BenRegID and only hashed values for HMAC columns like FirstName and PhoneNo.

docs/anonymization/GUIDE.md-373-380 (1)

373-380: ⚠️ Potential issue | 🟡 Minor

Duplicate introductory text for schema reset strategies.

Lines 375-376 and 379 both introduce the same concept with slightly different wording. Remove the first one (lines 375-376) and keep the second.

config.yaml.example-45-48 (1)

45-48: ⚠️ Potential issue | 🟡 Minor

Environment variable name inconsistent with documentation.

This file uses ${ANONYMIZATION_SECRET} but GUIDE.md (lines 208, 333) references ${HMAC_SECRET_KEY}. Pick one name and use it consistently across all config examples and documentation.

docs/anonymization/GUIDE.md-477-496 (1)

477-496: ⚠️ Potential issue | 🟡 Minor

Safety Model section immediately jumps to troubleshooting.

The "## Safety Model" heading at line 477 is followed directly by "### Common Issues" (line 479). The actual safety model description (allowlists, denylists, approval tokens) is missing from this section. Consider adding an introductory subsection explaining how the safety guards work before diving into troubleshooting.

docs/anonymization/GUIDE.md-283-286 (1)

283-286: ⚠️ Potential issue | 🟡 Minor

Duplicate heading: "### Complete Configuration Reference" appears twice.

Lines 283 and 285 both contain the same heading. Remove one. Flagged by markdownlint (MD024).

src/main/java/com/db/piramalswasthya/anonymizer/cli/DiffSchemaCommand.java-312-314 (1)

312-314: ⚠️ Potential issue | 🟡 Minor

FileWriter uses platform-default charset.

new FileWriter(outputPath) uses the JVM's default charset, which may not be UTF-8 on all systems. YAML output should be consistently written in UTF-8.

🔧 Proposed fix
-        try (FileWriter writer = new FileWriter(outputPath)) {
-            writer.write(yaml.toString());
+        try (var writer = new java.io.OutputStreamWriter(
+                new java.io.FileOutputStream(outputPath), java.nio.charset.StandardCharsets.UTF_8)) {
+            writer.write(yaml.toString());
         }

Or with Files.writeString:

-        try (FileWriter writer = new FileWriter(outputPath)) {
-            writer.write(yaml.toString());
-        }
+        Files.writeString(Path.of(outputPath), yaml.toString(), StandardCharsets.UTF_8);
src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java-118-118 (1)

118-118: ⚠️ Potential issue | 🟡 Minor

lastKey = 0L will skip rows with primary key ≤ 0.

The initial lastKey is 0L, so the first query is WHERE pk > 0. Any rows with PK = 0 or negative values will never be returned. While MySQL auto-increment defaults start at 1, this assumption should be documented or the initial query should handle the first page differently (e.g., using WHERE pk >= 0 for the first batch, or Long.MIN_VALUE).

src/main/java/com/db/piramalswasthya/anonymizer/core/AnonymizationEngine.java-110-118 (1)

110-118: ⚠️ Potential issue | 🟡 Minor

Null-safety: strategy.toUpperCase() will NPE if ColumnRule.getStrategy() returns null.

If a rule is defined in YAML without a strategy field, this will throw NullPointerException.

Proposed guard
     private Object applyStrategy(String strategy, String value) {
+        if (strategy == null) {
+            log.warn("Null strategy encountered - preserving value");
+            return value;
+        }
         return switch (strategy.toUpperCase()) {
src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java-171-180 (1)

171-180: ⚠️ Potential issue | 🟡 Minor

Success-path report uses hardcoded "./logs" instead of config.getLoggingPath().

config is scoped to the try block, so line 175 can't reference it. Consider extracting loggingPath before the try block.

Proposed fix
     RunReport report = new RunReport();
     report.setExecutionId(executionId);
     report.setStartTime(startTime);
+    String loggingPath = "./logs"; // default, overridden after config loads
     
     try {
         // 1. Load configuration and rules
         log.info("Loading configuration...");
         AnonymizerConfig config = loadConfig(configFile);
+        loggingPath = config.getLoggingPath();
         ...
-        writeReport(report, "./logs", executionId);
+        writeReport(report, loggingPath, executionId);
         return 1;
     }
     ...
-    writeReport(report, "./logs", executionId);
+    writeReport(report, loggingPath, executionId);
🧹 Nitpick comments (16)
src/test/resources/sample-dump.sql (2)

49-50: SET statements placement doesn't reflect real mysqldump output.

In production dumps, SET FOREIGN_KEY_CHECKS=0 appears at the top (before CREATE/INSERT) and is restored to 1 at the bottom. Placing it after all inserts means the anonymization parser isn't tested against the realistic ordering it will encounter in production. Also, FOREIGN_KEY_CHECKS is never re-enabled.

Consider restructuring to mirror actual dump layout:

Suggested restructuring

Move to the top of the file (after the header comments):

+SET FOREIGN_KEY_CHECKS=0;
+SET SQL_MODE='NO_AUTO_VALUE_ON_ZERO';
+
 -- Create database
 CREATE DATABASE IF NOT EXISTS `test_db`;

And add restoration at the end:

+SET FOREIGN_KEY_CHECKS=1;

1-50: Consider expanding edge-case coverage for more robust parser testing.

The fixture covers the core patterns well (batch inserts, NULLs, special characters, multiple tables). A few common mysqldump patterns are missing that the anonymization tool will likely encounter in production:

  • DROP TABLE IF EXISTS before each CREATE TABLE
  • LOCK TABLES ... WRITE / UNLOCK TABLES blocks
  • Empty string values ('') vs NULL
  • Multi-byte / Unicode characters in string values (e.g., names with diacritics)
  • INSERT IGNORE or ON DUPLICATE KEY UPDATE variants

Not blocking, but adding these would increase confidence that the parser handles real-world dumps without surprises.

src/main/resources/application-executor.properties (1)

17-19: Consider adding executor constraints and validation.

The executor settings lack validation boundaries:

  • batch-size: No min/max constraints (consider adding range validation)
  • output-dir: Uses a relative path that may behave differently depending on the working directory

Consider adding validation or documenting expected working directory assumptions.

src/main/resources/application-analyzer.properties (1)

10-10: Consider syncing tool version with project version.

The hardcoded tool-version=1.0.0 may drift from the actual project version. Consider using Maven/Gradle property substitution to keep versions synchronized.

💡 Example approach using Maven filtering

In pom.xml, enable resource filtering:

<resources>
  <resource>
    <directory>src/main/resources</directory>
    <filtering>true</filtering>
  </resource>
</resources>

Then use:

-anonymization.analyzer.tool-version=1.0.0
+anonymization.analyzer.tool-version=@project.version@
src/main/resources/anonymization/pii-detection-rules.yml (1)

347-355: CreatedBy/ModifiedBy fields may contain usernames that are PII.

These audit fields often store actual user names or email addresses rather than opaque IDs. A RETAIN strategy with riskLevel: LOW may under-protect PII. Consider raising the risk level or at least adding a rationale note to verify the field content.

src/main/java/com/db/piramalswasthya/data/FlywaySchemaVersion.java (1)

95-101: getInstalledOn() returns a mutable Timestamp reference.

java.sql.Timestamp is mutable. Returning the internal reference allows callers to modify the object's state. For an internal DTO this is likely acceptable, but worth noting if the class is used more broadly.

Defensive copy approach
     public Timestamp getInstalledOn() {
-        return installedOn;
+        return installedOn != null ? new Timestamp(installedOn.getTime()) : null;
     }
 
     public void setInstalledOn(Timestamp installedOn) {
-        this.installedOn = installedOn;
+        this.installedOn = installedOn != null ? new Timestamp(installedOn.getTime()) : null;
     }
src/main/java/com/db/piramalswasthya/config/DatasourceConfig.java (1)

15-15: Update environment property files to use consistent hyphenated prefix format.

The @ConfigurationProperties annotations in DatasourceConfig.java now use hyphenated prefixes (db-iemr, db-identity, db-reporting, db-1097-identity), but common_test.properties, common_ci.properties, and common_docker.properties still use the old flat format (dbiemr, dbidentity, etc.). While Spring Boot's relaxed binding will transparently handle both formats, update these environment-specific property files to match the new hyphenated convention for consistency and to avoid confusion.

src/main/java/com/db/piramalswasthya/config/FlywayConfig.java (1)

13-13: Inconsistent bean naming: flywaydbiemr vs. flywayDbidentity, flywayDbreporting, flywayDb1097identity.

The other three bean methods use camelCase (flywayDbidentity, flywayDbreporting, flywayDb1097identity), but this one was changed to all-lowercase. Pick one convention and apply it consistently.

rules.yaml.example (1)

1-89: Consider documenting which columns participate in cross-schema joins.

With 4 schemas sharing foreign keys (e.g., BenRegID / BeneficiaryRegID), it's easy to accidentally break referential integrity. Adding a comment block at the top of the rules file listing the cross-schema join columns and their required strategy consistency would help prevent future drift.

docs/anonymization/GUIDE.md (1)

53-53: Multiple fenced code blocks missing language specifiers.

The architecture diagram (line 53) and several error output blocks (lines 227, 484, 501, 517, 536, 552, 581) lack language identifiers. Add text or an appropriate language hint to satisfy markdownlint (MD040) and improve rendering.

src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java (1)

134-139: Regex patterns are recompiled on every call.

Pattern.compile is invoked each time matchesPattern is called. For built-in deny patterns (which are constants), consider pre-compiling them as static Pattern fields. This is a minor optimization for a check that runs once, but it's a good practice for correctness (compile-once, match-many).

src/test/java/com/db/piramalswasthya/anonymizer/config/ConfigurationTest.java (1)

77-78: Temp file cleanup is not guaranteed on assertion failure.

If any assertion on lines 83–95 throws, Files.deleteIfExists(tempFile) on Line 96 is never reached, leaving temp files behind. Same issue in testLoadAnonymizationRules. Consider using JUnit's @TempDir annotation or a try-finally block.

♻️ Example using `@TempDir`
+import org.junit.jupiter.api.io.TempDir;
+import java.nio.file.Path;
+
 class ConfigurationTest {
+
+    `@TempDir`
+    Path tempDir;
     
     `@Test`
     void testLoadAnonymizerConfig() throws Exception {
         String yaml = """
             ...
             """;
         
-        Path tempFile = Files.createTempFile("config", ".yaml");
+        Path tempFile = tempDir.resolve("config.yaml");
         Files.writeString(tempFile, yaml);
         
         // ... assertions ...
-        
-        Files.deleteIfExists(tempFile);
     }

Also applies to: 96-96

src/main/java/com/db/piramalswasthya/anonymizer/cli/DiffSchemaCommand.java (1)

322-330: No connection timeout or SSL configuration on the DataSource.

createDataSource sets host/port/schema/credentials but does not configure connection timeout (which is defined in AnonymizerConfig.DatabaseConfig.connectionTimeout) or any SSL/TLS settings. For connecting to a read replica, especially across networks, timeouts and encrypted transport should be configured.

src/main/java/com/db/piramalswasthya/anonymizer/config/AnonymizationRules.java (1)

53-57: Consider making strategy a typed enum instead of a free-form String.

ColumnRule.strategy is a String, meaning typos like "HMAC_HAHS" in rules.yaml won't be caught at load time. An enum (e.g., HMAC_HASH, FAKE_NAME, MASK_PHONE, GENERALIZE_DATE, PRESERVE) would provide fail-fast validation during deserialization. You can use Jackson's @JsonCreator or READ_UNKNOWN_ENUM_VALUES_AS_NULL for forward compatibility.

src/main/java/com/db/piramalswasthya/anonymizer/report/RunReport.java (1)

45-121: LGTM overall — clean, PII-safe report design.

The structure is well-thought-out with proper null handling in getTotalRowsProcessed and separation of table-level metrics.

One minor note: addTableReport (line 85) stores the strategies map reference directly. If the caller mutates the map after the call, the report is affected. Consider a defensive copy.

Proposed defensive copy
-        tr.setStrategyCounts(strategies);
+        tr.setStrategyCounts(strategies != null ? new HashMap<>(strategies) : new HashMap<>());
src/main/java/com/db/piramalswasthya/anonymizer/output/DirectRestoreWriter.java (1)

96-107: resetSchema fallback to deleteAllData() may silently mask permission issues that indicate misconfiguration.

If DROP DATABASE fails due to permissions, falling back to DELETE FROM all tables may mask a serious misconfiguration (e.g., accidentally pointing at the wrong database). Consider logging at ERROR level rather than WARN, or at least including the full exception for diagnostics.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In `@pom.xml`:
- Around line 98-108: Update the Picocli dependency version from 4.7.5 to 4.7.7
by modifying the <artifactId>picocli</artifactId> dependency entry in the pom
(change the <version> value for picocli to 4.7.7); leave the <dependency> entry
for com.fasterxml.jackson.dataformat:jackson-dataformat-yaml without an explicit
<version> so it continues to inherit the Spring Boot BOM managed version.

In `@src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java`:
- Around line 424-433: The code in RunCommand sets
ds.setVerifyServerCertificate(false) which disables TLS server verification;
make this configurable instead of hardcoding false by adding a flag/property to
AnonymizerConfig.DatabaseConfig (e.g., verifyServerCertificate or
requireServerCert) and read dbConfig.getVerifyServerCertificate() (or similar)
in the try block to pass into ds.setVerifyServerCertificate(...); ensure
dbConfig has a sensible default (true for production) and update any config
parsing/loading code to support the new boolean so tests and environments can
opt-out if needed.
- Around line 184-194: The report logging path is inconsistent: replace
hardcoded LOGS_DIRECTORY in the three writeReport/log lines with the configured
path (use config.getLoggingPath() where config is guaranteed to exist, i.e., the
success path after report creation), and for the error paths (where config may
be null) pass a fallback expression such as (config != null ?
config.getLoggingPath() : LOGS_DIRECTORY); update the calls referencing
writeReport, and the log.info lines that print the "Report written to" path, so
all three locations use the config fallback consistently within RunCommand
(refer to writeReport, LOGS_DIRECTORY, and config.getLoggingPath()).
- Around line 317-322: The strategy counts are being incremented per batch in
updateStrategyCounts causing inflation; move the call out of processBatch and
invoke updateStrategyCounts once per table inside processTable (before or after
batching) using the existing tableRules and strategyCounts so it reflects the
column→strategy mapping, or if you intended to count rows per strategy change
updateStrategyCounts to accept a rowCount (e.g., batchSize or rowsProcessed) and
add that amount (strategyCounts.merge(strategy, rowCount, Integer::sum))
instead; adjust calls in processBatch/processTable accordingly to use either the
one-time per-table call or the per-row increment variant.
- Around line 97-183: The try block in RunCommand.java only catches SQLException
and IOException so exceptions from SafetyGuard.validateSafeToConnect,
validateConfiguration, HmacAnonymizer constructor (and other runtime/checked
exceptions) can bypass report finalization; add a broad fallback (catch
(Exception e)) or a finally section that always sets report end time, total
duration, a failure status/error message, and calls writeReport(report,
LOGS_DIRECTORY, executionId) so the report is written for all failures; ensure
the fallback logs the sanitized error (reuse sanitizeError(e) or e.getMessage())
and does not swallow the original exception details.

In
`@src/main/java/com/db/piramalswasthya/anonymizer/config/AnonymizerConfig.java`:
- Around line 56-57: SafetyConfig currently has the boolean field enabled
defaulting to false which causes SafetyGuard.validateSafeToConnect to be
bypassed when safety.enabled is omitted; change SafetyConfig so that enabled
defaults to true (e.g., initialize the field or provide a getter that returns
true when null) and ensure any YAML binding still maps to safety.enabled, so
SafetyGuard.validateSafeToConnect and calls to config.isEnabled() will treat
safety as enabled by default unless explicitly set to false.
- Around line 44-53: DatabaseConfig.readOnly is never applied to created
DataSources; update RunCommand.createDataSource and
DiffSchemaCommand.createDataSource to honor DatabaseConfig.readOnly by returning
a DataSource wrapper that sets Connection.setReadOnly(readOnly) on each obtained
Connection (or otherwise sets read-only on connections before use) rather than
just configuring MysqlDataSource; reference DatabaseConfig.readOnly,
RunCommand.createDataSource, DiffSchemaCommand.createDataSource and
MysqlDataSource when making the change so every call to getConnection() yields a
Connection with setReadOnly(databaseConfig.isReadOnly()) applied.
🧹 Nitpick comments (3)
src/main/java/com/db/piramalswasthya/anonymizer/config/AnonymizerConfig.java (1)

55-102: Redundant manual accessors — @Data already generates these.

SafetyConfig is annotated with @Data, which generates all getters/setters via Lombok. Lines 63–101 manually re-declare every accessor, duplicating what Lombok provides. The other nested classes (DatabaseConfig, PerformanceConfig) correctly rely solely on @Data.

Remove the manual methods to stay consistent with the rest of the file.

♻️ Proposed fix
 `@Data`
 public static class SafetyConfig {
     private boolean enabled;
     private List<String> allowedHosts;
     private List<String> deniedPatterns;
     private boolean requireExplicitApproval;
     private String approvalFlag;
-        
-        public boolean isEnabled() {
-            return enabled;
-        }
-        
-        public void setEnabled(boolean enabled) {
-            this.enabled = enabled;
-        }
-        
-        public List<String> getAllowedHosts() {
-            return allowedHosts;
-        }
-        
-        public void setAllowedHosts(List<String> allowedHosts) {
-            this.allowedHosts = allowedHosts;
-        }
-        
-        public List<String> getDeniedPatterns() {
-            return deniedPatterns;
-        }
-        
-        public void setDeniedPatterns(List<String> deniedPatterns) {
-            this.deniedPatterns = deniedPatterns;
-        }
-        
-        public boolean isRequireExplicitApproval() {
-            return requireExplicitApproval;
-        }
-        
-        public void setRequireExplicitApproval(boolean requireExplicitApproval) {
-            this.requireExplicitApproval = requireExplicitApproval;
-        }
-        
-        public String getApprovalFlag() {
-            return approvalFlag;
-        }
-        
-        public void setApprovalFlag(String approvalFlag) {
-            this.approvalFlag = approvalFlag;
-        }
 }
src/test/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuardTest.java (1)

91-115: Missing test: allowlisted host that matches a built-in deny pattern.

The current suite doesn't cover the important edge case where a host is in the allowlist and matches a built-in deny pattern (e.g., "replica.nonprod.internal" matching .*prod.*). Per the SafetyGuard logic, the built-in deny check fires even for allowlisted hosts and requires requireExplicitApproval + a correct approval flag to bypass. This is a critical path for legitimate non-production replicas whose hostnames contain "prod".

Consider adding a test like:

`@Test`
void testAllowlistedHostMatchingDenyPattern_RequiresApproval() {
    AnonymizerConfig.SafetyConfig config = new AnonymizerConfig.SafetyConfig();
    config.setEnabled(true);
    config.setAllowedHosts(List.of("replica.nonprod.internal"));
    config.setRequireExplicitApproval(true);
    config.setApprovalFlag("CONFIRM");

    SafetyGuard guard = new SafetyGuard(config);

    // Denied without approval flag despite being in allowlist
    assertThrows(SafetyViolationException.class, () ->
        guard.validateSafeToConnect("replica.nonprod.internal", "db_uat", null)
    );

    // Allowed with correct approval flag
    assertDoesNotThrow(() ->
        guard.validateSafeToConnect("replica.nonprod.internal", "db_uat", "CONFIRM")
    );
}
src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java (1)

464-479: Sanitize logic is reasonable but consider edge cases.

The regex '[^']{10,}' only redacts quoted strings longer than 10 characters, which means shorter PII values (e.g., short names, phone numbers in quotes) could leak into error messages/reports. Consider lowering the threshold or redacting all quoted strings.

fix: improve error handling in anonymizer
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java`:
- Around line 219-251: processSchema currently creates targetDataSource using
createDataSource which wraps with ReadOnlyAwareDataSource based on
DatabaseConfig.readOnly (default true), causing writes to fail; modify
RunCommand.processSchema to validate the target DatabaseConfig (from
config.getTarget()) before creating the target DataSource and fail fast if it is
readOnly (or explicitly set writable). Specifically, inspect
DatabaseConfig.isReadOnly() for the target, and if true either (a) throw an
informative IllegalStateException/log and abort (so
DirectRestoreWriter.resetSchema/write operations won't run) or (b) override the
flag to false when constructing the target DataSource; reference
createDataSource, ReadOnlyAwareDataSource, DatabaseConfig.isReadOnly(),
RunCommand.processSchema, and DirectRestoreWriter.resetSchema to locate where to
add this check.

In `@src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java`:
- Around line 196-222: The validation currently skips empty tables and opens its
own connection; change validateNumericPrimaryKey to accept a Connection (or
provide an overload) so callers like streamTable can pass their existing
Connection (reuse connection) and change the validation query to fetch metadata
even for empty tables (e.g., use "SELECT %s FROM %s WHERE 1=0" or "LIMIT 0") so
ResultSetMetaData is available even when rs.next() is false; then use the
ResultSetMetaData to check column type (Types.TINYINT/SMALLINT/INTEGER/BIGINT)
and throw the same SQLException if it’s not numeric, and update callers to pass
the Connection (or document intentional separate-connection behavior if you keep
the original signature).
- Around line 170-194: In processBatch(PreparedStatement stmt, List<String>
columns, String primaryKey) ensure the primary key value is stored in the
RowData returned: after populating row from the provided columns (inside the
rs.next() loop) check if row.containsKey(primaryKey) and if not, read the PK
value (rs.getObject(primaryKey)) and put it into row before adding to batch;
keep the existing numeric check that sets lastKey using the same PK value and
throw the same SQLException if the PK is not numeric.

In `@src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java`:
- Around line 170-176: hostMatchesWildcard currently builds a regex by doing
allowed.replace("*", ".*") which fails to escape other regex metacharacters;
update hostMatchesWildcard to escape the allowed string with Pattern.quote (or
equivalent) first, then restore wildcard semantics by replacing the quoted "\*"
(backslash-star) with ".*" so only '*' becomes a wildcard and other
metacharacters remain literal, and use the resulting regex with host.matches or
Pattern/Matcher to perform the comparison; ensure null checks remain.

In `@src/main/resources/anonymization-registry.yml`:
- Around line 1-17: The file src/main/resources/anonymization-registry.yml is
orphaned and mismatches the application's model: it references
AnonymizationRegistry (which doesn't exist) and uses fields incompatible with
AnonymizationRules and ColumnRule; either delete this YAML or adapt it and
wiring to match the existing model. To fix, either (A) remove
anonymization-registry.yml and any references to it, or (B) create a matching
AnonymizationRegistry Java class and update startup/config loader to load this
resource, and/or transform the YAML shape to the AnonymizationRules structure
(use rulesVersion, unknownColumnPolicy, databases -> ColumnRule entries with
only strategy and format) and update the codepaths that load AnonymizationRules
to accept this file; reference the classes AnonymizationRegistry,
AnonymizationRules, and ColumnRule to locate where to implement the change.
🧹 Nitpick comments (2)
src/main/java/com/db/piramalswasthya/anonymizer/core/HmacAnonymizer.java (1)

114-121: HmacAnonymizerException is package-private but wraps failures from the public hashId method.

Callers outside com.db.piramalswasthya.anonymizer.core cannot catch HmacAnonymizerException by type, forcing them to catch RuntimeException. Consider making it public (or a public static inner class of HmacAnonymizer) for more precise exception handling.

src/main/java/com/db/piramalswasthya/anonymizer/config/AnonymizerConfig.java (1)

56-103: Redundant explicit getters/setters — @Data already generates them.

SafetyConfig is annotated with @Data (which generates getters, setters, equals, hashCode, toString), yet all getters and setters are also manually written. The same applies to AnonymizerConfig (lines 112-126) with getSource/setSource/getTarget/setTarget. Remove the manual accessors to reduce noise, or remove @Data if you prefer explicit control.

techy4shri and others added 4 commits February 15, 2026 21:56
….java


this is a suggestion and must not be done without need

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
fix: code cleanup and exception handling improvements
@techy4shri
Copy link
Copy Markdown
Author

@drtechie everything done from my side, kindly check the cli tool from your side once to confirm it works properly and nothing has broken.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (4)
src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java (1)

383-394: ObjectMapper is instantiated twice with the same configuration.

loadConfig and loadRules each create a new ObjectMapper(new YAMLFactory()). This is a minor inefficiency — ObjectMapper is thread-safe and reusable.

♻️ Reuse a single ObjectMapper instance
+    private static final ObjectMapper YAML_MAPPER = new ObjectMapper(new YAMLFactory());
+
     private AnonymizerConfig loadConfig(String path) throws IOException {
-        ObjectMapper mapper = new ObjectMapper(new YAMLFactory());
-        return mapper.readValue(new File(path), AnonymizerConfig.class);
+        return YAML_MAPPER.readValue(new File(path), AnonymizerConfig.class);
     }
     
     private AnonymizationRules loadRules(String path) throws IOException {
-        ObjectMapper mapper = new ObjectMapper(new YAMLFactory());
-        return mapper.readValue(new File(path), AnonymizationRules.class);
+        return YAML_MAPPER.readValue(new File(path), AnonymizationRules.class);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java` around
lines 383 - 394, Both loadConfig and loadRules create a new ObjectMapper(new
YAMLFactory()) separately; make a single reusable ObjectMapper (e.g., a private
static final ObjectMapper YAML_MAPPER = new ObjectMapper(new YAMLFactory()) or
an instance field) and replace the new ObjectMapper calls in loadConfig and
loadRules with that shared YAML_MAPPER to avoid redundant instantiation while
preserving thread-safety.
src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java (1)

252-270: RowData.getData() returns the internal mutable map directly.

Callers (e.g., RunCommand.processBatch at line 329) receive the live LinkedHashMap. If any consumer mutates it after anonymization (adds/removes keys, changes values), it silently corrupts the row. Returning Collections.unmodifiableMap(data) would make this defensive without breaking read-only consumers.

♻️ Proposed defensive copy
         public java.util.Map<String, Object> getData() {
-            return data;
+            return java.util.Collections.unmodifiableMap(data);
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java`
around lines 252 - 270, RowData.getData() exposes the internal mutable
LinkedHashMap which lets callers (e.g., RunCommand.processBatch) mutate row
contents after anonymization; change getData() to return a defensive view such
as Collections.unmodifiableMap(data) (or a shallow copy like new
LinkedHashMap<>(data)) so consumers cannot modify the internal map; update the
method body of RowData.getData() to wrap/clone `data` before returning and
ensure callers expecting mutability are adjusted accordingly.
src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java (2)

134-146: User-supplied custom deny patterns are passed directly to Pattern.compile without validation.

A malformed regex in config.getDeniedPatterns() (e.g., [unclosed) will throw an uncaught PatternSyntaxException at runtime with a non-obvious stack trace. Consider validating patterns at construction time or catching the exception with a clear error message.

♻️ Validate custom patterns at construction time
     public SafetyGuard(AnonymizerConfig.SafetyConfig config) {
         this.config = config;
+        validateCustomPatterns();
+    }
+    
+    private void validateCustomPatterns() {
+        if (config.getDeniedPatterns() == null) return;
+        for (String pattern : config.getDeniedPatterns()) {
+            try {
+                Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
+            } catch (java.util.regex.PatternSyntaxException e) {
+                throw new IllegalArgumentException(
+                    "Invalid custom deny pattern '" + pattern + "': " + e.getMessage(), e);
+            }
+        }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java`
around lines 134 - 146, The current code passes user regexes from
config.getDeniedPatterns() directly to matching (via matchesPattern), which can
throw a PatternSyntaxException at runtime; fix by validating/compiling those
patterns at SafetyGuard construction time: in the SafetyGuard constructor
iterate config.getDeniedPatterns(), attempt Pattern.compile(...) for each,
collect the compiled java.util.regex.Pattern instances (or throw a clear
IllegalArgumentException/SafetyViolationException with the invalid pattern and
the PatternSyntaxException message), store them (e.g., compiledDeniedPatterns)
and then update checkCustomDenials to use these compiled patterns instead of raw
strings (or, if you prefer a smaller change, catch PatternSyntaxException inside
checkCustomDenials and rethrow a SafetyViolationException with a clear message
referencing the offending pattern and exception).

177-182: matchesPattern compiles a new Pattern on every invocation.

This is called in a tight loop over deny patterns and is not on a hot path, so the impact is negligible. However, if custom deny patterns grow or this is called frequently, consider caching compiled patterns at construction time.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java`
around lines 177 - 182, The matchesPattern method in SafetyGuard compiles a new
Pattern on every call which is inefficient; precompile and cache deny patterns
(e.g., in a Map<String, Pattern> or List<Pattern>) when the SafetyGuard is
constructed or when patterns are loaded, then change matchesPattern(String
value, String pattern) to look up the compiled Pattern by the pattern key (or
accept a Pattern overload) and call matcher(value).matches(); update any callers
to use the cached patterns loader (or provide an overload matchesPattern(String
value, Pattern compiledPattern)) so runtime compilation is eliminated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java`:
- Around line 288-303: The code builds allColumns from tableRules only, causing
any source columns not listed in tableRules to be omitted and written as NULL;
fix by retrieving the actual source column list (e.g., via the
paginator/engine/table metadata) and use that as the canonical allColumns in
BatchContext instead of List.copyOf(tableRules.getColumns().keySet()); add a
validation step (new method or inline in RunCommand) that compares sourceColumns
to tableRules.getColumns().keySet() and either (a) auto-add missing columns with
a PRESERVE strategy to the rules map or (b) fail/raise an explicit error
depending on UnknownColumnPolicy, and ensure updateStrategyCounts,
processTableBatches, and recordTableCompletion operate on the full source column
list so missing columns are fetched and inserted rather than silently dropped.
- Around line 442-464: createDataSource currently only sets connect timeout but
not a socket timeout, so long-running reads can hang; update the
createDataSource method to set the MySQL Connector/J socket timeout by calling
MysqlDataSource.setIntegerRuntimeProperty("socketTimeout", <milliseconds>)
(e.g., 300000 for 5 minutes) alongside the other advanced settings, keep it
inside the existing try/catch that logs failures, and return the
ReadOnlyAwareDataSource as before so ReadOnlyAwareDataSource, MysqlDataSource
and AnonymizerConfig.DatabaseConfig remain the same.

In `@src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java`:
- Around line 169-175: hostMatchesWildcard currently calls Pattern.quote on the
whole pattern which prevents '*' wildcards from working; modify
hostMatchesWildcard to split the allowed string on the literal '*' character,
quote each segment with Pattern.quote, then join the quoted segments with ".*"
to build the regex (optionally anchor start/end or use matches as before), and
return host.matches(builtRegex); keep the null checks and ensure empty segments
are handled so patterns like "*.uat.internal" become ".*\.uat\.internal".

In `@src/main/resources/anonymization-registry.yml`:
- Around line 90-94: The Password field in anonymization-registry.yml is marked
nullable: false but uses strategy: "SUPPRESS", which writes NULL and will
violate NOT NULL constraints for the Password column (m_user -> Password);
change the strategy to a non-null-producing anonymization method such as
"RANDOM_FAKE_DATA" or a fixed HMAC/hash strategy so anonymized values remain
non-null and satisfy the database constraint while neutralizing credentials, and
update the Password entry's strategy accordingly.
- Around line 51-55: The AadhaarNo anonymization entry currently uses an unkeyed
hash; update the anonymization-registry.yml entry for the AadhaarNo
canonicalName to use strategy: "HMAC_HASH" instead of "HASH_SHA256", and ensure
the HMAC key is configured in the system's secrets/config (the HMAC key
reference used by the anonymizer) so the anonymizer implementation that reads
the AadhaarNo strategy will apply a keyed HMAC; locate the AadhaarNo block in
anonymization-registry.yml and replace the strategy value and confirm the
runtime secret name used by the anonymizer is present.
- Around line 31-35: The registry assigns strategy "HMAC_HASH" to numeric
columns (e.g., BeneficiaryID, UserID, BeneficiaryRegID) but
HmacAnonymizer.hashId() returns a 64-char hex string and
DirectRestoreWriter.writeInsert() uses stmt.setObject(), causing SQL type errors
and non-matching joins; fix by either (A) changing the
anonymization-registry.yml entries for those fields (BeneficiaryID, UserID,
BeneficiaryRegID) to a numeric-compatible strategy (e.g., a documented
"HMAC_TO_LONG" / "HMAC_TRUNCATE_TO_BIGINT" or a deterministic hash-to-int
strategy) or (B) changing the DB column mapping to accept strings (VARCHAR(64))
so HMAC_HASH is valid, and additionally update DirectRestoreWriter.writeInsert()
to detect HmacAnonymizer.hashId() outputs and call stmt.setString(...) (not
setObject) for hex hashes to avoid type conversion errors; reference
HmacAnonymizer.hashId() and DirectRestoreWriter.writeInsert() when making these
code/registry/schema changes.
- Around line 61-65: The applyStrategy switch in AnonymizationEngine currently
lacks a handler for the GENERALIZE strategy causing values to be returned
unchanged; update AnonymizationEngine.applyStrategy to handle GENERALIZE by
dispatching to the appropriate generalization routines (e.g., call
HmacAnonymizer.generalizeDate for DATE fields and a new or existing
generalizeText/generalizeVarchar helper for TEXT/VARCHAR/STRING types) and
return the generalized value, ensuring the strategy enum/constant for GENERALIZE
is matched; alternatively, if you prefer an immediate safe fallback, update the
anonymization-registry.yml entries for Address, DateOfBirth, Diagnosis,
DrugName, and TestResult from GENERALIZE to SUPPRESS so AnonymizationEngine will
drop those values—make sure to update related unit tests for
AnonymizationEngine.applyStrategy and add coverage for GENERALIZE behaviour (or
SUPPRESS) to prevent regression.

---

Duplicate comments:
In `@src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java`:
- Around line 451-460: The hardcoded ds.setVerifyServerCertificate(false) in
RunCommand disables TLS identity verification; change this to read a new
configurable flag from DatabaseConfig (e.g., isVerifyServerCertificate or
verifyServerCertificate) instead of hardcoding false. Update the code that
constructs/loads DatabaseConfig to expose a boolean defaulting to true, replace
the direct call with
ds.setVerifyServerCertificate(dbConfig.isVerifyServerCertificate()), and ensure
any config parsing/CLI options and tests are updated to allow overriding the
flag for non-production use while keeping the secure default.

In `@src/main/resources/anonymization-registry.yml`:
- Around line 1-17: The anonymization-registry.yml is likely orphaned and its
fields (canonicalName, dataType, nullable, schemaHash) won't be deserialized
into the existing Java model; locate and either implement a matching model
(e.g., create AnonymizationRegistry and extend ColumnRule to include
canonicalName, dataType, nullable, schemaHash) or update the YAML to match the
current model (ColumnRule with only strategy and format); then ensure the file
is actually loaded by the application by wiring a loader/bean (e.g., in the
existing AnonymizationRules loader or bootstrap code) that reads
anonymization-registry.yml via Jackson/ObjectMapper or
ClassPathResource/getResourceAsStream and registers it for runtime use.

---

Nitpick comments:
In `@src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java`:
- Around line 383-394: Both loadConfig and loadRules create a new
ObjectMapper(new YAMLFactory()) separately; make a single reusable ObjectMapper
(e.g., a private static final ObjectMapper YAML_MAPPER = new ObjectMapper(new
YAMLFactory()) or an instance field) and replace the new ObjectMapper calls in
loadConfig and loadRules with that shared YAML_MAPPER to avoid redundant
instantiation while preserving thread-safety.

In `@src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java`:
- Around line 252-270: RowData.getData() exposes the internal mutable
LinkedHashMap which lets callers (e.g., RunCommand.processBatch) mutate row
contents after anonymization; change getData() to return a defensive view such
as Collections.unmodifiableMap(data) (or a shallow copy like new
LinkedHashMap<>(data)) so consumers cannot modify the internal map; update the
method body of RowData.getData() to wrap/clone `data` before returning and
ensure callers expecting mutability are adjusted accordingly.

In `@src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java`:
- Around line 134-146: The current code passes user regexes from
config.getDeniedPatterns() directly to matching (via matchesPattern), which can
throw a PatternSyntaxException at runtime; fix by validating/compiling those
patterns at SafetyGuard construction time: in the SafetyGuard constructor
iterate config.getDeniedPatterns(), attempt Pattern.compile(...) for each,
collect the compiled java.util.regex.Pattern instances (or throw a clear
IllegalArgumentException/SafetyViolationException with the invalid pattern and
the PatternSyntaxException message), store them (e.g., compiledDeniedPatterns)
and then update checkCustomDenials to use these compiled patterns instead of raw
strings (or, if you prefer a smaller change, catch PatternSyntaxException inside
checkCustomDenials and rethrow a SafetyViolationException with a clear message
referencing the offending pattern and exception).
- Around line 177-182: The matchesPattern method in SafetyGuard compiles a new
Pattern on every call which is inefficient; precompile and cache deny patterns
(e.g., in a Map<String, Pattern> or List<Pattern>) when the SafetyGuard is
constructed or when patterns are loaded, then change matchesPattern(String
value, String pattern) to look up the compiled Pattern by the pattern key (or
accept a Pattern overload) and call matcher(value).matches(); update any callers
to use the cached patterns loader (or provide an overload matchesPattern(String
value, Pattern compiledPattern)) so runtime compilation is eliminated.

@techy4shri
Copy link
Copy Markdown
Author

techy4shri commented Feb 21, 2026

@drtechie made a few changes, removed newly introduced and separate properties for anonymization where the predefined props can be applied. Also, added a util to tackle persistent case of issues with configuration parsing.
Do let me know if any modifications are required. Thanks!

@techy4shri
Copy link
Copy Markdown
Author

techy4shri commented Mar 3, 2026

currently debugging some springboot and functionality issues, please do let me know in case there are any suggestions from the team for implementation. @drtechie @vanitha1822 @paras-dba

@techy4shri
Copy link
Copy Markdown
Author

Update: anonymizer running, dry run succeeded.
image

@techy4shri techy4shri requested a review from drtechie March 8, 2026 19:07
techy4shri and others added 8 commits March 10, 2026 02:12
- we have reached an empassee
- RunCommand has been cleaned up. I plan to make it shine tomorrow.
- dry run successful, as predicted. documentation remains the same. for now
- need to fix reporting
- everything works locally, need full testing now
Added PARTIAL_MASK handling in AnonymizationEngine.applyStrategy:
Kept PermCoordinate generalization as a note.
- anonymizer working, except for a bug
- table detection seems to vary
- will continue to monitor and fix
@techy4shri
Copy link
Copy Markdown
Author

fixing a bug, will take a while.

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants