Automate anonymized production DB snapshot for UAT#93
Automate anonymized production DB snapshot for UAT#93techy4shri wants to merge 41 commits intoPSMRI:mainfrom
Conversation
fix: readme and gitignore chore: update application properties and environment configs
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request introduces a comprehensive database anonymization framework for AMRIT DB, including deterministic HMAC-based data masking, CLI tools for execution and schema diffing, configuration-driven rule management, safety guardrails for production databases, and complete documentation and testing infrastructure. Changes
Sequence DiagramsequenceDiagram
participant User
participant CLI as AmritDbAnonymizer
participant RunCmd as RunCommand
participant SafetyGuard
participant SourceDB as Source Database
participant KeysetPaginator
participant AnonymizationEngine
participant TargetDB as Target Database
participant DirectRestoreWriter
participant RunReport
User->>CLI: Execute with config.yaml & approval token
CLI->>RunCmd: Delegate to RunCommand.call()
RunCmd->>RunCmd: Load config & rules from YAML
RunCmd->>RunCmd: Validate configuration
RunCmd->>SafetyGuard: validateSafeToConnect(source_host, approval_token)
SafetyGuard->>SafetyGuard: Check allowlist/deny patterns
SafetyGuard-->>RunCmd: Success or SafetyViolationException
RunCmd->>SafetyGuard: validateSafeToConnect(target_host, approval_token)
SafetyGuard-->>RunCmd: Success or SafetyViolationException
RunCmd->>DirectRestoreWriter: resetSchema(sourceDataSource)
DirectRestoreWriter->>TargetDB: DROP/recreate schema
DirectRestoreWriter->>SourceDB: Clone table structures
DirectRestoreWriter-->>RunCmd: Schema ready
loop For each source schema & table
RunCmd->>KeysetPaginator: streamTable(table, primaryKey, columns, batchHandler)
KeysetPaginator->>SourceDB: Query batch with keyset pagination
SourceDB-->>KeysetPaginator: RowData batch
KeysetPaginator->>AnonymizationEngine: anonymizeBatch(database, table, rows)
AnonymizationEngine->>AnonymizationEngine: Apply per-column rules (HMAC_HASH/PRESERVE)
AnonymizationEngine-->>KeysetPaginator: Anonymized rows + strategy counts
KeysetPaginator->>DirectRestoreWriter: writeInsert(tableName, columns, rows)
DirectRestoreWriter->>TargetDB: Batch INSERT anonymized data
DirectRestoreWriter-->>KeysetPaginator: Insert confirmed
KeysetPaginator-->>RunCmd: Next batch or completion
end
RunCmd->>RunReport: addTableReport(table, rowCount, duration, strategies)
RunCmd->>RunReport: writeToFile(outputPath)
RunReport-->>RunCmd: Report written
RunCmd->>DirectRestoreWriter: close() (commit/rollback)
DirectRestoreWriter-->>RunCmd: Transaction finalized
RunCmd-->>User: Exit with status code & JSON report
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can use Trivy to scan for security misconfigurations and secrets in Infrastructure as Code files.Add a .trivyignore file to your project to customize which findings Trivy reports. |
There was a problem hiding this comment.
Actionable comments posted: 11
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/main/java/com/db/piramalswasthya/config/FlywayMigrator.java (1)
37-37: 🛠️ Refactor suggestion | 🟠 MajorRemove
System.out.println("SUCCESS")— use the logger instead.Line 38 already logs the success message via SLF4J. The
System.out.printlnbypasses structured logging and should be removed.Proposed fix
- System.out.println("SUCCESS"); logger.info("Flyway migration completed successfully");
🤖 Fix all issues with AI agents
In `@config.yaml.example`:
- Around line 7-11: The schemas list in config.yaml.example uses "dbiemr" but
rules.yaml.example defines rules under "db_iemr", causing AnonymizationEngine's
direct lookup (rules.getDatabases().get("dbiemr")) to return null and skip the
schema; fix by making the schema name consistent—update config.yaml.example's
schemas entry from "dbiemr" to "db_iemr" (or alternatively rename the key in
rules.yaml.example to "dbiemr") so that the schema name in the schemas list
matches the key used by rules.getDatabases() and the AnonymizationEngine.
In `@rules.yaml.example`:
- Around line 38-39: Cross-schema referential integrity is broken because
db_iemr.t_benvisitdetail.BeneficiaryRegID is set to HMAC_HASH while the matching
foreign-key column db_identity.m_beneficiaryregidmapping.BenRegID is PRESERVE;
update the rules so both sides use the same strategy (either PRESERVE on both if
the ID is non-sensitive or HMAC_HASH on both if sensitive) and then search the
ruleset for any other cross-schema join pairs to apply the same consistent
strategy (look for BeneficiaryRegID and BenRegID in the rules and make the
strategies identical).
In `@src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java`:
- Around line 204-216: The DirectRestoreWriter opened in the try-with-resources
is never marked successful, so its close() will always rollback; after
processing all tables for the schema you must call writer.markSuccess() to
commit the inserts. Update the block that calls processSchemaTables(schema,
rules, engine, paginator, writer, report) (or inside processSchemaTables once
all writes/reports succeed) to invoke DirectRestoreWriter.markSuccess() when no
errors occurred, ensuring commit before the try-with-resources scope ends; keep
the writer.resetSchema(...) call and only markSuccess() after successful
completion of all table processing.
In
`@src/main/java/com/db/piramalswasthya/anonymizer/core/AnonymizationEngine.java`:
- Around line 79-101: The loop incorrectly casts KeysetPaginator.RowData to a
Map which causes ClassCastException; instead, access the underlying
Map<String,Object> (e.g. via RowData#getData() or add a getData()/getDataMap()
accessor on KeysetPaginator.RowData) and iterate a snapshot of its keyset like
new ArrayList<>(row.getData().keySet()) to avoid
ConcurrentModificationException, use Map<String,Object> for the value types (not
Integer), remove the bad cast, keep using row.get(column)/row.put(column,
anonymizedValue), and preserve strategyCounts.merge(rule.getStrategy(), 1,
Integer::sum) and handleUnknownColumn(database, table, column) as before.
In `@src/main/java/com/db/piramalswasthya/anonymizer/core/HmacAnonymizer.java`:
- Around line 81-93: The current fakeName in HmacAnonymizer uses
Math.abs(hash.hashCode()) which can remain negative for Integer.MIN_VALUE;
change seed to a non-negative long (e.g., compute int h = hash.hashCode(); long
seed = (h == Integer.MIN_VALUE) ? ((long)Integer.MAX_VALUE + 1L) :
Math.abs((long)h);) and then use that seed for the name selection (firstNames[
(int)(seed % firstNames.length) ] and lastNames[ (int)((seed / 10) %
lastNames.length) ]) so array indices are always non-negative; update the seed
variable type and usages in fakeName accordingly.
In `@src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java`:
- Around line 94-152: The SELECT built in streamTable doesn't guarantee the
primary key is selected, causing processBatch to fail when it calls
rs.getObject(primaryKey); update streamTable so the constructed columnList
always includes the quoted primary key (use quoteIdentifier(primaryKey)) if it's
not already present (avoid duplicating if present in columns), then build
selectSql from that augmented columnList; keep identifier quoting and the
existing validateNumericPrimaryKey call.
In `@src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java`:
- Around line 90-106: The loop over BUILT_IN_DENY_PATTERNS incorrectly applies
deny checks even for allowlisted hosts; update the logic in SafetyGuard (around
the loop that calls matchesPattern(host, pattern) / matchesPattern(database,
pattern)) to first detect allowlist membership (use the existing allowlist check
or method that determines if a host/database is allowlisted) and skip the
built-in deny pattern check for allowlisted entries, otherwise keep the existing
approval-flag bypass using config.isRequireExplicitApproval(), approvalFlag and
config.getApprovalFlag(); ensure you still throw SafetyViolationException with
the same message when a non-allowlisted host/database matches a deny pattern and
approval is not provided.
In `@src/main/resources/anonymization-registry.json`:
- Line 27: The JSON anonymization registry uses hyphenated keys (e.g.,
"db-1097-identity", "db-iemr") that don't match the underscored names used
elsewhere (e.g., db_1097_identity, db_iemr); update the keys in
src/main/resources/anonymization-registry.json to use the underscored convention
to match the YAML registry and environment configs (replace "db-1097-identity"
with "db_1097_identity" and "db-iemr" with "db_iemr" wherever they appear,
including the entries referenced around lines 27 and 35) so the engine can
correlate entries across files.
In `@src/main/resources/anonymization-registry.yml`:
- Around line 133-134: The registry uses the wrong database key "dbiemr" which
must match the rest of the config that uses "db_iemr"; replace the "dbiemr"
entry with "db_iemr" (i.e., update the key in the anonymization-registry.yml so
the schemaHash entry is under db_iemr) to ensure lookups succeed against the
existing environment configs and JSON registry.
- Around line 26-35: Several columns in the YAML use the key anonymizationRule
but the Java model ColumnRule only defines strategy, causing rule.getStrategy()
to return null and crash when applyStrategy() calls toUpperCase(); update every
column entry that uses anonymizationRule (e.g., BenRegID, FirstName, LastName,
PhoneNo, AadhaarNo, EmailID, Address, DateOfBirth, Gender, UserName, Password,
Diagnosis, DrugName, DoctorName, TestResult, LabTechnicianName) to use strategy
instead (match the existing BeneficiaryID entry) so deserialization populates
ColumnRule.strategy and applyStrategy() receives a non-null value.
In `@src/test/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuardTest.java`:
- Around line 37-50: The test fails because SafetyGuard.validateSafeToConnect
applies the built-in deny regex (e.g., ".*prod.*") before checking the
allowlist, causing allowlisted hosts like "replica.nonprod.internal" to be
blocked; change the logic in SafetyGuard.validateSafeToConnect so that allowed
hosts (from AnonymizerConfig.SafetyConfig.getAllowedHosts / AllowedHosts) are
checked first and bypass built-in deny checks (i.e., if host is explicitly
allowlisted and requireExplicitApproval is false, return/permit immediately),
ensuring allowlist takes precedence over built-in deny patterns.
🟠 Major comments (13)
src/main/environment/anonymization_production.yaml.example-21-25 (1)
21-25:⚠️ Potential issue | 🟠 MajorTarget schemas share identical names with source schemas — risk of accidental production overwrite.
The target schemas (
db_iemr,db_identity,db_reporting,db_1097_identity) are identical to the source schemas on lines 9-12. If someone misconfigures the targethostto point at a production replica or primary, the anonymization engine could overwrite live production data. The local config template correctly uses_uat-suffixed schema names for targets — apply the same convention here, or add a safety guard that rejects same-host + same-schema combinations.src/main/resources/anonymization-registry.json-1-77 (1)
1-77:⚠️ Potential issue | 🟠 MajorStrategy mismatches between JSON and YAML registries for the same columns.
Several columns have conflicting anonymization strategies between this file and the YAML registry:
BeneficiaryID:HASH_SHA256(JSON) vsHMAC_HASH(YAML)Diagnosis:PRESERVE(JSON) vsGENERALIZE(YAML)TestResult:PRESERVE(JSON) vsGENERALIZE(YAML)- Name fields:
RANDOM_UUID(JSON) vsRANDOM_FAKE_DATA(YAML)Clarify which file is the source of truth (the YAML registry header claims to be) and ensure the JSON registry is either auto-generated from it or removed to avoid drift.
src/main/resources/anonymization/pii-detection-rules.yml-7-15 (1)
7-15:⚠️ Potential issue | 🟠 Major
riskLevel: LINKING_KEYis not a standard risk level.All other rules use
CRITICAL,HIGH,MEDIUM,LOW, orNONEforriskLevel. Line 12 usesLINKING_KEY, which appears to be a copy of thepiiCategoryvalue. This will likely fail validation or be ignored by the engine.Proposed fix
piiCategory: LINKING_KEY - riskLevel: LINKING_KEY + riskLevel: CRITICALsrc/main/resources/anonymization/pii-detection-rules.yml-150-158 (1)
150-158:⚠️ Potential issue | 🟠 MajorPAN regex
.*[Pp][Aa][Nn].*is overly broad — will match unrelated columns.This pattern matches any column containing "pan" (case-insensitive), which will produce false positives on columns like
CompanyName,PanelID,Expansion,PancardNo, etc. Consider anchoring the pattern more tightly.Proposed fix
- columnNamePattern: ".*[Pp][Aa][Nn].*|^PANNo$|^PANCard$" + columnNamePattern: "^PAN$|^PANNo$|^PANCard$|^PAN[Nn]umber$|.*[Pp]an[Cc]ard.*"src/main/resources/application.properties-28-48 (1)
28-48:⚠️ Potential issue | 🟠 MajorHardcoded database credentials (
root/1234) committed to version control.
application.propertiesis packaged into the WAR artifact. Use placeholders or environment-variable references (${DB_PASSWORD}) as done in the example/test configs, and override via environment-specific profiles or runtime properties.Proposed fix
-spring.datasource.db-iemr.username=root -spring.datasource.db-iemr.password=1234 +spring.datasource.db-iemr.username=${DB_IEMR_USERNAME} +spring.datasource.db-iemr.password=${DB_IEMR_PASSWORD}Apply the same pattern for
db-identity,db-reporting, anddb-1097-identity.src/main/environment/common_test.properties-18-20 (1)
18-20:⚠️ Potential issue | 🟠 MajorProperty keys not migrated to hyphenated form—datasources will fail to bind.
DatasourceConfig.javaexpects hyphenated prefixes (spring.datasource.db-iemr,spring.datasource.db-identity,spring.datasource.db-reporting,spring.datasource.db-1097-identity), andcommon_example.propertieswas updated accordingly. However,common_test.propertiesstill uses the old non-hyphenated keys (dbiemr,dbidentity,dbreporting). Spring's@ConfigurationPropertiesbinding will fail to load these values, leaving all datasource beans uninitialized.Update lines 19–22, 26–29, 32–35, and similar sections to use the hyphenated prefixes:
spring.datasource.db-iemr.*,spring.datasource.db-identity.*,spring.datasource.db-reporting.*, etc.src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java-78-80 (1)
78-80:⚠️ Potential issue | 🟠 MajorRegex injection risk in allowlist host matching.
Line 80:
allowed.replace("*", ".*")does not escape other regex metacharacters. Hostnames contain dots (.), which are regex wildcards. For example, an allowlist entrydb.example.combecomes the regexdb.example.com, which matchesdbXexampleYcom. For a safety-critical guard, this is too permissive.Use
Pattern.quoteon literal segments and only expand*globs explicitly:🛡️ Proposed fix: escape regex metacharacters
inAllowlist = config.getAllowedHosts().stream() - .anyMatch(allowed -> host.equalsIgnoreCase(allowed) || - host.matches(allowed.replace("*", ".*"))); + .anyMatch(allowed -> host.equalsIgnoreCase(allowed) || + matchesPattern(host, + "\\Q" + allowed.replace("*", "\\E.*\\Q") + "\\E"));src/main/java/com/db/piramalswasthya/anonymizer/cli/DiffSchemaCommand.java-71-126 (1)
71-126:⚠️ Potential issue | 🟠 MajorNo
SafetyGuardvalidation before connecting to the database.The
diff-schemacommand connects directly to the source database without invokingSafetyGuard.validateSafeToConnect(). This bypasses the safety layer designed to prevent accidental execution against production databases. TheRunCommand(per the PR description) presumably enforces this, butdiff-schemaalso reads from a live DB and should be equally guarded.🛡️ Proposed fix: add safety check before connecting
+import com.db.piramalswasthya.anonymizer.safety.SafetyGuard; + // Inside call(), after loading config: +SafetyGuard safetyGuard = new SafetyGuard(config.getSafety()); +safetyGuard.validateSafeToConnect( + config.getSource().getHost(), + String.join(",", config.getSource().getSchemas()), + null // or wire up an --approval-flag option +);src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java-364-370 (1)
364-370:⚠️ Potential issue | 🟠 MajorSSL certificate verification is disabled — MITM risk for database connections carrying credentials and PII.
setVerifyServerCertificate(false)withsetUseSSL(true)means TLS is enabled but the server certificate is not verified. An attacker in a network position between the tool and the database could intercept credentials and data. For a tool explicitly designed to handle production database access and PII, this weakens the security posture.At minimum, make certificate verification configurable rather than hardcoded to
false, and default totrue.src/main/java/com/db/piramalswasthya/anonymizer/config/AnonymizerConfig.java-44-53 (1)
44-53:⚠️ Potential issue | 🟠 Major
@Data-generatedtoString()will include the database password.If a
DatabaseConfiginstance is ever logged (directly or via the parentAnonymizerConfig), the password will appear in plaintext in logs. This is a compliance/privacy risk for a tool handling production database credentials.Exclude
passwordfromtoString()using Lombok's@ToString.Exclude:Proposed fix
+import lombok.ToString; ... `@Data` public static class DatabaseConfig { private String host; private int port = 3306; private List<String> schemas; private String username; + `@ToString.Exclude` private String password; private boolean readOnly = true; private int connectionTimeout = 30000; }src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java-257-280 (1)
257-280:⚠️ Potential issue | 🟠 MajorStrategy counts are incorrectly computed — return value of
anonymizeBatchis discarded.Line 260 calls
engine.anonymizeBatch(...)but discards its return value (the actual per-strategy counts). Instead, lines 273–276 iterate over the rule definitions and increment by 1 per column per batch — this countsbatches × columns, not actual row-level applications.Proposed fix
try { // Anonymize batch - engine.anonymizeBatch(schema, tableName, batch); + Map<String, Integer> batchStrategyCounts = + engine.anonymizeBatch(schema, tableName, batch); // Convert RowData to Map List<Map<String, Object>> rowMaps = batch.stream() .map(row -> row.getData()) .toList(); // Write to target writer.writeInsert(tableName, allColumns, rowMaps); rowCount[0] += batch.size(); // Track strategy counts - for (String column : tableRules.getColumns().keySet()) { - String strategy = tableRules.getColumns().get(column).getStrategy(); - strategyCounts.merge(strategy, 1, Integer::sum); - } + batchStrategyCounts.forEach((strategy, count) -> + strategyCounts.merge(strategy, count, Integer::sum));src/main/java/com/db/piramalswasthya/anonymizer/output/DirectRestoreWriter.java-146-154 (1)
146-154:⚠️ Potential issue | 🟠 Major
setCatalog(schema)is called afterSHOW TABLEShas already executed.In the try-with-resources, all three resources (connection, statement, result set) are initialized sequentially.
stmt.executeQuery("SHOW TABLES")runs before the body'ssrcConn.setCatalog(schema). This currently works only because thesourceDataSourceis already configured with the correct schema viasetDatabaseName(), making thesetCatalogcall a no-op.Move
setCatalogbefore executing the query, or remove it if it's intentionally redundant.Proposed fix
- try (Connection srcConn = sourceDataSource.getConnection(); - Statement stmt = srcConn.createStatement(); - ResultSet rs = stmt.executeQuery("SHOW TABLES")) { - - srcConn.setCatalog(schema); + try (Connection srcConn = sourceDataSource.getConnection()) { + srcConn.setCatalog(schema); + try (Statement stmt = srcConn.createStatement(); + ResultSet rs = stmt.executeQuery("SHOW TABLES")) { while (rs.next()) { tables.add(rs.getString(1)); } + } }src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java-421-430 (1)
421-430:⚠️ Potential issue | 🟠 MajorAdd
mapper.findAndRegisterModules()to serializeLocalDateTimefields
RunReportcontainsLocalDateTimefields (startTime,endTime). The plainnew ObjectMapper()inwriteReport()will not serialize these without registering the JSR310 module. While the project includesjackson-datatype-jsr310via Spring Boot, plain ObjectMapper instances require explicit module registration.RunReport.writeToFile()already does this correctly withmapper.findAndRegisterModules().Proposed fix
private void writeReport(RunReport report, String loggingPath, String executionId) throws IOException { Files.createDirectories(Paths.get(loggingPath)); String reportPath = loggingPath + "/run-report-" + executionId + ".json"; ObjectMapper mapper = new ObjectMapper(); + mapper.findAndRegisterModules(); mapper.writerWithDefaultPrettyPrinter().writeValue(new File(reportPath), report); log.info("Report written to: {}", reportPath); }
🟡 Minor comments (13)
README.md-39-39 (1)
39-39:⚠️ Potential issue | 🟡 MinorDatabase naming inconsistency across documentation.
The README lists
dbiemr(without underscore) as the required database name, but:
anonymization_example.yamlline 8 usesdb_iemr(with underscore)application-executor.propertiesline 22 usesdb_iemr(with underscore)- Lines 40-42 of this file use underscores for other databases (
db_reporting,db_identity, etc.)Clarify whether the database should be named
dbiemrordb_iemrand ensure consistency across all configuration files and documentation.src/main/resources/application-executor.properties-22-22 (1)
22-22:⚠️ Potential issue | 🟡 MinorUpdate
anonymization.lookup.databaseto match actual database name.Line 22 uses
db_iemr, but the actual MySQL database name isdbiemr(as shown in JDBC URLs and Spring datasource configuration). Update this todbiemrfor consistency with the application configuration andanonymization-registry.yml.src/main/environment/anonymization_example.yaml-45-45 (1)
45-45:⚠️ Potential issue | 🟡 MinorDocument the expected location for
rules.yaml.The
rulesFile: rules.yamlresolves relative to the current working directory (where the JVM is invoked), but this is not documented. Users running from different directories may encounter "file not found" errors. Add guidance to GUIDE.md clarifying:
- The expected location (e.g., repo root, same directory as config file)
- Whether relative paths are supported
- An example showing the correct directory structure
Consider documenting this in the Configuration section alongside the note that config files are stored in
src/main/environment/.src/main/environment/anonymization_production.yaml.example-41-41 (1)
41-41:⚠️ Potential issue | 🟡 MinorApproval flag comment says "Rotate monthly" but value suggests quarterly (
Q1).The approval flag
PROD_TO_UAT_2025_Q1implies quarterly rotation, but the inline comment says# Rotate monthly. Clarify the intended rotation cadence and update either the flag naming convention or the comment.docs/anonymization/GUIDE.md-714-720 (1)
714-720:⚠️ Potential issue | 🟡 MinorSpot-check example contradicts rules:
BenRegIDisPRESERVE, not hashed.The rules file marks
BenRegIDasPRESERVE, so it should retain its original integer value in UAT — not appear as a "long hex hash (64 chars)." Update this example to show the preserved original value forBenRegIDand only hashed values for HMAC columns likeFirstNameandPhoneNo.docs/anonymization/GUIDE.md-373-380 (1)
373-380:⚠️ Potential issue | 🟡 MinorDuplicate introductory text for schema reset strategies.
Lines 375-376 and 379 both introduce the same concept with slightly different wording. Remove the first one (lines 375-376) and keep the second.
config.yaml.example-45-48 (1)
45-48:⚠️ Potential issue | 🟡 MinorEnvironment variable name inconsistent with documentation.
This file uses
${ANONYMIZATION_SECRET}butGUIDE.md(lines 208, 333) references${HMAC_SECRET_KEY}. Pick one name and use it consistently across all config examples and documentation.docs/anonymization/GUIDE.md-477-496 (1)
477-496:⚠️ Potential issue | 🟡 MinorSafety Model section immediately jumps to troubleshooting.
The "## Safety Model" heading at line 477 is followed directly by "### Common Issues" (line 479). The actual safety model description (allowlists, denylists, approval tokens) is missing from this section. Consider adding an introductory subsection explaining how the safety guards work before diving into troubleshooting.
docs/anonymization/GUIDE.md-283-286 (1)
283-286:⚠️ Potential issue | 🟡 MinorDuplicate heading: "### Complete Configuration Reference" appears twice.
Lines 283 and 285 both contain the same heading. Remove one. Flagged by markdownlint (MD024).
src/main/java/com/db/piramalswasthya/anonymizer/cli/DiffSchemaCommand.java-312-314 (1)
312-314:⚠️ Potential issue | 🟡 Minor
FileWriteruses platform-default charset.
new FileWriter(outputPath)uses the JVM's default charset, which may not be UTF-8 on all systems. YAML output should be consistently written in UTF-8.🔧 Proposed fix
- try (FileWriter writer = new FileWriter(outputPath)) { - writer.write(yaml.toString()); + try (var writer = new java.io.OutputStreamWriter( + new java.io.FileOutputStream(outputPath), java.nio.charset.StandardCharsets.UTF_8)) { + writer.write(yaml.toString()); }Or with
Files.writeString:- try (FileWriter writer = new FileWriter(outputPath)) { - writer.write(yaml.toString()); - } + Files.writeString(Path.of(outputPath), yaml.toString(), StandardCharsets.UTF_8);src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java-118-118 (1)
118-118:⚠️ Potential issue | 🟡 Minor
lastKey = 0Lwill skip rows with primary key ≤ 0.The initial
lastKeyis0L, so the first query isWHERE pk > 0. Any rows with PK = 0 or negative values will never be returned. While MySQL auto-increment defaults start at 1, this assumption should be documented or the initial query should handle the first page differently (e.g., usingWHERE pk >= 0for the first batch, orLong.MIN_VALUE).src/main/java/com/db/piramalswasthya/anonymizer/core/AnonymizationEngine.java-110-118 (1)
110-118:⚠️ Potential issue | 🟡 MinorNull-safety:
strategy.toUpperCase()will NPE ifColumnRule.getStrategy()returnsnull.If a rule is defined in YAML without a
strategyfield, this will throwNullPointerException.Proposed guard
private Object applyStrategy(String strategy, String value) { + if (strategy == null) { + log.warn("Null strategy encountered - preserving value"); + return value; + } return switch (strategy.toUpperCase()) {src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java-171-180 (1)
171-180:⚠️ Potential issue | 🟡 MinorSuccess-path report uses hardcoded
"./logs"instead ofconfig.getLoggingPath().
configis scoped to the try block, so line 175 can't reference it. Consider extractingloggingPathbefore the try block.Proposed fix
RunReport report = new RunReport(); report.setExecutionId(executionId); report.setStartTime(startTime); + String loggingPath = "./logs"; // default, overridden after config loads try { // 1. Load configuration and rules log.info("Loading configuration..."); AnonymizerConfig config = loadConfig(configFile); + loggingPath = config.getLoggingPath(); ... - writeReport(report, "./logs", executionId); + writeReport(report, loggingPath, executionId); return 1; } ... - writeReport(report, "./logs", executionId); + writeReport(report, loggingPath, executionId);
🧹 Nitpick comments (16)
src/test/resources/sample-dump.sql (2)
49-50:SETstatements placement doesn't reflect realmysqldumpoutput.In production dumps,
SET FOREIGN_KEY_CHECKS=0appears at the top (beforeCREATE/INSERT) and is restored to1at the bottom. Placing it after all inserts means the anonymization parser isn't tested against the realistic ordering it will encounter in production. Also,FOREIGN_KEY_CHECKSis never re-enabled.Consider restructuring to mirror actual dump layout:
Suggested restructuring
Move to the top of the file (after the header comments):
+SET FOREIGN_KEY_CHECKS=0; +SET SQL_MODE='NO_AUTO_VALUE_ON_ZERO'; + -- Create database CREATE DATABASE IF NOT EXISTS `test_db`;And add restoration at the end:
+SET FOREIGN_KEY_CHECKS=1;
1-50: Consider expanding edge-case coverage for more robust parser testing.The fixture covers the core patterns well (batch inserts, NULLs, special characters, multiple tables). A few common
mysqldumppatterns are missing that the anonymization tool will likely encounter in production:
DROP TABLE IF EXISTSbefore eachCREATE TABLELOCK TABLES ... WRITE/UNLOCK TABLESblocks- Empty string values (
'') vsNULL- Multi-byte / Unicode characters in string values (e.g., names with diacritics)
INSERT IGNOREorON DUPLICATE KEY UPDATEvariantsNot blocking, but adding these would increase confidence that the parser handles real-world dumps without surprises.
src/main/resources/application-executor.properties (1)
17-19: Consider adding executor constraints and validation.The executor settings lack validation boundaries:
batch-size: No min/max constraints (consider adding range validation)output-dir: Uses a relative path that may behave differently depending on the working directoryConsider adding validation or documenting expected working directory assumptions.
src/main/resources/application-analyzer.properties (1)
10-10: Consider syncing tool version with project version.The hardcoded
tool-version=1.0.0may drift from the actual project version. Consider using Maven/Gradle property substitution to keep versions synchronized.💡 Example approach using Maven filtering
In
pom.xml, enable resource filtering:<resources> <resource> <directory>src/main/resources</directory> <filtering>true</filtering> </resource> </resources>Then use:
-anonymization.analyzer.tool-version=1.0.0 +anonymization.analyzer.tool-version=@project.version@src/main/resources/anonymization/pii-detection-rules.yml (1)
347-355:CreatedBy/ModifiedByfields may contain usernames that are PII.These audit fields often store actual user names or email addresses rather than opaque IDs. A
RETAINstrategy withriskLevel: LOWmay under-protect PII. Consider raising the risk level or at least adding a rationale note to verify the field content.src/main/java/com/db/piramalswasthya/data/FlywaySchemaVersion.java (1)
95-101:getInstalledOn()returns a mutableTimestampreference.
java.sql.Timestampis mutable. Returning the internal reference allows callers to modify the object's state. For an internal DTO this is likely acceptable, but worth noting if the class is used more broadly.Defensive copy approach
public Timestamp getInstalledOn() { - return installedOn; + return installedOn != null ? new Timestamp(installedOn.getTime()) : null; } public void setInstalledOn(Timestamp installedOn) { - this.installedOn = installedOn; + this.installedOn = installedOn != null ? new Timestamp(installedOn.getTime()) : null; }src/main/java/com/db/piramalswasthya/config/DatasourceConfig.java (1)
15-15: Update environment property files to use consistent hyphenated prefix format.The
@ConfigurationPropertiesannotations inDatasourceConfig.javanow use hyphenated prefixes (db-iemr,db-identity,db-reporting,db-1097-identity), butcommon_test.properties,common_ci.properties, andcommon_docker.propertiesstill use the old flat format (dbiemr,dbidentity, etc.). While Spring Boot's relaxed binding will transparently handle both formats, update these environment-specific property files to match the new hyphenated convention for consistency and to avoid confusion.src/main/java/com/db/piramalswasthya/config/FlywayConfig.java (1)
13-13: Inconsistent bean naming:flywaydbiemrvs.flywayDbidentity,flywayDbreporting,flywayDb1097identity.The other three bean methods use camelCase (
flywayDbidentity,flywayDbreporting,flywayDb1097identity), but this one was changed to all-lowercase. Pick one convention and apply it consistently.rules.yaml.example (1)
1-89: Consider documenting which columns participate in cross-schema joins.With 4 schemas sharing foreign keys (e.g.,
BenRegID/BeneficiaryRegID), it's easy to accidentally break referential integrity. Adding a comment block at the top of the rules file listing the cross-schema join columns and their required strategy consistency would help prevent future drift.docs/anonymization/GUIDE.md (1)
53-53: Multiple fenced code blocks missing language specifiers.The architecture diagram (line 53) and several error output blocks (lines 227, 484, 501, 517, 536, 552, 581) lack language identifiers. Add
textor an appropriate language hint to satisfy markdownlint (MD040) and improve rendering.src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java (1)
134-139: Regex patterns are recompiled on every call.
Pattern.compileis invoked each timematchesPatternis called. For built-in deny patterns (which are constants), consider pre-compiling them as staticPatternfields. This is a minor optimization for a check that runs once, but it's a good practice for correctness (compile-once, match-many).src/test/java/com/db/piramalswasthya/anonymizer/config/ConfigurationTest.java (1)
77-78: Temp file cleanup is not guaranteed on assertion failure.If any assertion on lines 83–95 throws,
Files.deleteIfExists(tempFile)on Line 96 is never reached, leaving temp files behind. Same issue intestLoadAnonymizationRules. Consider using JUnit's@TempDirannotation or atry-finallyblock.♻️ Example using `@TempDir`
+import org.junit.jupiter.api.io.TempDir; +import java.nio.file.Path; + class ConfigurationTest { + + `@TempDir` + Path tempDir; `@Test` void testLoadAnonymizerConfig() throws Exception { String yaml = """ ... """; - Path tempFile = Files.createTempFile("config", ".yaml"); + Path tempFile = tempDir.resolve("config.yaml"); Files.writeString(tempFile, yaml); // ... assertions ... - - Files.deleteIfExists(tempFile); }Also applies to: 96-96
src/main/java/com/db/piramalswasthya/anonymizer/cli/DiffSchemaCommand.java (1)
322-330: No connection timeout or SSL configuration on the DataSource.
createDataSourcesets host/port/schema/credentials but does not configure connection timeout (which is defined inAnonymizerConfig.DatabaseConfig.connectionTimeout) or any SSL/TLS settings. For connecting to a read replica, especially across networks, timeouts and encrypted transport should be configured.src/main/java/com/db/piramalswasthya/anonymizer/config/AnonymizationRules.java (1)
53-57: Consider makingstrategya typed enum instead of a free-form String.
ColumnRule.strategyis aString, meaning typos like"HMAC_HAHS"inrules.yamlwon't be caught at load time. An enum (e.g.,HMAC_HASH, FAKE_NAME, MASK_PHONE, GENERALIZE_DATE, PRESERVE) would provide fail-fast validation during deserialization. You can use Jackson's@JsonCreatororREAD_UNKNOWN_ENUM_VALUES_AS_NULLfor forward compatibility.src/main/java/com/db/piramalswasthya/anonymizer/report/RunReport.java (1)
45-121: LGTM overall — clean, PII-safe report design.The structure is well-thought-out with proper null handling in
getTotalRowsProcessedand separation of table-level metrics.One minor note:
addTableReport(line 85) stores thestrategiesmap reference directly. If the caller mutates the map after the call, the report is affected. Consider a defensive copy.Proposed defensive copy
- tr.setStrategyCounts(strategies); + tr.setStrategyCounts(strategies != null ? new HashMap<>(strategies) : new HashMap<>());src/main/java/com/db/piramalswasthya/anonymizer/output/DirectRestoreWriter.java (1)
96-107:resetSchemafallback todeleteAllData()may silently mask permission issues that indicate misconfiguration.If
DROP DATABASEfails due to permissions, falling back toDELETE FROMall tables may mask a serious misconfiguration (e.g., accidentally pointing at the wrong database). Consider logging atERRORlevel rather thanWARN, or at least including the full exception for diagnostics.
src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java
Outdated
Show resolved
Hide resolved
src/main/java/com/db/piramalswasthya/anonymizer/core/AnonymizationEngine.java
Show resolved
Hide resolved
src/main/java/com/db/piramalswasthya/anonymizer/core/HmacAnonymizer.java
Outdated
Show resolved
Hide resolved
src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java
Outdated
Show resolved
Hide resolved
src/test/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuardTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Fix all issues with AI agents
In `@pom.xml`:
- Around line 98-108: Update the Picocli dependency version from 4.7.5 to 4.7.7
by modifying the <artifactId>picocli</artifactId> dependency entry in the pom
(change the <version> value for picocli to 4.7.7); leave the <dependency> entry
for com.fasterxml.jackson.dataformat:jackson-dataformat-yaml without an explicit
<version> so it continues to inherit the Spring Boot BOM managed version.
In `@src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java`:
- Around line 424-433: The code in RunCommand sets
ds.setVerifyServerCertificate(false) which disables TLS server verification;
make this configurable instead of hardcoding false by adding a flag/property to
AnonymizerConfig.DatabaseConfig (e.g., verifyServerCertificate or
requireServerCert) and read dbConfig.getVerifyServerCertificate() (or similar)
in the try block to pass into ds.setVerifyServerCertificate(...); ensure
dbConfig has a sensible default (true for production) and update any config
parsing/loading code to support the new boolean so tests and environments can
opt-out if needed.
- Around line 184-194: The report logging path is inconsistent: replace
hardcoded LOGS_DIRECTORY in the three writeReport/log lines with the configured
path (use config.getLoggingPath() where config is guaranteed to exist, i.e., the
success path after report creation), and for the error paths (where config may
be null) pass a fallback expression such as (config != null ?
config.getLoggingPath() : LOGS_DIRECTORY); update the calls referencing
writeReport, and the log.info lines that print the "Report written to" path, so
all three locations use the config fallback consistently within RunCommand
(refer to writeReport, LOGS_DIRECTORY, and config.getLoggingPath()).
- Around line 317-322: The strategy counts are being incremented per batch in
updateStrategyCounts causing inflation; move the call out of processBatch and
invoke updateStrategyCounts once per table inside processTable (before or after
batching) using the existing tableRules and strategyCounts so it reflects the
column→strategy mapping, or if you intended to count rows per strategy change
updateStrategyCounts to accept a rowCount (e.g., batchSize or rowsProcessed) and
add that amount (strategyCounts.merge(strategy, rowCount, Integer::sum))
instead; adjust calls in processBatch/processTable accordingly to use either the
one-time per-table call or the per-row increment variant.
- Around line 97-183: The try block in RunCommand.java only catches SQLException
and IOException so exceptions from SafetyGuard.validateSafeToConnect,
validateConfiguration, HmacAnonymizer constructor (and other runtime/checked
exceptions) can bypass report finalization; add a broad fallback (catch
(Exception e)) or a finally section that always sets report end time, total
duration, a failure status/error message, and calls writeReport(report,
LOGS_DIRECTORY, executionId) so the report is written for all failures; ensure
the fallback logs the sanitized error (reuse sanitizeError(e) or e.getMessage())
and does not swallow the original exception details.
In
`@src/main/java/com/db/piramalswasthya/anonymizer/config/AnonymizerConfig.java`:
- Around line 56-57: SafetyConfig currently has the boolean field enabled
defaulting to false which causes SafetyGuard.validateSafeToConnect to be
bypassed when safety.enabled is omitted; change SafetyConfig so that enabled
defaults to true (e.g., initialize the field or provide a getter that returns
true when null) and ensure any YAML binding still maps to safety.enabled, so
SafetyGuard.validateSafeToConnect and calls to config.isEnabled() will treat
safety as enabled by default unless explicitly set to false.
- Around line 44-53: DatabaseConfig.readOnly is never applied to created
DataSources; update RunCommand.createDataSource and
DiffSchemaCommand.createDataSource to honor DatabaseConfig.readOnly by returning
a DataSource wrapper that sets Connection.setReadOnly(readOnly) on each obtained
Connection (or otherwise sets read-only on connections before use) rather than
just configuring MysqlDataSource; reference DatabaseConfig.readOnly,
RunCommand.createDataSource, DiffSchemaCommand.createDataSource and
MysqlDataSource when making the change so every call to getConnection() yields a
Connection with setReadOnly(databaseConfig.isReadOnly()) applied.
🧹 Nitpick comments (3)
src/main/java/com/db/piramalswasthya/anonymizer/config/AnonymizerConfig.java (1)
55-102: Redundant manual accessors —@Dataalready generates these.
SafetyConfigis annotated with@Data, which generates all getters/setters via Lombok. Lines 63–101 manually re-declare every accessor, duplicating what Lombok provides. The other nested classes (DatabaseConfig,PerformanceConfig) correctly rely solely on@Data.Remove the manual methods to stay consistent with the rest of the file.
♻️ Proposed fix
`@Data` public static class SafetyConfig { private boolean enabled; private List<String> allowedHosts; private List<String> deniedPatterns; private boolean requireExplicitApproval; private String approvalFlag; - - public boolean isEnabled() { - return enabled; - } - - public void setEnabled(boolean enabled) { - this.enabled = enabled; - } - - public List<String> getAllowedHosts() { - return allowedHosts; - } - - public void setAllowedHosts(List<String> allowedHosts) { - this.allowedHosts = allowedHosts; - } - - public List<String> getDeniedPatterns() { - return deniedPatterns; - } - - public void setDeniedPatterns(List<String> deniedPatterns) { - this.deniedPatterns = deniedPatterns; - } - - public boolean isRequireExplicitApproval() { - return requireExplicitApproval; - } - - public void setRequireExplicitApproval(boolean requireExplicitApproval) { - this.requireExplicitApproval = requireExplicitApproval; - } - - public String getApprovalFlag() { - return approvalFlag; - } - - public void setApprovalFlag(String approvalFlag) { - this.approvalFlag = approvalFlag; - } }src/test/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuardTest.java (1)
91-115: Missing test: allowlisted host that matches a built-in deny pattern.The current suite doesn't cover the important edge case where a host is in the allowlist and matches a built-in deny pattern (e.g.,
"replica.nonprod.internal"matching.*prod.*). Per theSafetyGuardlogic, the built-in deny check fires even for allowlisted hosts and requiresrequireExplicitApproval+ a correct approval flag to bypass. This is a critical path for legitimate non-production replicas whose hostnames contain "prod".Consider adding a test like:
`@Test` void testAllowlistedHostMatchingDenyPattern_RequiresApproval() { AnonymizerConfig.SafetyConfig config = new AnonymizerConfig.SafetyConfig(); config.setEnabled(true); config.setAllowedHosts(List.of("replica.nonprod.internal")); config.setRequireExplicitApproval(true); config.setApprovalFlag("CONFIRM"); SafetyGuard guard = new SafetyGuard(config); // Denied without approval flag despite being in allowlist assertThrows(SafetyViolationException.class, () -> guard.validateSafeToConnect("replica.nonprod.internal", "db_uat", null) ); // Allowed with correct approval flag assertDoesNotThrow(() -> guard.validateSafeToConnect("replica.nonprod.internal", "db_uat", "CONFIRM") ); }src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java (1)
464-479: Sanitize logic is reasonable but consider edge cases.The regex
'[^']{10,}'only redacts quoted strings longer than 10 characters, which means shorter PII values (e.g., short names, phone numbers in quotes) could leak into error messages/reports. Consider lowering the threshold or redacting all quoted strings.
src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java
Outdated
Show resolved
Hide resolved
src/main/java/com/db/piramalswasthya/anonymizer/config/AnonymizerConfig.java
Show resolved
Hide resolved
src/main/java/com/db/piramalswasthya/anonymizer/config/AnonymizerConfig.java
Outdated
Show resolved
Hide resolved
fix: improve error handling in anonymizer
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java`:
- Around line 219-251: processSchema currently creates targetDataSource using
createDataSource which wraps with ReadOnlyAwareDataSource based on
DatabaseConfig.readOnly (default true), causing writes to fail; modify
RunCommand.processSchema to validate the target DatabaseConfig (from
config.getTarget()) before creating the target DataSource and fail fast if it is
readOnly (or explicitly set writable). Specifically, inspect
DatabaseConfig.isReadOnly() for the target, and if true either (a) throw an
informative IllegalStateException/log and abort (so
DirectRestoreWriter.resetSchema/write operations won't run) or (b) override the
flag to false when constructing the target DataSource; reference
createDataSource, ReadOnlyAwareDataSource, DatabaseConfig.isReadOnly(),
RunCommand.processSchema, and DirectRestoreWriter.resetSchema to locate where to
add this check.
In `@src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java`:
- Around line 196-222: The validation currently skips empty tables and opens its
own connection; change validateNumericPrimaryKey to accept a Connection (or
provide an overload) so callers like streamTable can pass their existing
Connection (reuse connection) and change the validation query to fetch metadata
even for empty tables (e.g., use "SELECT %s FROM %s WHERE 1=0" or "LIMIT 0") so
ResultSetMetaData is available even when rs.next() is false; then use the
ResultSetMetaData to check column type (Types.TINYINT/SMALLINT/INTEGER/BIGINT)
and throw the same SQLException if it’s not numeric, and update callers to pass
the Connection (or document intentional separate-connection behavior if you keep
the original signature).
- Around line 170-194: In processBatch(PreparedStatement stmt, List<String>
columns, String primaryKey) ensure the primary key value is stored in the
RowData returned: after populating row from the provided columns (inside the
rs.next() loop) check if row.containsKey(primaryKey) and if not, read the PK
value (rs.getObject(primaryKey)) and put it into row before adding to batch;
keep the existing numeric check that sets lastKey using the same PK value and
throw the same SQLException if the PK is not numeric.
In `@src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java`:
- Around line 170-176: hostMatchesWildcard currently builds a regex by doing
allowed.replace("*", ".*") which fails to escape other regex metacharacters;
update hostMatchesWildcard to escape the allowed string with Pattern.quote (or
equivalent) first, then restore wildcard semantics by replacing the quoted "\*"
(backslash-star) with ".*" so only '*' becomes a wildcard and other
metacharacters remain literal, and use the resulting regex with host.matches or
Pattern/Matcher to perform the comparison; ensure null checks remain.
In `@src/main/resources/anonymization-registry.yml`:
- Around line 1-17: The file src/main/resources/anonymization-registry.yml is
orphaned and mismatches the application's model: it references
AnonymizationRegistry (which doesn't exist) and uses fields incompatible with
AnonymizationRules and ColumnRule; either delete this YAML or adapt it and
wiring to match the existing model. To fix, either (A) remove
anonymization-registry.yml and any references to it, or (B) create a matching
AnonymizationRegistry Java class and update startup/config loader to load this
resource, and/or transform the YAML shape to the AnonymizationRules structure
(use rulesVersion, unknownColumnPolicy, databases -> ColumnRule entries with
only strategy and format) and update the codepaths that load AnonymizationRules
to accept this file; reference the classes AnonymizationRegistry,
AnonymizationRules, and ColumnRule to locate where to implement the change.
🧹 Nitpick comments (2)
src/main/java/com/db/piramalswasthya/anonymizer/core/HmacAnonymizer.java (1)
114-121:HmacAnonymizerExceptionis package-private but wraps failures from the publichashIdmethod.Callers outside
com.db.piramalswasthya.anonymizer.corecannot catchHmacAnonymizerExceptionby type, forcing them to catchRuntimeException. Consider making itpublic(or apublic staticinner class ofHmacAnonymizer) for more precise exception handling.src/main/java/com/db/piramalswasthya/anonymizer/config/AnonymizerConfig.java (1)
56-103: Redundant explicit getters/setters —@Dataalready generates them.
SafetyConfigis annotated with@Data(which generates getters, setters,equals,hashCode,toString), yet all getters and setters are also manually written. The same applies toAnonymizerConfig(lines 112-126) withgetSource/setSource/getTarget/setTarget. Remove the manual accessors to reduce noise, or remove@Dataif you prefer explicit control.
src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java
Outdated
Show resolved
Hide resolved
src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java
Show resolved
Hide resolved
src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java
Show resolved
Hide resolved
….java this is a suggestion and must not be done without need Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
fix: code cleanup and exception handling improvements
|
@drtechie everything done from my side, kindly check the cli tool from your side once to confirm it works properly and nothing has broken. |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (4)
src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java (1)
383-394: ObjectMapper is instantiated twice with the same configuration.
loadConfigandloadRuleseach create anew ObjectMapper(new YAMLFactory()). This is a minor inefficiency — ObjectMapper is thread-safe and reusable.♻️ Reuse a single ObjectMapper instance
+ private static final ObjectMapper YAML_MAPPER = new ObjectMapper(new YAMLFactory()); + private AnonymizerConfig loadConfig(String path) throws IOException { - ObjectMapper mapper = new ObjectMapper(new YAMLFactory()); - return mapper.readValue(new File(path), AnonymizerConfig.class); + return YAML_MAPPER.readValue(new File(path), AnonymizerConfig.class); } private AnonymizationRules loadRules(String path) throws IOException { - ObjectMapper mapper = new ObjectMapper(new YAMLFactory()); - return mapper.readValue(new File(path), AnonymizationRules.class); + return YAML_MAPPER.readValue(new File(path), AnonymizationRules.class); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java` around lines 383 - 394, Both loadConfig and loadRules create a new ObjectMapper(new YAMLFactory()) separately; make a single reusable ObjectMapper (e.g., a private static final ObjectMapper YAML_MAPPER = new ObjectMapper(new YAMLFactory()) or an instance field) and replace the new ObjectMapper calls in loadConfig and loadRules with that shared YAML_MAPPER to avoid redundant instantiation while preserving thread-safety.src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java (1)
252-270:RowData.getData()returns the internal mutable map directly.Callers (e.g.,
RunCommand.processBatchat line 329) receive the liveLinkedHashMap. If any consumer mutates it after anonymization (adds/removes keys, changes values), it silently corrupts the row. ReturningCollections.unmodifiableMap(data)would make this defensive without breaking read-only consumers.♻️ Proposed defensive copy
public java.util.Map<String, Object> getData() { - return data; + return java.util.Collections.unmodifiableMap(data); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java` around lines 252 - 270, RowData.getData() exposes the internal mutable LinkedHashMap which lets callers (e.g., RunCommand.processBatch) mutate row contents after anonymization; change getData() to return a defensive view such as Collections.unmodifiableMap(data) (or a shallow copy like new LinkedHashMap<>(data)) so consumers cannot modify the internal map; update the method body of RowData.getData() to wrap/clone `data` before returning and ensure callers expecting mutability are adjusted accordingly.src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java (2)
134-146: User-supplied custom deny patterns are passed directly toPattern.compilewithout validation.A malformed regex in
config.getDeniedPatterns()(e.g.,[unclosed) will throw an uncaughtPatternSyntaxExceptionat runtime with a non-obvious stack trace. Consider validating patterns at construction time or catching the exception with a clear error message.♻️ Validate custom patterns at construction time
public SafetyGuard(AnonymizerConfig.SafetyConfig config) { this.config = config; + validateCustomPatterns(); + } + + private void validateCustomPatterns() { + if (config.getDeniedPatterns() == null) return; + for (String pattern : config.getDeniedPatterns()) { + try { + Pattern.compile(pattern, Pattern.CASE_INSENSITIVE); + } catch (java.util.regex.PatternSyntaxException e) { + throw new IllegalArgumentException( + "Invalid custom deny pattern '" + pattern + "': " + e.getMessage(), e); + } + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java` around lines 134 - 146, The current code passes user regexes from config.getDeniedPatterns() directly to matching (via matchesPattern), which can throw a PatternSyntaxException at runtime; fix by validating/compiling those patterns at SafetyGuard construction time: in the SafetyGuard constructor iterate config.getDeniedPatterns(), attempt Pattern.compile(...) for each, collect the compiled java.util.regex.Pattern instances (or throw a clear IllegalArgumentException/SafetyViolationException with the invalid pattern and the PatternSyntaxException message), store them (e.g., compiledDeniedPatterns) and then update checkCustomDenials to use these compiled patterns instead of raw strings (or, if you prefer a smaller change, catch PatternSyntaxException inside checkCustomDenials and rethrow a SafetyViolationException with a clear message referencing the offending pattern and exception).
177-182:matchesPatterncompiles a newPatternon every invocation.This is called in a tight loop over deny patterns and is not on a hot path, so the impact is negligible. However, if custom deny patterns grow or this is called frequently, consider caching compiled patterns at construction time.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java` around lines 177 - 182, The matchesPattern method in SafetyGuard compiles a new Pattern on every call which is inefficient; precompile and cache deny patterns (e.g., in a Map<String, Pattern> or List<Pattern>) when the SafetyGuard is constructed or when patterns are loaded, then change matchesPattern(String value, String pattern) to look up the compiled Pattern by the pattern key (or accept a Pattern overload) and call matcher(value).matches(); update any callers to use the cached patterns loader (or provide an overload matchesPattern(String value, Pattern compiledPattern)) so runtime compilation is eliminated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java`:
- Around line 288-303: The code builds allColumns from tableRules only, causing
any source columns not listed in tableRules to be omitted and written as NULL;
fix by retrieving the actual source column list (e.g., via the
paginator/engine/table metadata) and use that as the canonical allColumns in
BatchContext instead of List.copyOf(tableRules.getColumns().keySet()); add a
validation step (new method or inline in RunCommand) that compares sourceColumns
to tableRules.getColumns().keySet() and either (a) auto-add missing columns with
a PRESERVE strategy to the rules map or (b) fail/raise an explicit error
depending on UnknownColumnPolicy, and ensure updateStrategyCounts,
processTableBatches, and recordTableCompletion operate on the full source column
list so missing columns are fetched and inserted rather than silently dropped.
- Around line 442-464: createDataSource currently only sets connect timeout but
not a socket timeout, so long-running reads can hang; update the
createDataSource method to set the MySQL Connector/J socket timeout by calling
MysqlDataSource.setIntegerRuntimeProperty("socketTimeout", <milliseconds>)
(e.g., 300000 for 5 minutes) alongside the other advanced settings, keep it
inside the existing try/catch that logs failures, and return the
ReadOnlyAwareDataSource as before so ReadOnlyAwareDataSource, MysqlDataSource
and AnonymizerConfig.DatabaseConfig remain the same.
In `@src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java`:
- Around line 169-175: hostMatchesWildcard currently calls Pattern.quote on the
whole pattern which prevents '*' wildcards from working; modify
hostMatchesWildcard to split the allowed string on the literal '*' character,
quote each segment with Pattern.quote, then join the quoted segments with ".*"
to build the regex (optionally anchor start/end or use matches as before), and
return host.matches(builtRegex); keep the null checks and ensure empty segments
are handled so patterns like "*.uat.internal" become ".*\.uat\.internal".
In `@src/main/resources/anonymization-registry.yml`:
- Around line 90-94: The Password field in anonymization-registry.yml is marked
nullable: false but uses strategy: "SUPPRESS", which writes NULL and will
violate NOT NULL constraints for the Password column (m_user -> Password);
change the strategy to a non-null-producing anonymization method such as
"RANDOM_FAKE_DATA" or a fixed HMAC/hash strategy so anonymized values remain
non-null and satisfy the database constraint while neutralizing credentials, and
update the Password entry's strategy accordingly.
- Around line 51-55: The AadhaarNo anonymization entry currently uses an unkeyed
hash; update the anonymization-registry.yml entry for the AadhaarNo
canonicalName to use strategy: "HMAC_HASH" instead of "HASH_SHA256", and ensure
the HMAC key is configured in the system's secrets/config (the HMAC key
reference used by the anonymizer) so the anonymizer implementation that reads
the AadhaarNo strategy will apply a keyed HMAC; locate the AadhaarNo block in
anonymization-registry.yml and replace the strategy value and confirm the
runtime secret name used by the anonymizer is present.
- Around line 31-35: The registry assigns strategy "HMAC_HASH" to numeric
columns (e.g., BeneficiaryID, UserID, BeneficiaryRegID) but
HmacAnonymizer.hashId() returns a 64-char hex string and
DirectRestoreWriter.writeInsert() uses stmt.setObject(), causing SQL type errors
and non-matching joins; fix by either (A) changing the
anonymization-registry.yml entries for those fields (BeneficiaryID, UserID,
BeneficiaryRegID) to a numeric-compatible strategy (e.g., a documented
"HMAC_TO_LONG" / "HMAC_TRUNCATE_TO_BIGINT" or a deterministic hash-to-int
strategy) or (B) changing the DB column mapping to accept strings (VARCHAR(64))
so HMAC_HASH is valid, and additionally update DirectRestoreWriter.writeInsert()
to detect HmacAnonymizer.hashId() outputs and call stmt.setString(...) (not
setObject) for hex hashes to avoid type conversion errors; reference
HmacAnonymizer.hashId() and DirectRestoreWriter.writeInsert() when making these
code/registry/schema changes.
- Around line 61-65: The applyStrategy switch in AnonymizationEngine currently
lacks a handler for the GENERALIZE strategy causing values to be returned
unchanged; update AnonymizationEngine.applyStrategy to handle GENERALIZE by
dispatching to the appropriate generalization routines (e.g., call
HmacAnonymizer.generalizeDate for DATE fields and a new or existing
generalizeText/generalizeVarchar helper for TEXT/VARCHAR/STRING types) and
return the generalized value, ensuring the strategy enum/constant for GENERALIZE
is matched; alternatively, if you prefer an immediate safe fallback, update the
anonymization-registry.yml entries for Address, DateOfBirth, Diagnosis,
DrugName, and TestResult from GENERALIZE to SUPPRESS so AnonymizationEngine will
drop those values—make sure to update related unit tests for
AnonymizationEngine.applyStrategy and add coverage for GENERALIZE behaviour (or
SUPPRESS) to prevent regression.
---
Duplicate comments:
In `@src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java`:
- Around line 451-460: The hardcoded ds.setVerifyServerCertificate(false) in
RunCommand disables TLS identity verification; change this to read a new
configurable flag from DatabaseConfig (e.g., isVerifyServerCertificate or
verifyServerCertificate) instead of hardcoding false. Update the code that
constructs/loads DatabaseConfig to expose a boolean defaulting to true, replace
the direct call with
ds.setVerifyServerCertificate(dbConfig.isVerifyServerCertificate()), and ensure
any config parsing/CLI options and tests are updated to allow overriding the
flag for non-production use while keeping the secure default.
In `@src/main/resources/anonymization-registry.yml`:
- Around line 1-17: The anonymization-registry.yml is likely orphaned and its
fields (canonicalName, dataType, nullable, schemaHash) won't be deserialized
into the existing Java model; locate and either implement a matching model
(e.g., create AnonymizationRegistry and extend ColumnRule to include
canonicalName, dataType, nullable, schemaHash) or update the YAML to match the
current model (ColumnRule with only strategy and format); then ensure the file
is actually loaded by the application by wiring a loader/bean (e.g., in the
existing AnonymizationRules loader or bootstrap code) that reads
anonymization-registry.yml via Jackson/ObjectMapper or
ClassPathResource/getResourceAsStream and registers it for runtime use.
---
Nitpick comments:
In `@src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java`:
- Around line 383-394: Both loadConfig and loadRules create a new
ObjectMapper(new YAMLFactory()) separately; make a single reusable ObjectMapper
(e.g., a private static final ObjectMapper YAML_MAPPER = new ObjectMapper(new
YAMLFactory()) or an instance field) and replace the new ObjectMapper calls in
loadConfig and loadRules with that shared YAML_MAPPER to avoid redundant
instantiation while preserving thread-safety.
In `@src/main/java/com/db/piramalswasthya/anonymizer/core/KeysetPaginator.java`:
- Around line 252-270: RowData.getData() exposes the internal mutable
LinkedHashMap which lets callers (e.g., RunCommand.processBatch) mutate row
contents after anonymization; change getData() to return a defensive view such
as Collections.unmodifiableMap(data) (or a shallow copy like new
LinkedHashMap<>(data)) so consumers cannot modify the internal map; update the
method body of RowData.getData() to wrap/clone `data` before returning and
ensure callers expecting mutability are adjusted accordingly.
In `@src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java`:
- Around line 134-146: The current code passes user regexes from
config.getDeniedPatterns() directly to matching (via matchesPattern), which can
throw a PatternSyntaxException at runtime; fix by validating/compiling those
patterns at SafetyGuard construction time: in the SafetyGuard constructor
iterate config.getDeniedPatterns(), attempt Pattern.compile(...) for each,
collect the compiled java.util.regex.Pattern instances (or throw a clear
IllegalArgumentException/SafetyViolationException with the invalid pattern and
the PatternSyntaxException message), store them (e.g., compiledDeniedPatterns)
and then update checkCustomDenials to use these compiled patterns instead of raw
strings (or, if you prefer a smaller change, catch PatternSyntaxException inside
checkCustomDenials and rethrow a SafetyViolationException with a clear message
referencing the offending pattern and exception).
- Around line 177-182: The matchesPattern method in SafetyGuard compiles a new
Pattern on every call which is inefficient; precompile and cache deny patterns
(e.g., in a Map<String, Pattern> or List<Pattern>) when the SafetyGuard is
constructed or when patterns are loaded, then change matchesPattern(String
value, String pattern) to look up the compiled Pattern by the pattern key (or
accept a Pattern overload) and call matcher(value).matches(); update any callers
to use the cached patterns loader (or provide an overload matchesPattern(String
value, Pattern compiledPattern)) so runtime compilation is eliminated.
src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java
Outdated
Show resolved
Hide resolved
src/main/java/com/db/piramalswasthya/anonymizer/safety/SafetyGuard.java
Outdated
Show resolved
Hide resolved
|
@drtechie made a few changes, removed newly introduced and separate properties for anonymization where the predefined props can be applied. Also, added a util to tackle persistent case of issues with configuration parsing. |
src/main/java/com/db/piramalswasthya/anonymizer/core/HmacAnonymizer.java
Outdated
Show resolved
Hide resolved
src/main/java/com/db/piramalswasthya/anonymizer/cli/RunCommand.java
Outdated
Show resolved
Hide resolved
|
currently debugging some springboot and functionality issues, please do let me know in case there are any suggestions from the team for implementation. @drtechie @vanitha1822 @paras-dba |
- removed all duplicacies from the pom - migrations disabled locally, along with flyway
- we have reached an empassee - RunCommand has been cleaned up. I plan to make it shine tomorrow. - dry run successful, as predicted. documentation remains the same. for now - need to fix reporting
- everything works locally, need full testing now
Added PARTIAL_MASK handling in AnonymizationEngine.applyStrategy: Kept PermCoordinate generalization as a note.
- anonymizer working, except for a bug - table detection seems to vary - will continue to monitor and fix
|
fixing a bug, will take a while. |
|




📋 Description
#112
JIRA ID:
This master plan provides a complete roadmap for implementing an automated, secure, and repeatable process to refresh the UAT environment with anonymized production data.
Small workflow:
Prod DB (DB1, read only perm applied)-> anonimizer (lookup the datatype of the field to anonymize, type of anon to apply) -> stream read and apply in batches with pagination for perf -> output: anonymized sql dump/direct restore to DB2 (anonymized DB)
Done.
✅ Type of Change
ℹ️ Additional Information
Please describe how the changes were tested, and include any relevant screenshots, logs, or other information that provides additional context.
Summary by CodeRabbit
Release Notes
New Features
Documentation
Configuration
Tests