Skip to content

fix(core): deterministic normalization for parallel batch analyzer output#60

Closed
fishinakleinbottle wants to merge 1 commit intoLum1104:mainfrom
fishinakleinbottle:fix/normalization-for-generated-graph
Closed

fix(core): deterministic normalization for parallel batch analyzer output#60
fishinakleinbottle wants to merge 1 commit intoLum1104:mainfrom
fishinakleinbottle:fix/normalization-for-generated-graph

Conversation

@fishinakleinbottle
Copy link
Copy Markdown
Collaborator

@fishinakleinbottle fishinakleinbottle commented Mar 31, 2026

Summary

  • Adds normalize-graph.ts with normalizeNodeId, normalizeComplexity, and normalizeBatchOutput utilities
  • Rewrites Phase 3 (ASSEMBLE) in SKILL.md with a 6-step normalization sequence that runs before deduplication and dangling-edge cleanup
  • Strengthens the file-analyzer prompt with explicit ID format warnings
  • 32 new normalization tests + 2 schema boundary tests

Reviewer focus

Start with packages/core/src/analyzer/normalize-graph.ts — that's the core logic. In particular:

  • stripToValidPrefix() (line 36) handles the recursive prefix-peeling — verify the loop correctly handles project-name, double-prefix, and bare-path cases
  • normalizeBatchOutput() (line 133) orchestrates everything — check the node-first-then-edges ordering and the ID map rewriting

The SKILL.md Phase 3 rewrite and prompt change are straightforward and follow the existing Phase 4/5 normalization patterns.

Test plan

  • pnpm --filter @understand-anything/core build passes
  • pnpm --filter @understand-anything/core test — all 191 tests pass
  • Run /understand --full on a multi-file project and verify no dropped nodes or edges in the dashboard

Note

Medium Risk
Touches graph assembly semantics (ID canonicalization and edge rewrites), which can change node/edge identity and deduplication behavior; covered by new unit tests but could still affect downstream consumers relying on prior IDs.

Overview
Introduces analyzer/normalize-graph.ts with normalizeNodeId, normalizeComplexity, and normalizeBatchOutput to canonicalize node IDs (strip project/double prefixes, add missing file:/func:/class: prefixes), coerce numeric/aliased complexity to simple|moderate|complex, and rewrite + dedupe edges while dropping dangling references and reporting stats.

Exports these utilities from packages/core/src/index.ts, adds comprehensive normalization tests plus a schema test asserting node IDs may be bare strings, and updates /understand docs/prompts to explicitly warn about invalid ID formats and define the new Phase 3 normalization sequence before dedupe/dangling-edge cleanup.

Written by Cursor Bugbot for commit 60c2ead. This will update automatically on new commits. Configure here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

const oldSource = String(raw.source ?? "");
const oldTarget = String(raw.target ?? "");
const newSource = idMap.get(oldSource) ?? oldSource;
const newTarget = idMap.get(oldTarget) ?? oldTarget;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edge source/target IDs not normalized before lookup

High Severity

Edge source/target IDs are only resolved via idMap.get(oldSource) ?? oldSource, but idMap only contains original node IDs as keys. When an edge references a node using a different malformed format than the node's original ID (e.g., edge uses "proj:file:src/a.ts" but the node was "file:src/a.ts"), the lookup misses, and the raw malformed ID falls through — failing the validNodeIds check and being silently dropped as dangling. The edge source/target needs to be independently normalized (via stripToValidPrefix) before the idMap fallback. The test "deduplicates edges after ID rewriting" passes for the wrong reason — the second edge is dropped as dangling, not deduplicated — masking this bug.

Additional Locations (1)
Fix in Cursor Fix in Web

@fishinakleinbottle fishinakleinbottle force-pushed the fix/normalization-for-generated-graph branch from f579b43 to aa8a194 Compare March 31, 2026 05:55
@Lum1104
Copy link
Copy Markdown
Owner

Lum1104 commented Mar 31, 2026

@codex review this

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa8a194652

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +223 to +224
const newSource = idMap.get(oldSource) ?? oldSource;
const newTarget = idMap.get(oldTarget) ?? oldTarget;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Rewrite non-canonical edge IDs before dangling-edge pruning

normalizeBatchOutput only rewrites edge endpoints through exact idMap lookups, so endpoints that are malformed-but-normalizable (for example proj:file:src/a.ts when the node exists as file:src/a.ts) are left unchanged and then dropped as dangling. In parallel batch merges this can silently remove valid relationships whenever edge IDs differ in formatting from node IDs, which defeats the normalization pass's purpose.

Useful? React with 👍 / 👎.

Comment on lines +87 to +88
if (prefix) {
return `${prefix}:${path}`;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce node-type prefix when canonicalizing node IDs

When a node ID already has any valid prefix, normalizeNodeId returns it unchanged even if that prefix contradicts node.type (for example a function node with a file: ID). This leaves non-canonical IDs in place and can cause collisions during later node deduplication and edge rewriting, turning recoverable model output errors into dropped or miswired graph entities.

Useful? React with 👍 / 👎.

…r output

Parallel file-analyzer subagents can produce inconsistent node IDs
(project-name prefixed, double-prefixed, bare paths) and invalid
complexity values. Phase 3 ASSEMBLE now normalizes these deterministically
before merging, preventing cascading edge drops and dashboard load failures.

- Add normalize-graph.ts with normalizeNodeId, normalizeComplexity,
  and normalizeBatchOutput utilities
- Rewrite SKILL.md Phase 3 with 6-step normalization sequence
- Strengthen file-analyzer prompt with ID format warnings
- Add 32 normalization tests and 2 schema boundary tests
@fishinakleinbottle fishinakleinbottle force-pushed the fix/normalization-for-generated-graph branch from 9da0166 to 60c2ead Compare March 31, 2026 11:05
@fishinakleinbottle
Copy link
Copy Markdown
Collaborator Author

@Lum1104 there were some redundant changes in this PR - like VALID_NODE_TYPES, VALID_EDGE_TYPES which were already being checked for, and normalizing batch output which is also taken care of.

I'll close this and create a new PR with just new changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallel file-analyzer subagents can produce inconsistent node IDs and invalid complexity values

2 participants