feat(codegen): add wassirman validation IR generation target by mojodna · Pull Request #460 · OvertureMaps/schema

Seth Fitzsimmons (mojodna) · 2026-03-07T01:12:11Z

Proof-of-concept: generating the same YAML validation IR that
schema-validator produces, but from the codegen extraction layer instead
of re-walking Pydantic internals.

Why this matters

The codegen extraction layer (TypeInfo, FieldSpec, ModelSpec, tree
expansion) already solves the hard problems — unwrapping Annotated,
NewType, Union, nested lists, collecting constraints with provenance.
The markdown renderer is one consumer, Arrow/Parquet is another.
This PR adds a third target that produces validation rules, demonstrating
that the extraction machinery generalizes across output formats.

The walker (wassirman/walker.py) is ~160 lines of rule emission logic
that operates on expanded FeatureSpec trees. Compare that to
schema-validator's extract.py, which re-derives the same information
from raw Pydantic model internals. Same output, different extraction path.

Golden tests confirm the output matches the reference validator for all 16
feature types.

Usage

# All feature types to stdout
overture-codegen generate --format wassirman

# Filter by theme
overture-codegen generate --format wassirman --theme places

# Per-dataset files
overture-codegen generate --format wassirman --output-dir ./validation-rules

Discussion points

The extraction layer captures more than what the current IR needs (descriptions,
union discriminators, cross-field constraint provenance). Future targets get
that for free.
The IR is target-agnostic — it describes what to validate, not how.
The same rules feed PySpark and DuckDB backends.
What should the IR look like long-term? The current shape matches
schema-validator's output, but the richer extraction data could support
a more expressive format.

pytest-subtests merged into pytest core as of pytest 9. Update test imports from pytest_subtests.SubTests to _pytest.subtests.Subtests.

- Add -q, --tb=short to `make test` for compact output - Set verbosity_subtests=0 to suppress per-subtest progress characters (the u/,/- markers from pytest's built-in subtests support)

Bare triple-quoted strings after NewType assignments are expression statements that Python never attaches to the NewType object, leaving __doc__ as None. Convert each to an explicit __doc__ assignment so codegen and introspection tools can read them at runtime. Same pattern DocumentedEnum uses for enum member docs.

OvertureFeature validator error message had two continuation lines missing the f-prefix, so {self.__class__.__name__} was rendered literally. Also add missing space before "and".

Also fix "supserset" typo in docstring.

Replace hardcoded discriminator_fields tuple ("type", "theme", "subtype") in _process_union_member with the discriminator field name extracted from the union's Annotated metadata. introspect_union already extracted the discriminator field name but didn't pass it through to member processing. Now it does, so unions using any field name as discriminator work correctly. For nested unions, parent discriminator values are extracted from nested leaf models to preserve structural tuple classification. Feature.field_discriminator now attaches _field_name to the callable, and _extract_discriminator_name reads it. This handles the Discriminator-wrapping-a-callable case that str(disc) got wrong silently.

Make _extract_literal_value return str directly instead of object, eliminating implicit str() conversions at call sites. Add comment explaining nested union re-indexing under the parent discriminator. Remove redundant test covered by TestDiscriminatorDiscovery and debugging print() calls from TestStructuralTuples.

The field holds the entry point value in "module:Class" format, not a class name. The old name required callers to know this (codegen's cli.py had a comment explaining it, and assigned to a local `entry_point` variable to compensate).

Empty package with build config, namespace packages, and py.typed marker. Declares click, jinja2, tomli, and overture-schema-core/system as dependencies.

Type analyzer (analyze_type) handles all type unwrapping in a single iterative function: NewType → Annotated → Union → list → terminal classification. Constraints accumulate from Annotated metadata with source tracking via ConstraintSource. Data structures: TypeInfo (type representation), FieldSpec (model field), ModelSpec (model), EnumSpec, NewTypeSpec, PrimitiveSpec. Type registry maps type names to per-target string representations via TypeMapping. is_semantic_newtype() distinguishes meaningful NewTypes from pass-through aliases. Utilities: case_conversion (snake_case), docstring (cleaning and custom-docstring detection).

Domain-specific extractors that consume analyze_type() and produce specs: - model_extraction: extract_model() for Pydantic models with MRO-aware field ordering, alias resolution, and recursive sub-model expansion via expand_model_tree() - enum_extraction: extract_enum() for DocumentedEnum classes - newtype_extraction: extract_newtype() for semantic NewTypes - primitive_extraction: extract_primitives() for numeric types with range and precision introspection - union_extraction: extract_union() with field merging across discriminated union variants Shared test fixtures in codegen_test_support.py.

Generate prose from extracted constraint data: - field_constraint_description: describe field-level constraints (ranges, patterns, unique items, hex colors) as human-readable notes with NewType source attribution - model_constraint_description: describe model-level constraints (@require_any_of, @radio_group, @min_fields_set, @require_if, @forbid_if) as prose, with consolidation of same-field conditional constraints

Determine what artifacts to generate and where they go: - module_layout: compute output directories for entry points, map Python module paths to filesystem output paths via compute_output_dir - path_assignment: build_placement_registry maps types to output file paths. Feature models get {theme}/{slug}/, shared types get types/{subsystem}/, theme-local types nest under their feature or sit flat at theme level - type_collection: discover supplementary types (enums, NewTypes, sub-models) by walking expanded feature trees - link_computation: relative_link() computes cross-page links, LinkContext holds page path + registry for resolving links during rendering

Embed JSON example features in [tool.overture-schema.examples] sections. Each example is a complete GeoJSON Feature matching the theme's Pydantic model, used by the codegen example_loader to render example tables in documentation.

Jinja2 templates and rendering logic for documentation pages: - markdown_renderer: orchestrates page rendering for features, enums, NewTypes, primitives, and geometry. Recursively expands MODEL-kind fields inline with dot-notation. - markdown_type_format: type string formatting with link-aware rendering via LinkContext - example_loader: loads examples from theme pyproject.toml, validates against Pydantic models, flattens to dot-notation - reverse_references: computes "Used By" cross-references between types and the features that reference them Templates: feature, enum, newtype, primitives, geometry pages. Golden-file snapshot tests verify rendered output stability. Adds renderer-specific fixtures to conftest.py (cli_runner, primitives_markdown, geometry_markdown).

Click-based CLI entry point (overture-codegen generate) that wires discovery → extraction → output layout → rendering: - Discovers models via discover_models() entry points - Filters themes, extracts specs, builds placement registry - Renders markdown pages with field tables, examples, cross- references, and sidebar metadata - Supports --theme filtering and --output-dir targeting Integration tests verify extraction against real Overture models (Building, Division, Segment, etc.) to catch schema drift. CLI tests verify end-to-end generation, output structure, and link integrity.

Design doc covers the four-layer architecture, analyze_type(), domain-specific extractors, and extension points for new output targets. Walkthrough traces Segment through the full pipeline module-by-module in dependency order, with FeatureVersion as a secondary example for constraint provenance in the type analyzer. README describes the problem (Pydantic flattens domain vocabulary), the "unwrap once, render many" approach, CLI usage, architecture overview, and programmatic API.

TypeInfo.literal_value discarded multi-value Literals entirely (Literal["a", "b"] got None). Renamed to literal_values as a tuple of all args so consumers decide presentation. single_literal_value() preserves its contract: returns the value for single-arg Literals, None otherwise. Callers (example_loader, union_extraction) are unchanged. Multi-value Literals render as pipe-separated quoted values in markdown tables: `"a"` \| `"b"`.

Replace TypeInfo.is_list: bool with list_depth: int so nested lists like list[NewType("Hierarchy", list[HierarchyItem])] are handled correctly. analyze_type increments list_depth for each list[...] layer instead of setting a boolean. An is_list property preserves the boolean API for depth-unaware consumers. Markdown renderer: format_type and format_underlying_type wrap list_depth times. _expandable_list_suffix returns "[]" per nesting level for dot-notation expansion. Constraint annotation matching strips all trailing "[]" suffixes instead of one. Union extraction: _type_identity uses list_depth (int) instead of is_list (bool) so fields with different nesting depths don't incorrectly deduplicate. Update design doc and walkthrough to reflect list_depth replacing the is_list boolean throughout TypeInfo, _UnwrapState, type formatting, and union deduplication.

Replace bare class name keys with TypeIdentity objects across all registries. Two types with the same __name__ from different modules (e.g., Places Address vs Addresses Address) now get separate registry entries and resolve to different output paths. TypeIdentity is a frozen dataclass pairing a unique Python object (class, NewType callable, or union annotation) with its display name. Equality and hashing delegate to object identity so lookups are collision-free regardless of display name. Changes across the pipeline: - ConstraintSource stores source_ref (NewType callable) and source_name instead of a bare name string - type_collection, path_assignment, link_computation, and reverse_references all key on TypeIdentity - primitive_extraction returns TypeIdentity instead of strings - Renderers construct TypeIdentity for link resolution - Each spec type exposes an identity property via _SourceTypeIdentityMixin (or directly for UnionSpec)

MinLen/MaxLen: render as prose ("Minimum length: 1") instead of wrapping the entire phrase in backticks. Math notation (≥, <) stays in backticks; English words don't belong there. UniqueItemsConstraint: reword docstring from class-description phrasing ("Ensures all items in a collection are unique") to validation-requirement phrasing ("All items must be unique"), matching model-level constraint tone. String constraints: normalize PhoneNumberConstraint, RegionCodeConstraint, and WikidataIdConstraint docstrings to the "Allows only..." pattern used by all other StringConstraint subclasses.

Pydantic types like HttpUrl and EmailStr appear in field annotations but previously rendered as unlinked inline code. Each referenced Pydantic type now gets its own page under pydantic/<module>/ with a description, upstream Pydantic docs link, and Used By section. Discovery is reference-driven: the type collection visitor detects PRIMITIVE-kind types from pydantic modules in expanded feature trees. PydanticTypeSpec joins the SupplementarySpec union and flows through placement, reverse references, and rendering. Linking is registry-driven for all PRIMITIVE-kind types. Any primitive with a page in the placement registry gets linked, whether it's a Pydantic type (individual page) or a registered numeric primitive (aggregate page). This also links int32/float64 to the primitives page, which they weren't before. Shared is_pydantic_sourced() predicate gates collection and reverse reference tracking to pydantic-origin types without restricting the linking mechanism.

Remove bbox from default skip keys so it renders in example output like any other field.

After resolving type name collisions across themes (101596f), two referrers from different modules can share a display name. The sort key (kind, name) produced ties, and Python's sorted() preserved set iteration order for tied elements -- which depends on id()-based hashing and varies across process invocations. Add the source module as a tiebreaker: (kind, name, module). Expose TypeIdentity.module property to encapsulate the getattr(obj, "__module__") access pattern.

Constraint annotations in table description cells ran directly into the preceding description text with only a single <br/>. Double the break so constraints read as a separate paragraph.

list[PhoneNumber] rendered as "PhoneNumber (list)" — implying PhoneNumber itself is a list type. The root cause: format_type couldn't distinguish list layers outside a NewType from list layers inside one. Add newtype_outer_list_depth to TypeInfo, snapshotted from list_depth when the type analyzer enters the first NewType. The renderer uses this to choose list<X> syntax (list wraps the NewType) vs a (list) qualifier (NewType wraps a list internally). Non-NewType identities (enums, models) continue using list<X>.

_truncate() produced strings up to 103 chars (100 + "..."). Account for the 3-char ellipsis so output stays within the 100-char limit.

str() on string list items renders as [a, b], indistinguishable from bare identifiers. repr() renders as ['a', 'b'] so strings are visually distinct from numbers.

extract_model() on union members produced ModelSpecs with model=None on MODEL-kind fields. _collect_from_fields then hit the RuntimeError guard when it encountered those unexpanded references. Call expand_model_tree() on each member before walking its fields. No current union members have sub-model fields, so this was latent.

flatten_example recursed into all dicts, splitting dict-typed fields like `tags: dict[str, str]` into dot-notation rows. Now collect_dict_paths walks the FieldSpec tree to identify dict-typed field paths, and _flatten_value checks membership before recursing. Indexed runtime paths (items[0].tags) are normalized to schema notation (items[].tags) for matching. The pipeline computes dict_paths from spec.fields and threads them through load_examples. Also: clarify mutual exclusion in type visitor elif chains (reverse_references, type_collection) and rename _TypeIdentity to _TypeShape in union_extraction to avoid shadowing specs.TypeIdentity.

Move modules into three sub-packages matching the architecture layers: - extraction/ (14 modules): type analysis, specs, extractors, constraints - layout/ (2 modules): module layout, type collection - markdown/ (6 modules + templates): pipeline, renderer, type formatting, links, paths, reverse references Three modules renamed to drop redundant prefixes: field_constraint_description → extraction/field_constraints model_constraint_description → extraction/model_constraints example_loader → extraction/examples Templates flattened from templates/markdown/ to markdown/templates/.

New `--format wassirman` option for `overture-codegen generate` that emits YAML validation IR from Pydantic schema models. The pipeline walks expanded FeatureSpec trees and emits one rule per field constraint: not_null, numeric bounds (gte/lte/between), length, enum/literal membership, geometry type, pattern, and uniqueness. Model-level constraints (require_any_of, radio_group, require_if, forbid_if) produce multi-column or conditional rules. list_columns tracks array nesting for element-level checks. Parent optionality propagates as `when: not_null` guards. Structural fields (theme, type, bbox, ext_*) are skipped. With --output-dir, writes one YAML file per feature type. Without it, emits a single envelope to stdout. Golden snapshot tests cover all 16 discovered feature types, verified against the reference validator output.

Seth Fitzsimmons (mojodna) added 30 commits February 24, 2026 18:54

fix(core): switch to relative import

8029c48

fix(core): fix __name__ reference

537b36f

chore: add install make target

cb8b8db

Remove pytest-subtests dependency

e7771dc

pytest-subtests merged into pytest core as of pytest 9. Update test imports from pytest_subtests.SubTests to _pytest.subtests.Subtests.

Quiet pytest output for dev workflow

6f7cb5c

- Add -q, --tb=short to `make test` for compact output - Set verbosity_subtests=0 to suppress per-subtest progress characters (the u/,/- markers from pytest's built-in subtests support)

fix(core): add missing f-prefix to string continuation lines

0edb552

OvertureFeature validator error message had two continuation lines missing the f-prefix, so {self.__class__.__name__} was rendered literally. Also add missing space before "and".

fix(system): use dict instead of Mapping in test util type hints

f969ffc

Also fix "supserset" typo in docstring.

feat(codegen): add overture-schema-codegen package

28ce953

Empty package with build config, namespace packages, and py.typed marker. Declares click, jinja2, tomli, and overture-schema-core/system as dependencies.

fix(codegen): include bbox in examples

6a1c70c

Remove bbox from default skip keys so it renders in example output like any other field.

fix(codegen): add visual break before constraints

4a1dc06

Constraint annotations in table description cells ran directly into the preceding description text with only a single <br/>. Double the break so constraints read as a separate paragraph.

fix(codegen): include ellipsis in truncation limit

14e6ac7

_truncate() produced strings up to 103 chars (100 + "..."). Account for the 3-char ellipsis so output stays within the 100-char limit.

Seth Fitzsimmons (mojodna) added 5 commits March 4, 2026 14:43

fix(codegen): use repr() for list items in examples

3f4bb72

str() on string list items renders as [a, b], indistinguishable from bare identifiers. repr() renders as ['a', 'b'] so strings are visually distinct from numbers.

Merge branch 'dev' into codegen

bfaaf14

Seth Fitzsimmons (mojodna) requested review from Adam Lastowka (Rachmanin0xFF) and Jacob Wasserman (jwass) March 7, 2026 01:12

Seth Fitzsimmons (mojodna) force-pushed the wassirman branch from 9d087f9 to 6a13b1d Compare March 7, 2026 01:13

Seth Fitzsimmons (mojodna) added the change type - cosmetic 🌹 Cosmetic change label Mar 7, 2026

Seth Fitzsimmons (mojodna) force-pushed the codegen branch 4 times, most recently from 038c250 to 86d864a Compare March 11, 2026 19:06

Base automatically changed from codegen to dev March 11, 2026 19:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(codegen): add wassirman validation IR generation target#460

feat(codegen): add wassirman validation IR generation target#460
Seth Fitzsimmons (mojodna) wants to merge 36 commits intodevfrom
wassirman

Seth Fitzsimmons (mojodna) commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Seth Fitzsimmons (mojodna) commented Mar 7, 2026

Why this matters

Usage

Discussion points

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant