Skip to content

Architecture: move codegen backends toward SemanticSchema as the common backend contract #129

@hardbyte

Description

@hardbyte

Follow-up to #96.

Summary

#96 introduced the compiler-oriented foundations: stable symbol IDs, normalization, and SemanticSchema.

Today, backend consumption is still split:

  • Python partially consumes SemanticSchema and raw Schema
  • Rust mostly renders from raw Schema after backend-local preprocessing
  • TypeScript mostly renders from raw Schema after backend-local preprocessing
  • OpenAPI walks raw Schema directly

That means we still have multiple backend contracts and multiple places where semantic fixes can fail to propagate consistently.

This issue is for moving the architecture toward a model where all backends consume SemanticSchema directly, or at minimum consume a single backend-facing IR derived from it.

Recent work on the Python branch established a clearer boundary between shared compiler concerns and backend-local concerns:

  • stable IDs are now assigned in the compiler path instead of constructors
  • Python type-to-language mappings (e.g. i32int, chrono::DateTimedatetime) were evaluated for inclusion in the schema layer but deliberately kept as a backend-local static table — these are static codegen knowledge, not per-type annotations (fix(python): DX improvements — compact enums, name truncation, bug fixes #128 review)

That second point is instructive for this issue: the boundary between "shared meaning" and "backend-specific rendering" needs to be drawn carefully. Type-to-language mappings are backend-local. Ordering, dependency analysis, and symbol identity are shared concerns.

Why

The main benefits are architectural and compiler-facing:

  • One canonical backend contract instead of a mix of raw-schema and semantic-schema paths
  • Less backend drift in ordering, naming, dependency handling, and type resolution
  • Fewer backend-local schema mutations and ad hoc preprocessing steps
  • Stronger guarantees that symbol identity, dependency analysis, and normalization semantics are applied consistently across languages
  • Better foundation for future transforms like monomorphization, richer validation, and backend-specific lowering passes

This is also the natural continuation of the direction described in #96: shared frontend/compiler stages with thinner, more predictable backends.

Current pain points

  • Python still needs to synchronize semantic ordering with raw schema lookups and some raw-schema mutation
  • Rust and TypeScript still rely on raw schema traversal after local consolidation
  • OpenAPI still bypasses the semantic layer entirely
  • Schema phase boundaries are implicit because important transforms are performed in-place and repeated in backend code
  • Compiler concerns like stable symbol identity are still stored on raw schema structs, even though they are primarily needed by normalization/codegen

Proposed direction

  1. Define the desired common backend contract explicitly.
    Either:
  • all backends consume SemanticSchema, or
  • all backends consume a single codegen IR lowered from SemanticSchema
  1. Make the raw schema vs compiler-schema boundary explicit.
    At minimum:
  • raw/interchange Schema remains wire-focused
  • stable IDs are assigned by the compiler path, not treated as constructor-level business data
  • semantic/codegen stages consume compiler-owned identity and dependency information
  1. Move backend-independent meaning into the shared frontend/compiler layers.
    Examples:
  • resolved references
  • stable ordering
  • dependency information
  • naming/consolidation decisions
  • symbol identity
  • normalized container / fallback semantics that every backend would otherwise rediscover separately
  1. Keep backend-specific rendering local to the backend.
    Examples:
  • Python type mappings, runtime-provided types, and imports
  • TypeScript type mappings and intersection-type strategies
  • Rust derives / ownership choices
  • OpenAPI-specific schema projection

Language-specific type mappings are static codegen knowledge and belong in the backend, not the schema. Only information that cannot be inferred at codegen time (e.g. Rust additional_derives from source-code annotations) should travel on the schema.

  1. Reduce direct raw Schema traversal in backends over time.

Likely subproblems

  • Audit what each backend still needs from raw Schema that is not represented in SemanticSchema
  • Enrich SemanticSchema where it is missing required information
  • Decide how naming and consolidation should appear to backends
  • Decide whether some backend-facing concerns belong in a post-semantic lowering stage rather than in SemanticSchema itself
  • Decide whether the long-term design should keep id fields on raw schema structs or move identity fully into a compiler-owned layer
  • Migrate one backend at a time, ideally starting with the backend already furthest along

Non-goals

  • This issue is not necessarily about deleting raw Schema
  • This issue is not necessarily about making every backend identical internally
  • This issue is not necessarily about serializing every backend-specific config into reflectapi.json
  • This issue is not about pretending different languages can share one universal final mapping layer

Suggested first step

Do a backend-by-backend audit of:

  • which raw Schema fields are still read directly
  • which of those reads are truly backend-specific
  • which should instead be represented in SemanticSchema or a shared lowering stage

That audit should produce a staged migration plan rather than a single large rewrite.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions