Skip to content

QA: Architectural review — findings and recommendations #84

@maxine-at-forecast

Description

@maxine-at-forecast

Codebase Snapshot

Metric Value
Production code 29,183 lines (src/atdata/)
Tests 2,021 collected (1,911 active), pytest + Codecov CI
Modules 40+ across 8 packages
Python 3.12+ (generics, `
TODOs/FIXMEs 0 in production code
Hardcoded secrets 0
Bare except: 0 (all use specific types or Exception)

This is a healthy, well-tested codebase. The findings below are about long-term maintainability, not correctness crises.


Architectural Findings

Facade classes are intentional — not God Objects

The three largest classes (Index 42 public methods, Dataset 32, Atmosphere 30) are user-facing facades. They present a unified API surface and splitting them into sub-objects would hurt DX:

# Good DX (current):
index.write_samples(samples, name="mnist")
index.get_schema_type("mnist")

# Worse DX (if split):
index.writer.write_samples(samples, name="mnist")
index.schemas.get_type("mnist")

The problem is not the public surface — it's the internal implementation of these facades. The methods themselves are too long and duplicate logic internally. The fix is internal delegation, not API decomposition.

Exception: Atmosphere — this one would benefit from partial namespacing (atmo.records.create, atmo.blobs.upload) because its API serves two audiences: casual users (list/search) and power users (XRPC, record CRUD). Namespacing helps discoverability without hurting the common path.


Actionable Items

Quick Wins

1. Extract Dataset._build_pipeline() to eliminate 4x duplication

Files: src/atdata/dataset.py:1127-1219

ordered() and shuffled() each have unbatched and batched variants — 4 near-identical pipeline constructions. A shared _build_pipeline(shuffle_shards=0, shuffle_samples=0, batch_size=None) method reduces this to 4 one-line delegations. Zero API change.

2. Introduce DatasetMeta dataclass for parameter explosion

Files: _index.py:695, _index.py:874, atmosphere/records.py

insert_dataset (15 params), write_samples (11 params), and DatasetPublisher.publish (10 params) share the same metadata fields (name, description, tags, license, schema_ref, metadata). A DatasetMeta dataclass bundles these. Accept both flat kwargs (simple case) and the config object (reusable/composable case).

3. Narrow except Exception in atmosphere AppView fallback

Files: All 6 files in src/atdata/atmosphere/

13 sites use the identical pattern:

if getattr(self.client, "has_appview", False) is True:
    try:
        return self._op_via_appview(...)
    except Exception:
        from .._logging import get_logger
        get_logger().warning("AppView <op> failed, falling back", exc_info=True)
return self._op_client_side(...)

Problems:

  • Catches KeyboardInterrupt, SystemExit, MemoryError — not just network errors
  • Lazy import of _logging in each except block (8x duplication)
  • Silent degradation: if AppView is misconfigured and always fails, users see a slightly slower client-side path with no visible indication except debug-level logs

Fix: Extract _with_appview_fallback(appview_fn, client_fn, operation) helper. Narrow to (ConnectionError, TimeoutError, httpx.HTTPStatusError). Centralizes logging, makes fallback rate observable, and eliminates 8 duplicate except blocks.

4. Add SQLite indexes for schema and label lookups

File: src/atdata/providers/_sqlite.py:32

Only idx_entries_name exists. Schema lookups by (name, version) and label queries do full table scans. Add composite indexes.

5. Switch user-facing deprecation warnings to FutureWarning

Files: 18 sites across the codebase

Python suppresses DeprecationWarning by default in user code. The 18 deprecation sites are invisible unless running with -Wd. FutureWarning is shown by default — use it for APIs being removed in the next major version.


Medium Effort

6. Delegate Index.insert_dataset / write_samples to Repository

Files: _index.py:695-872, _index.py:874-1070, repository.py

These two methods are 195 and 200 lines respectively. They inline repository routing, atmosphere detection, store selection, size guards, and blob upload orchestration. The Repository abstraction already exists but is bypassed.

Fix: Repository.insert_dataset() and Repository.write_samples() do the real work. Index methods become ~20-line dispatchers that resolve the target repo and delegate. Internal complexity drops without changing the public API.

7. Fix Redis label resolution (returns random, not latest)

File: src/atdata/providers/_redis.py:159-181

When no version is specified, scan_iter returns entries in hash-table order. The "latest" label is whichever entry happens to come last — effectively random. This is a correctness bug.

Fix: Fetch all candidates, sort by version (semver) or timestamp, return the max.

8. Fix Redis N+1 query in iter_entries

File: src/atdata/providers/_redis.py:74-79

Each entry is fetched with a separate get_entry_by_cid() call. N entries = N round-trips.

Fix: Use MGET or Redis pipeline for batch retrieval.

9. Consolidate _schema_codec field type mapping duplication

File: src/atdata/_schema_codec.py

_field_type_to_python() and _field_type_to_stub_str() duplicate the same type mapping — one returns Python types, the other returns string representations. A new field type must be added to both.

Fix: Single mapping table or FieldTypeMapper class that returns (python_type, stub_string).


Larger Effort

10. Namespace Atmosphere for discoverability

File: src/atdata/atmosphere/client.py

Split internal implementation into RecordOps, BlobOps, XrpcClient. Expose as atmo.records, atmo.blobs, atmo.xrpc. Keep common operations (list_datasets, search_datasets, auth properties) on the root. Preserve flat methods with deprecation warnings for one major version.

11. Split StubManager dual responsibility

File: src/atdata/_stub_manager.py

Handles both schema stubs and lens stubs with duplicated naming, path construction, and write logic (_module_filename / _lens_module_filename, _module_path / _lens_module_path). Extract a generic StubWriter base or split into SchemaStubManager / LensStubManager.


Cleanup / Hygiene

12. Schedule deprecated method removal

File: _index.py (7 deprecated methods), atmosphere/__init__.py, _sources.py, others

18 DeprecationWarning sites across the codebase. Document a removal version (e.g., # Removal: v1.0) in each. Consider batching the removal into a single PR for the next major version.

13. Thread-safety of Atmosphere._atproto_client_class lazy init

File: src/atdata/atmosphere/client.py

Module-level global with lazy initialization, no lock. Two concurrent imports could race. Fix with threading.Lock or move to module-level import.

14. _AtmosphereBackend._ensure_loaders() temporal coupling

File: src/atdata/repository.py

Every public method starts with _ensure_loaders() — a hidden precondition. If a new method omits it, runtime failure. Fix by initializing eagerly in __init__ or using __getattr__ auto-init.


Non-Issues (Noted for Future Reviewers)

These were flagged during the review but are intentional design choices, not problems:

  • Index has 42 public methods — it's a facade over 4 domains (datasets, schemas, labels, lenses). The flat API is the correct DX choice. Internal delegation (item 6 above) is the right fix, not splitting the public surface.
  • Dataset has 32 public methods — this follows the pandas/HuggingFace pattern. Chainable, fluent API. Do not split.
  • Global singletons (LensNetwork, _default_index, _type_cache, _logger) — all intentional. LensNetwork is a registry; the others have proper lifecycle management.
  • except Exception in CLI modules (15 sites in cli/) — CLI is the system boundary. Catching broadly and printing user-friendly errors is correct here.
  • Deprecated methods still present — backward compat is intentional, just needs a removal timeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    qualityCode quality and architectural reviewsrefactor

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions