Codebase Snapshot
| Metric |
Value |
| Production code |
29,183 lines (src/atdata/) |
| Tests |
2,021 collected (1,911 active), pytest + Codecov CI |
| Modules |
40+ across 8 packages |
| Python |
3.12+ (generics, ` |
| TODOs/FIXMEs |
0 in production code |
| Hardcoded secrets |
0 |
Bare except: |
0 (all use specific types or Exception) |
This is a healthy, well-tested codebase. The findings below are about long-term maintainability, not correctness crises.
Architectural Findings
Facade classes are intentional — not God Objects
The three largest classes (Index 42 public methods, Dataset 32, Atmosphere 30) are user-facing facades. They present a unified API surface and splitting them into sub-objects would hurt DX:
# Good DX (current):
index.write_samples(samples, name="mnist")
index.get_schema_type("mnist")
# Worse DX (if split):
index.writer.write_samples(samples, name="mnist")
index.schemas.get_type("mnist")
The problem is not the public surface — it's the internal implementation of these facades. The methods themselves are too long and duplicate logic internally. The fix is internal delegation, not API decomposition.
Exception: Atmosphere — this one would benefit from partial namespacing (atmo.records.create, atmo.blobs.upload) because its API serves two audiences: casual users (list/search) and power users (XRPC, record CRUD). Namespacing helps discoverability without hurting the common path.
Actionable Items
Quick Wins
1. Extract Dataset._build_pipeline() to eliminate 4x duplication
Files: src/atdata/dataset.py:1127-1219
ordered() and shuffled() each have unbatched and batched variants — 4 near-identical pipeline constructions. A shared _build_pipeline(shuffle_shards=0, shuffle_samples=0, batch_size=None) method reduces this to 4 one-line delegations. Zero API change.
2. Introduce DatasetMeta dataclass for parameter explosion
Files: _index.py:695, _index.py:874, atmosphere/records.py
insert_dataset (15 params), write_samples (11 params), and DatasetPublisher.publish (10 params) share the same metadata fields (name, description, tags, license, schema_ref, metadata). A DatasetMeta dataclass bundles these. Accept both flat kwargs (simple case) and the config object (reusable/composable case).
3. Narrow except Exception in atmosphere AppView fallback
Files: All 6 files in src/atdata/atmosphere/
13 sites use the identical pattern:
if getattr(self.client, "has_appview", False) is True:
try:
return self._op_via_appview(...)
except Exception:
from .._logging import get_logger
get_logger().warning("AppView <op> failed, falling back", exc_info=True)
return self._op_client_side(...)
Problems:
- Catches
KeyboardInterrupt, SystemExit, MemoryError — not just network errors
- Lazy import of
_logging in each except block (8x duplication)
- Silent degradation: if AppView is misconfigured and always fails, users see a slightly slower client-side path with no visible indication except debug-level logs
Fix: Extract _with_appview_fallback(appview_fn, client_fn, operation) helper. Narrow to (ConnectionError, TimeoutError, httpx.HTTPStatusError). Centralizes logging, makes fallback rate observable, and eliminates 8 duplicate except blocks.
4. Add SQLite indexes for schema and label lookups
File: src/atdata/providers/_sqlite.py:32
Only idx_entries_name exists. Schema lookups by (name, version) and label queries do full table scans. Add composite indexes.
5. Switch user-facing deprecation warnings to FutureWarning
Files: 18 sites across the codebase
Python suppresses DeprecationWarning by default in user code. The 18 deprecation sites are invisible unless running with -Wd. FutureWarning is shown by default — use it for APIs being removed in the next major version.
Medium Effort
6. Delegate Index.insert_dataset / write_samples to Repository
Files: _index.py:695-872, _index.py:874-1070, repository.py
These two methods are 195 and 200 lines respectively. They inline repository routing, atmosphere detection, store selection, size guards, and blob upload orchestration. The Repository abstraction already exists but is bypassed.
Fix: Repository.insert_dataset() and Repository.write_samples() do the real work. Index methods become ~20-line dispatchers that resolve the target repo and delegate. Internal complexity drops without changing the public API.
7. Fix Redis label resolution (returns random, not latest)
File: src/atdata/providers/_redis.py:159-181
When no version is specified, scan_iter returns entries in hash-table order. The "latest" label is whichever entry happens to come last — effectively random. This is a correctness bug.
Fix: Fetch all candidates, sort by version (semver) or timestamp, return the max.
8. Fix Redis N+1 query in iter_entries
File: src/atdata/providers/_redis.py:74-79
Each entry is fetched with a separate get_entry_by_cid() call. N entries = N round-trips.
Fix: Use MGET or Redis pipeline for batch retrieval.
9. Consolidate _schema_codec field type mapping duplication
File: src/atdata/_schema_codec.py
_field_type_to_python() and _field_type_to_stub_str() duplicate the same type mapping — one returns Python types, the other returns string representations. A new field type must be added to both.
Fix: Single mapping table or FieldTypeMapper class that returns (python_type, stub_string).
Larger Effort
10. Namespace Atmosphere for discoverability
File: src/atdata/atmosphere/client.py
Split internal implementation into RecordOps, BlobOps, XrpcClient. Expose as atmo.records, atmo.blobs, atmo.xrpc. Keep common operations (list_datasets, search_datasets, auth properties) on the root. Preserve flat methods with deprecation warnings for one major version.
11. Split StubManager dual responsibility
File: src/atdata/_stub_manager.py
Handles both schema stubs and lens stubs with duplicated naming, path construction, and write logic (_module_filename / _lens_module_filename, _module_path / _lens_module_path). Extract a generic StubWriter base or split into SchemaStubManager / LensStubManager.
Cleanup / Hygiene
12. Schedule deprecated method removal
File: _index.py (7 deprecated methods), atmosphere/__init__.py, _sources.py, others
18 DeprecationWarning sites across the codebase. Document a removal version (e.g., # Removal: v1.0) in each. Consider batching the removal into a single PR for the next major version.
13. Thread-safety of Atmosphere._atproto_client_class lazy init
File: src/atdata/atmosphere/client.py
Module-level global with lazy initialization, no lock. Two concurrent imports could race. Fix with threading.Lock or move to module-level import.
14. _AtmosphereBackend._ensure_loaders() temporal coupling
File: src/atdata/repository.py
Every public method starts with _ensure_loaders() — a hidden precondition. If a new method omits it, runtime failure. Fix by initializing eagerly in __init__ or using __getattr__ auto-init.
Non-Issues (Noted for Future Reviewers)
These were flagged during the review but are intentional design choices, not problems:
Index has 42 public methods — it's a facade over 4 domains (datasets, schemas, labels, lenses). The flat API is the correct DX choice. Internal delegation (item 6 above) is the right fix, not splitting the public surface.
Dataset has 32 public methods — this follows the pandas/HuggingFace pattern. Chainable, fluent API. Do not split.
- Global singletons (
LensNetwork, _default_index, _type_cache, _logger) — all intentional. LensNetwork is a registry; the others have proper lifecycle management.
except Exception in CLI modules (15 sites in cli/) — CLI is the system boundary. Catching broadly and printing user-friendly errors is correct here.
- Deprecated methods still present — backward compat is intentional, just needs a removal timeline.
Codebase Snapshot
except:Exception)This is a healthy, well-tested codebase. The findings below are about long-term maintainability, not correctness crises.
Architectural Findings
Facade classes are intentional — not God Objects
The three largest classes (
Index42 public methods,Dataset32,Atmosphere30) are user-facing facades. They present a unified API surface and splitting them into sub-objects would hurt DX:The problem is not the public surface — it's the internal implementation of these facades. The methods themselves are too long and duplicate logic internally. The fix is internal delegation, not API decomposition.
Exception:
Atmosphere— this one would benefit from partial namespacing (atmo.records.create,atmo.blobs.upload) because its API serves two audiences: casual users (list/search) and power users (XRPC, record CRUD). Namespacing helps discoverability without hurting the common path.Actionable Items
Quick Wins
1. Extract
Dataset._build_pipeline()to eliminate 4x duplicationFiles:
src/atdata/dataset.py:1127-1219ordered()andshuffled()each have unbatched and batched variants — 4 near-identical pipeline constructions. A shared_build_pipeline(shuffle_shards=0, shuffle_samples=0, batch_size=None)method reduces this to 4 one-line delegations. Zero API change.2. Introduce
DatasetMetadataclass for parameter explosionFiles:
_index.py:695,_index.py:874,atmosphere/records.pyinsert_dataset(15 params),write_samples(11 params), andDatasetPublisher.publish(10 params) share the same metadata fields (name, description, tags, license, schema_ref, metadata). ADatasetMetadataclass bundles these. Accept both flat kwargs (simple case) and the config object (reusable/composable case).3. Narrow
except Exceptionin atmosphere AppView fallbackFiles: All 6 files in
src/atdata/atmosphere/13 sites use the identical pattern:
Problems:
KeyboardInterrupt,SystemExit,MemoryError— not just network errors_loggingin each except block (8x duplication)Fix: Extract
_with_appview_fallback(appview_fn, client_fn, operation)helper. Narrow to(ConnectionError, TimeoutError, httpx.HTTPStatusError). Centralizes logging, makes fallback rate observable, and eliminates 8 duplicate except blocks.4. Add SQLite indexes for schema and label lookups
File:
src/atdata/providers/_sqlite.py:32Only
idx_entries_nameexists. Schema lookups by(name, version)and label queries do full table scans. Add composite indexes.5. Switch user-facing deprecation warnings to
FutureWarningFiles: 18 sites across the codebase
Python suppresses
DeprecationWarningby default in user code. The 18 deprecation sites are invisible unless running with-Wd.FutureWarningis shown by default — use it for APIs being removed in the next major version.Medium Effort
6. Delegate
Index.insert_dataset/write_samplestoRepositoryFiles:
_index.py:695-872,_index.py:874-1070,repository.pyThese two methods are 195 and 200 lines respectively. They inline repository routing, atmosphere detection, store selection, size guards, and blob upload orchestration. The
Repositoryabstraction already exists but is bypassed.Fix:
Repository.insert_dataset()andRepository.write_samples()do the real work. Index methods become ~20-line dispatchers that resolve the target repo and delegate. Internal complexity drops without changing the public API.7. Fix Redis label resolution (returns random, not latest)
File:
src/atdata/providers/_redis.py:159-181When no version is specified,
scan_iterreturns entries in hash-table order. The "latest" label is whichever entry happens to come last — effectively random. This is a correctness bug.Fix: Fetch all candidates, sort by version (semver) or timestamp, return the max.
8. Fix Redis N+1 query in
iter_entriesFile:
src/atdata/providers/_redis.py:74-79Each entry is fetched with a separate
get_entry_by_cid()call. N entries = N round-trips.Fix: Use
MGETor Redis pipeline for batch retrieval.9. Consolidate
_schema_codecfield type mapping duplicationFile:
src/atdata/_schema_codec.py_field_type_to_python()and_field_type_to_stub_str()duplicate the same type mapping — one returns Python types, the other returns string representations. A new field type must be added to both.Fix: Single mapping table or
FieldTypeMapperclass that returns(python_type, stub_string).Larger Effort
10. Namespace
Atmospherefor discoverabilityFile:
src/atdata/atmosphere/client.pySplit internal implementation into
RecordOps,BlobOps,XrpcClient. Expose asatmo.records,atmo.blobs,atmo.xrpc. Keep common operations (list_datasets,search_datasets, auth properties) on the root. Preserve flat methods with deprecation warnings for one major version.11. Split
StubManagerdual responsibilityFile:
src/atdata/_stub_manager.pyHandles both schema stubs and lens stubs with duplicated naming, path construction, and write logic (
_module_filename/_lens_module_filename,_module_path/_lens_module_path). Extract a genericStubWriterbase or split intoSchemaStubManager/LensStubManager.Cleanup / Hygiene
12. Schedule deprecated method removal
File:
_index.py(7 deprecated methods),atmosphere/__init__.py,_sources.py, others18
DeprecationWarningsites across the codebase. Document a removal version (e.g.,# Removal: v1.0) in each. Consider batching the removal into a single PR for the next major version.13. Thread-safety of
Atmosphere._atproto_client_classlazy initFile:
src/atdata/atmosphere/client.pyModule-level global with lazy initialization, no lock. Two concurrent imports could race. Fix with
threading.Lockor move to module-level import.14.
_AtmosphereBackend._ensure_loaders()temporal couplingFile:
src/atdata/repository.pyEvery public method starts with
_ensure_loaders()— a hidden precondition. If a new method omits it, runtime failure. Fix by initializing eagerly in__init__or using__getattr__auto-init.Non-Issues (Noted for Future Reviewers)
These were flagged during the review but are intentional design choices, not problems:
Indexhas 42 public methods — it's a facade over 4 domains (datasets, schemas, labels, lenses). The flat API is the correct DX choice. Internal delegation (item 6 above) is the right fix, not splitting the public surface.Datasethas 32 public methods — this follows the pandas/HuggingFace pattern. Chainable, fluent API. Do not split.LensNetwork,_default_index,_type_cache,_logger) — all intentional. LensNetwork is a registry; the others have proper lifecycle management.except Exceptionin CLI modules (15 sites incli/) — CLI is the system boundary. Catching broadly and printing user-friendly errors is correct here.