Skip to content

feat: DataProfiler - foundational data-discovery layer for AStats agent.#2

Open
GauravSRC wants to merge 1 commit intom2b3:mainfrom
GauravSRC:feat/data-profiler
Open

feat: DataProfiler - foundational data-discovery layer for AStats agent.#2
GauravSRC wants to merge 1 commit intom2b3:mainfrom
GauravSRC:feat/data-profiler

Conversation

@GauravSRC
Copy link

What this does

Implements the data auto-discovery and summarization layer described in the project.
The profiler outputs a structured JSON with agent_hints (parametric/nonparametric/welch
routing) that the LangGraph agent (Phase 2) will consume for assumption-checked test selection.

Modules added

  • astats/profiler/data_profiler.py - full profiler with normality, variance, outlier checks
  • examples/data/generate_sample.py - simulated dataset generator with known ground truth
  • tests/test_profiler.py - 8 pytest cases

Next step

Phase 2: LangGraph agent that reads this profile and performs
assumption-checked statistical test selection and execution.

- Column-type detection (continuous, categorical, datetime)
- Normality: Shapiro-Wilk (n<=5000) or D'Agostino-Pearson
- Variance homogeneity: Levene's test (optional, via group_col)
- IQR-based outlier detection per numeric column
- Structured agent_hints JSON: parametric/nonparametric/welch routing
- Ground-truth simulated dataset generator for eval harness
- 8-test pytest suite covering shape, normality, outliers, hints, variance

Foundation for Phase 2: LangGraph agent test-selection harness.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant