<<<<<<< HEAD
A system where you upload any CSV and get senior data scientist–style analysis: profiling, EDA, statistical structure, anomalies, modeling recommendations, and an executive summary — in under a few seconds.
Vibe: "I just hired a data scientist for 10 seconds."
- Profiles the dataset — dtypes, missing %, skewness, kurtosis, cardinality, class imbalance, leakage indicators, and a data health score.
- Runs intelligent EDA — correlation matrix, mutual information, PCA variance, IQR outliers, distribution fitting and transform suggestions (e.g. “Feature ‘income’ is highly right-skewed. Log transform recommended.”).
- Detects statistical structure — high correlations, multicollinearity hints, feature clustering (PCA).
- Identifies anomalies — Isolation Forest, Z-score, DBSCAN; summarizes e.g. “X% of data points exhibit high-leverage anomaly patterns.”
- Suggests modeling strategies — classification (Logistic, RF, XGBoost) or regression (Ridge, RF, LightGBM) or clustering (KMeans); cross-validation, feature importance, SHAP, overfitting detection.
- Explains in natural language — executive summary, business implications, risks, next steps (template or LLM if
OPENAI_API_KEYis set). - Cognitive flags — data leakage risk, Simpson’s paradox possibility, multicollinearity, high cardinality, small sample bias, feature dominance, overfitting risk.
From the project root:
pip install -r requirements.txtstreamlit run frontend/app.py- Upload a CSV in the sidebar.
- Optionally set a target column for supervised modeling.
- Click Run full analysis.
- Switch between Executive (plain English) and Technical (full stats).
- Use the insight cards (expandable) for math and recommendations.
cd backend && uvicorn app.main:app --reloadPOST /api/analyzewith form data:file(CSV), optionaltarget_column.- Returns full JSON:
profile,statistical,modeling,anomaly,cognitive_flags,executive_summary.
For an LLM-generated executive summary instead of the template:
- Copy
.env.exampleto.env. - Set
OPENAI_API_KEY=your_key(and optionallyOPENAI_API_BASE,OPENAI_MODEL).
If the key is not set, the app still runs and uses a template-based summary.
├── backend/
│ └── app/
│ ├── main.py # FastAPI app
│ ├── api/routes.py # POST /api/analyze
│ ├── agents/
│ │ ├── profiler.py # Data profiler + health score
│ │ ├── statistical.py # Correlation, PCA, MI, outliers, distributions
│ │ ├── modeling.py # Classification / regression / clustering + SHAP
│ │ ├── anomaly.py # Isolation Forest, Z-score, DBSCAN
│ │ ├── cognitive_flags.py # Leakage, Simpson, multicollinearity, etc.
│ │ └── insight_generator.py # Executive summary (template or LLM)
│ ├── core/config.py
│ └── schemas/
├── frontend/
│ └── app.py # Streamlit UI (Data Story + Technical mode, insight cards)
├── requirements.txt
├── .env.example
└── README.md
- Backend: FastAPI, Pandas, NumPy, SciPy, scikit-learn, XGBoost, LightGBM, SHAP.
- Frontend: Streamlit, Plotly.
- Optional: OpenAI (or compatible) API for natural-language executive summary.
- Cognitive flags — leakage, Simpson’s paradox, multicollinearity, high cardinality, small sample bias, feature dominance, overfitting.
- Data Story Mode — toggle: Technical (full stats) vs Executive (plain English).
- Interactive insight cards — each flag expandable with recommendation and math explanation.
You can extend with: Bayesian inference, drift detection, fairness metrics, or LangGraph for multi-agent reasoning.
Agentic AI Data Science Stack This project is a full-stack, production-ready AI platform designed to automate complex data diagnostics and predictive modeling. Unlike standard notebooks, this system uses an Agentic Orchestration architecture to handle data health, statistical profiling, and anomaly detection through specialized autonomous modules.
0f8be79801aa529f4cc1eec1c70abb396892457d