📊 AI Data Scientist

<<<<<<< HEAD

📊 AI Data Scientist

A system where you upload any CSV and get senior data scientist–style analysis: profiling, EDA, statistical structure, anomalies, modeling recommendations, and an executive summary — in under a few seconds.

Vibe: "I just hired a data scientist for 10 seconds."

What it does

Profiles the dataset — dtypes, missing %, skewness, kurtosis, cardinality, class imbalance, leakage indicators, and a data health score.
Runs intelligent EDA — correlation matrix, mutual information, PCA variance, IQR outliers, distribution fitting and transform suggestions (e.g. “Feature ‘income’ is highly right-skewed. Log transform recommended.”).
Detects statistical structure — high correlations, multicollinearity hints, feature clustering (PCA).
Identifies anomalies — Isolation Forest, Z-score, DBSCAN; summarizes e.g. “X% of data points exhibit high-leverage anomaly patterns.”
Suggests modeling strategies — classification (Logistic, RF, XGBoost) or regression (Ridge, RF, LightGBM) or clustering (KMeans); cross-validation, feature importance, SHAP, overfitting detection.
Explains in natural language — executive summary, business implications, risks, next steps (template or LLM if OPENAI_API_KEY is set).
Cognitive flags — data leakage risk, Simpson’s paradox possibility, multicollinearity, high cardinality, small sample bias, feature dominance, overfitting risk.

Quick start

1. Install dependencies

From the project root:

pip install -r requirements.txt

2. Run the Streamlit app (recommended)

streamlit run frontend/app.py

Upload a CSV in the sidebar.
Optionally set a target column for supervised modeling.
Click Run full analysis.
Switch between Executive (plain English) and Technical (full stats).
Use the insight cards (expandable) for math and recommendations.

3. Optional: Run the FastAPI backend

cd backend && uvicorn app.main:app --reload

POST /api/analyze with form data: file (CSV), optional target_column.
Returns full JSON: profile, statistical, modeling, anomaly, cognitive_flags, executive_summary.

Optional: LLM executive summary

For an LLM-generated executive summary instead of the template:

Copy .env.example to .env.
Set OPENAI_API_KEY=your_key (and optionally OPENAI_API_BASE, OPENAI_MODEL).

If the key is not set, the app still runs and uses a template-based summary.

Project layout

├── backend/
│   └── app/
│       ├── main.py           # FastAPI app
│       ├── api/routes.py     # POST /api/analyze
│       ├── agents/
│       │   ├── profiler.py           # Data profiler + health score
│       │   ├── statistical.py       # Correlation, PCA, MI, outliers, distributions
│       │   ├── modeling.py           # Classification / regression / clustering + SHAP
│       │   ├── anomaly.py            # Isolation Forest, Z-score, DBSCAN
│       │   ├── cognitive_flags.py   # Leakage, Simpson, multicollinearity, etc.
│       │   └── insight_generator.py # Executive summary (template or LLM)
│       ├── core/config.py
│       └── schemas/
├── frontend/
│   └── app.py                # Streamlit UI (Data Story + Technical mode, insight cards)
├── requirements.txt
├── .env.example
└── README.md

Tech stack

Backend: FastAPI, Pandas, NumPy, SciPy, scikit-learn, XGBoost, LightGBM, SHAP.
Frontend: Streamlit, Plotly.
Optional: OpenAI (or compatible) API for natural-language executive summary.

Killer features

Cognitive flags — leakage, Simpson’s paradox, multicollinearity, high cardinality, small sample bias, feature dominance, overfitting.
Data Story Mode — toggle: Technical (full stats) vs Executive (plain English).
Interactive insight cards — each flag expandable with recommendation and math explanation.

You can extend with: Bayesian inference, drift detection, fairness metrics, or LangGraph for multi-agent reasoning.

AI-Data-Project

Agentic AI Data Science Stack This project is a full-stack, production-ready AI platform designed to automate complex data diagnostics and predictive modeling. Unlike standard notebooks, this system uses an Agentic Orchestration architecture to handle data health, statistical profiling, and anomaly detection through specialized autonomous modules.

0f8be79801aa529f4cc1eec1c70abb396892457d

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.streamlit		.streamlit
backend/app		backend/app
frontend		frontend
sample_data		sample_data
.env.example		.env.example
.gitignore		.gitignore
AI Data Science Project.code-workspace		AI Data Science Project.code-workspace
BUG_FIXES.md		BUG_FIXES.md
LINKEDIN_SUMMARY.md		LINKEDIN_SUMMARY.md
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
UPDATES_SUMMARY.md		UPDATES_SUMMARY.md
fix_modeling.py		fix_modeling.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 AI Data Scientist

What it does

Quick start

1. Install dependencies

2. Run the Streamlit app (recommended)

3. Optional: Run the FastAPI backend

Optional: LLM executive summary

Project layout

Tech stack

Killer features

You can extend with: Bayesian inference, drift detection, fairness metrics, or LangGraph for multi-agent reasoning.

AI-Data-Project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📊 AI Data Scientist

What it does

Quick start

1. Install dependencies

2. Run the Streamlit app (recommended)

3. Optional: Run the FastAPI backend

Optional: LLM executive summary

Project layout

Tech stack

Killer features

You can extend with: Bayesian inference, drift detection, fairness metrics, or LangGraph for multi-agent reasoning.

AI-Data-Project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages