A comprehensive collection of fundamental Machine Learning algorithms implemented from scratch and using Scikit-Learn. This repository serves as a practical guide for understanding the mathematical foundations, intuition, and implementation details of various supervised learning models.
| Regression ( Predict a Number ) | Classification ( Predict a Category ) |
|---|---|
| • Linear Regression | • Logistic Regression |
| • Multiple Linear Regression | • Support Vector Machine (SVM) |
| • Polynomial Regression | • Naive Bayes |
| • Support Vector Regression (SVR) | • K-Nearest Neighbors (KNN) |
| • KNN Regression | • Decision Tree Classification |
| • Decision Tree Regression |
| Type | Algorithm | Key Idea |
|---|---|---|
| Regression | Linear Regression | Best-fit straight line |
| Regression | Multiple Linear Regression | Multiple features → one target |
| Regression | Polynomial Regression | Curved best-fit line |
| Regression | SVR | Fit within error margin |
| Regression | KNN Regression | Average of nearest neighbors |
| Regression | Decision Tree Reg. | Data splitting based on MSE |
| Classification | Logistic Regression | Probability of class membership |
| Classification | SVM | Maximum margin hyperplane |
| Classification | Naive Bayes | Bayes' Theorem + independence |
| Classification | KNN Classification | Majority vote of neighbors |
| Classification | Decision Tree Class. | Data splitting based on Gini/Entropy |
- Goal: Predict a continuous numerical value (e.g., price, salary, temperature).
- Finds the best-fit straight line through the data to predict a continuous target from a single feature.
-
ŷ = β₀ + β₁x
-
Where:
- ŷ = Predicted value
- β₀ = Intercept (y-axis crossing)
- β₁ = Slope (rate of change)
- x = Input feature
y
│
│ •
│ •
│ •
│ •
│ •─────────────── ← Regression line (ŷ = β₀ + β₁x)
│ │
│ │ residual (error)
│ •
│ •
└────────────────────────── x
- Linear relationship between feature and target
- Single input feature
- Quick baseline model
- Cannot capture curved or complex patterns
- Sensitive to outliers
- Assumes constant variance of errors (homoscedasticity)
- Extends linear regression to multiple independent variables, predicting outcomes based on several features simultaneously.
ŷ = β₀ + β₁x₁ + β₂x₂ + β₃x₃ + ... + βₙxₙ
- Salary = β₀ + β₁(Experience) + β₂(Education) + β₃(Age)
- Multiple features influence the target
- Features are not highly correlated with each other
| Issue | What It Means | Solution |
|---|---|---|
| Multicollinearity | Features are correlated with each other | Remove or combine correlated features |
| Irrelevant features | Noise features hurt performance | Feature selection (backward elimination) |
| Different scales | Large-valued features dominate | Apply feature scaling |
- Models non-linear (curved) relationships by transforming features into polynomial terms while still using a linear model framework.
ŷ = β₀ + β₁x + β₂x² + β₃x³ + ... + βₙxⁿ
- Degree 1: Straight line ──────
- Degree 2: Parabola ╱╲
- Degree 3: S-curve ╱╲╱
- Data shows a curved / non-linear trend
- Linear regression gives poor results (low R²)
- Overfitting with high-degree polynomials — the model memorizes noise
- Always validate with a test set or cross-validation
- Uses Support Vector Machines to predict continuous values by fitting data within an ε-insensitive tube (error margin).
- Points OUTSIDE the tube = Support Vectors (they define the model)
- Points INSIDE the tube = No penalty
y
│ / ----------------- Upper Boundary
│ / • • /
│ /───────────/ <--------- Regression Line
│ / • • /
│ / ----------------- Lower Boundary (Tube width = 2ε)
└────────────────────────── x
- Data has outliers (SVR is robust to them)
- Non-linear relationships (with RBF / polynomial kernels)
- Medium-sized datasets
| Parameter | Role |
|---|---|
C |
Regularization — trade-off between error and margin width |
ε (epsilon) |
Width of the insensitive tube |
kernel |
Transformation function (linear, rbf, poly) |
- SVR requires feature scaling — always standardize your features before training.
- Predicts a value based on the average (or weighted average) of the
kmost similar neighboring data points.
- Prediction = Average of neighbor values = (y₁ + y₂ + y₃) / 3
y
│ • (Actual)
│ /
│ •───○─────• <-- Local average (Prediction)
│ / k=3
│ •
└────────────────────────── x
- Non-linear relationships
- Small to medium datasets
- When you need a simple, intuitive model
- Tip: Iterate through
k = 1 to 20, plot accuracy, and select the optimal value.
- Requires feature scaling — distance-based algorithm
- Slow on large datasets (computes distances to every point)
- Sensitive to irrelevant features
- Predicts a continuous value by splitting the dataset into smaller subsets (leaves) based on feature thresholds, forming a tree-like structure of decisions.
- Recursive Partitioning: The algorithm splits data where it reduces the Mean Squared Error (MSE) the most.
- Leaf Nodes: The final prediction is the average value of all training points that fall into that specific leaf.
y
│ _______ (Avg of Region 3)
│ |
│ _______| (Avg of Region 2)
│ |
│______| (Avg of Region 1)
└──────┬───────┬─────────── x
Split 1 Split 2
- Non-linear and complex datasets
- When you need a model that handles both numerical and categorical data without much preprocessing
- No feature scaling required
- High Risk of Overfitting: A tree can grow deep enough to memorize every data point.
- Instability: Small changes in data can lead to a completely different tree structure.
- Solution: Limit max_depth or use "Pruning."
- Goal: Predict a discrete category/class (e.g., spam/not spam, disease/healthy, yes/no).
- A foundational binary classification algorithm that estimates the probability of class membership using the sigmoid function.
1
P(y = 1) = ─────────────
1 + e^(-(β₀ + β₁x))
Output: Probability between 0 and 1
Decision Rule: If P ≥ 0.5 → Class 1, else → Class 0
P(y=1)
1.0 │ ─────────
│ /
0.5 │ · · · · · · / · · · · · · ← Decision boundary
│ /
0.0 │──────────────
└──────────────────────── x
- Binary classification (two classes)
- Linearly separable data
- When you need probability estimates
- Fast, interpretable baseline model
- Assumes a linear decision boundary
- Struggles with complex, non-linear patterns
- Finds the optimal hyperplane that maximizes the margin between different classes. Supports linear and non-linear (kernel-based) classification.
Class A: ○ Class B: ●
○ ○ ● ●
○ ○ ┃ ● ●
○ ○ ┃ ● ● ← Maximum margin hyperplane
○ ◁┃▷ ●
○ ○ ┃ ● ●
○ ○ ┃ ● ●
┃
◁────┃────▷
MARGIN (maximized)
◁▷ = Support Vectors
- For non-linearly separable data, kernels transform features into a higher-dimensional space where a linear boundary can be found.
| Kernel | Use Case | Boundary |
|---|---|---|
linear |
Linearly separable data | Straight line / plane |
rbf (Gaussian) |
Most non-linear problems | Flexible curved boundary |
poly |
Polynomial boundaries | Curved with degree control |
- High-dimensional data
- Clear margin of separation
- Binary or multi-class classification
- SVM requires feature scaling — always standardize before training.
- A probabilistic classifier based on Bayes’ Theorem with the assumption that features are conditionally independent given the class.
P(X | C) · P(C)
P(C | X) = ─────────────────────
P(X)
Where:
- P(C | X) → Posterior probability
- P(X | C) → Likelihood
- P(C) → Prior probability
- P(X) → Evidence
Is this email SPAM?
Features:
- Contains "free"
- Contains "winner"
- Length > 100
We compute:
P(Spam | features) vs P(Not Spam | features)
Choose the class with higher probability ✅
| Variant | Feature Type | Example |
|---|---|---|
| Gaussian NB | Continuous | Age, salary |
| Multinomial NB | Count data | Word frequency |
| Bernoulli NB | Binary (0/1) | Word presence |
- Text classification (spam, sentiment)
- Small datasets
- Real-time prediction (very fast)
- Independence assumption rarely true
- Complex models may outperform on large datasets
- Classifies data points based on the majority vote of their
knearest neighbors.
○ ○ ●
○ ◎ ● ●
◎ ★ ◎ ○ = Class A
○ ● ● ● = Class B
○ ● ◎ = Neighbors
k = 3 → 2 Class A, 1 Class B
Prediction = Class A ✅
| k Value | Effect |
|---|---|
| Too small (k=1) | Overfitting |
| Too large (k=50) | Underfitting |
| Optimal | Best validation accuracy |
from sklearn.neighbors import KNeighborsClassifier
for k in range(1, 21):
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
print(f"k={k}: Accuracy = {accuracy:.4f}")- Non-linear decision boundaries
- Small to medium datasets
- Multi-class classification
- Requires feature scaling
- Slow on large datasets
- Curse of dimensionality
- Breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed.
- Splitting Criteria: Uses Gini Impurity or Entropy (Information Gain) to determine the best feature to split on at each node.
- Leaf Nodes: The final prediction is the majority class of the samples in that leaf.
Is Age > 30?
├── Yes: Is Income > $50k?
│ ├── Yes: Class A (Buyer)
│ └── No: Class B (Non-Buyer)
└── No: Class B (Non-Buyer)
- Clear, rule-based decision making
- Non-linear relationships between features
- When interpretability is critical (you can see exactly why a choice was made)
- Can create very complex trees that do not generalize well (overfitting)
- Biased toward features with many levels/categories
┌─────────────────────────────────────────────────────────────┐
│ STANDARD WORKFLOW │
├──────────┬──────────────────────────────────────────────────┤
│ Step 1 │ Data Preprocessing │
│ │ • Handling missing values │
│ │ • Encoding categorical variables │
│ │ • Feature scaling │
├──────────┼──────────────────────────────────────────────────┤
│ Step 2 │ Train-Test Split │
│ │ • 80/20 split │
├──────────┼──────────────────────────────────────────────────┤
│ Step 3 │ Model Training │
│ │ • Fit model on training data │
├──────────┼──────────────────────────────────────────────────┤
│ Step 4 │ Hyperparameter Tuning │
│ │ • K selection (KNN) │
│ │ • Kernel selection (SVM) │
│ │ • Degree selection │
├──────────┼──────────────────────────────────────────────────┤
│ Step 5 │ Performance Evaluation │
│ │ • Regression → R², MSE, MAE │
│ │ • Classification → Accuracy, Confusion Matrix │
│ │ • Visualization │
└──────────┴──────────────────────────────────────────────────┘