Linear Regression on the Boston Housing Dataset

This repository demonstrates an end-to-end Linear Regression project built in Python using the Boston Housing dataset. The objective is to predict the median value of owner-occupied homes (MEDV) given various socio-economic, demographic, and housing-related features. The project includes dataset exploration, regression modeling, evaluation, and visualization to understand how these features influence housing prices.

Dataset

The Boston Housing dataset is a well-known dataset in machine learning, often used as a benchmark for regression tasks. It contains 506 observations and 14 variables (13 input features and 1 target variable).

Target variable:
- MEDV: Median value of owner-occupied homes (in $1000s).
Input features:

Feature	Description
CRIM	Per capita crime rate by town
ZN	Proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS	Proportion of non-retail business acres per town
CHAS	Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX	Nitric oxide concentration (parts per 10 million)
RM	Average number of rooms per dwelling
AGE	Proportion of owner-occupied units built before 1940
DIS	Weighted distance to employment centers
RAD	Index of accessibility to radial highways
TAX	Full-value property-tax rate per $10,000
PTRATIO	Pupil-teacher ratio by town
B	1000(Bk - 0.63)² where Bk is the proportion of Black residents
LSTAT	% lower status of the population

Workflow

Data Loading
Loaded the dataset using pandas and inspected its structure with .head(), .info(), and .describe().
Exploratory Data Analysis (EDA)
- Checked for missing values.
- Visualized feature distributions and relationships with MEDV.
- Built a correlation heatmap to identify the most significant predictors.
Data Preprocessing
- Defined features (X) as the 13 independent variables and target (y) as MEDV.
- Split dataset into training (80%) and testing (20%) sets.
Model Training
- Applied LinearRegression from scikit-learn.
- Trained the model on the training set and extracted intercepts and coefficients.
Evaluation
- Predictions were made on the test set.
- Model performance was measured with:
  - Mean Absolute Error (MAE)
  - Mean Squared Error (MSE)
  - Root Mean Squared Error (RMSE)
  - R² Score
Visualization
- Predicted vs actual plot to show model accuracy.
- Residual plots to analyze model errors.

Results

Mean Absolute Error (MAE): ~3.1
Mean Squared Error (MSE): ~21.5
Root Mean Squared Error (RMSE): ~4.6
R² Score: ~0.72

The model explains about 72% of the variance in housing prices. While results are reasonable, there is scope for improvement with more advanced techniques.

Tools and Libraries

Python
pandas, numpy
matplotlib, seaborn
scikit-learn

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
linear_regression.py		linear_regression.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Linear Regression on the Boston Housing Dataset

Dataset

Workflow

Results

Tools and Libraries

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Linear Regression on the Boston Housing Dataset

Dataset

Workflow

Results

Tools and Libraries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages