Predicting residential house prices using machine learning models in R.
An end to end housing price prediction project built using R and tidymodels.
This repository includes a fully rendered HTML report that demonstrates the complete analysis, visualizations, and final model results in a clear and reproducible way.
The aim of this project was to apply everything I learnt about machine learning in R to a single, realistic problem. Instead of focusing on theory alone, I wanted to build a full, practical workflow that mirrors how predictive models are used in real life.
Using the Ames Housing dataset, I built and compared multiple models to predict residential house prices. The project covers the full machine learning pipeline, from exploratory data analysis to feature engineering, model training, tuning, evaluation, and interpretation.
This project was done to strengthen my applied skills in R, machine learning, and statistical modeling, using both university coursework and self study as a foundation.
- Dataset: Ames Housing Dataset
- Source: https://www.kaggle.com/datasets/prevek18/ames-housing-dataset
- Description: Detailed information on residential properties in Ames, Iowa.
House prices are in US dollars ($) and measurements are in feet (ft).
The predictive model can easily be adapted to South African metrics when required. - Target Variable: Sale price
The dataset contains a rich mix of numeric and categorical variables describing house size, quality, location, and condition, making it ideal for real world price prediction.
I built this project to:
- Apply machine learning concepts learnt in university coursework and self study
- Practice using R for real world predictive modeling
- Gain hands on experience with tidymodels and model workflows
- Move beyond theory and demonstrate a complete end to end implementation
This project combines ideas from machine learning, statistics, and data science into one coherent analysis.
The analysis follows a clear and structured pipeline:
- Loaded and cleaned the raw dataset in R
- Performed exploratory data analysis to understand distributions and relationships
- Handled missing values and categorical levels carefully
- Removed extreme outliers to improve model stability
- Applied log transformation to the target variable
- Engineered features using a tidymodels recipe
- Split the data into training and test sets
- Trained models using cross validation
- Tuned hyperparameters for tree based models
- Compared models using RMSE and R squared
- Evaluated final performance on unseen test data
- Visualized predictions and residuals to assess model fit
- Demonstrated how the model can be used for real world predictions
All steps are fully reproducible and documented using RMarkdown.
The full analysis, visualizations, and model results can be viewed here:
https://joshuakohlmeyer.github.io/House-Price-Prediction/
- Linear Regression
- Random Forest
- XGBoost
Each model was trained using cross validation and evaluated consistently to ensure fair comparison.
- The XGBoost model achieved the strongest performance
- Test set R squared ≈ 0.81

- The model explains about 81 percent of the variation in house prices
- Prediction plots show strong alignment between actual and predicted prices
- Residual diagnostics indicate a well behaved model with no major systematic issues
Overall, the final model performs very well and produces realistic price estimates.
Below are selected outputs from the final HTML report that demonstrate model performance and interpretability.
Predicted vs actual sale prices for the XGBoost model.
Residual diagnostics showing no major systematic patterns.
This project demonstrates how a trained machine learning model can be used to:
- Predict the market value of a house based on its features
- Understand how changes in inputs affect price
- Support pricing decisions in a real estate context
- Translate statistical models into actionable insights
The same workflow can be applied to many real world prediction problems beyond housing data.
- R
- tidyverse
- ggplot2
- tidymodels
- xgboost
- ranger
- RMarkdown (All results are fully reproducible. The project can be rerun end to end by knitting the RMarkdown file.)
If you would like to connect or give feedback on this project (much appreciated):




