A Data Analysis course project that explores the main drivers behind hotel booking cancellations and produces actionable insights for the hospitality business.
Target variable:
Is_Canceled(0 = not canceled, 1 = canceled)
Hotel booking cancellations cause revenue loss and make planning harder (room allocation, staffing, forecasting). This project answers:
Which factors are most associated with booking cancellations, and how can hotels use data-driven insights to reduce cancellation risk?
- Dataset name: Hotel Booking Demand
- Source: Kaggle (jessemostipak / hotel-booking-demand)
- Hotels: City Hotel & Resort Hotel (Portugal)
- Time period: 2015–2017
- Raw shape: ~119k rows × 32 columns (each row = one booking)
The dataset file used in this repo is included as:
hotel_bookings.csv
- Handling missing values (median/mode/mean depending on the feature type)
- Removing low-quality / inconsistent records (e.g., invalid ADR values)
- Dropping highly-missing columns (e.g.,
Company) - Standardizing column names and data types
- Distributions and relationships for key variables such as:
- Lead time, ADR (price), stay duration, guests
- Market segments, customer types, seasonality
Examples of engineered features:
Total_Nights=Stays_In_Weekend_Nights+Stays_In_Week_NightsTotal_Guests=Adults+Children+Babies- Season/long-stay indicators
- Revenue-related features
Multiple approaches were used to identify the most relevant predictors:
- Correlation analysis
- Lasso regression
- Recursive Feature Elimination (RFE) with Logistic Regression
To validate whether differences are statistically meaningful:
- Normality checks
- Mann–Whitney U test for numeric features
- Chi-square test for categorical features
- PCA to reduce dimensionality for customer-behavior features
- K-Means clustering to group customers with similar booking behavior
.
├── Hotel_Analysis.ipynb # Main notebook (analysis + plots + modeling steps)
├── hotel_bookings.csv # Dataset used in the notebook
├── Documention_DA.pdf # Full project documentation/report
└── Presentation DA.pdf # Project presentation slides
- Create a virtual environment (recommended)
- Install dependencies:
pip install numpy pandas matplotlib seaborn scipy scikit-learn jupyter- Launch Jupyter and open the notebook:
jupyter notebook- Run
Hotel_Analysis.ipynbfrom top to bottom.
Upload Hotel_Analysis.ipynb and hotel_bookings.csv to Colab and run the cells.
- Business-focused insights about cancellation drivers
- Statistical evidence of significant relationships
- Customer segments (clusters) based on booking behavior
- A full written report and a presentation deck (PDFs)
- This repository focuses on analysis + insights + segmentation.
If you want an end-to-end prediction model (train/test metrics), you can extend the notebook with a classification pipeline and evaluation metrics.
- Dataset provider: Hotel Booking Demand on Kaggle (as credited above)