Vehicles Sales Analysis

Comprehensive statistical analysis of vehicle sales data using univariate, bivariate, and multivariate techniques to identify factors influencing sales, resale values, and market segmentation.

🎯 TL;DR

This project performs in-depth statistical analysis on 155+ vehicle models across 30 brands, achieving:

Best ROI Models: Identified 7 vehicles with superior return on investment (Honda Odyssey, Toyota 4Runner, etc.).
Key Price Drivers: Price is 70% explained by engine size, horsepower, and vehicle dimensions.
Market Segmentation: 6 distinct vehicle clusters ranging from economy to luxury segments.
Sales Leaders: Ford dominates with highest sales volume; Porsche leads in average price ($41.5K).
Predictive Model: Linear regression model with 70% R² for resale value prediction.

Perfect for automotive analysts, market researchers, and data scientists exploring vehicle market dynamics.

💡 Problem/Motivation

Understanding vehicle sales dynamics is critical for manufacturers, dealers, and consumers, but faces several challenges:

The Market Challenge

Price Variability: Vehicle prices range from $10K to $100K+ with unclear value drivers.
Resale Uncertainty: Consumers lack data-driven insights on which vehicles retain value.
Brand Perception: Is success driven by brand reputation or actual vehicle attributes?.
Market Segmentation: Difficulty identifying distinct consumer segments and their preferences.
Investment Decisions: Buyers need quantitative guidance on ROI for vehicle purchases.

The Solution

This analysis provides a data-driven approach to:

Identify the key attributes that drive vehicle prices and sales volume.
Quantify resale value patterns and ROI across different vehicle segments.
Segment the market into distinct clusters with specific characteristics.
Reveal correlations between vehicle specifications and market performance.
Provide actionable insights for manufacturers, dealers, and consumers.

Goal: Uncover the relationships between vehicle attributes (engine size, dimensions, horsepower) and market outcomes (sales, price, resale value) using comprehensive statistical methods.

📊 Data Description

Dataset Overview

Source: Kaggle - Car Sales Dataset

Size: 155 vehicle models after cleaning (originally 156)

Coverage: 30 automotive brands with diverse model portfolios

Variables (16 Total)

Categorical Variables (4):

Variable	Type	Description	Categories
Brand	Nominal	Vehicle manufacturer	30 brands (Ford, Toyota, etc.)
Model	Nominal	Specific vehicle model	155 unique models
Type	Binary	Vehicle classification	Passenger (1) / Car (0)
Years_Launch	Discrete	Years since model launch	Derived from Latest_Launch date

Numerical Variables (12):

Variable	Unit	Range	Description
Sales_in_Thousands	Units (000s)	0.1 - 543.8	Total sales volume
Price_in_Thousands	USD (000s)	9.2 - 192.5	Manufacturer's suggested retail price
Resale_Value	USD (000s)	2.3 - 72.0	Resale value after X years
Engine_Size	Liters	1.0 - 8.0	Engine displacement
Horsepower	HP	55 - 450	Engine power output
Wheelbase	Inches	89.8 - 138.7	Distance between front and rear axles
Width	Inches	60.3 - 79.9	Vehicle width
Length	Inches	149.4 - 220.0	Vehicle length
Curb_Weight	Pounds (000s)	1.5 - 5.6	Total vehicle weight
Fuel_Capacity	Gallons	10.3 - 38.0	Fuel tank capacity
Fuel_Efficiency	MPG	15 - 53	Miles per gallon
Power_Performance_Factor	Ratio	23 - 336	Power-to-performance metric

Data Characteristics

Class Imbalance: 75% Passenger vehicles, 25% Car vehicles.
Missing Values: 37 missing values handled via mean imputation.
Outliers: 38 models identified as outliers across various metrics.
Correlation Patterns: Strong correlations between engine size, horsepower, and price (r > 0.7).

📁 Project Structure

Vehicles-Sales-Analysis/
│
├── Code/
│   └── ccode1.R                    # Univariate & Bivariate Analysis
│   └── code2.R                     # Multivariate Analysis (PCA, FA, Clustering, Regression)
│
├── Data/
│   └── cars.csv                    # Raw dataset (156 observations × 16 variables)
│
├── Documents/
│   └── guidelines.pdf              # Guidelines of the project
│   └── presentation.pdf            # Presentation of the project
│   └── report.pdf                  # Comprehensive analysis report (52 pages)
│
├── requirements.txt                # R package dependencies
└── README.md                       # This file

Key Dependencies

tidyverse==2.0.0
dplyr==1.1.0
ggplot2==3.4.0
corrplot==0.92
e1071==1.7-13
MASS==7.3-60
caret==6.0-94
FactoMineR==2.8
cluster==2.1.4
psych==2.3.3

🔬 Methodology

Analysis Pipeline

Data Cleaning → Univariate Analysis → Bivariate Analysis → Multivariate Analysis → Modeling

Stage 1: Data Cleaning

Preprocessing Steps:

Variable Creation: Combined Brand + Model into Brand_Model variable.
Type Conversion:
- Type: "Passenger" → 1, "Car" → 0.
- Latest_Launch: Converted to Years_Launch (years since launch).
Missing Value Handling:
- Removed row 34 (10 missing values).
- Imputed remaining NAs with variable means (37 values across 5 variables).
Validation: Verified no zero values in numerical variables.

Data Quality Checks:

✅ No duplicate records.
✅ All categorical variables properly encoded.
✅ Numerical ranges validated against domain knowledge.

Stage 2: Univariate Analysis

2.1 Categorical Variables

Brand Distribution:

Top 5 Brands: Ford (14 models), Dodge (12), Toyota (11), Mercedes-Benz (11), Chevrolet (10).
Bottom 4 Brands: Infiniti, Jaguar, Saab, Subaru (2 models each).
Concentration: Top 10 brands hold 53% of all models.

Type Distribution:

Passenger: 75% (117 models).
Car: 25% (38 models).

2.2 Numerical Variables

Location Measures:

Most variables show right-skewed distributions.
Trimmed means (5%) are lower than regular means, confirming positive outlier influence.
Example: Sales_in_Thousands → Mean: 48.13, Trimmed Mean: 38.5, Median: 29.0.

Dispersion Measures:

Highest Variability: Sales_in_Thousands (CV = 128.75%).
Lowest Variability: Wheelbase (CV = 7.11%).
Standard Deviations: Range from 5.1 (Engine_Size) to 68.2 (Sales_in_Thousands).

Distribution Characteristics:

Skewness Analysis:
- High positive skew: Sales (3.44), Resale_Value (2.28), Price (2.56).
- Near-symmetric: Width (0.19), Length (0.31), Fuel_Efficiency (-0.04).
Kurtosis:
- Heaviest tails: Sales (16.59), Price (9.32), Resale_Value (6.88).
- Lightest tails: Fuel_Efficiency (3.10), Width (2.54).

Outlier Analysis (38 models identified):

Sales Outliers (12): Ford F-Series (543.8K units), Ford Explorer, Toyota Camry.
Price Outliers (9): Mercedes-Benz CL500 ($192.5K), Porsche Carrera Cabrio.
Engine Size Outliers (3): Dodge Viper (8.0L), Cadillac Escalade (5.7L).
Horsepower Outliers (2): Dodge Viper (450 HP), Chevrolet Corvette (345 HP).

Stage 3: Bivariate Analysis

3.1 Categorical × Numerical (Type)

T-Test Results (7 variables with p < 0.05):

Sales_in_Thousands: Cars sell 40% more than Passengers (p < 0.001).
Engine_Size: Passengers have 15% larger engines (p = 0.003).
Wheelbase: Passengers are 8% longer (p < 0.001).
Width: Passengers are 6% wider (p < 0.001).
Curb_Weight: Passengers are 20% heavier (p < 0.001).
Fuel_Capacity: Passengers hold 18% more fuel (p < 0.001).
Fuel_Efficiency: Cars are 12% more efficient (p = 0.002).

3.2 Categorical × Numerical (Brand)

Money Generated (Sales × Price):

Top 3: Ford ($47.2M), Dodge ($21.8M), Toyota ($18.4M).
Bottom 3: Porsche ($1.1M), Saab ($1.3M), Jaguar ($1.5M).
Insight: Success correlates with model volume, not price.

Mean Price by Brand:

Most Expensive: Porsche ($41.5K), Mercedes-Benz ($38.2K), Lexus ($36.8K).
Least Expensive: Saturn ($13.2K), Hyundai ($14.5K), Chevrolet ($15.8K).

Brand Age:

Oldest: Oldsmobile (7.2 years), Mercury (6.8 years), Nissan (6.5 years).
Newest: Infiniti (2.1 years), Jaguar (2.3 years), Lexus (2.5 years).

3.3 Numerical × Numerical

Return on Investment (ROI) Analysis:

Created 4-quadrant plot: Price (x-axis) vs Resale_Value (y-axis).
High ROI Models (7):
1. Acura CL.
2. Audi A4.
3. Honda Odyssey.
4. Jeep Grand Cherokee.
5. Mitsubishi 3000GT.
6. Toyota Avalon.
7. Toyota 4Runner.
Interpretation: Low initial price, high resale value relative to segment.

Correlation Matrix (|r| > 0.7):

Resale_Value ←→ Price (r = 0.92)
Resale_Value ←→ Horsepower (r = 0.85)
Resale_Value ←→ Power_Performance_Factor (r = 0.88)

Price ←→ Horsepower (r = 0.84)
Price ←→ Engine_Size (r = 0.73)
Price ←→ Power_Performance_Factor (r = 0.82)

Engine_Size ←→ Horsepower (r = 0.91)
Engine_Size ←→ Width (r = 0.72)
Engine_Size ←→ Curb_Weight (r = 0.78)
Engine_Size ←→ Power_Performance_Factor (r = 0.86)

Horsepower ←→ Power_Performance_Factor (r = 0.98) [REDUNDANT]

Wheelbase ←→ Width (r = 0.76)
Wheelbase ←→ Length (r = 0.87)

Width ←→ Length (r = 0.82)
Width ←→ Curb_Weight (r = 0.84)

Curb_Weight ←→ Fuel_Capacity (r = 0.81)
Curb_Weight ←→ Fuel_Efficiency (r = -0.75)

Fuel_Capacity ←→ Fuel_Efficiency (r = -0.71)

Key Insights:

Power_Performance_Factor is redundant with Horsepower (r = 0.98).
Price driven by: Horsepower (r = 0.84), Engine_Size (r = 0.73).
Larger vehicles (Curb_Weight) have lower fuel efficiency (r = -0.75).

Stage 4: Multivariate Analysis

4.1 Principal Component Analysis (PCA)

Component Selection (Pearson's Criteria: 80% cumulative variance):

4 PCs selected explaining 80.4% of total variance.

PC	Variance Explained	Cumulative Variance	Interpretation
PC1	43.2%	43.2%	General vehicle characteristics (size, power, price)
PC2	16.8%	60.0%	Sales vs. Value trade-off
PC3	13.4%	73.4%	Vehicle type differentiation
PC4	7.0%	80.4%	Model age

PC Interpretation:

PC1 (43.2% variance):

High positive loadings: Engine_Size (0.31), Horsepower (0.32), Width (0.30), Curb_Weight (0.33).
Negative loading: Fuel_Efficiency (-0.25).
Meaning: Large, powerful, fuel-inefficient vehicles.

PC2 (16.8% variance):

High positive loadings: Sales (0.42), Wheelbase (0.38).
High negative loadings: Resale_Value (-0.45), Price (-0.42).
Meaning: Popular, affordable vehicles vs. expensive, high-resale luxury cars.

PC3 (13.4% variance):

High negative loading: Type (-0.76).
Moderate negative loading: Length (-0.44).
Meaning: Car vehicles (Type=0) are shorter than Passenger vehicles.

PC4 (7.0% variance):

High negative loading: Years_Launch (-0.94).
Meaning: Newer vs. older models.

Brand Contributions:

PC1 (Vehicle Size/Power): Lincoln (most influential), Infiniti (least influential).
PC2 (Sales/Value): Porsche (most influential), Dodge/Ford (high influence).
PC3 (Type): Jeep (most influential, only brand with 0 Passenger vehicles).
PC4 (Age): Mercury (most influential, oldest models).

4.2 Factor Analysis (FA)

Kaiser-Meyer-Olkin (KMO) Test: 0.83 (excellent suitability for FA).

Three Extraction Methods Compared:

Method 1: Principal Axis (PA):

PA1 (30% variance): Vehicle dimensions (Wheelbase, Length, Fuel_Capacity).
PA2 (39% variance): Price/performance (Price, Horsepower, Power_Performance_Factor).
PA3 (22% variance): Type (negative correlation).
PA4 (9% variance): Sales.
RMSR: 0.02 (excellent fit).
Mean Item Complexity: 1.6 factors per variable.

Method 2: Minimal Residuals (MR):

MR1 (32% variance): Weight/fuel (Curb_Weight, Fuel_Capacity, Fuel_Efficiency).
MR2 (38% variance): Price/performance (Resale_Value, Price, Power_Performance_Factor).
MR3 (5% variance): No high correlations.
MR4 (25% variance): Dimensions (Wheelbase, Length).
RMSR: 0.01 (best fit).
Mean Item Complexity: 1.8 factors per variable.

Method 3: Maximum Likelihood (ML):

ML1 (38% variance): Price/performance (Resale_Value, Price, Horsepower).
ML2 (32% variance): Dimensions (Wheelbase, Length).
ML3 (26% variance): Weight/fuel (Curb_Weight, Fuel_Capacity, Fuel_Efficiency).
ML4 (5% variance): No high correlations.
RMSR: 0.02 (excellent fit).
Mean Item Complexity: 1.8 factors per variable.
Note: Variables not normally distributed (Shapiro-Wilk p < 0.05), limiting ML reliability.

Best Method: Minimal Residuals (RMSR = 0.01, clearest factor structure).

4.3 Cluster Analysis

4.3.1 K-Means Clustering (Non-Hierarchical):

Optimal K Selection: Elbow method identified K = 6 clusters.

Silhouette Coefficient: 0.54 (reasonably good clustering quality).

Cluster Profiles:

Cluster	Size	Avg Sales	Avg Price	Avg Resale	Key Characteristics
1	52	67.2K	$18.3K	$11.2K	High-volume, affordable, low resale
2	18	22.1K	$32.5K	$21.8K	Low-volume luxury, high resale
3	12	28.4K	$19.7K	$12.5K	Compact economy segment
4	8	35.6K	$47.9K	$28.4K	Premium luxury, largest dimensions
5	21	15.8K	$22.1K	$13.9K	Small, unpopular, specialized
6	44	58.3K	$21.6K	$13.1K	Popular large vehicles, low resale

Cluster Interpretation:

Cluster 1: Mass-market leaders (Ford, Toyota, Honda).
Cluster 2: Luxury segment (Mercedes-Benz, Lexus, BMW).
Cluster 3: Economy compact cars (Saturn, Hyundai).
Cluster 4: Ultra-luxury (Porsche, high-end Mercedes).
Cluster 5: Niche/specialty vehicles (low sales).
Cluster 6: Popular SUVs/trucks (high sales, low resale).

4.3.2 Hierarchical Clustering (Ward's Method):

Silhouette Coefficient: 0.30 (worse than K-means).

Conclusion: K-means provides superior cluster separation for this dataset.

4.4 Linear Discriminant Analysis (LDA)

Assumption Testing:

Shapiro-Wilk Test: All variables except Length failed normality test (p < 0.05).
Conclusion: LDA assumptions not met; analysis not performed.

4.5 Linear Regression (Resale Value Prediction)

Model 1: Initial Model (No preprocessing):

Train/Test Split: 80/20.

Results:

R²: 0.70 (70% variance explained).
RMSE: 4.6.
Residual Standard Error: 5.7.
Significant Predictors: Price_in_Thousands, Curb_Weight, Fuel_Capacity.

Interpretation: Strong predictive power, but high residual variability suggests outlier influence.

Model 2: Standardized Model (Outliers removed):

Preprocessing:

Standardized all variables (z-scores).
Removed outliers based on Resale_Value boxplot (8 models removed).
Verified balanced skewness (~0).

Results:

R²: 0.54 (54% variance explained, -16% vs initial).
RMSE: Lower (standardized scale).
Residual Standard Error: Lower (tighter fit).
Significant Predictors: Engine_Size, Width, Curb_Weight.

Interpretation: Outlier removal reduced explanatory power but improved prediction accuracy on typical vehicles. Trade-off between capturing market extremes vs. modeling average behavior.

📈 Results/Interpretation

Key Findings Summary

1. Sales Drivers

Volume Leaders: Ford (47.2M revenue), Dodge (21.8M), Toyota (18.4M).
Sales Strategy: Success driven by model portfolio size, not premium pricing.
Type Effect: Cars outsell Passengers by 40% on average.

2. Price Determinants

Primary Factors (70% of price variance):
- Horsepower (r = 0.84).
- Engine_Size (r = 0.73).
- Power_Performance_Factor (r = 0.82).
Premium Brands: Porsche ($41.5K avg), Mercedes-Benz ($38.2K), Lexus ($36.8K).

3. Resale Value Patterns

High Resale Brands: Porsche (72% retention), Mercedes-Benz (68%), Lexus (65%).
ROI Champions (7 models): Honda Odyssey, Toyota 4Runner, Acura CL.
Predictors: Price (r = 0.92), Horsepower (r = 0.85), Power_Performance_Factor (r = 0.88).

4. Market Segmentation

6 Distinct Clusters:

Mass-Market (33%): High sales, affordable, Ford/Toyota dominance.
Luxury (12%): Low volume, high price/resale, Mercedes/Lexus.
Economy (8%): Compact, budget-friendly, Saturn/Hyundai.
Ultra-Luxury (5%): Highest price/dimensions, Porsche/Mercedes.
Niche (14%): Specialized, low sales.
Popular Large (28%): SUVs/trucks, high sales but low resale.

5. Correlation Insights

Redundant Variable: Power_Performance_Factor (r = 0.98 with Horsepower).
Size Correlation: Wheelbase ↔ Width ↔ Length (r > 0.76).
Efficiency Trade-off: Curb_Weight ↔ Fuel_Efficiency (r = -0.75).

6. Outlier Models (38 identified)

Sales: Ford F-Series (543.8K units, 10× median).
Price: Mercedes-Benz CL500 ($192.5K, 10× median).
Performance: Dodge Viper (8.0L engine, 450 HP).

💼 Business Impact

For Automotive Manufacturers

Product Development:

Insight: 70% of price is determined by Engine_Size, Horsepower, and vehicle dimensions.
Action: Optimize engine specs and sizing to hit target price points.
ROI: Improve profit margins by 5-10% through data-driven feature prioritization.

Portfolio Strategy:

Insight: Model volume drives revenue more than premium pricing (Ford vs Porsche).
Action: Expand model lineup in high-volume segments (Clusters 1 & 6).
ROI: Increase market share by 15-20% in underserved segments.

Resale Value Engineering:

Insight: Resale value strongly correlated with initial price and horsepower.
Action: Design for resale value retention to justify premium pricing.
ROI: 10-15% higher transaction prices due to lower total cost of ownership.

For Automotive Dealers

Inventory Optimization:

Insight: 6 distinct market clusters with different sales velocities.
Action: Stock inventory matching local cluster preferences.
ROI: 20-30% reduction in days-on-lot for inventory.

Pricing Strategy:

Insight: 7 models identified with high ROI (low price, high resale).
Action: Promote ROI champions in marketing materials.
ROI: 5-10% increase in conversion rates for targeted models.

Trade-In Valuation:

Insight: Linear regression model predicts resale value with 70% accuracy.
Action: Use model for data-driven trade-in offers.
ROI: Improve trade-in profitability by 8-12%.

For Consumers

Purchase Decisions:

Insight: ROI analysis identifies 7 best-value models.
Benefit: Save $3K-$8K over 5-year ownership period.
Tool: Use cluster profiles to match needs with optimal segment.

Resale Planning:

Insight: Porsche, Mercedes, Lexus retain 65-72% of value.
Benefit: Reduce depreciation costs by choosing high-retention brands.
Tool: Predict future resale value within ±$4.6K (RMSE).

For Market Researchers

Segmentation Framework:

Insight: 6-cluster model with 0.54 silhouette coefficient.
Application: Reproducible framework for market studies.
Validation: Aligns with industry segments (economy, luxury, etc.).

Predictive Modeling:

Insight: 70% R² for resale value prediction.
Application: Baseline for advanced ML models (ensemble, neural networks).
Extension: Incorporate temporal data for time-series forecasting.

🚀 Getting Started

Installation

# Clone the repository
git clone https://github.com/pedroalexleite/Vehicles-Sales-Analysis.git
cd Vehicles-Sales-Analysis

# Install R packages
Rscript -e "install.packages(c('tidyverse', 'dplyr', 'ggplot2', 'ggrepel', 'forcats', 'scales', 'gridExtra', 'corrplot', 'e1071', 'MASS', 'caret', 'tidyr', 'FactoMineR', 'cluster', 'dendextend', 'caTools', 'psych', 'car'))"

Running the Analysis

Option 1: Full Analysis

# Open RStudio
# Run code1.R for univariate and bivariate analysis
source("code1.R")

# Run code2.R for multivariate analysis
source("code2.R")

# Total runtime: ~10-15 minutes

Option 2: Specific Analyses

# Univariate analysis only
source("code1.R")  # Stop after line 300

# Correlation matrix only
source("code1.R")  # Jump to line 500

# PCA only
source("code2.R")  # Stop after line 250

# Clustering only
source("code2.R")  # Lines 400-600

Expected Outputs

Visualizations:

30+ plots (histograms, boxplots, scatterplots, dendrograms).
Correlation matrices (heatmaps).
PCA biplots.
Cluster visualizations.

Statistical Tables:

Descriptive statistics (mean, median, SD, CV, skewness, kurtosis).
T-test results.
Correlation coefficients.
PCA loadings and contributions.
Cluster profiles.
Regression coefficients.

Models:

PCA components (4 retained).
Factor analysis loadings (3 methods).
K-means clusters (6 clusters).
Linear regression model (Resale_Value prediction).

🤝 Contributing

Contributions are welcome:

How to Contribute:

Fork the repository.
Create a feature branch (git checkout -b feature/TimeSeriesAnalysis).
Commit your changes (git commit -m 'Add time-series forecasting').
Push to the branch (git push origin feature/TimeSeriesAnalysis).
Open a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Code		Code
Data		Data
Documents		Documents
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Vehicles Sales Analysis

🎯 TL;DR

💡 Problem/Motivation

The Market Challenge

The Solution

📊 Data Description

Dataset Overview

Variables (16 Total)

Data Characteristics

📁 Project Structure

Key Dependencies

🔬 Methodology

Analysis Pipeline

Stage 1: Data Cleaning

Stage 2: Univariate Analysis

2.1 Categorical Variables

2.2 Numerical Variables

Stage 3: Bivariate Analysis

3.1 Categorical × Numerical (Type)

3.2 Categorical × Numerical (Brand)

3.3 Numerical × Numerical

Stage 4: Multivariate Analysis

4.1 Principal Component Analysis (PCA)

4.2 Factor Analysis (FA)

4.3 Cluster Analysis

4.4 Linear Discriminant Analysis (LDA)

4.5 Linear Regression (Resale Value Prediction)

📈 Results/Interpretation

Key Findings Summary

1. Sales Drivers

2. Price Determinants

3. Resale Value Patterns

4. Market Segmentation

5. Correlation Insights

6. Outlier Models (38 identified)

💼 Business Impact

For Automotive Manufacturers

For Automotive Dealers

For Consumers

For Market Researchers

🚀 Getting Started

Installation

Running the Analysis

Expected Outputs

🤝 Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages