Comprehensive statistical analysis of vehicle sales data using univariate, bivariate, and multivariate techniques to identify factors influencing sales, resale values, and market segmentation.
This project performs in-depth statistical analysis on 155+ vehicle models across 30 brands, achieving:
- Best ROI Models: Identified 7 vehicles with superior return on investment (Honda Odyssey, Toyota 4Runner, etc.).
- Key Price Drivers: Price is 70% explained by engine size, horsepower, and vehicle dimensions.
- Market Segmentation: 6 distinct vehicle clusters ranging from economy to luxury segments.
- Sales Leaders: Ford dominates with highest sales volume; Porsche leads in average price ($41.5K).
- Predictive Model: Linear regression model with 70% RΒ² for resale value prediction.
Perfect for automotive analysts, market researchers, and data scientists exploring vehicle market dynamics.
Understanding vehicle sales dynamics is critical for manufacturers, dealers, and consumers, but faces several challenges:
- Price Variability: Vehicle prices range from $10K to $100K+ with unclear value drivers.
- Resale Uncertainty: Consumers lack data-driven insights on which vehicles retain value.
- Brand Perception: Is success driven by brand reputation or actual vehicle attributes?.
- Market Segmentation: Difficulty identifying distinct consumer segments and their preferences.
- Investment Decisions: Buyers need quantitative guidance on ROI for vehicle purchases.
This analysis provides a data-driven approach to:
- Identify the key attributes that drive vehicle prices and sales volume.
- Quantify resale value patterns and ROI across different vehicle segments.
- Segment the market into distinct clusters with specific characteristics.
- Reveal correlations between vehicle specifications and market performance.
- Provide actionable insights for manufacturers, dealers, and consumers.
Goal: Uncover the relationships between vehicle attributes (engine size, dimensions, horsepower) and market outcomes (sales, price, resale value) using comprehensive statistical methods.
Source: Kaggle - Car Sales Dataset
Size: 155 vehicle models after cleaning (originally 156)
Coverage: 30 automotive brands with diverse model portfolios
Categorical Variables (4):
| Variable | Type | Description | Categories |
|---|---|---|---|
| Brand | Nominal | Vehicle manufacturer | 30 brands (Ford, Toyota, etc.) |
| Model | Nominal | Specific vehicle model | 155 unique models |
| Type | Binary | Vehicle classification | Passenger (1) / Car (0) |
| Years_Launch | Discrete | Years since model launch | Derived from Latest_Launch date |
Numerical Variables (12):
| Variable | Unit | Range | Description |
|---|---|---|---|
| Sales_in_Thousands | Units (000s) | 0.1 - 543.8 | Total sales volume |
| Price_in_Thousands | USD (000s) | 9.2 - 192.5 | Manufacturer's suggested retail price |
| Resale_Value | USD (000s) | 2.3 - 72.0 | Resale value after X years |
| Engine_Size | Liters | 1.0 - 8.0 | Engine displacement |
| Horsepower | HP | 55 - 450 | Engine power output |
| Wheelbase | Inches | 89.8 - 138.7 | Distance between front and rear axles |
| Width | Inches | 60.3 - 79.9 | Vehicle width |
| Length | Inches | 149.4 - 220.0 | Vehicle length |
| Curb_Weight | Pounds (000s) | 1.5 - 5.6 | Total vehicle weight |
| Fuel_Capacity | Gallons | 10.3 - 38.0 | Fuel tank capacity |
| Fuel_Efficiency | MPG | 15 - 53 | Miles per gallon |
| Power_Performance_Factor | Ratio | 23 - 336 | Power-to-performance metric |
- Class Imbalance: 75% Passenger vehicles, 25% Car vehicles.
- Missing Values: 37 missing values handled via mean imputation.
- Outliers: 38 models identified as outliers across various metrics.
- Correlation Patterns: Strong correlations between engine size, horsepower, and price (r > 0.7).
Vehicles-Sales-Analysis/
β
βββ Code/
β βββ ccode1.R # Univariate & Bivariate Analysis
β βββ code2.R # Multivariate Analysis (PCA, FA, Clustering, Regression)
β
βββ Data/
β βββ cars.csv # Raw dataset (156 observations Γ 16 variables)
β
βββ Documents/
β βββ guidelines.pdf # Guidelines of the project
β βββ presentation.pdf # Presentation of the project
β βββ report.pdf # Comprehensive analysis report (52 pages)
β
βββ requirements.txt # R package dependencies
βββ README.md # This file
tidyverse==2.0.0
dplyr==1.1.0
ggplot2==3.4.0
corrplot==0.92
e1071==1.7-13
MASS==7.3-60
caret==6.0-94
FactoMineR==2.8
cluster==2.1.4
psych==2.3.3Data Cleaning β Univariate Analysis β Bivariate Analysis β Multivariate Analysis β Modeling
Preprocessing Steps:
- Variable Creation: Combined Brand + Model into
Brand_Modelvariable. - Type Conversion:
Type: "Passenger" β 1, "Car" β 0.Latest_Launch: Converted toYears_Launch(years since launch).
- Missing Value Handling:
- Removed row 34 (10 missing values).
- Imputed remaining NAs with variable means (37 values across 5 variables).
- Validation: Verified no zero values in numerical variables.
Data Quality Checks:
- β No duplicate records.
- β All categorical variables properly encoded.
- β Numerical ranges validated against domain knowledge.
Brand Distribution:
- Top 5 Brands: Ford (14 models), Dodge (12), Toyota (11), Mercedes-Benz (11), Chevrolet (10).
- Bottom 4 Brands: Infiniti, Jaguar, Saab, Subaru (2 models each).
- Concentration: Top 10 brands hold 53% of all models.
Type Distribution:
- Passenger: 75% (117 models).
- Car: 25% (38 models).
Location Measures:
- Most variables show right-skewed distributions.
- Trimmed means (5%) are lower than regular means, confirming positive outlier influence.
- Example:
Sales_in_Thousandsβ Mean: 48.13, Trimmed Mean: 38.5, Median: 29.0.
Dispersion Measures:
- Highest Variability:
Sales_in_Thousands(CV = 128.75%). - Lowest Variability:
Wheelbase(CV = 7.11%). - Standard Deviations: Range from 5.1 (Engine_Size) to 68.2 (Sales_in_Thousands).
Distribution Characteristics:
- Skewness Analysis:
- High positive skew: Sales (3.44), Resale_Value (2.28), Price (2.56).
- Near-symmetric: Width (0.19), Length (0.31), Fuel_Efficiency (-0.04).
- Kurtosis:
- Heaviest tails: Sales (16.59), Price (9.32), Resale_Value (6.88).
- Lightest tails: Fuel_Efficiency (3.10), Width (2.54).
Outlier Analysis (38 models identified):
- Sales Outliers (12): Ford F-Series (543.8K units), Ford Explorer, Toyota Camry.
- Price Outliers (9): Mercedes-Benz CL500 ($192.5K), Porsche Carrera Cabrio.
- Engine Size Outliers (3): Dodge Viper (8.0L), Cadillac Escalade (5.7L).
- Horsepower Outliers (2): Dodge Viper (450 HP), Chevrolet Corvette (345 HP).
T-Test Results (7 variables with p < 0.05):
Sales_in_Thousands: Cars sell 40% more than Passengers (p < 0.001).Engine_Size: Passengers have 15% larger engines (p = 0.003).Wheelbase: Passengers are 8% longer (p < 0.001).Width: Passengers are 6% wider (p < 0.001).Curb_Weight: Passengers are 20% heavier (p < 0.001).Fuel_Capacity: Passengers hold 18% more fuel (p < 0.001).Fuel_Efficiency: Cars are 12% more efficient (p = 0.002).
Money Generated (Sales Γ Price):
- Top 3: Ford ($47.2M), Dodge ($21.8M), Toyota ($18.4M).
- Bottom 3: Porsche ($1.1M), Saab ($1.3M), Jaguar ($1.5M).
- Insight: Success correlates with model volume, not price.
Mean Price by Brand:
- Most Expensive: Porsche ($41.5K), Mercedes-Benz ($38.2K), Lexus ($36.8K).
- Least Expensive: Saturn ($13.2K), Hyundai ($14.5K), Chevrolet ($15.8K).
Brand Age:
- Oldest: Oldsmobile (7.2 years), Mercury (6.8 years), Nissan (6.5 years).
- Newest: Infiniti (2.1 years), Jaguar (2.3 years), Lexus (2.5 years).
Return on Investment (ROI) Analysis:
- Created 4-quadrant plot: Price (x-axis) vs Resale_Value (y-axis).
- High ROI Models (7):
- Acura CL.
- Audi A4.
- Honda Odyssey.
- Jeep Grand Cherokee.
- Mitsubishi 3000GT.
- Toyota Avalon.
- Toyota 4Runner.
- Interpretation: Low initial price, high resale value relative to segment.
Correlation Matrix (|r| > 0.7):
Resale_Value ββ Price (r = 0.92)
Resale_Value ββ Horsepower (r = 0.85)
Resale_Value ββ Power_Performance_Factor (r = 0.88)
Price ββ Horsepower (r = 0.84)
Price ββ Engine_Size (r = 0.73)
Price ββ Power_Performance_Factor (r = 0.82)
Engine_Size ββ Horsepower (r = 0.91)
Engine_Size ββ Width (r = 0.72)
Engine_Size ββ Curb_Weight (r = 0.78)
Engine_Size ββ Power_Performance_Factor (r = 0.86)
Horsepower ββ Power_Performance_Factor (r = 0.98) [REDUNDANT]
Wheelbase ββ Width (r = 0.76)
Wheelbase ββ Length (r = 0.87)
Width ββ Length (r = 0.82)
Width ββ Curb_Weight (r = 0.84)
Curb_Weight ββ Fuel_Capacity (r = 0.81)
Curb_Weight ββ Fuel_Efficiency (r = -0.75)
Fuel_Capacity ββ Fuel_Efficiency (r = -0.71)
Key Insights:
Power_Performance_Factoris redundant withHorsepower(r = 0.98).- Price driven by: Horsepower (r = 0.84), Engine_Size (r = 0.73).
- Larger vehicles (Curb_Weight) have lower fuel efficiency (r = -0.75).
Component Selection (Pearson's Criteria: 80% cumulative variance):
- 4 PCs selected explaining 80.4% of total variance.
| PC | Variance Explained | Cumulative Variance | Interpretation |
|---|---|---|---|
| PC1 | 43.2% | 43.2% | General vehicle characteristics (size, power, price) |
| PC2 | 16.8% | 60.0% | Sales vs. Value trade-off |
| PC3 | 13.4% | 73.4% | Vehicle type differentiation |
| PC4 | 7.0% | 80.4% | Model age |
PC Interpretation:
PC1 (43.2% variance):
- High positive loadings: Engine_Size (0.31), Horsepower (0.32), Width (0.30), Curb_Weight (0.33).
- Negative loading: Fuel_Efficiency (-0.25).
- Meaning: Large, powerful, fuel-inefficient vehicles.
PC2 (16.8% variance):
- High positive loadings: Sales (0.42), Wheelbase (0.38).
- High negative loadings: Resale_Value (-0.45), Price (-0.42).
- Meaning: Popular, affordable vehicles vs. expensive, high-resale luxury cars.
PC3 (13.4% variance):
- High negative loading: Type (-0.76).
- Moderate negative loading: Length (-0.44).
- Meaning: Car vehicles (Type=0) are shorter than Passenger vehicles.
PC4 (7.0% variance):
- High negative loading: Years_Launch (-0.94).
- Meaning: Newer vs. older models.
Brand Contributions:
- PC1 (Vehicle Size/Power): Lincoln (most influential), Infiniti (least influential).
- PC2 (Sales/Value): Porsche (most influential), Dodge/Ford (high influence).
- PC3 (Type): Jeep (most influential, only brand with 0 Passenger vehicles).
- PC4 (Age): Mercury (most influential, oldest models).
Kaiser-Meyer-Olkin (KMO) Test: 0.83 (excellent suitability for FA).
Three Extraction Methods Compared:
Method 1: Principal Axis (PA):
- PA1 (30% variance): Vehicle dimensions (Wheelbase, Length, Fuel_Capacity).
- PA2 (39% variance): Price/performance (Price, Horsepower, Power_Performance_Factor).
- PA3 (22% variance): Type (negative correlation).
- PA4 (9% variance): Sales.
- RMSR: 0.02 (excellent fit).
- Mean Item Complexity: 1.6 factors per variable.
Method 2: Minimal Residuals (MR):
- MR1 (32% variance): Weight/fuel (Curb_Weight, Fuel_Capacity, Fuel_Efficiency).
- MR2 (38% variance): Price/performance (Resale_Value, Price, Power_Performance_Factor).
- MR3 (5% variance): No high correlations.
- MR4 (25% variance): Dimensions (Wheelbase, Length).
- RMSR: 0.01 (best fit).
- Mean Item Complexity: 1.8 factors per variable.
Method 3: Maximum Likelihood (ML):
- ML1 (38% variance): Price/performance (Resale_Value, Price, Horsepower).
- ML2 (32% variance): Dimensions (Wheelbase, Length).
- ML3 (26% variance): Weight/fuel (Curb_Weight, Fuel_Capacity, Fuel_Efficiency).
- ML4 (5% variance): No high correlations.
- RMSR: 0.02 (excellent fit).
- Mean Item Complexity: 1.8 factors per variable.
- Note: Variables not normally distributed (Shapiro-Wilk p < 0.05), limiting ML reliability.
Best Method: Minimal Residuals (RMSR = 0.01, clearest factor structure).
4.3.1 K-Means Clustering (Non-Hierarchical):
Optimal K Selection: Elbow method identified K = 6 clusters.
Silhouette Coefficient: 0.54 (reasonably good clustering quality).
Cluster Profiles:
| Cluster | Size | Avg Sales | Avg Price | Avg Resale | Key Characteristics |
|---|---|---|---|---|---|
| 1 | 52 | 67.2K | $18.3K | $11.2K | High-volume, affordable, low resale |
| 2 | 18 | 22.1K | $32.5K | $21.8K | Low-volume luxury, high resale |
| 3 | 12 | 28.4K | $19.7K | $12.5K | Compact economy segment |
| 4 | 8 | 35.6K | $47.9K | $28.4K | Premium luxury, largest dimensions |
| 5 | 21 | 15.8K | $22.1K | $13.9K | Small, unpopular, specialized |
| 6 | 44 | 58.3K | $21.6K | $13.1K | Popular large vehicles, low resale |
Cluster Interpretation:
- Cluster 1: Mass-market leaders (Ford, Toyota, Honda).
- Cluster 2: Luxury segment (Mercedes-Benz, Lexus, BMW).
- Cluster 3: Economy compact cars (Saturn, Hyundai).
- Cluster 4: Ultra-luxury (Porsche, high-end Mercedes).
- Cluster 5: Niche/specialty vehicles (low sales).
- Cluster 6: Popular SUVs/trucks (high sales, low resale).
4.3.2 Hierarchical Clustering (Ward's Method):
Silhouette Coefficient: 0.30 (worse than K-means).
Conclusion: K-means provides superior cluster separation for this dataset.
Assumption Testing:
- Shapiro-Wilk Test: All variables except
Lengthfailed normality test (p < 0.05). - Conclusion: LDA assumptions not met; analysis not performed.
Model 1: Initial Model (No preprocessing):
Train/Test Split: 80/20.
Results:
- RΒ²: 0.70 (70% variance explained).
- RMSE: 4.6.
- Residual Standard Error: 5.7.
- Significant Predictors: Price_in_Thousands, Curb_Weight, Fuel_Capacity.
Interpretation: Strong predictive power, but high residual variability suggests outlier influence.
Model 2: Standardized Model (Outliers removed):
Preprocessing:
- Standardized all variables (z-scores).
- Removed outliers based on Resale_Value boxplot (8 models removed).
- Verified balanced skewness (~0).
Results:
- RΒ²: 0.54 (54% variance explained, -16% vs initial).
- RMSE: Lower (standardized scale).
- Residual Standard Error: Lower (tighter fit).
- Significant Predictors: Engine_Size, Width, Curb_Weight.
Interpretation: Outlier removal reduced explanatory power but improved prediction accuracy on typical vehicles. Trade-off between capturing market extremes vs. modeling average behavior.
- Volume Leaders: Ford (47.2M revenue), Dodge (21.8M), Toyota (18.4M).
- Sales Strategy: Success driven by model portfolio size, not premium pricing.
- Type Effect: Cars outsell Passengers by 40% on average.
- Primary Factors (70% of price variance):
- Horsepower (r = 0.84).
- Engine_Size (r = 0.73).
- Power_Performance_Factor (r = 0.82).
- Premium Brands: Porsche ($41.5K avg), Mercedes-Benz ($38.2K), Lexus ($36.8K).
- High Resale Brands: Porsche (72% retention), Mercedes-Benz (68%), Lexus (65%).
- ROI Champions (7 models): Honda Odyssey, Toyota 4Runner, Acura CL.
- Predictors: Price (r = 0.92), Horsepower (r = 0.85), Power_Performance_Factor (r = 0.88).
6 Distinct Clusters:
- Mass-Market (33%): High sales, affordable, Ford/Toyota dominance.
- Luxury (12%): Low volume, high price/resale, Mercedes/Lexus.
- Economy (8%): Compact, budget-friendly, Saturn/Hyundai.
- Ultra-Luxury (5%): Highest price/dimensions, Porsche/Mercedes.
- Niche (14%): Specialized, low sales.
- Popular Large (28%): SUVs/trucks, high sales but low resale.
- Redundant Variable: Power_Performance_Factor (r = 0.98 with Horsepower).
- Size Correlation: Wheelbase β Width β Length (r > 0.76).
- Efficiency Trade-off: Curb_Weight β Fuel_Efficiency (r = -0.75).
- Sales: Ford F-Series (543.8K units, 10Γ median).
- Price: Mercedes-Benz CL500 ($192.5K, 10Γ median).
- Performance: Dodge Viper (8.0L engine, 450 HP).
Product Development:
- Insight: 70% of price is determined by Engine_Size, Horsepower, and vehicle dimensions.
- Action: Optimize engine specs and sizing to hit target price points.
- ROI: Improve profit margins by 5-10% through data-driven feature prioritization.
Portfolio Strategy:
- Insight: Model volume drives revenue more than premium pricing (Ford vs Porsche).
- Action: Expand model lineup in high-volume segments (Clusters 1 & 6).
- ROI: Increase market share by 15-20% in underserved segments.
Resale Value Engineering:
- Insight: Resale value strongly correlated with initial price and horsepower.
- Action: Design for resale value retention to justify premium pricing.
- ROI: 10-15% higher transaction prices due to lower total cost of ownership.
Inventory Optimization:
- Insight: 6 distinct market clusters with different sales velocities.
- Action: Stock inventory matching local cluster preferences.
- ROI: 20-30% reduction in days-on-lot for inventory.
Pricing Strategy:
- Insight: 7 models identified with high ROI (low price, high resale).
- Action: Promote ROI champions in marketing materials.
- ROI: 5-10% increase in conversion rates for targeted models.
Trade-In Valuation:
- Insight: Linear regression model predicts resale value with 70% accuracy.
- Action: Use model for data-driven trade-in offers.
- ROI: Improve trade-in profitability by 8-12%.
Purchase Decisions:
- Insight: ROI analysis identifies 7 best-value models.
- Benefit: Save $3K-$8K over 5-year ownership period.
- Tool: Use cluster profiles to match needs with optimal segment.
Resale Planning:
- Insight: Porsche, Mercedes, Lexus retain 65-72% of value.
- Benefit: Reduce depreciation costs by choosing high-retention brands.
- Tool: Predict future resale value within Β±$4.6K (RMSE).
Segmentation Framework:
- Insight: 6-cluster model with 0.54 silhouette coefficient.
- Application: Reproducible framework for market studies.
- Validation: Aligns with industry segments (economy, luxury, etc.).
Predictive Modeling:
- Insight: 70% RΒ² for resale value prediction.
- Application: Baseline for advanced ML models (ensemble, neural networks).
- Extension: Incorporate temporal data for time-series forecasting.
# Clone the repository
git clone https://github.com/pedroalexleite/Vehicles-Sales-Analysis.git
cd Vehicles-Sales-Analysis
# Install R packages
Rscript -e "install.packages(c('tidyverse', 'dplyr', 'ggplot2', 'ggrepel', 'forcats', 'scales', 'gridExtra', 'corrplot', 'e1071', 'MASS', 'caret', 'tidyr', 'FactoMineR', 'cluster', 'dendextend', 'caTools', 'psych', 'car'))"Option 1: Full Analysis
# Open RStudio
# Run code1.R for univariate and bivariate analysis
source("code1.R")
# Run code2.R for multivariate analysis
source("code2.R")
# Total runtime: ~10-15 minutesOption 2: Specific Analyses
# Univariate analysis only
source("code1.R") # Stop after line 300
# Correlation matrix only
source("code1.R") # Jump to line 500
# PCA only
source("code2.R") # Stop after line 250
# Clustering only
source("code2.R") # Lines 400-600Visualizations:
- 30+ plots (histograms, boxplots, scatterplots, dendrograms).
- Correlation matrices (heatmaps).
- PCA biplots.
- Cluster visualizations.
Statistical Tables:
- Descriptive statistics (mean, median, SD, CV, skewness, kurtosis).
- T-test results.
- Correlation coefficients.
- PCA loadings and contributions.
- Cluster profiles.
- Regression coefficients.
Models:
- PCA components (4 retained).
- Factor analysis loadings (3 methods).
- K-means clusters (6 clusters).
- Linear regression model (Resale_Value prediction).
Contributions are welcome:
How to Contribute:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/TimeSeriesAnalysis). - Commit your changes (
git commit -m 'Add time-series forecasting'). - Push to the branch (
git push origin feature/TimeSeriesAnalysis). - Open a Pull Request.