Spotify Global Trends is a professional data science application that analyzes over 70 years of music history. The project implements a full modular pipeline: from processing a messy raw dataset to performing advanced statistical anomaly detection, presented via an interactive web dashboard.
| Team Member | Responsibilities |
|---|---|
| Şevval | • Project Integration (main.py) • Pipeline Architecture • Documentation |
| Dila | • Data Cleaning Logic (src/cleaner.py) • Messy Data Handling |
| Tuana | • Statistical Analysis (src/analyzer.py) • Anomaly Detection (Z-Score) |
| Nilsu | • Visualization Engine (src/visualizer.py) • Seaborn & Plotly Design |
| İlayda | • Web Interface (app.py) • Streamlit Dashboard Design |
The project utilizes Object-Oriented Programming (OOP) and a modular structure:
- Modular Design: Cleaning, Analyzing, and Visualizing are separated into dedicated modules.
- OOP Implementation: Core logic is encapsulated within classes for better maintainability.
- Web Integration: Modern dashboard built with Streamlit and Plotly.
SPOTIFY_DATA_ANALYSIS/
├── src/ # Core Logic (OOP Modules)
│ ├── cleaner.py # Data Cleaning Class
│ ├── analyzer.py # Statistical Analysis Class
│ └── visualizer.py # Visualization Engine
├── notebooks/ # Exploratory Data Analysis (EDA) logs
├── data/ # Raw and Processed Datasets
├── outputs/figures/ # Exported High-Resolution Charts
├── app.py # Streamlit Dashboard Entry Point
├── main.py # Automation & Pipeline Orchestrator
├── requirements.txt # Dependency List
└── README.md # Documentation
To ensure the reliability of our analysis, we implemented a rigorous step-by-step cleaning pipeline. The raw dataset contained several inconsistencies, missing values, and formatting issues that were resolved as follows:
-
Metadata Integrity Check: We first identified records with missing essential identifiers such as
track_name,artist_name, andalbum_name. Since these are the primary keys for our music analysis, rows with null values in these columns were removed to prevent data corruption. -
Date & Time Normalization: The
album_release_datecolumn was originally in string format with varying precision. We converted this data into standardized Pythondatetimeobjects. To facilitate time-series trends, we engineered a newrelease_yearfeature by extracting the year from these objects. -
Robust Numerical Imputation: For missing entries in numerical columns like
artist_followersandartist_popularity, we applied Median Imputation. This method was specifically chosen over mean imputation to ensure that extreme outliers in the dataset did not bias our central tendency measures, preserving the statistical distribution's integrity. -
Unit Transformation: Track durations were provided in milliseconds. We transformed this feature into minutes (
duration_min) using a calculated transformation, making the data more interpretable for the end-user during visualization. -
Type Consistency & Casting: We ensured all boolean features (e.g.,
explicitcontent status) and categorical strings were cast into their appropriate data types. This step was crucial for the seamless operation of our dynamic filtering system in the dashboard.
To go beyond simple summaries, we implemented:
- Distribution & Quartile Analysis: We utilized box-and-whisker plots to visualize the interquartile range (IQR) of track popularity, allowing us to identify the "typical" success range of a song.
- Anomaly Detection (Z-Score): Using
scipy.stats, we calculated Z-scores for each track's popularity. Records with a |Z| > 3 were flagged as statistical anomalies, helping us identify viral outliers that deviate significantly from the global average.
- Seaborn: Used for static scientific reporting (KDE plots and Histograms) to show popularity distribution.
- Plotly: Implemented within our Modern Web Dashboard to provide interactive discovery. This meets the modern web-based visualization requirement, allowing users to inspect individual data points via tooltips.
1. Clone the Repository
git clone https://github.com/Sevvalm/spotify_data_analysis.git
cd spotify_data_analysis2. Create a Virtual Environment
python -m venv venv
venv\Scripts\activate
python3 -m venv venv
source venv/bin/activate
3. Install Dependencies Once the virtual environment is active, install all required libraries:
- Windows:
pip install -r requirements.txt
- macOS:
pip3 install -r requirements.txt
To run the entire pipeline (Data Cleaning + Analysis + Dashboard) automatically:
-
Windows:
python main.py
-
macOS:
python3 main.py
The application will process the raw data and automatically launch the web interface in your default browser.