Skip to content

Sevvalm/spotify_data_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spotify Global Music Trends (1952-2025) - Data Science Dashboard

Spotify Global Trends is a professional data science application that analyzes over 70 years of music history. The project implements a full modular pipeline: from processing a messy raw dataset to performing advanced statistical anomaly detection, presented via an interactive web dashboard.

👥 Team & Contributions

Team Member Responsibilities
Şevval • Project Integration (main.py)
• Pipeline Architecture
• Documentation
Dila • Data Cleaning Logic (src/cleaner.py)
• Messy Data Handling
Tuana • Statistical Analysis (src/analyzer.py)
• Anomaly Detection (Z-Score)
Nilsu • Visualization Engine (src/visualizer.py)
• Seaborn & Plotly Design
İlayda • Web Interface (app.py)
• Streamlit Dashboard Design

🏗 Architecture & Design

The project utilizes Object-Oriented Programming (OOP) and a modular structure:

  • Modular Design: Cleaning, Analyzing, and Visualizing are separated into dedicated modules.
  • OOP Implementation: Core logic is encapsulated within classes for better maintainability.
  • Web Integration: Modern dashboard built with Streamlit and Plotly.

Project Structure

SPOTIFY_DATA_ANALYSIS/
├── src/                 # Core Logic (OOP Modules)
│   ├── cleaner.py       # Data Cleaning Class
│   ├── analyzer.py      # Statistical Analysis Class
│   └── visualizer.py    # Visualization Engine
├── notebooks/           # Exploratory Data Analysis (EDA) logs
├── data/                # Raw and Processed Datasets
├── outputs/figures/     # Exported High-Resolution Charts
├── app.py               # Streamlit Dashboard Entry Point
├── main.py              # Automation & Pipeline Orchestrator
├── requirements.txt     # Dependency List
└── README.md            # Documentation

🛠 Data Cleaning & Preprocessing Workflow

To ensure the reliability of our analysis, we implemented a rigorous step-by-step cleaning pipeline. The raw dataset contained several inconsistencies, missing values, and formatting issues that were resolved as follows:

  1. Metadata Integrity Check: We first identified records with missing essential identifiers such as track_name, artist_name, and album_name. Since these are the primary keys for our music analysis, rows with null values in these columns were removed to prevent data corruption.

  2. Date & Time Normalization: The album_release_date column was originally in string format with varying precision. We converted this data into standardized Python datetime objects. To facilitate time-series trends, we engineered a new release_year feature by extracting the year from these objects.

  3. Robust Numerical Imputation: For missing entries in numerical columns like artist_followers and artist_popularity, we applied Median Imputation. This method was specifically chosen over mean imputation to ensure that extreme outliers in the dataset did not bias our central tendency measures, preserving the statistical distribution's integrity.

  4. Unit Transformation: Track durations were provided in milliseconds. We transformed this feature into minutes (duration_min) using a calculated transformation, making the data more interpretable for the end-user during visualization.

  5. Type Consistency & Casting: We ensured all boolean features (e.g., explicit content status) and categorical strings were cast into their appropriate data types. This step was crucial for the seamless operation of our dynamic filtering system in the dashboard.


📈 Advanced Analytics Methodology

To go beyond simple summaries, we implemented:

  • Distribution & Quartile Analysis: We utilized box-and-whisker plots to visualize the interquartile range (IQR) of track popularity, allowing us to identify the "typical" success range of a song.
  • Anomaly Detection (Z-Score): Using scipy.stats, we calculated Z-scores for each track's popularity. Records with a |Z| > 3 were flagged as statistical anomalies, helping us identify viral outliers that deviate significantly from the global average.

🎨 Professional Visualizations

  • Seaborn: Used for static scientific reporting (KDE plots and Histograms) to show popularity distribution.
  • Plotly: Implemented within our Modern Web Dashboard to provide interactive discovery. This meets the modern web-based visualization requirement, allowing users to inspect individual data points via tooltips.

⚙️ Installation & Setup

1. Clone the Repository

git clone https://github.com/Sevvalm/spotify_data_analysis.git
cd spotify_data_analysis

2. Create a Virtual Environment

Windows

python -m venv venv
venv\Scripts\activate

macOS

python3 -m venv venv
source venv/bin/activate

3. Install Dependencies Once the virtual environment is active, install all required libraries:

  • Windows:
    pip install -r requirements.txt
  • macOS:
    pip3 install -r requirements.txt

▶️ Running the Application

To run the entire pipeline (Data Cleaning + Analysis + Dashboard) automatically:

  • Windows:

    python main.py
  • macOS:

    python3 main.py

The application will process the raw data and automatically launch the web interface in your default browser.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors