Step 01Why Spotify Data?
Every data engineering project starts with a question, and music streaming data is packed with great ones. Why do some songs rack up billions of streams while similar tracks barely crack a million? Is there a measurable "formula" to danceability? Do listeners in different regions gravitate toward different tempos?
Spotify's audio feature data is one of the most accessible, well-structured, and genuinely fun datasets to work with. Every track comes annotated with machine-generated features: numerical scores for danceability, energy, valence (musical positivity), speechiness, and more. That means you get to practice real data engineering patterns on data you can actually listen to.
This project is designed for beginners who want real-world practice and data enthusiasts who want a satisfying end-to-end build. We'll take raw, messy CSV data, push it through a proper medallion architecture (Bronze > Silver > Gold), and produce visualizations that surface genuinely interesting musical insights.
Step 02The Dataset
We're working with a Spotify streaming dataset that you can find on Kaggle (search for "Most Streamed Spotify Songs" or "Spotify Global Music Dataset"). A typical version contains thousands of tracks with both metadata and audio features. Here's what the key columns look like:
Metadata Fields
| Column | Type | Example | Description |
|---|---|---|---|
| track_name | string | Blinding Lights | Name of the song |
| artist_name | string | The Weeknd | Primary artist |
| artist_count | int | 1 | Number of contributing artists |
| released_year | int | 2020 | Year of release |
| released_month | int | 11 | Month of release (1–12) |
| streams | string* | 3,703,895,074 | Total Spotify streams |
| in_spotify_playlists | int | 43,899 | Number of Spotify playlists featuring the track |
| in_apple_playlists | int | 672 | Number of Apple Music playlists |
Audio Feature Fields
These are the Spotify-generated scores (0–100 scale) that describe how a track sounds:
Additionally, you'll typically find bpm (beats per minute), key, and mode (Major/Minor) columns, which are crucial for musical analysis.
Step 03Architecture Overview
We'll organize the pipeline using the Medallion Architecture, a layered data design pattern widely used in modern data engineering (especially in Databricks and Delta Lake ecosystems). Each layer has a clear responsibility:
Bronze
Raw ingestion. Exactly what came from the source. No cleaning, no transforms. The "single source of truth" for raw data.
Silver
Cleaned & conformed. Fix types, handle nulls, normalize names, parse dates. Validated and query-ready.
Gold
Business-level aggregations. Pre-computed metrics, ranked summaries, and tables optimized for dashboards.
Why bother with layers? Because in real-world pipelines, raw data is messy but you never want to destroy the original. By keeping the Bronze layer untouched, you can always re-derive Silver and Gold if your logic changes. This is the same principle used in production data platforms at companies running Azure Data Factory, Databricks, Snowflake, and similar tools.
Step 04Bronze Layer: Raw Ingestion
The Bronze layer is deceptively simple: read the raw file and store it as-is, adding only ingestion metadata. No filtering, no type-casting, no cleaning.
from pyspark.sql import SparkSession from pyspark.sql.functions import current_timestamp, lit, input_file_name spark = SparkSession.builder.appName("SpotifyBronze").getOrCreate() # Read raw CSV exactly as-is raw_df = ( spark.read .option("header", "true") .option("inferSchema", "false") # Keep everything as StringType .csv("/mnt/landing/spotify_raw/*.csv") ) # Add ingestion metadata bronze_df = ( raw_df .withColumn("_ingested_at", current_timestamp()) .withColumn("_source_file", input_file_name()) .withColumn("_batch_id", lit("batch_2025_04")) ) # Write to Bronze table (Delta format for time-travel) ( bronze_df.write .format("delta") .mode("append") .saveAsTable("spotify_catalog.bronze.raw_tracks") ) print(f"Bronze: {bronze_df.count()} rows ingested")
What Bronze looks like
| track_name | artist(s)_name | streams | danceability_% | _ingested_at |
|---|---|---|---|---|
| Blinding Lights | The Weeknd | 3,703,895,074 | 51 | 2025-04-23 09:14:22 |
| Shape of You | Ed Sheeran | 3,562,543,890 | 83 | 2025-04-23 09:14:22 |
| Someone You Loved | Lewis Capaldi | 2,887,241,567 | 50 | 2025-04-23 09:14:22 |
| Sunflower | Post Malone, Swae Lee | 2,808,096,209 | 76 | 2025-04-23 09:14:22 |
| Bad Song™ | ??Unknown?? | NaN | 2025-04-23 09:14:22 |
Notice the last row, such messy data! That's fine! Bronze stores everything. We'll deal with it in Silver.
Step 05Silver Layer: Clean, Parse, Normalize
This is where the real engineering happens. The Silver layer is where you apply data quality rules, cast types, handle nulls, normalize inconsistent values, and produce a clean, reliable dataset that analysts can trust.
The cleaning checklist
Cast numeric columns
Convert streams, bpm, and all audio feature percentages from strings to proper integers/floats. Handle comma-separated numbers and non-numeric values.
Parse release dates
Combine released_year, released_month, and released_day into a proper DATE column. Handle invalid months (0, 13+) and missing days.
Normalize artist names
Trim whitespace, standardize casing, and split multi-artist fields (e.g., "Post Malone, Swae Lee") into an array for downstream joins.
Handle nulls & outliers
Drop rows where track_name is null, flag suspicious values (e.g., streams = 0 for a track with 40,000+ playlist appearances), and impute missing key values as "Unknown".
Add derived columns
Calculate decade, release_era ("Pre-streaming", "Early-streaming", "Modern"), and mood_category based on valence + energy thresholds.
from pyspark.sql.functions import ( col, regexp_replace, trim, split, when, make_date, coalesce, lit, current_timestamp ) from pyspark.sql.types import LongType, IntegerType, DoubleType # Read from Bronze bronze_df = spark.table("spotify_catalog.bronze.raw_tracks") # Step 1: Clean and cast the streams column silver_df = ( bronze_df # Remove commas and cast streams to Long .withColumn( "streams", regexp_replace(col("streams"), "[^0-9]", "").cast(LongType()) ) # Cast audio features to integers .withColumn("danceability_pct", col("`danceability_%`").cast(IntegerType())) .withColumn("energy_pct", col("`energy_%`").cast(IntegerType())) .withColumn("valence_pct", col("`valence_%`").cast(IntegerType())) .withColumn("speechiness_pct", col("`speechiness_%`").cast(IntegerType())) .withColumn("acousticness_pct", col("`acousticness_%`").cast(IntegerType())) .withColumn("liveness_pct", col("`liveness_%`").cast(IntegerType())) .withColumn("bpm", col("bpm").cast(IntegerType())) ) # Step 2: Build a proper release date silver_df = silver_df.withColumn( "release_date", make_date( col("released_year").cast(IntegerType()), coalesce(col("released_month").cast(IntegerType()), lit(1)), coalesce(col("released_day").cast(IntegerType()), lit(1)) ) ) # Step 3: Normalize artist names -> array silver_df = silver_df.withColumn( "artists_array", split(trim(col("`artist(s)_name`")), ",\\s*") ) # Step 4: Classify mood based on valence + energy silver_df = silver_df.withColumn( "mood", when((col("valence_pct") >= 60) & (col("energy_pct") >= 60), "Euphoric") .when((col("valence_pct") >= 60) & (col("energy_pct") < 60), "Chill Happy") .when((col("valence_pct") < 40) & (col("energy_pct") >= 60), "Intense") .when((col("valence_pct") < 40) & (col("energy_pct") < 40), "Melancholy") .otherwise("Neutral") ) # Step 5: Drop invalid rows, add metadata silver_df = ( silver_df .filter(col("track_name").isNotNull()) .filter(col("streams").isNotNull()) .withColumn("_cleaned_at", current_timestamp()) ) # Write to Silver ( silver_df.write .format("delta") .mode("overwrite") .option("overwriteSchema", "true") .saveAsTable("spotify_catalog.silver.cleaned_tracks") ) print(f"Silver: {silver_df.count()} clean rows")
The mood classification logic
This is one of the most interesting derived columns. By combining valence (how positive a song sounds) with energy (how intense it feels), we can classify every track into a mood quadrant:
Mood Quadrant
Step 06Gold Layer: Analytics-Ready Aggregations
The Gold layer is where we build the tables that directly power dashboards and reports. These are pre-aggregated, business-meaningful tables. Instead of running expensive group-by queries on millions of rows every time someone opens a dashboard, we compute the results once and store them.
Gold Table 1: Top Artists by Total Streams
CREATE OR REPLACE TABLE spotify_catalog.gold.top_artists AS WITH exploded_artists AS ( SELECT explode(artists_array) AS artist, streams, danceability_pct, energy_pct, valence_pct, release_date FROM spotify_catalog.silver.cleaned_tracks ) SELECT artist, COUNT(*) AS track_count, SUM(streams) AS total_streams, ROUND(AVG(danceability_pct), 1) AS avg_danceability, ROUND(AVG(energy_pct), 1) AS avg_energy, ROUND(AVG(valence_pct), 1) AS avg_valence, MIN(release_date) AS earliest_release, MAX(release_date) AS latest_release FROM exploded_artists GROUP BY artist HAVING track_count >= 3 ORDER BY total_streams DESC;
Gold Table 2: Mood Distribution by Year
CREATE OR REPLACE TABLE spotify_catalog.gold.mood_by_year AS SELECT YEAR(release_date) AS release_year, mood, COUNT(*) AS track_count, ROUND( COUNT(*) * 100.0 / SUM(COUNT(*)) OVER ( PARTITION BY YEAR(release_date) ), 1 ) AS pct_of_year, ROUND(AVG(streams), 0) AS avg_streams FROM spotify_catalog.silver.cleaned_tracks WHERE YEAR(release_date) >= 2010 GROUP BY release_year, mood ORDER BY release_year, track_count DESC;
Gold Table 3: Audio Feature Correlations
CREATE OR REPLACE TABLE spotify_catalog.gold.feature_correlations AS SELECT 'danceability_vs_energy' AS pair, ROUND(CORR(danceability_pct, energy_pct), 3) AS correlation FROM spotify_catalog.silver.cleaned_tracks UNION ALL SELECT 'energy_vs_valence', ROUND(CORR(energy_pct, valence_pct), 3) FROM spotify_catalog.silver.cleaned_tracks UNION ALL SELECT 'danceability_vs_valence', ROUND(CORR(danceability_pct, valence_pct), 3) FROM spotify_catalog.silver.cleaned_tracks UNION ALL SELECT 'speechiness_vs_streams', ROUND(CORR(speechiness_pct, streams), 3) FROM spotify_catalog.silver.cleaned_tracks UNION ALL SELECT 'acousticness_vs_energy', ROUND(CORR(acousticness_pct, energy_pct), 3) FROM spotify_catalog.silver.cleaned_tracks;
Step 07Visualizations: Charts That Tell a Story
Now for the payoff! Great data engineering isn't just about clean tables, it's about making the data speak to us. Below are the key charts we can build from our Gold layer, each revealing a different facet of the streaming landscape.
Gold Table > top_artists
Gold Table -> feature_correlations
Green = positive correlation · Pink = negative · Darker = stronger relationship
Gold Table -> mood_by_year
Silver Table -> cleaned_tracks
Step 08Key Insights: What the Data Reveals
Here are the most interesting patterns that emerge once the pipeline is running:
1. Danceability is king (but not everything)
Tracks with danceability scores between 65–80 dominate the top-streamed list. But pure danceability alone isn't enough, the biggest hits combine high danceability with moderate energy (55–75 range). Too much energy and the track tips from "fun to dance to" into "exhausting."
2. Sadder music streams more than you'd expect
Tracks classified as "Melancholy" or "Intense" account for roughly 35–40% of top-streamed songs. This challenges the assumption that happy = popular. Songs like Someone You Loved and Lovely prove that emotional depth can drive massive engagement.
3. The collaboration multiplier
Tracks with 2+ artists average 1.4x more streams than solo tracks. This isn't just about combined fanbases, collabs tend to score higher on energy, suggesting they're engineered for playlists and virality.
4. Energy and loudness are nearly the same thing
The correlation between energy and loudness is 0.52 in the dataset, one of the strongest in the matrix. This tells us Spotify's "energy" metric is noticeably influenced by production volume, not just tempo or intensity of arrangement.
5. Key doesn't matter (much)
There's no statistically significant relationship between musical key and stream count. C# and G# are the most common keys in the dataset, but that's more about production conventions than listener preference.
Step 09Where to Take It from Here
This project gives you a solid foundation. Here are natural extensions:
Add a real orchestrator
Wrap the Bronze > Silver > Gold pipeline in Azure Data Factory or Databricks Workflows. Schedule daily runs, add retry logic, and set up alerts for data quality failures (e.g., "stream count dropped 50% vs. yesterday > investigate").
Build a live dashboard
Connect the Gold tables to Power BI, Streamlit, or Grafana. Let stakeholders explore the mood heatmap, filter by artist, and drill into audio feature distributions interactively.
Add the Spotify API for live data
Instead of a static Kaggle CSV, connect to the Spotify Web API to pull real-time streaming data. This adds authentication (OAuth2), pagination, rate-limiting, and incremental ingestion patterns.
Predictive modeling
Use the Gold tables as feature inputs for a model that predicts whether a new track will cross 1 billion streams based on its audio features, artist history, and release timing. This bridges data engineering into data science territory.
Integrate cross-platform data
The Kaggle datasets often include Apple Music, TikTok, and YouTube metrics alongside Spotify data. Add those to your Silver layer and build Gold tables that compare cross-platform performance, which artists dominate TikTok but underperform on Spotify?