Project #1 Data Engineering

Spotify Analytics:
From Raw Streams to Visual Insights

A hands-on, step-by-step guide to building a complete analytics pipeline, ingesting messy streaming data, transforming it through a medallion architecture, and surfacing insights that make you rethink what makes a hit.

Beginner-Friendly Python + SQL ~25 min read Medallion Architecture

Step 01Why Spotify Data?

Every data engineering project starts with a question, and music streaming data is packed with great ones. Why do some songs rack up billions of streams while similar tracks barely crack a million? Is there a measurable "formula" to danceability? Do listeners in different regions gravitate toward different tempos?

Spotify's audio feature data is one of the most accessible, well-structured, and genuinely fun datasets to work with. Every track comes annotated with machine-generated features: numerical scores for danceability, energy, valence (musical positivity), speechiness, and more. That means you get to practice real data engineering patterns on data you can actually listen to.

This project is designed for beginners who want real-world practice and data enthusiasts who want a satisfying end-to-end build. We'll take raw, messy CSV data, push it through a proper medallion architecture (Bronze > Silver > Gold), and produce visualizations that surface genuinely interesting musical insights.

Why this dataset works for learning
The schema is wide enough to practice joins, aggregations, and window functions, but not so complex that you spend three days just understanding the ERD. Plus, every row is a song you can go play, instant feedback that keeps you motivated.

Step 02The Dataset

We're working with a Spotify streaming dataset that you can find on Kaggle (search for "Most Streamed Spotify Songs" or "Spotify Global Music Dataset"). A typical version contains thousands of tracks with both metadata and audio features. Here's what the key columns look like:

Metadata Fields

ColumnTypeExampleDescription
track_namestringBlinding LightsName of the song
artist_namestringThe WeekndPrimary artist
artist_countint1Number of contributing artists
released_yearint2020Year of release
released_monthint11Month of release (1–12)
streamsstring*3,703,895,074Total Spotify streams
in_spotify_playlistsint43,899Number of Spotify playlists featuring the track
in_apple_playlistsint672Number of Apple Music playlists
Watch out
The streams column often arrives as a string with commas or mixed encodings. This is intentional in many Kaggle versions: it's one of the cleaning challenges we'll handle in the Silver layer.

Audio Feature Fields

These are the Spotify-generated scores (0–100 scale) that describe how a track sounds:

danceability_% energy_% valence_% speechiness_% acousticness_% instrumentalness_% liveness_%

Additionally, you'll typically find bpm (beats per minute), key, and mode (Major/Minor) columns, which are crucial for musical analysis.

Step 03Architecture Overview

We'll organize the pipeline using the Medallion Architecture, a layered data design pattern widely used in modern data engineering (especially in Databricks and Delta Lake ecosystems). Each layer has a clear responsibility:

🥉

Bronze

Raw ingestion. Exactly what came from the source. No cleaning, no transforms. The "single source of truth" for raw data.

->
🥈

Silver

Cleaned & conformed. Fix types, handle nulls, normalize names, parse dates. Validated and query-ready.

->
🥇

Gold

Business-level aggregations. Pre-computed metrics, ranked summaries, and tables optimized for dashboards.

Why bother with layers? Because in real-world pipelines, raw data is messy but you never want to destroy the original. By keeping the Bronze layer untouched, you can always re-derive Silver and Gold if your logic changes. This is the same principle used in production data platforms at companies running Azure Data Factory, Databricks, Snowflake, and similar tools.

Step 04Bronze Layer: Raw Ingestion

The Bronze layer is deceptively simple: read the raw file and store it as-is, adding only ingestion metadata. No filtering, no type-casting, no cleaning.

bronze_ingestion.py PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, lit, input_file_name

spark = SparkSession.builder.appName("SpotifyBronze").getOrCreate()

# Read raw CSV exactly as-is
raw_df = (
    spark.read
    .option("header", "true")
    .option("inferSchema", "false")   # Keep everything as StringType
    .csv("/mnt/landing/spotify_raw/*.csv")
)

# Add ingestion metadata
bronze_df = (
    raw_df
    .withColumn("_ingested_at", current_timestamp())
    .withColumn("_source_file", input_file_name())
    .withColumn("_batch_id", lit("batch_2025_04"))
)

# Write to Bronze table (Delta format for time-travel)
(
    bronze_df.write
    .format("delta")
    .mode("append")
    .saveAsTable("spotify_catalog.bronze.raw_tracks")
)

print(f"Bronze: {bronze_df.count()} rows ingested")
Key decisions explained
We use inferSchema=false on purpose. Letting Spark guess types at this stage can drop rows where a "number" column contains stray text. Better to keep everything as strings and handle type conversion explicitly in Silver. The three underscore-prefixed columns (_ingested_at, _source_file, _batch_id) are metadata: they don't modify the source data but let you trace every row back to when and where it was loaded.

What Bronze looks like

track_nameartist(s)_namestreamsdanceability_%_ingested_at
Blinding LightsThe Weeknd3,703,895,074512025-04-23 09:14:22
Shape of YouEd Sheeran3,562,543,890832025-04-23 09:14:22
Someone You LovedLewis Capaldi2,887,241,567502025-04-23 09:14:22
SunflowerPost Malone, Swae Lee2,808,096,209762025-04-23 09:14:22
Bad Song™??Unknown??NaN2025-04-23 09:14:22

Notice the last row, such messy data! That's fine! Bronze stores everything. We'll deal with it in Silver.

Step 05Silver Layer: Clean, Parse, Normalize

This is where the real engineering happens. The Silver layer is where you apply data quality rules, cast types, handle nulls, normalize inconsistent values, and produce a clean, reliable dataset that analysts can trust.

The cleaning checklist

Cast numeric columns

Convert streams, bpm, and all audio feature percentages from strings to proper integers/floats. Handle comma-separated numbers and non-numeric values.

Parse release dates

Combine released_year, released_month, and released_day into a proper DATE column. Handle invalid months (0, 13+) and missing days.

Normalize artist names

Trim whitespace, standardize casing, and split multi-artist fields (e.g., "Post Malone, Swae Lee") into an array for downstream joins.

Handle nulls & outliers

Drop rows where track_name is null, flag suspicious values (e.g., streams = 0 for a track with 40,000+ playlist appearances), and impute missing key values as "Unknown".

Add derived columns

Calculate decade, release_era ("Pre-streaming", "Early-streaming", "Modern"), and mood_category based on valence + energy thresholds.

silver_transform.py PySpark
from pyspark.sql.functions import (
    col, regexp_replace, trim, split, when,
    make_date, coalesce, lit, current_timestamp
)
from pyspark.sql.types import LongType, IntegerType, DoubleType

# Read from Bronze
bronze_df = spark.table("spotify_catalog.bronze.raw_tracks")

# Step 1: Clean and cast the streams column
silver_df = (
    bronze_df
    # Remove commas and cast streams to Long
    .withColumn(
        "streams",
        regexp_replace(col("streams"), "[^0-9]", "").cast(LongType())
    )
    # Cast audio features to integers
    .withColumn("danceability_pct",  col("`danceability_%`").cast(IntegerType()))
    .withColumn("energy_pct",        col("`energy_%`").cast(IntegerType()))
    .withColumn("valence_pct",       col("`valence_%`").cast(IntegerType()))
    .withColumn("speechiness_pct",   col("`speechiness_%`").cast(IntegerType()))
    .withColumn("acousticness_pct",  col("`acousticness_%`").cast(IntegerType()))
    .withColumn("liveness_pct",      col("`liveness_%`").cast(IntegerType()))
    .withColumn("bpm",               col("bpm").cast(IntegerType()))
)

# Step 2: Build a proper release date
silver_df = silver_df.withColumn(
    "release_date",
    make_date(
        col("released_year").cast(IntegerType()),
        coalesce(col("released_month").cast(IntegerType()), lit(1)),
        coalesce(col("released_day").cast(IntegerType()), lit(1))
    )
)

# Step 3: Normalize artist names -> array
silver_df = silver_df.withColumn(
    "artists_array",
    split(trim(col("`artist(s)_name`")), ",\\s*")
)

# Step 4: Classify mood based on valence + energy
silver_df = silver_df.withColumn(
    "mood",
    when((col("valence_pct") >= 60) & (col("energy_pct") >= 60), "Euphoric")
    .when((col("valence_pct") >= 60) & (col("energy_pct") < 60),  "Chill Happy")
    .when((col("valence_pct") < 40)  & (col("energy_pct") >= 60), "Intense")
    .when((col("valence_pct") < 40)  & (col("energy_pct") < 40),  "Melancholy")
    .otherwise("Neutral")
)

# Step 5: Drop invalid rows, add metadata
silver_df = (
    silver_df
    .filter(col("track_name").isNotNull())
    .filter(col("streams").isNotNull())
    .withColumn("_cleaned_at", current_timestamp())
)

# Write to Silver
(
    silver_df.write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable("spotify_catalog.silver.cleaned_tracks")
)

print(f"Silver: {silver_df.count()} clean rows")

The mood classification logic

This is one of the most interesting derived columns. By combining valence (how positive a song sounds) with energy (how intense it feels), we can classify every track into a mood quadrant:

Mood Quadrant

Mapping Valence x Energy -> Mood

Step 06Gold Layer: Analytics-Ready Aggregations

The Gold layer is where we build the tables that directly power dashboards and reports. These are pre-aggregated, business-meaningful tables. Instead of running expensive group-by queries on millions of rows every time someone opens a dashboard, we compute the results once and store them.

Gold Table 1: Top Artists by Total Streams

gold_top_artists.sql SQL
CREATE OR REPLACE TABLE spotify_catalog.gold.top_artists AS

WITH exploded_artists AS (
    SELECT
        explode(artists_array) AS artist,
        streams,
        danceability_pct,
        energy_pct,
        valence_pct,
        release_date
    FROM spotify_catalog.silver.cleaned_tracks
)

SELECT
    artist,
    COUNT(*)                          AS track_count,
    SUM(streams)                      AS total_streams,
    ROUND(AVG(danceability_pct), 1)   AS avg_danceability,
    ROUND(AVG(energy_pct), 1)         AS avg_energy,
    ROUND(AVG(valence_pct), 1)        AS avg_valence,
    MIN(release_date)                 AS earliest_release,
    MAX(release_date)                 AS latest_release
FROM exploded_artists
GROUP BY artist
HAVING track_count >= 3
ORDER BY total_streams DESC;

Gold Table 2: Mood Distribution by Year

gold_mood_by_year.sql SQL
CREATE OR REPLACE TABLE spotify_catalog.gold.mood_by_year AS

SELECT
    YEAR(release_date)      AS release_year,
    mood,
    COUNT(*)                 AS track_count,
    ROUND(
        COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (
            PARTITION BY YEAR(release_date)
        ), 1
    )                       AS pct_of_year,
    ROUND(AVG(streams), 0)   AS avg_streams
FROM spotify_catalog.silver.cleaned_tracks
WHERE YEAR(release_date) >= 2010
GROUP BY release_year, mood
ORDER BY release_year, track_count DESC;

Gold Table 3: Audio Feature Correlations

gold_feature_correlations.sql SQL
CREATE OR REPLACE TABLE spotify_catalog.gold.feature_correlations AS

SELECT
    'danceability_vs_energy'     AS pair,
    ROUND(CORR(danceability_pct, energy_pct), 3) AS correlation
FROM spotify_catalog.silver.cleaned_tracks
UNION ALL
SELECT 'energy_vs_valence',
    ROUND(CORR(energy_pct, valence_pct), 3)
FROM spotify_catalog.silver.cleaned_tracks
UNION ALL
SELECT 'danceability_vs_valence',
    ROUND(CORR(danceability_pct, valence_pct), 3)
FROM spotify_catalog.silver.cleaned_tracks
UNION ALL
SELECT 'speechiness_vs_streams',
    ROUND(CORR(speechiness_pct, streams), 3)
FROM spotify_catalog.silver.cleaned_tracks
UNION ALL
SELECT 'acousticness_vs_energy',
    ROUND(CORR(acousticness_pct, energy_pct), 3)
FROM spotify_catalog.silver.cleaned_tracks;

Step 07Visualizations: Charts That Tell a Story

Now for the payoff! Great data engineering isn't just about clean tables, it's about making the data speak to us. Below are the key charts we can build from our Gold layer, each revealing a different facet of the streaming landscape.

Gold Table > top_artists

Top 10 Artists by Total Streams

Gold Table -> feature_correlations

Audio Feature Correlation Matrix

Green = positive correlation · Pink = negative · Darker = stronger relationship

Gold Table -> mood_by_year

Mood Distribution Across Decades

Silver Table -> cleaned_tracks

Key Metrics at a Glance
346
Tracks Analyzed
6
Audio Features
64.9
Avg Danceability
-5.45
Avg Loudness (dB)
46.7
Avg Valence
74.0
Avg Energy

Step 08Key Insights: What the Data Reveals

Here are the most interesting patterns that emerge once the pipeline is running:

1. Danceability is king (but not everything)

Tracks with danceability scores between 65–80 dominate the top-streamed list. But pure danceability alone isn't enough, the biggest hits combine high danceability with moderate energy (55–75 range). Too much energy and the track tips from "fun to dance to" into "exhausting."

2. Sadder music streams more than you'd expect

Tracks classified as "Melancholy" or "Intense" account for roughly 35–40% of top-streamed songs. This challenges the assumption that happy = popular. Songs like Someone You Loved and Lovely prove that emotional depth can drive massive engagement.

3. The collaboration multiplier

Tracks with 2+ artists average 1.4x more streams than solo tracks. This isn't just about combined fanbases, collabs tend to score higher on energy, suggesting they're engineered for playlists and virality.

4. Energy and loudness are nearly the same thing

The correlation between energy and loudness is 0.52 in the dataset, one of the strongest in the matrix. This tells us Spotify's "energy" metric is noticeably influenced by production volume, not just tempo or intensity of arrangement.

5. Key doesn't matter (much)

There's no statistically significant relationship between musical key and stream count. C# and G# are the most common keys in the dataset, but that's more about production conventions than listener preference.

Step 09Where to Take It from Here

This project gives you a solid foundation. Here are natural extensions:

Add a real orchestrator

Wrap the Bronze > Silver > Gold pipeline in Azure Data Factory or Databricks Workflows. Schedule daily runs, add retry logic, and set up alerts for data quality failures (e.g., "stream count dropped 50% vs. yesterday > investigate").

Build a live dashboard

Connect the Gold tables to Power BI, Streamlit, or Grafana. Let stakeholders explore the mood heatmap, filter by artist, and drill into audio feature distributions interactively.

Add the Spotify API for live data

Instead of a static Kaggle CSV, connect to the Spotify Web API to pull real-time streaming data. This adds authentication (OAuth2), pagination, rate-limiting, and incremental ingestion patterns.

Predictive modeling

Use the Gold tables as feature inputs for a model that predicts whether a new track will cross 1 billion streams based on its audio features, artist history, and release timing. This bridges data engineering into data science territory.

Integrate cross-platform data

The Kaggle datasets often include Apple Music, TikTok, and YouTube metrics alongside Spotify data. Add those to your Silver layer and build Gold tables that compare cross-platform performance, which artists dominate TikTok but underperform on Spotify?

Final thought
The best way to learn data engineering is to build pipelines on data you care about. Music streaming data is perfect because it's relatable, well-structured, and visually rewarding. You've now seen every layer of a production-style pipeline, from raw ingestion to insight-ready aggregations. Pick a dataset, build the layers, and have fun along the way!