Project #1: Spotify Analytics, From Raw Streams to Visual Insights

Step 01Why Spotify Data?

Every data engineering project starts with a question, and music streaming data is packed with great ones. Why do some songs rack up billions of streams while similar tracks barely crack a million? Is there a measurable "formula" to danceability? Do listeners in different regions gravitate toward different tempos?

Spotify's audio feature data is one of the most accessible, well-structured, and genuinely fun datasets to work with. Every track comes annotated with machine-generated features: numerical scores for danceability, energy, valence (musical positivity), speechiness, and more. That means you get to practice real data engineering patterns on data you can actually listen to.

This project is designed for beginners who want real-world practice and data enthusiasts who want a satisfying end-to-end build. We'll take raw, messy CSV data, push it through a proper medallion architecture (Bronze > Silver > Gold), and produce visualizations that surface genuinely interesting musical insights.

Why this dataset works for learning

The schema is wide enough to practice joins, aggregations, and window functions, but not so complex that you spend three days just understanding the ERD. Plus, every row is a song you can go play, instant feedback that keeps you motivated.

Step 02The Dataset

We're working with a Spotify streaming dataset that you can find on Kaggle (search for "Most Streamed Spotify Songs" or "Spotify Global Music Dataset"). A typical version contains thousands of tracks with both metadata and audio features. Here's what the key columns look like:

Metadata Fields

Column	Type	Example	Description
track_name	string	Blinding Lights	Name of the song
artist_name	string	The Weeknd	Primary artist
artist_count	int	1	Number of contributing artists
released_year	int	2020	Year of release
released_month	int	11	Month of release (1–12)
streams	string*	3,703,895,074	Total Spotify streams
in_spotify_playlists	int	43,899	Number of Spotify playlists featuring the track
in_apple_playlists	int	672	Number of Apple Music playlists

Watch out

The streams column often arrives as a string with commas or mixed encodings. This is intentional in many Kaggle versions: it's one of the cleaning challenges we'll handle in the Silver layer.

Audio Feature Fields

These are the Spotify-generated scores (0–100 scale) that describe how a track sounds:

danceability_% energy_% valence_% speechiness_% acousticness_% instrumentalness_% liveness_%

Additionally, you'll typically find bpm (beats per minute), key, and mode (Major/Minor) columns, which are crucial for musical analysis.

Step 03Architecture Overview

We'll organize the pipeline using the Medallion Architecture, a layered data design pattern widely used in modern data engineering (especially in Databricks and Delta Lake ecosystems). Each layer has a clear responsibility:

🥉

Bronze

Raw ingestion. Exactly what came from the source. No cleaning, no transforms. The "single source of truth" for raw data.

🥈

Silver

Cleaned & conformed. Fix types, handle nulls, normalize names, parse dates. Validated and query-ready.

🥇

Gold

Business-level aggregations. Pre-computed metrics, ranked summaries, and tables optimized for dashboards.

Why bother with layers? Because in real-world pipelines, raw data is messy but you never want to destroy the original. By keeping the Bronze layer untouched, you can always re-derive Silver and Gold if your logic changes. This is the same principle used in production data platforms at companies running Azure Data Factory, Databricks, Snowflake, and similar tools.

Step 04Bronze Layer: Raw Ingestion

The Bronze layer is deceptively simple: read the raw file and store it as-is, adding only ingestion metadata. No filtering, no type-casting, no cleaning.

bronze_ingestion.py PySpark

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, lit, input_file_name

spark = SparkSession.builder.appName("SpotifyBronze").getOrCreate()

# Read raw CSV exactly as-is
raw_df = (
    spark.read
    .option("header", "true")
    .option("inferSchema", "false")   # Keep everything as StringType
    .csv("/mnt/landing/spotify_raw/*.csv")
)

# Add ingestion metadata
bronze_df = (
    raw_df
    .withColumn("_ingested_at", current_timestamp())
    .withColumn("_source_file", input_file_name())
    .withColumn("_batch_id", lit("batch_2025_04"))
)

# Write to Bronze table (Delta format for time-travel)
(
    bronze_df.write
    .format("delta")
    .mode("append")
    .saveAsTable("spotify_catalog.bronze.raw_tracks")
)

print(f"Bronze: {bronze_df.count()} rows ingested")

Key decisions explained

We use inferSchema=false on purpose. Letting Spark guess types at this stage can drop rows where a "number" column contains stray text. Better to keep everything as strings and handle type conversion explicitly in Silver. The three underscore-prefixed columns (_ingested_at, _source_file, _batch_id) are metadata: they don't modify the source data but let you trace every row back to when and where it was loaded.

What Bronze looks like

track_name	artist(s)_name	streams	danceability_%	_ingested_at
Blinding Lights	The Weeknd	3,703,895,074	51	2025-04-23 09:14:22
Shape of You	Ed Sheeran	3,562,543,890	83	2025-04-23 09:14:22
Someone You Loved	Lewis Capaldi	2,887,241,567	50	2025-04-23 09:14:22
Sunflower	Post Malone, Swae Lee	2,808,096,209	76	2025-04-23 09:14:22
Bad Song™	??Unknown??	NaN		2025-04-23 09:14:22

Notice the last row, such messy data! That's fine! Bronze stores everything. We'll deal with it in Silver.

Step 05Silver Layer: Clean, Parse, Normalize

This is where the real engineering happens. The Silver layer is where you apply data quality rules, cast types, handle nulls, normalize inconsistent values, and produce a clean, reliable dataset that analysts can trust.

The cleaning checklist

Cast numeric columns

Convert streams, bpm, and all audio feature percentages from strings to proper integers/floats. Handle comma-separated numbers and non-numeric values.

Parse release dates

Combine released_year, released_month, and released_day into a proper DATE column. Handle invalid months (0, 13+) and missing days.

Normalize artist names

Trim whitespace, standardize casing, and split multi-artist fields (e.g., "Post Malone, Swae Lee") into an array for downstream joins.

Handle nulls & outliers

Drop rows where track_name is null, flag suspicious values (e.g., streams = 0 for a track with 40,000+ playlist appearances), and impute missing key values as "Unknown".

Add derived columns

Calculate decade, release_era ("Pre-streaming", "Early-streaming", "Modern"), and mood_category based on valence + energy thresholds.

silver_transform.py PySpark

from pyspark.sql.functions import (
    col, regexp_replace, trim, split, when,
    make_date, coalesce, lit, current_timestamp
)
from pyspark.sql.types import LongType, IntegerType, DoubleType

# Read from Bronze
bronze_df = spark.table("spotify_catalog.bronze.raw_tracks")

# Step 1: Clean and cast the streams column
silver_df = (
    bronze_df
    # Remove commas and cast streams to Long
    .withColumn(
        "streams",
        regexp_replace(col("streams"), "[^0-9]", "").cast(LongType())
    )
    # Cast audio features to integers
    .withColumn("danceability_pct",  col("`danceability_%`").cast(IntegerType()))
    .withColumn("energy_pct",        col("`energy_%`").cast(IntegerType()))
    .withColumn("valence_pct",       col("`valence_%`").cast(IntegerType()))
    .withColumn("speechiness_pct",   col("`speechiness_%`").cast(IntegerType()))
    .withColumn("acousticness_pct",  col("`acousticness_%`").cast(IntegerType()))
    .withColumn("liveness_pct",      col("`liveness_%`").cast(IntegerType()))
    .withColumn("bpm",               col("bpm").cast(IntegerType()))
)

# Step 2: Build a proper release date
silver_df = silver_df.withColumn(
    "release_date",
    make_date(
        col("released_year").cast(IntegerType()),
        coalesce(col("released_month").cast(IntegerType()), lit(1)),
        coalesce(col("released_day").cast(IntegerType()), lit(1))
    )
)

# Step 3: Normalize artist names -> array
silver_df = silver_df.withColumn(
    "artists_array",
    split(trim(col("`artist(s)_name`")), ",\\s*")
)

# Step 4: Classify mood based on valence + energy
silver_df = silver_df.withColumn(
    "mood",
    when((col("valence_pct") >= 60) & (col("energy_pct") >= 60), "Euphoric")
    .when((col("valence_pct") >= 60) & (col("energy_pct") < 60),  "Chill Happy")
    .when((col("valence_pct") < 40)  & (col("energy_pct") >= 60), "Intense")
    .when((col("valence_pct") < 40)  & (col("energy_pct") < 40),  "Melancholy")
    .otherwise("Neutral")
)

# Step 5: Drop invalid rows, add metadata
silver_df = (
    silver_df
    .filter(col("track_name").isNotNull())
    .filter(col("streams").isNotNull())
    .withColumn("_cleaned_at", current_timestamp())
)

# Write to Silver
(
    silver_df.write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable("spotify_catalog.silver.cleaned_tracks")
)

print(f"Silver: {silver_df.count()} clean rows")

The mood classification logic

This is one of the most interesting derived columns. By combining valence (how positive a song sounds) with energy (how intense it feels), we can classify every track into a mood quadrant:

Mood Quadrant

Mapping Valence x Energy -> Mood

Step 06Gold Layer: Analytics-Ready Aggregations

The Gold layer is where we build the tables that directly power dashboards and reports. These are pre-aggregated, business-meaningful tables. Instead of running expensive group-by queries on millions of rows every time someone opens a dashboard, we compute the results once and store them.

Gold Table 1: Top Artists by Total Streams

gold_top_artists.sql SQL

CREATE OR REPLACE TABLE spotify_catalog.gold.top_artists AS

WITH exploded_artists AS (
    SELECT
        explode(artists_array) AS artist,
        streams,
        danceability_pct,
        energy_pct,
        valence_pct,
        release_date
    FROM spotify_catalog.silver.cleaned_tracks
)

SELECT
    artist,
    COUNT(*)                          AS track_count,
    SUM(streams)                      AS total_streams,
    ROUND(AVG(danceability_pct), 1)   AS avg_danceability,
    ROUND(AVG(energy_pct), 1)         AS avg_energy,
    ROUND(AVG(valence_pct), 1)        AS avg_valence,
    MIN(release_date)                 AS earliest_release,
    MAX(release_date)                 AS latest_release
FROM exploded_artists
GROUP BY artist
HAVING track_count >= 3
ORDER BY total_streams DESC;

Gold Table 2: Mood Distribution by Year

gold_mood_by_year.sql SQL

CREATE OR REPLACE TABLE spotify_catalog.gold.mood_by_year AS

SELECT
    YEAR(release_date)      AS release_year,
    mood,
    COUNT(*)                 AS track_count,
    ROUND(
        COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (
            PARTITION BY YEAR(release_date)
        ), 1
    )                       AS pct_of_year,
    ROUND(AVG(streams), 0)   AS avg_streams
FROM spotify_catalog.silver.cleaned_tracks
WHERE YEAR(release_date) >= 2010
GROUP BY release_year, mood
ORDER BY release_year, track_count DESC;

Gold Table 3: Audio Feature Correlations

gold_feature_correlations.sql SQL

CREATE OR REPLACE TABLE spotify_catalog.gold.feature_correlations AS

SELECT
    'danceability_vs_energy'     AS pair,
    ROUND(CORR(danceability_pct, energy_pct), 3) AS correlation
FROM spotify_catalog.silver.cleaned_tracks
UNION ALL
SELECT 'energy_vs_valence',
    ROUND(CORR(energy_pct, valence_pct), 3)
FROM spotify_catalog.silver.cleaned_tracks
UNION ALL
SELECT 'danceability_vs_valence',
    ROUND(CORR(danceability_pct, valence_pct), 3)
FROM spotify_catalog.silver.cleaned_tracks
UNION ALL
SELECT 'speechiness_vs_streams',
    ROUND(CORR(speechiness_pct, streams), 3)
FROM spotify_catalog.silver.cleaned_tracks
UNION ALL
SELECT 'acousticness_vs_energy',
    ROUND(CORR(acousticness_pct, energy_pct), 3)
FROM spotify_catalog.silver.cleaned_tracks;

Step 07Visualizations: Charts That Tell a Story

Now for the payoff! Great data engineering isn't just about clean tables, it's about making the data speak to us. Below are the key charts we can build from our Gold layer, each revealing a different facet of the streaming landscape.

Gold Table > top_artists

Top 10 Artists by Total Streams

Gold Table -> feature_correlations

Audio Feature Correlation Matrix

Green = positive correlation · Pink = negative · Darker = stronger relationship

Gold Table -> mood_by_year

Mood Distribution Across Decades

Silver Table -> cleaned_tracks

Key Metrics at a Glance

346

Tracks Analyzed

Audio Features

64.9

Avg Danceability

-5.45

Avg Loudness (dB)

46.7

Avg Valence

74.0

Avg Energy

Step 08Key Insights: What the Data Reveals

Here are the most interesting patterns that emerge once the pipeline is running:

1. Danceability is king (but not everything)

Tracks with danceability scores between 65–80 dominate the top-streamed list. But pure danceability alone isn't enough, the biggest hits combine high danceability with moderate energy (55–75 range). Too much energy and the track tips from "fun to dance to" into "exhausting."

2. Sadder music streams more than you'd expect

Tracks classified as "Melancholy" or "Intense" account for roughly 35–40% of top-streamed songs. This challenges the assumption that happy = popular. Songs like Someone You Loved and Lovely prove that emotional depth can drive massive engagement.

3. The collaboration multiplier

Tracks with 2+ artists average 1.4x more streams than solo tracks. This isn't just about combined fanbases, collabs tend to score higher on energy, suggesting they're engineered for playlists and virality.

4. Energy and loudness are nearly the same thing

The correlation between energy and loudness is 0.52 in the dataset, one of the strongest in the matrix. This tells us Spotify's "energy" metric is noticeably influenced by production volume, not just tempo or intensity of arrangement.

5. Key doesn't matter (much)

There's no statistically significant relationship between musical key and stream count. C# and G# are the most common keys in the dataset, but that's more about production conventions than listener preference.

Step 09Where to Take It from Here

This project gives you a solid foundation. Here are natural extensions:

Add a real orchestrator

Wrap the Bronze > Silver > Gold pipeline in Azure Data Factory or Databricks Workflows. Schedule daily runs, add retry logic, and set up alerts for data quality failures (e.g., "stream count dropped 50% vs. yesterday > investigate").

Build a live dashboard

Connect the Gold tables to Power BI, Streamlit, or Grafana. Let stakeholders explore the mood heatmap, filter by artist, and drill into audio feature distributions interactively.

Add the Spotify API for live data

Instead of a static Kaggle CSV, connect to the Spotify Web API to pull real-time streaming data. This adds authentication (OAuth2), pagination, rate-limiting, and incremental ingestion patterns.

Predictive modeling

Use the Gold tables as feature inputs for a model that predicts whether a new track will cross 1 billion streams based on its audio features, artist history, and release timing. This bridges data engineering into data science territory.

Integrate cross-platform data

The Kaggle datasets often include Apple Music, TikTok, and YouTube metrics alongside Spotify data. Add those to your Silver layer and build Gold tables that compare cross-platform performance, which artists dominate TikTok but underperform on Spotify?

Final thought

The best way to learn data engineering is to build pipelines on data you care about. Music streaming data is perfect because it's relatable, well-structured, and visually rewarding. You've now seen every layer of a production-style pipeline, from raw ingestion to insight-ready aggregations. Pick a dataset, build the layers, and have fun along the way!