This post continues Behind the Tap, a series exploring the hidden mechanics of everyday tech — from Uber to Spotify to search engines. I’ll dive under the hood to demystify the systems shaping your digital world.

first relationship with music listening started at 6, rotating through the albums in the living room’s Onkyo 6-disc player. Cat Stevens, Groove Armada, Sade. There was always one song I kept rewinding to, though I didn’t know its name. 10 years on, moments of the song returned to memory. I searched through forums, ‘old saxophone melody’, ‘vintage song about sand dunes’, looking for years with no success. Then, one day at university, I was in my friend Pegler’s dorm room when he played it:

That long search taught me how important it is to be able to find the music you love.


Before streaming and smart assistants, music discovery relied on memory, luck, or a friend with good music taste. That one catchy chorus could be lost to the ether.

Then came a music-lover’s miracle.

A few seconds of sound. A button press. And a name on your screen.

Shazam made music recognisable.

The Origin: 2580

Shazam launched in 2002, long before apps were a thing. Back then it worked like this:

You’d dial 2580# on your mobile (UK only).
Hold your phone up to the speaker.
…Wait in silence…
And receive a SMS telling you the name of the song.

It felt like magic. The founding team, Chris Barton, Philip Inghelbrecht, Avery Wang, and Dhiraj Mukherjee, spent years building that illusion.

To build its first database, Shazam hired 30 young workers to run 18-hour shifts, manually loading 100,000 CDs into computers and using custom software. Because CD’s don’t contain metadata they had to type the names of the songs manually, referring to the CD sleeve, to eventually create the company’s first million audio fingerprints — a painstaking process that took months.

In an era before smartphones or apps, when Nokia’s and Blackberry’s couldn’t handle the processing or memory demands, Shazam had to stay alive long enough for the technology to catch up to their idea. This was a lesson in market timing.

This post is about what happens in the moment between the tap and the title, the signal processing, hashing, indexing, and pattern matching that lets Shazam hear what you can’t quite name.


The Algorithm: Audio Fingerprinting

In 2003, Shazam co-founder Avery Wang published the blueprint for an algorithm that still powers the app today. The paper’s central idea: If humans can understand music by superimposing layers of sound, a machine could do it too.

Let’s walk through how Shazam breaks sound down to something a machine can recognise instantly.

1. Capturing Audio Sample

It starts with a tap.

When you hit the Shazam button, the app records a 5–10 second snippet of the audio around you. This is long enough to identify most songs, though we’ve all waited minutes holding our phones in the air (or hiding in our pockets) for the ID.

But Shazam doesn’t store that recording. Instead, it reduces it to something far smaller and smarter: a fingerprint.

2. Generating the Spectrogram

Before Shazam can recognise a song, it needs to understand what frequencies are in the sound and when they occur. To do this, it uses a mathematical tool called the Fast Fourier Transform (FFT).

The FFT breaks an audio signal into its component frequencies, revealing which notes or tones make up the sound at any moment.

Why it matters: Waveforms are fragile, sensitive to noise, pitch changes, and device compression. But frequency relationships over time remain stable. That’s the gold.

If you studied Mathematics at Uni, you would remember the struggles of learning the Discrete Fourier Transform process.Fast Fourier Transform (FFT) is a more efficient version that lets us decompose a complex signal into its frequency components, like hearing all the notes in a chord.

Music isn’t static. Notes and harmonics change over time. So Shazam doesn’t just run FFT once, it runs it repeatedly over small, overlapping windows of the signal. This process is known as the Short-Time Fourier Transform (STFT) and forms the basis of the spectrogram.

Image by Author: Fast Fourier Transformation Visualised

The resulting spectrogram is a transformation of sound from the amplitude-time domain (waveform) into the frequency-time domain.

Think of this as turning a messy audio waveform into a musical heatmap.
Instead of showing how loud the sound is, a spectrogram shows what frequencies are present at what times.

Image by Author: A visualisation of the transition from a waveform to a spectrogram using FFT

A spectrogram moves analysis from the amplitude-time domain to frequency-time domain. It displays time on the horizontal axis, frequency on the vertical axis, and uses brightness to indicate the amplitude (or volume) of each frequency at each moment. This allows you to see not just which frequencies are present, but also how their intensity evolves, making it possible to identify patterns, transient events, or changes in the signal that are not visible in a standard time-domain waveform.

Spectrograms are widely used in fields such as audio analysis, speech processing, seismology, and music, providing a powerful tool for understanding the temporal and spectral characteristics of signals.

3. From Spectrogram to Constellation Map

Spectrograms are dense and contain too much data to compare across millions of songs. Shazam filters out low-intensity frequencies, leaving just the loudest peaks.

This creates a constellation map, a visual scatterplot of standout frequencies over time, similar to sheet music, although it reminds me of a mechanical music-box.

Image by Author: A visualisation of the transition into a Constellation Map

4. Creating the Audio Fingerprint

Now comes the magic, turning points into a signature.

Shazam takes each anchor point (a dominant peak) and pairs it with target peaks in a small time window ahead — forming a connection that encodes both frequency pair and timing difference.

Each of these becomes a hash tuple:

(anchor_frequency, target_frequency, time_delta)

Image by Author: Hash Generation Process

What is a Hash?

A hash is the output of a mathematical function, called a hash function, that transforms input data into a fixed-length string of numbers and/or characters. It’s a way of turning complex data into a short, unique identifier.

Hashing is widely used in computer science and cryptography, especially for tasks like data lookup, verification, and indexing.

Image by Author: Refer to this source understand Hashing

For Shazam, a typical hash is 32 bits long, and it might be structured like this:

  • 10 bits for the anchor frequency
  • 10 bits for the target frequency
  • 12 bits for the time delta between them
Image by Author: A visualisation of the hashing example from above

This tiny fingerprint captures the relationship between two sound peaks and how far apart they are in time, and is strong enough to identify the song and small enough to transmit quickly, even on low-bandwidth connections.

5. Matching Against the Database

Once Shazam creates a fingerprint from your snippet, it needs to quickly find a match in its database containing millions of songs.

Although Shazam has no idea where in the song your clip came from — intro, verse, chorus, bridge — doesn’t matter, it looks for relative timing between hash pairs. This makes the system robust to time offsets in the input audio.

Image by Author: Visualisation of matching hashes to a database song

Shazam compares your recording’s hashes against its database and identifies the song with the highest number of matches, the fingerprint that best lines up with your sample, even if it’s not an exact match due to background noise.

How it Searches So Fast

To make this lightning-fast, Shazam uses a hashmap, a data structure that allows for near-instant lookup.

A hashmap can find a match in O(1) time, that means the lookup time stays constant, even if there are millions of entries.

In contrast, a sorted index (like B-tree on disk) takes O(log n) time, which grows slowly as the database grows.

This balance of time and space complexity is known as Big O Notation, theory I am not prepared of bothered to teach. Please refer to a Computer Scientist.

6. Scaling the System

To maintain this speed at global scale, Shazam does more than just use fast data structures, it optimises how and where the data lives:

  • Shards the database — dividing it by time range, hash prefix, or geography
  • Keeps hot shards in memory (RAM) for instant access
  • Offloads colder data to disk, which is slower but cheaper to store
  • Distributes the system by region (e.g., US East, Europe, Asia ) so recognition is fast no matter where you are

This design supports 23,000+ recognitions per minute, even at global scale.


Impact & Future Applications

The obvious application is music discovery on your phone, but there is another major application of Shazam’s process.

Shazam facilitates Market Insights. Every time a user tags a song, Shazam collects anonymised, geo-temporal metadata (where, when, and how often a song is being ID’d.)

Labels, artists, and promoters use this to:

  • Spot breakout tracks before they hit the charts.
  • Identify regional trends (a remix gaining traction in Tokyo before LA).
  • Guide marketing spend based on organic attraction.

Unlike Spotify, which uses user listening behaviour to refine recommendations, Shazam provides real-time data on songs people actively identify, offering the music industry early insights into emerging trends and popular tracks.

What Spotify Hears Before You Do
The Data Science of Music Recommendationmedium.com

On December 2017, Apple bought Shazam for a reported $400 million. Apple reportedly uses Shazam’s data to augment Apple Music’s recommendation engine, and record labels now monitor Shazam trends like they used to monitor radio spins.

Photo by Rachel Coyne on Unsplash

In the future, there is expected evolution in areas like:

  • Visual Shazam: Already piloted, point you camera at an object or artwork to identify it, useful for an Augmented Reality future.
  • Concert Mode: Identify songs live during gigs and sync to a real-time setlist.
  • Hyper-local trends: Surface what’s trending ‘on this street’ or ‘in this venue’, expanding community-shared music taste.
  • Generative AI integration: Pair audio snippets with lyric generation, remix suggestions, or visual accompaniment.

Outro: The Algorithm That Endures

In a world of ever-shifting tech stacks, it’s rare for an algorithm to stay relevant for over 20 years.

But Shazam’s fingerprinting method hasn’t just endured, it’s scaled, evolved, and become a blueprint for audio recognition systems across industries.

The magic isn’t just that Shazam can name a song. It’s how it does it, turning messy sound into elegant math, and doing it reliably, instantly, and globally.

So next time you’re in a loud, trashy bar holding your phone up to the speaker playing Lola Young’s ‘Messy’ just remember: behind that tap is a beautiful stack of signal processing, hashing, and search, designed so well it barely had to change.

Share.

Comments are closed.