Tutorials

How I scraped and visualized over 1,500 NPR Tiny Desk concerts

The kinds of music played behind NPR’s Tiny Desk have changed significantly since the concert series started in 2008. I decided to use data to show exactly how it’s evolved.

To accomplish this, I used a web-scraping package called Playwright, Last.fm’s artists API, Datawrapper, Adobe Illustrator, some CSS and JavaScript for animation, and, of course, my secret love — ChatGPT. The final project is at the top of the page.

Let’s walk through how I got there.

Scraping with Playwright

NPR provides a helpful archive of all their Tiny Desk concerts on their website here:

We want two pieces of information from this archive: the artists’ names and the date of the performance. Simple! In a Jupyter notebook, let’s dump all of the page’s HTML into BeautifulSoup and parse out the information we need. From that we get:

Nothing.

The problem here is that the HTML BeautifulSoup is parsing doesn’t contain the artists’ names. That’s because the information is loading onto the page dynamically, after requests pulls the HTML. 

Here’s where Playwright comes in. Playwright acts more like a real person browsing a webpage, making it better at grabbing information that needs to load on the page. It accomplishes this by opening the specified URL through a browser called Chromium (open-source Google Chrome). Playwright will wait until a specified event has occurred or an amount of time has passed before scraping anything on the page. For this purpose, I tell it to wait for the load state “domcontentloaded”

Using this approach solves the problem of scraping dynamically loaded content, but there’s another issue to resolve. Each page only shows concerts from a specific month.

Thankfully, Playwright allows us to programmatically click through buttons on the page,  so we can tell it to click the most recent month in 2024, scrape the desired information, then click the next month, and so on. Unfortunately, this process proved a bit finicky. The combination of multiple layers of required clicking plus the occasional unpredictable full-page popup led to a series of errors.

So let’s take another approach. Instead of clicking, we can use the inspector tool to find a simple ul element containing the URLs of each month’s page. From there, we can use Playwright to scrape and iterate over each of the URLs.

Combined with Pandas, this results in a simple dataframe with a column for the artists’ names and a column for the date of their performance. 

If you’re still confused by Playwright, Jonathan Soma, a data Journalism professor at Columbia University, has a wonderfully approachable crash course on the software available here.

Assigning Genre with Last.fm API

From here, we can use Last.fm’s (free) API to add additional columns to the dataframe that includes the genre of the artists. But first, we need to do some manual cleaning. Most of the concerts are titled simply as the name of the artists, but others provide some additional information, as in this instance of an Ed Sheeran Tiny Desk concert back in 2021:

You’re smart, so you know this title is referring to the artist Ed Sheeran. But, if you provide that title to Last.fm’s API it won’t return anything. Why? Because there’s no artists called “Ed Sheeran (Home) Concert.” 

To clean the data, we can simply save the dataframe as a CSV file and then bring it into Google Sheets where we can manually go through and clean the titles. Once cleaned, we can load the CSV back into our notebook as a dataframe and then feed it to the Last.fm API.

DON’T MISS  Visualizing public records appeals in your state

Knowing exactly how to ask the API to return the genres for each artist usually requires referencing Last.fm’s documentation, but, in this instance, I just explain to ChatGPT what I want and it knows the documentation well enough to write out the proper code. Here are the results:

Here we encounter our next problem. We want to assign just one genre to each artist, but the API returned an entire list of applicable genres for every single artist. There are various ways to approach this problem depending on what you’re trying to achieve with the project. Personally,  I want to strike a balance between grouping artists under large genre umbrellas while not wholly ignoring differences in their style. 

To achieve this, I include only the first three genres for each artist (this helped eliminate overly niche genres like “kickasstic”). From here,  I can write code to count the number of times each genre appears across the dataframe, using those tallies to create a “most common genre” column. Here’s an example of how it works:

Let’s say we’re determining the most common genre for the artist Sampha. The first three genres listed for him are Electronic, Soul, and UK Garage. Across the dataframe, the genre Electronic appears 85 times, Soul appears 190 times, and UK Garage appears just two times. Since Soul appeared the most of the three genres, the code fills in the “most_common_genre” cell as Soul.

It should be clear by now that even assigning a genre to an artist involves us making subjective decisions. Another person might think it’s better to simply assign an artist the first genre listed for them, which would designate Sampha as an electronic artist rather than a soul artist in our dataset. Both of these approaches would be valid. The important thing is to have a consistent rationale that explains why you decided on one approach over another. For this project, I am more concerned with bucketing artists together into more common genres to understand how prevalent different genres have been over the years at NPR.

Visualization (AKA the fun part)

Last, we need to do some additional shaping to get the data into the right form for our visualizations. For the small multiples chart, we can use a pivot table to group by the year and most common genre and then count the number of times each genre occurs:

For simplicity’s sake,  we can limit this chart to the 10 most common genres. From here, we can create a small multiples line chart in Datawrapper and export the SVG to style it how we want in Illustrator.

Exporting SVG files from Datawrapper does require a paid enterprise account (a basic account is free). If that is not accessible to you, another good option is RawGraphs. It will likely take more time to style in Illustrator,  but it’s totally free.

Next, for the animated bump chart, we can take essentially the same data that we used in the previous chart but replace the raw count with the relative ranking of each genre in a given year using the “=RANK” function in Google Sheets. The base of the bump chart is created in Datawrapper (there’s a helpful starting template here) and then we can export the SVG and bring it into Illustrator for styling.

DON’T MISS  Assembling a searchable interface of Boston police misconduct data

To animate the lines, I decided to create CSS animations since that’s what I am more familiar with. Software like Adobe After Effects is another good option.

To apply CSS animations to the chart, we will need to export the SVG from Illustrator and place it in an HTML file. For speed, we can make a quick project on Glitch, which lets you create and host webpages all in the browser. We then copy the chart’s SVG and dump that into an HTML file. Once the chart is showing up on the page how we want, we can use ChatGPT to help write the CSS and JavaScript necessary to animate the lines, dots, and labels when and how we want.

To make the animating easier, we want to have consistent IDs for the different elements we need to control. This can be done by hand in the HTML, but the easier option is to rename the layers and groups in Illustrator. Then, when exporting the SVG, we’ll see an SVG Options box, with an Object IDs setting that should be automatically set to:

This means that if all of the genre labels on the left of the graph are in a group called “start-labels” then the SVG group of those dots will have an ID of “start-labels.” This is very helpful for easily selecting particular elements on the page. Here’s what that selection looks like in the JavaScript:

From there, we can simply reference “startLabels” in a for loop and iterate over the different labels as the animation progresses. Once the chart is animating how we want in the browser, we can just make a screen recording of the full animation and then bring it into Adobe Premiere to cut the video to the exact length and aspect ratio we want for posting on Instagram.

Many specifics of this process weren’t included here for brevity’s sake, but the full Glitch page where the bump chart was animated is available here

A quick note

Much of the code used in this project was written by ChatGPT. The process was by no means mindless — there were still a fair number of errors that I had to troubleshoot — but using GPT made the undertaking much less intimidating. If this project inspired you to pursue your own idea, ChatGPT can be very helpful in filling in some of the details left blank by this relatively brief guide.

Elijah Nicholson-Messmer

Leave a Reply

Your email address will not be published. Required fields are marked *

Get the latest from Storybench

Keep up with tutorials, behind-the-scenes interviews and more.