The Marshall Project’s Aaron Sankin on AI, large data sets and contextualizing data stories in the real world
We’re living in the era of mass sharing and large data sets. Journalists have used all this new data to do everything from investigations to developing useful tools for the public. The relationship between journalism and data isn’t one way. As much as journalists have used data to strengthen their storytelling, data has deeply shaped the field of journalism in the last few decades.
At the Computation + Journalism Symposium hosted by Northeastern University last fall, practicing journalists, independent data storytellers, computational social scientists, artists, digital humanities scholars, cartographers and others from around the globe came together to discuss the ways that technology can be used to tell compelling stories at the intersections of computation and journalism.
Aaron Sankin, Deputy Data Editor at the Marshall Project, was one of two keynote speakers at the conference alongside investigative journalist Julia Angwin, who founded The Markup in 2018. Sankin spoke at Northeastern on Oct. 26, 2024.
We sat down with Sankin to discuss his career in data journalism and how data can help us understand the structural forces shaping the U.S. criminal justice system and the world at large.
This interview has been edited for length and clarity.
What are you working on right now at the Marshall Project?
We have been discussing the use of AI a lot. In this larger conversation about AI in journalism, we are figuring out how to use AI within journalism ethically. AI is useful as a tool as a way to structure large sets of data for example. A lot of work the Marshall Project does is done by obtaining massive sets of data like public records and AI allows for deeper analysis of this data.
At this point in AI’s life cycle, there is still a lot of human involvement in the industry. We’re interested in setting up best practices around AI, not just for the Marshall Project, but for the entire industry.
What brought you to the field of journalism and when did data become something in the field you wanted to work with?
I didn’t start my journalism career in data journalism at all. I came out of the online viral news world doing a lot of politics and technology reporting. Then I began moving onto [reporting about] online communities and extremist online communities.
My experience with data started at the Markup which had a bunch of data reporters and were looking for data reporters who had expertise in technology. A lot of my role there was working with data reporters. The job felt like a service desk for [these] data reporters where I helped them develop their data into pieces of journalism. A lot of data analysis work is doing fieldwork with the data you have to better understand it.
My current role at the Marshall Project is to assist the data team in helping their work become stories.
Are there any projects you work with that involve data that you are particularly proud of? What was special about those projects
I worked on a project called Blacklight at The Markup, which was a tool that allowed someone to put a URL in the tool and send a blank profile to the website. The tool would then allow a user to look at how many people are on the website and see where the data from that website is going.
I was tasked with developing a story using this tool and saw when looking at websites that served the trans community how much cookies were being used on almost every platform.
Data is essentially this method of systemic abstract observations about the world and what it allows you to do is use these observations to develop a deeper understanding of the world.
When you are coming across a data set for a story, let’s say police bureau arrest data, what are the first steps you take to better understand that data?
Don’t go into a data set blindly, try to go into it with a hypothesis, a question, or something that you are looking for. A story that was done at the Markup focused on research at Princeton where they were [data] scraping Internet service providers (ISP) and figuring out what speed they were running on.
We were interested in franchise agreements, which are agreements ISPs sign with cities to get use of their infrastructure to do things like carry their cables while they have to give these cities service. We used this data set to see if the ISPs were giving cities the services they were supposed to, this was our first hypothesis. When scraping this data we saw that generally cities did get the service they were promised but that areas in one city would get different speeds from the same service. You can imagine which areas were getting uneven service. In reaction to this, the hypothesis changed.
In my mind, there is a bit of a conflict between objectivity and data analysis within the journalism industry. As you said data analysis requires coming into a set of data with a hypothesis while in my opinion the objectivity standard often encourages journalists to not come into stories with preconceived notions.
I don’t agree that they necessarily contradict. The process that goes into objectivity, if you are saying something bad about someone, you ask them about it, for example, is the same thought process you use when engaging with data analysis.
What makes data fundamentally different is, it is a very narrow view of a super specific thing and if you don’t get more context in the real world you aren’t truly comprehending the full story.