Behind the Scenes Features Tools

How The Washington Post uncovered the sources that make AI chatbots sound so smart

How did AI chatbots get so smart? We set out to show our readers at The Washington Post in “Inside the secret list of websites that make AI bots like ChatGPT sound so smart.” Our journey started with an attempt to understand the underlying websites that shape chatbot knowledge.

Accessing the data

To start, we pursued access to Google’s Colossal Cleaned Crawled Corpus, or C4 dataset that encapsulates the vast web sources training artificial intelligence models. C4 is a dataset created by applying a set of filters to a single snapshot of Common Crawl.

Although Google does not offer it for download, researchers at Allen Institute for AI re-created C4 and provided us with a list of the top 1,000 domains from C4. Initially, we aimed to categorize these domains manually. However, we quickly learned that they only accounted for a mere 8% of tokens — key words – far from a representative sample.

This revelation prompted a deeper investigation.

The Allen Institute researchers extracted all 15.7 million domains and shared them with us. Navigating such a vast dataset, however, presented its own set of challenges.

Cleaning and analyzing

Many websites have separate domains for their mobile versions (i.e., “en.m.wikipedia.org” and “en.wikipedia.org.”) We decided to treat these as the same domain. We also combined subdomains aimed at specific languages, so “en.wikipedia.org” became “wikipedia.org.”

We engaged with other researchers and utilized categorization data from the analytics firm, Similarweb. R Markdown became our tool of choice for data cleaning and analysis, enabling us to produce interactive web pages for collaborative access. 

Similarweb helped us place two-thirds of them — about 10 million domains — into categories and subcategories. (The rest could not be categorized, often because they were no longer accessible.) This necessitated a hands-on approach, as we combed through the datasets to enhance and reorganize the categories.

Our comprehensive examination of these 15 million websites unveiled a diverse range of sources. From proprietary and personal to those bearing controversial content, these domains collectively shape the responses of chatbots.

To determine the significance of each website, we employed the token count, reflecting the volume of words or phrases from each domain. A snapshot of the top websites by tokens:

Explore more top sites here.

Visualization

Given the hierarchical and part-to-whole nature of our data, we decided to visualize the information with a tree map.

However, this visualization relied heavily on detailed annotations, highlighting the importance of revealing individual websites within overarching categories. We spent a lot of time getting annotations right— tricky for a dynamic chart that rearranges itself.

There are two primary methods to avoid label collisions within a given parent container:

  • Nudging colliding labels: We have a labelAvoiderNudge function that detects colliding labels and then nudges to avoid overlaps. I simplified these functions to just use the nudge to the right method and limit the attempts to a hard-coded number rather than providing it as an option.
const getCollisions = (tree: RBush<Body>) => {
return tree.all().filter(body => tree.search(body).length > 1);
}

export const simpleLabelAvoiderNudge = (labels: Body[]) => {
const tree: RBush<Body> = new RBush();

// Insert all label bodies into the R-tree.
labels.forEach(label => tree.insert(label));

// Check and handle collisions.
let attempts = 0;
while (attempts < 3) {
const collidingLabels = getCollisions(tree);
if (!collidingLabels.length) break;

collidingLabels.forEach(body => {
// For simplicity, just move the label slightly to the right.
body.minX += 5;
body.maxX += 5;

tree.update(body);
});
attempts++;
}

// This returns the updated labels. 
return labels;
}
  • Positioning choices for labels: We have another function called labelAvoiderChoices that changes the position of colliding labels based on a set of provided choices: top right, top left, bottom right, bottom left, etc. For the sake of brevity, I’ll assume that we have two position choices — the current position and a new position slightly below — and that the new position is hard-coded to be a certain number of units below the original.
export const simpleLabelAvoiderChoices = (labels: Body[]) => {
const tree: RBush<Body> = new RBush();

// Insert all label bodies into the R-tree.
labels.forEach(label => tree.insert(label));

// Check and handle collisions.
let attempts = 0;
while (attempts < 3) {
const collidingLabels = getCollisions(tree);
if (!collidingLabels.length) break;

collidingLabels.forEach(body => {
// Move the label slightly below as our second position choice.
body.minY += 10;
body.maxY += 10;

tree.update(body);
});
attempts++;
}

// Returns the updated labels. 
return labels;
}

Both of these functions utilize spatial indexing through the RBush library, a tool to manage and query rectangular spatial data, making collision detection efficient. 

DON’T MISS  Boston lacked a system for the public to keep tabs on police, so the Boston Globe built a public database

To gain further insights, the Allen Institute granted us access to their full-text search API. Our searches revealed telling statistics, like 72,000 mentions of the term “swastika” and a whopping 200 million hits for the copyright symbol “©.”

Our exploration wasn’t without its challenges. Delving into the “clean” dataset, we encountered a disturbing amount of offensive material. We spent days combing through offensive domain names, including racial slurs, obscenities and porn, to confirm their existence and mask specific words from readers in our searchable database.

An interactive searchable tool

When we discovered our personal websites within the dataset, we realized we would need a public search function.

This posed a challenge: how could we enable a search through 15 million domains without constructing an API?

Our solution entailed grouping domains by their initial three letters, creating a CSV for each combination. Now, when a user types three characters, the pertinent CSV loads, filtering the results. It’s worth noting that 10 million domains are scattered across 46,000 files, exemplified by names like “ksc.csv.”

Concluding thoughts

Like all stories, this piece relies on regular reporting. It’s more than accumulating data points — it’s about critically examining them, identifying the gaps and seeking further insights. 

When we realized the initial 1,000 domains were insufficient, the solution wasn’t just a matter of getting more data. We actively communicated with the researchers to obtain the full dataset for all domains to get the complete picture of what gets fed into the models. 

Leveraging the categorization data from Similarweb helped streamline the process, but it wasn’t just a simple import and process task. There was a substantial manual inspection phase — checking the data, discussing the nuances and discrepancies and repeatedly consulting with researchers. This iterative process is a reminder that regular reporting isn’t always linear. It’s a back-and-forth process, demanding patience and a detail-oriented approach.

DON’T MISS  Behind the scenes of a Pulitzer Prize finalist from the Portland Press Herald
Szu Yu Chen and Kevin Schaul
Latest posts by Szu Yu Chen and Kevin Schaul (see all)

Leave a Reply

Your email address will not be published. Required fields are marked *

Get the latest from Storybench

Keep up with tutorials, behind-the-scenes interviews and more.