A high school student is thrown to a classroom floor by a school resource officer in South Carolina, a tsunami destroys homes after the 2011 Japanese earthquake, a horrifying stampede kills hundreds of pilgrims during the 2015 Hajj outside Mecca. These events were all captured by amateur filmmakers.
The footage, all captured on cell phones, made international news.
“As billions of people come to own smartphones, the amount of newsworthy video shot by non-journalists will explode,” wrote Marcus Moretti, project manager of data at Mic News, a news site that targets Millennials, in a recent blogpost. “Today, anyone with a smartphone can be a stringer.”
But how will news organizations find that video, especially, as Moretti noted, in an age where more than 300 hours of video are uploaded to YouTube every minute?
Enter companies like Dextro, a technology startup that is helping Mic discover newsworthy videos using algorithms that identify scenes, speech and objects within videos. By analyzing a video’s image and sound in real time, Dextro tags and extracts what may be relevant to a news organization such as footage of a political rally or of the stampede that killed more than 1,400 people in Saudi Arabia last month.
Dextro, along with other computer vision companies like MetaMind and Clarifai, aims to change the way we find and consume video by doing what humans cannot do. By employing a process known as machine learning, which teaches computer algorithms to recognize patterns, Dextro can, in minutes, analyze hours of video, flag salient content, and serve it to clients like Mic.
Credit: Dextro and Mic News.
“We essentially just watch videos with algorithms and make them searchable by indexing their content,” Dextro’s co-founder David Luan told Storybench. He makes it sound simple. “It’s video in, structured data out. And we’re able to analyze both audio and video, which is something no one’s ever done before.”
Improving tagging across the Internet
Who benefits from this technology? Stock photography companies like Foap and video streaming platforms like Vimeo are two examples. Both are clients of Clarifai, another NYC-based computer vision company. Matt Zeiler, Clarifai’s founder, explains what attracted those partners.
“A lot of what stock media companies do is being done manually,” he told Storybench. Zeiler immediately recognized a problem and offered a solution with computer vision. “Stock photographers don’t want to be wasting their time tagging content. We want to make the upload process faster so they can get more onto the marketplace.”
Zeiler, who started Clarifai in 2013 after finishing his PhD in machine learning at New York University, is currently in talks with media companies to help them, among other things, improve their internal tagging structure. “There’s no current standard,” Zeiler says. “Tags should be more consistent across the marketplace.”
That’s because the better your labeling, the easier it is to find what’s in your archives. It’s not unlike a factory warehouse–the easier it is to find your inventory, the faster you can ship it out.
Like Clarifai, Dextro has partnered with companies eager to monetize their terabytes of often mislabeled and disorganized video content in order to create channels and curate collections using this automated categorization technology.
For newsrooms interested in finding the signal in all the noise of user-generated video, Dextro can surface videos with newsworthy themes, transcribe and flag conversations within videos, and offer a visual, navigable timeline of what’s being said and what’s being shown in a video. With the amount of multimedia it processes, it’s no surprise that Facebook is also among those getting into the searchable video game.
From chasing puppies to annotating video
Transcribing audio and searching through a transcript for keywords is pretty straightforward, but how exactly do you get a machine to identify a crowd in a video or recognize and flag a police van? It took a while to get to that point, Luan says. He started thinking about the challenge of detecting objects in video as an intern at Microsoft and iRobot. He found that their “smart” consumer toys and appliances, in some cases connected to the Internet, were far from intelligent. What if robots and security cameras could process their onboard video feeds, make complicated choices and adjust to their environment? Could engineers teach them to learn from what their “eyes” were picking up?
“We wanted to make robots, security cameras, and Internet of Things smart. But people didn’t know what to do with the data except download it,” he explains. “We wondered, what if we could analyze a robot’s video feed, identify a puppy in the video, for example, and make the robot chase the puppy?”
After Luan started developing this technology under the name Dextro Robotics, he announced it on Hacker News, a geeky news website hosted by Y Combinator. Immediately, people started offering scenarios where they imagined Luan’s technology could be applied. “Everyone wanted to use it,” he says, “the value add was in making video searchable.”
Can a computer watch video?
How can you duplicate in a computer what the human eye and brain do so well–namely acquire, process and understand visual data? Teaching a computer to do this is the central challenge in computer vision, a subfield of the much broader computer science field known as machine learning, where computers learn and make predictions based on data.
We can recognize a yellow taxi after looking at it for mere tenths-of-a-second. But to teach a computer to recognize a yellow taxi takes building a model that the computer can reference and, in a sense, check its math. This model must embody every element that makes up a yellow taxi, such as circles for wheels, yellow panels on all sides, windows, etc. By feeding the model hundreds of videos of yellow taxis, its algorithm becomes more and more optimized to “see” a yellow taxi–and, crucially, know to differentiate it from a yellow school bus.
At Dextro, Luan explains, “we have lots and lots of videos that somebody has labeled, like ‘this is a video of two pizzas on an oven,’ we take that information and feed it into our model and say ‘the correct output is two pizzas on an oven.’ Then we reconfigure its parameters until the model converges with that pizza output. If for any reason you’re not outputting ‘two pizzas on an oven,’ we make adjustments.”
The model in essence is a decision tree that hopefully contains all conceivable possibilities of what is and what is not a yellow taxi. On the inside, says Luan, it looks like “a gigantic graph of nodes with weights and sums and billions of parameters.” Once the model is sufficiently trained, it can identify the key characteristics in a video without the help of a human.
Teaming up with journalists
It all started, as many things do, in a bar in New York. Luan happened to be down the street from the Mic Product team when his phone went off. Did he want to grab a drink? Moretti, an old friend from Yale, was around the corner at Anotheroom, a popular Tribeca lounge.
Over drinks, talk turned to Luan’s work with machine learning. Moretti was intrigued. As the night wore on, his team quickly realized the value of searchable video for their organization. Mic is targeted at Millennials and much of their content is multimedia. An interface where they could surface user-submitted video that had never been tagged by users, but nonetheless contained newsworthy content, would be extremely helpful. Luan agreed.
“There’s so much citizen journalism out there just waiting to be discovered,” Luan notes. “Twitter, Instagram, Periscope, there are so many platforms and everything’s being uploading but not being tagged. Well, if there’s all this stuff being posted, what if news organizations were able to filter down all Twitter videos to the gigantic kernels of relevance to current events?”
So Dextro and Mic teamed up to build a dashboard that aggregates videos shared on Twitter, analyzes and then surfaces them according to topics as diverse as Ted Turner, the Pittsburgh Steelers, Benghazi, and Jeremy Corbyn. “We built an interface for Mic with an internal dashboard that allows you to filter metadata, discover dense concepts, and generate a thumbnail of the region where, say, they were talking about Pope Francis. It’s way denser than tags,” says Luan.
One can imagine more newsrooms around the world leveraging this kind of video discovery technology, especially as the sheer quantity of video uploads skyrockets.
As Luan puts it: “There’s way too much video for humans to watch all of it.”