Eight tools, datasets and resources from the Open Data Science Conference


This week, thousands of data scientists, software developers, machine learning experts and other data-curious gathered in Boston for the ODSC East conference. Between hands-on training sessions and dynamic keynote speeches from people like MIT professor Regina Barzilay and Wired writer Clive Thompson, the recommendations were flying. Storybench attended and jotted a few down. 

Satellite imagery dataset

In his talk on computer vision and satellite imagery, Microsoft’s Xiaoyong Zhu highly recommended the xView dataset. He calls it the “largest publicly-available dataset for satellite imagery” which has a high coverage of urban scenarios.

Getting tidy with classification and regression

RStudio’s Max Kuhn, who built R’s caret package, gave a helpful workshop on modeling in R introducing some great packages like parsnipkknn and recipes. His slides are up on Github

Insurance against disruption

“Chance are you’ll be blindsided by new technology,” said MIT’s Michael Stonebraker in his hit keynote talk. “If you are sitting on your laurels, chances are your market is going to be eaten by someone else… You have to be able to reinvent yourself.” What to do? “You should all read Clayton Christensen’s book, ‘The innovator’s dilemma.’ Read it once a year.” 

Other book recommendation from the conference: The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies, Erik Brynjolfsson, Andrew McAfee; Superintelligence: Paths, Dangers, Strategies, Nick Bostrom.

Deep learning from scratch

Seth Weidman, a senior data scientist at Facebook, has the full documentation and presentation of his deep learning from scratch talk on Github. It walks you through the math, diagrams and code to understand deep learning. Cool. 

Deep learning for Twitter sentiment analysis

Mathworks’ MATLAB product manager Heather Gorr open-sourced her deep learning model (using a Long Short Term Memory Network) that classifies tweets as positive or negative. 

Two deep learning basics resources

I saw these resources over and over again: fast.ai and deeplearning.ai

Generating unicorns in the Andes?

Researchers working on automatic text generation at OpenAI are training a “large-scale unsupervised language model which generates coherent paragraphs of text.”

In other words, they supply this text:

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

and get:

The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Perez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Perez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow…

One thought on “Eight tools, datasets and resources from the Open Data Science Conference

  • Thank you for this article and the links. I love the diagram about the modeling process.

    The part about the unicorns was funny but that doesn’t make AI’s writing good. If a human wrote or spoke like that we’d scratch our heads trying to figure out what in the world he was trying to say. 🙂

    Anyway, good article. I love the writing on your site. Clear and smart. No doubt about the humanity of its authors. 🙂

Leave a Reply