Airbnb open sources data-science-sharing platform

Airbnb open sources data-science-sharing platform

Most organizations have well established procedures for vetting and sharing computer code. But what about data analysis?

Important findings are often held in "a mixed bag of presentations, emails, and Google Docs," two members of Airbnb's engineering and data science team blogged at Medium in February. When someone in the organization wants to locate and use that existing work, they often have to track down updated code and waste time checking and reproducing earlier results. And then they'll typically distribute their own findings "through a presentation, email, or Google Doc, perpetuating the cycle."

After considering various ideas on how to solve this problem, Airbnb created an internal Knowledge Repo, combining git version control and Markdown templates for reporting results. Airbnb recently open-sourced its Knowledge Repository Beta, seeking contributors to help move the project forward.

Git allows the same sort of peer review and version control that developers typically use to collaborate on code, while Markdown offers a mixture of text and code in a single, easily reproducible file. You can see RStudio's tutorial on R Markdown for more info of what Markdown in general can do. Markdown is available for other languages such as Python as well.

The Airbnb framework setup requires Python and supports "knowledge posts" in several formats.

"Posts are written in Jupyter notebooks, Rmarkdown files, or in plain Markdown, but all files (including query files and other scripts) are committed. Every file starts with a small amount of structured meta-data, including author(s), tags, and a TLDR," according to the Medium post, Scaling Knowledge at Airbnb. "A Python script validates the content and transforms the post into plain text with Markdown syntax. We use GitHub’s pull request system for the review process. Finally, there is a Flask web-app that renders the Repo’s contents as an internal blog, organized by time, topic, or contents.

"It provides various data stores (and utilities to manage them) for "knowledge posts," with a particular focus on notebooks (R Markdown and Jupyter / iPython Notebook) to better promote reproducible research," according to the GitHub repository. "The Knowledge Repository is a work in progress. There are lots of code cleanups and feature extensions TBD. Your assistance and involvement is more than encouraged."

IDG Insider

PREVIOUS ARTICLE

«The future of Drupal could be cooking in this lab

NEXT ARTICLE

Dell's futuristic Smart Desk PC will challenge Microsoft's Surface Studio»

Add Your Comment

Most Recent Comments

Our Case Studies

IDG Connect delivers full creative solutions to meet all your demand generatlon needs. These cover the full scope of options, from customized content and lead delivery through to fully integrated campaigns.

images

Our Marketing Research

Our in-house analyst and editorial team create a range of insights for the global marketing community. These look at IT buying preferences, the latest soclal media trends and other zeitgeist topics.

images

Poll

Should we donate our health data the same way we donate organs?