Doctoral Thesis: Collaborative, open, and automated data science

SHARE:

Event Speaker: 

Micah J. Smith

Event Location: 

via zoom, see details below

Event Date/Time: 

Wednesday, July 21, 2021 - 1:30pm

Abstract

Data science and machine learning have already revolutionized many industries and organizations, and are increasingly being used in an open-source setting to address important societal problems. However, there remain many challenges to developing predictive machine learning models in practice, such as the complexity of the steps in the modern data science development process, the involvement of many different people with varying skills and roles, and the necessity of, yet difficulty in, collaborating across steps and people. In this thesis, I describe progress in two directions in supporting the development of predictive models.

First, I propose to focus the effort of data scientists and support structured collaboration on the most challenging steps in a data science project. In the Ballet framework, we create a new approach to collaborative data science development, based on adapting and extending the open-source software development model for the collaborative development of feature engineering pipelines. Using Ballet as a probe, we conduct a detailed case study analysis of an open-source personal income prediction project in order to better understand data science collaborations.

Second, I propose to supplement human collaborators with advanced automated machine learning within end-to-end data science and machine learning pipelines. In the Machine Learning Bazaar, we create a flexible and powerful framework for developing machine learning and automated machine learning systems. In our approach, experts annotate and curate components from different machine learning libraries, which can be seamlessly composed into end-to-end pipelines using a unified interface. We build into these pipelines support for automated model selection and hyperparameter tuning. We use these components to create an open-source, general-purpose, automated machine learning system, and describe several other applications.

Thesis Supervisor: Kalyan Veeramachaneni

To attend this defense, please contact the doctoral candidate at micahs at mit dot edu