Doctoral Thesis: Nonparametric high-dimensional models: sparsity, efficiency, interpretability

Thursday, April 25
10:00 am - 11:30 am

4-265

By: Shibal Ibrahim

Supervisor: Dr. Rahul Mazumder

Details

  • Date: Thursday, April 25
  • Time: 10:00 am - 11:30 am
  • Category:
  • Location: 4-265
Additional Location Details:

ABSTRACT:
This thesis explores ensemble methods in machine learning, a technique that builds a predictive model by jointly training simpler base models. It examines three types of ensemble methods: additive models, tree ensembles, and mixtures of experts. Each ensemble method is characterized by a specific structure: additive models can involve base learners with single or pairwise covariates, tree ensembles use a decision tree as a base learner, and mixtures of experts typically employ neural networks. The focus of this thesis is on considering various sparsity and structural constraints within these methods and develop optimization based approaches to enhance training efficiency, inference, and/or interpretability.

In the first part, we consider additive models with interactions under component selection constraints and additional structural constraints e.g., hierarchical interactions. We consider different optimization based formulations and propose efficient algorithms to learn a good subset of components. We develop multiple toolkits that are scalable to large number of samples and large set of pairwise interactions.

In the second part, we consider tree ensemble learning. In this setting, we consider flexible and efficient formulation of differentiable tree ensemble learning. We study flexible loss functions, multitask learning etc. We also consider end-to-end feature selection in tree ensembles, i.e., we perform feature selection while training of tree ensembles. This is in contrast to popular tree ensemble learning toolkits, which perform post-training feature selection based on feature importances. Our toolkit provides substantial improvements in predictive performance for a desired feature budget.

In the third part, we consider sparse gating in mixture of experts. Sparse Mixture of Experts is paradigm where a subset of experts (typically neural networks) are activated for each input sample. This is used to scale training as well as inference of large-scale vision and language models. We consider multiple approaches to improve sparse gating in mixture of expert models. Our new approaches show improvements in large-scale experiments on machine translation as well as distillation of pre-trained models on natural language processing tasks.

Thesis Committee: Dr. Rahul Mazumder (Thesis Advisor)
Dr. Patrick Jaillet
Dr. Stephen Bates