Building interpretable and accurate models are attracting more and more interest in the machine learning community. In this thesis, we developed an interpretable machine learning algorithm called SBRL and we built an interpretable and statistically more accurate model for predicting stroke for patients in atrial fabrication (AF) who have not had a prior history of stroke and who are not taking anticoagulants.
The first part of the thesis presents an interpretable machine learning algorithm that can be used as an alternative algorithm to the decision tree algorithm. Our algorithm builds an optimized rules list model from data by maximizing the posterior probability of a natural hierarchical generative model. It has the form of chained IF-THEN clauses which is simple for a human to follow and derive its prediction by hand. We developed two theoretical bounds for the algorithm. One for the length of the optimal rules list model; and the other for the upper bounds for the posterior probability of the optimized rules list given its prefixes. The latter can be used for pruning the search space. We thoroughly tested our algorithm against other interpretable machine learning algorithms as well as non-interpretable algorithms across multiple publicly available datasets from UCI repository, in terms of interpretability, computational speed, and accuracy. Our algorithm strikes a balance among these metrics.
The second part of the thesis presents how we used the ATRIA-CVRN study cohort to build a stroke prediction model that is as simple as but statistically significantly more accurate than the stroke models in wide use, such as the CHA2DS2-VASc and ATRIA scores, for patients in AF who are not taking anticoagulants like warfarin. We focused on the more challenging problem of primary prevention: building a model for predicting stroke risk for patients who have not had a history of stroke. We assessed the strengths of predictors and identified informative predictors that are not used in existing stroke models. We created a univariate stroke model using the most informative predictor age and achieved statistically significantly better performance than CHA2DS2-VASc and similar performance as ATRIA. We used linear and nonlinear machine learning models to test the limit of the information that can be extracted from the data. We built a linear model with optimized integer coefficients using RiskSLIM. We used our scalable Bayesian rules list algorithm to generate simple-yet-accurate representations for patients who were predicted a high risk of stroke and who should be recommended anticoagulants.
Thesis Supervisor: Dr. Cynthia Rudin
Committee: Prof. Peter Szolovits, Dr. Una-May O'Reilly