Big Data, Bigger Ideas: the four coauthors of “Data Science In Context” share their perspectives on a growing field.
As the pioneers of a developing field, data scientists often have to deal with a frustratingly slippery question: what is data science, precisely, and what is it good for? The newly-released Data Science in Context: Foundations, Challenges, Opportunities, seeks to do just that, giving a broad, conversational overview of the wide-ranging field driving change in sectors ranging from healthcare to transportation to commerce to entertainment.
Alfred Spector, a visiting scholar in the Department of Electrical Engineering and Computer Science, and a team of distinguished co-authors contributed their perspectives to the text: Peter Norvig (Distinguished Education Fellow at Stanford’s Human-Centered Artificial Intelligence Institute and a research director at Google), Chris Wiggins (Associate Professor of Applied Mathematics at Columbia University and Chief Data Scientist at The New York Times), and Jeannette M. Wing (Executive Vice President for Research and Professor of Computer Science at Columbia, and the inaugural Avanessians Director of the university’s Data Science Institute). The co-authorial team interviewed with MIT to share their experiences writing the book.
Anyone who writes a textbook hopes to contribute something new to the discussion. When you and your co-authors began planning “Data Science In Context”, what did you feel was missing, and what did you hope to offer?
Alfred Spector: We portray present data science as a field that’s already had enormous benefits, that provides even more future opportunities, but one that requires equally enormous care in its use. Referencing the word “context” in the title, we explain that the proper use of data science must consider the specifics of the application, the laws and norms of the society in which the application is used, and even the time period of its deployment. And, importantly for an MIT audience, the practice of data science must go beyond just the data and the model to the careful consideration of an application’s objectives, its security, privacy, abuse, and resilience risks, and even the understandability it conveys to humans. Within this expansive notion of context, we finally explain that data scientists must also carefully consider ethical trade-offs and societal implications.
Tell me a little about honing and editing the work. How did you keep focus throughout the process?
Peter Norvig: We started when Alfred came to us co-authors, saying he had a complete one-hour talk that was well-received (including this 2017 Seminar at MIT), and he wanted to expand it into a book. As is often the case in such projects, everyone drastically underestimated the amount of remaining work.
Alfred Spector: Much like in open-source projects, I played both the coordinating author role and also the role of overall librarian of all the material, but we all made significant contributions. Chris Wiggins is very knowledgeable on the Belmont principles and applied ethics; he was the major contributor of those sections. Peter Norvig, as the coauthor of the bestselling AI textbook, was particularly involved in the sections on building models and causality. Jeannette Wing worked with me very closely on our seven-element Analysis Rubric and recognized that a checklist for data science practitioners would end up being one of our book’s most important contributions.
Jeannette Wing: Alfred is first author for a good reason! He did the bulk of the writing; Chris did a lot of the ethics parts; Peter did a lot of careful editing and rewriting to ensure we got the technical details right. The reason I had been brought in was I had already written many articles and given many talks about data science, and I had very clear and big ideas that Alfred wanted to include in the book. I was also insistent upon the section describing advanced technologies for protecting privacy; we should educate our audience that there are these advanced technologies for protecting privacy and solving problems! My other role was to bring the academic perspective to the table. Although I spent time at Microsoft, I’m basically an academic; there’s a difference between a technology that’s scaled up to industry use, versus something cutting-edge that’s still being developed in academia.
Alfred Spector: From a nuts-and-bolts perspective, we wrote the book during COVID, using one large shared Google doc with weekly videoconferences. Amazingly enough, Chris, Jeannette, and I didn’t meet in person at all, and Peter and I met only once – sitting outdoors on a wooden bench on the Stanford campus.
That is an unusual way to write a book! Do you recommend it?
Alfred Spector: It would be nice to have had more social interaction, but a shared document, at least with a coordinating author, worked pretty well for something up to this size. The benefit is that we always had a single, coherent textual base, not dissimilar to how a programming team works together.
When writing a comprehensive textbook, I’m sure there are areas of expertise or personal interest that you could have gone into in much greater detail, but didn’t for the sake of focus and length. If you could have added just one more chapter to the book, what would it have focused on?
Peter Norvig: I would have a whole chapter on large generative models; language models such as GPT and image models such as DALL-E. Generative models are changing rapidly in both technology, application, and societal implications.
Jeannette Wing: If I’d written another chapter, I would have focused on the use of data science for science, which relies on large, often unique costly instruments, e.g., telescopes, sequencing machines, neutrino detectors, and particle colliders, that generate tons and tons of data, all of which, if analyzed with modern data science methods, can help advance the sciences, from astronomy to biology to physics. We touch on it only very lightly.
Let’s talk a bit about how data is referenced in everyday life. One of the most common daily buzzwords Americans hear is “data-driven”, but many might not know what that term is supposed to mean, and how to verify if a given recommendation or course of action really is supported by data. Can you unpack the term, “data-driven”, a little bit?
Alfred Spector: Data-driven broadly refers to techniques or algorithms powered by data — they either provide insight or reach conclusions, say, a recommendation or a prediction. The algorithms power models which are increasingly woven into the fabric of science, commerce, and life and they often provide excellent results. The list of their successes is really too long to even begin to list.
However, one concern is that the proliferation of data makes it easy for us as students, scientists, or just members of the public to jump to erroneous conclusions. As just one example, our own confirmation biases make us prone to believing some data elements or insights “prove” something we already believe to be true. Additionally, we often tend to see causal relationships where the data only shows correlation. It might seem paradoxical, but data science makes critical reading and analysis of data all the more important.
I was struck by chapter 11’s message that “data science does not automatically lead to more understanding.” If there’s one thing you’d like the general public to know about data, and what our analysis of it can and can’t do, what would it be?
Peter Norvig: Data provides evidence, not answers. To make use of data, you need a rough model of what’s going on in the world. Given the rough model and the data, the processes of Data Science can fill in the details of the model in the way that is most consistent with the evidence. But without the model, the data is meaningless. We can use the resulting models to suggest actions, but first we have to tell it what our preferences are. Sometimes this is obvious: when playing chess, we prefer to win, not to lose. But sometimes balancing preferences is problematic and controversial: in diagnosing disease, how many false positives are we willing to accept to eliminate a false negative? Data Science and Statistics can’t tell us the answer; only our personal and societal values can.
Many of our readers might be current EECS students at MIT with an interest in marshaling data to help solve local or even global problems. Where do you see a particular need, in the next five to ten years, for good data scientists?
Chris Wiggins: I think the COVID epidemic opened many people’s eyes to how health care in the US is lagging. Not just relative to our expectations, but relative to other countries as well. “Our World in Data” has a great chart on this topic, in which the US is the clear outlier. Clearly, there are many factors here that need to be fixed, but I can’t help but wonder how data, both from individuals’ electronic health records as well as at the policy level, or even health apps and private software, might help. It’s an example of a “wicked problem”, as we discuss in our Concluding Thoughts, and is going to take collaboration across many sectors, but I think data could be extremely useful in education, informing and improving public policy and regulation, and improving individuals’ choices and health outcomes.
Let’s talk about COVID. While some short-range models for mortality were very accurate during the pandemic, you noted the failure of long-range models to predict any of 2020’s four major geotemporal COVID waves in the US.
Alfred Spector: COVID was particularly difficult to predict over the long term because of many factors—the virus was changing, human behavior was changing, political entities changed their minds; also, we didn’t have fine-grained mobility data (perhaps, for good reasons),and we lacked sufficient scientific understanding of the virus, particularly in the first year.
Do you feel that COVID was a uniquely hard situation to model?
Alfred Spector: Good question. I think there are many other domains which are similarly difficult. Our book teases out many reasons why data-driven models may not be applicable. Perhaps, it’s too difficult to get or hold the necessary data. Perhaps, the past doesn’t predict the future. If data models are being used in life and death situations, we may not be able to make them sufficiently dependable; this is particularly true as we’ve seen all the motivations that bad actors have to find vulnerabilities. So, as we continue to apply data science, we need to think through all the requirements we have, and the capability of the field to meet them. They often align, but not always. And, as Peter and Chris observed, data science needs to move into ever more important areas such as human health, education, transportation safety, etc., and there will be many challenges.
Alfred noted that a core concern of the group of coauthors was making sure a balance was struck between sharing both cautions about the limits of data science, and a sense of hope and optimism about the incredible potential of this science to achieve positive results for the world. Can you share an application where you feel that the use of data has the potential to lead to large-scale societal good?
Peter Norvig: Today I saw a presentation on weather prediction; better predictions will give better crop yields for farmers and will keep us safer from storms and floods. This week there has been a lot of attention to large language models that can write computer programs; this could open up new possibilities for non-programmers to do tasks they were unable to do in the past. Last month there was yet another paper on automated drug discovery, which could mitigate diseases and extend lifespan and health. Next week I expect there will be more examples.
Jeannette Wing: One obvious answer is health care. We have so many unknowns in how to treat disease and conditions that all of us suffer from; the more data we have about a diverse population, the better. Right now, we base a lot of our treatments on clinical trials, which give us a gold standard of randomized controlled testing, but there are alternatives, such as the OHDSI dataset that Columbia is a coordinating center for. They have 600 million unique patient records from around the world. You’re not going to have a 600-million-person population for a clinical trial, so by using a dataset of this size, one can discover things that you would never otherwise discover. I think people would be more inclined to share their data for research purposes if they knew that scientists are using that data to discover new treatments, to do early detection, and to do early intervention; because it could actually help you, as a patient, down the line. That’s the “for good” part of data.
Let’s talk about how data is visualized, and the power of a good visualization. Alfred, you mentioned the popular, early 2000’s Baby Name Voyager website as one that changed your view on the importance of data visualization. Tell us a little about how that happened.
Alfred Spector: That website, recently reborn as the Name Grapher, had two characteristics that I thought were brilliant. First, it had a really natural interface, where you type the initial characters of a name and it shows a frequency graph of all the names beginning with those letters, and their popularity over time. Second, it’s so much better than a spreadsheet with 140 columns representing years and rows representing names, despite the fact it contains no extra information. It also provided instantaneous feedback with its display graph dynamically changing as you type. To me, this showed the power of a very simple transformation that is done correctly.
You strongly emphasize the importance of humility as data scientists work towards applications that provide insight or generate conclusions. Can you tell me a bit about how you try to impart humility to the next generation of data scientists; and what, to your mind, makes a good data scientist?
Alfred Spector: I optimistically emphasize the power of data science and the importance of gaining the computational, statistical, and machine learning skills to apply it. I also remind students that we are obligated to solve problems well. Chris paraphrases danah boyd, who says that a successful application of data science is not one that merely meets some technical goal, but one that actually improves lives. More specifically, I exhort practitioners to provide a real solution to problems, or else clearly identify what we are not solving so that people see the limitations of our work. We should be extremely clear so that we do not lead others to erroneous conclusions. I also remind people that all of us, including scientists and engineers, are human and subject to the same human foibles as everyone else, such as various biases.
Chris Wiggins: Like most data science teams, at The New York Times, we look for people who know machine learning broadly enough that they know the right tool for the right job, and deeply enough that they know how to adapt and extend standard approaches and open-source tools to our particular problems and data products. But we also emphasize communication: data science in industry is most valuable when data scientists are working directly with partners in product, the newsroom, or business, and we can only do that by speaking the language of our partners, listening closely to their needs, and ensuring that they understand what data science can do (and can’t do!) to help them. I constantly remind my teams and aspiring data scientists of the importance of good communication as the key part of being a good collaborator.
Journalists seeking information about EECS, or interviews with EECS faculty members, should email email@example.com.
Please note: The EECS Communications Office only handles media inquiries related to MIT’s Department of Electrical Engineering & Computer Science. Please visit other school, department, laboratory, or center websites to locate their dedicated media-relations teams.