Abstract: It is common practice for data scientists to acquire and integrate disparate data sources into one. Besides the tremendous work on cleaning and integrating techniques, surprisingly, there has been very little work on determining (1) how clean and (2) how complete the data is, and (3) what the potential impact of any unknown data (a.k.a., unknown unknowns) on query results is?
In this talk, I will first present new techniques to estimate the completeness and the impact of unknown data on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Second, I will present novel techniques to estimate the number of remaining errors in a data set. Finally, I will show how these techniques are integrated in QUDE, a new component of our interactive data exploration system, which aims to automatically assist users in identifying potential risk factors, such as the mentioned missing data items.
Bio: Tim Kraska is an Assistant Professor in the Computer Science department at Brown University. Currently, his research focuses building systems for interactive data exploration and transactional systems for modern hardware, especially the next generation of networks. Before joining Brown, Tim spent 3 years as a PostDoc in the AMPLab at UC Berkeley, where he worked on hybrid human-machine database systems and cloud-scale data management systems. Tim received his PhD from the ETH Zurich under the supervision of Donald Kossmann. He was awarded an NSF Career Award (2015), an Airforce Young Investigator award (2015), a Swiss National Science Foundation Prospective Researcher Fellowship (2010), a DAAD Scholarship (2006), a University of Sydney Master of Information Technology Scholarship for outstanding achievement (2005), the University of Sydney Siemens Prize (2005), two VLDB best demo awards (2015 and 2011), and an ICDE best paper award (2013), and very recently got selected as a 2017 Alfred P. Sloan Research Fellow in Computer Science.
Host: Sam Madden