【学术会议】DataPrep: Make Data Scientists Not Complain about Data Preparation
主题: DataPrep: Make Data Scientists Not Complain about Data Preparation
报告人: Prof. Jiannan Wang, Simon Fraser University
时间: 10:30 am - 11:30 am, October 21, 2020
地点：Zoom, Meeting ID: 559 916 3678
Data scientists have been complaining about data preparation (data collection --> data understanding --> data cleaning --> data enrichment --> data integration --> feature engineering) for many years. Although some efforts have been devoted to solving this problem, a recent survey released by Anaconda in 2020 shows that it is still the case that “Data preparation and cleansing takes valuable time away from real data science work and has a negative impact on overall job satisfaction.”
In this talk, I will explain what makes data preparation hard to solve, and present DataPrep, a fast and easy-to-use python library to address these challenges. DataPrep aims to become the "scikit-learn" for data preparation. The DataPrep library currently contains two components: a data connector component to simplify web data collection and an exploratory data analysis (EDA) component to enable fast data understanding. I will describe their novel design in detail and demonstrate how they can significantly save data scientists’ time. I will also talk about our design of other components such as data enrichment and data cleaning. In the end, I will introduce a framework from Prof. Ion Stoica (UC Berkeley) about how to pick up a research problem and then use it to justify why data preparation is a great research problem to work on in the next decade.
Please refer to http://dataprep.ai for more detail about the DataPrep project.
Professor Jiannan Wang is an Associate Professor in the School of Computing Science at Simon Fraser University. His current research interests are data preparation, ML model debugging/monitoring, and approximate query processing. Prior to that, he was a postdoc in the AMPLab at UC Berkeley. He obtained his Ph.D. from Tsinghua University. He has won an IEEE TCDE Rising Star Award (2018), an ACM SIGMOD Best Demonstration Award (2016), a Distinguished Dissertation Award from the China Computer Federation (2013), and a Google Ph.D. Fellowship (2011).