The One Million Spreadsheet Project:
Scientists, professionals, and managers create hundreds of millions of spreadsheets, making spreadsheets the most widely used programming tool. However, spreadsheet tools have minimal support for performing analysis over datasets stored in several spreadsheet files. This lack of sophisticated analysis of distributed datasets, along with the limited programming experience of spreadsheet users, lead to errors in spreadsheets that have resulted in substantial financial losses to many government and private organizations.
In this project, we are developing an approach for scanning millions of spreadsheets to extract datasets automatically and to predict relationships among the extracted datasets. The relationships that our system predicts will help spreadsheet users detect errors in spreadsheets and avoid the substantial financial losses that result from these errors. It will also make it easy for them to publish their datasets to public databases or to migrate it to relational or NoSQL databases. Database systems have very sophisticated query language, such as Structured Query Language (SQL), that allows analysts to perform complex analysis tasks over data stored in several tables. These types of tasks are impossible to achieve with current spreadsheets tools.
People
- Abdussalam Alawini, University of Illinois at Urbana-Champaign
- Hang Song, University of Illinois at Urbana-Champaign
- Yitao Meng, University of Illinois at Urbana-Champaign
From imperative to declarative: A data-driven study of how students learn Structured Query Language (SQL)
Structured Query Language (SQL) is the de facto standard for querying databases, thus it is critical for many professionals to have a solid understanding of how SQL works to make data-driven decisions. Yet despite the importance of databases, we know little about how people learn to write SQL, especially compared to other programming paradigms. Yet, knowledge of SQL is vital to the use of data science for accelerating scientific inquiry, business decision-making, and effective healthcare.
SQL is a declarative query language, which is a completely different paradigm than imperative programming languages such as Python, Java, or C. Consequently, the majority of misconceptions and programming difficulties research conducted in these paradigms does not translate well, if at all, to how people learn SQL. In this project, we propose to both provide foundational research on how people learn SQL and create infrastructure for facilitating future research and dissemination of that research. This project aims to:
- Analyze thousands of student SQL submissions using automated program analysis techniques to identify common student mistakes in writing SQL queries.
- Use qualitative interviews to more richly explore common student mistakes.
- Create an open-source web app through which instructors can quickly see the common mistakes their students are making in real-time to facilitate just-in-time instruction and active learning and contribute to a repository of data for future research on common student mistakes.
People
- Abdussalam Alawini, University of Illinois at Urbana-Champaign
- Geoffery Herman, University of Illinois at Urbana-Champaign
- University of Illinois at Urbana-Champaign
- Eric Wang, University of Illinois at Urbana-Champaign
- Micah Meng, University of Illinois at Urbana-Champaign
Data Citation for Data Science Applications
To support modern data science applications, this work aims to develop a generic framework that works across a variety of different database models, e.g., graph-based, and semi-structured data. We are working on integrating our data citation framework with existing data science environments. One problem to consider is that data science applications operate on various file formats such as comma-separated values (CSV), XML, JSON, and text, some of which do not have query language support (e.g., CSV). We want to develop techniques for tracking data ownership across these formats and using this information to generate citations for data science results automatically.
We also work on developing tools to provide data citation capabilities to biomedical databases, such as Hetionet [5], a graph of biomedical knowledge that encodes relationships uncovered by millions of studies conducted over the last half-century into a single resource.
People
- Abdussalam Alawini, University of Illinois at Urbana-Champaign
- Siqi Xiong, University of Illinois at Urbana-Champaign
- University of Illinois at Urbana-Champaign
Data Citation
Citation is an essential part of scientific publishing and, more generally, of scholarship. It is used to gauge the trust placed in published information and, for better or for worse, is an important factor in judging academic reputation. Now that so much scientific publishing involves data and takes place through a database rather than conventional journals, how is some part of a database to be cited? More generally, how should data stored in a repository that has a complex internal structure and that is subject to change be cited?
People
- Abdussalam Alawini, University of Illinois at Urbana-Champaign
- Susan Davidson, University of Pennsylvania
- Gianmaria Silvello, University of Padua
- Yinjun Wu, University of Pennsylvania
- Wei Hu, Google (Work was done when Wei was a student at Penn)
Publications
- Yinjun Wu, Abdussalam Alawini, Daniel Deutch, Tova Milo, Susan B. Davidson. (2019). ProvCite: provenance-based data citation. Proceedings of the VLDB Endowment.
- Yinjun Wu, Abdussalam Alawini, Susan Davidson, and Gianmaria Silvello. Data Citation: Giving Credit where Credit is Due. To appear in the proc. of SIGMOD 2018
- Abdussalam Alawini, Susan Davidson, Gianmaria Silvello, Val Tannen, Yinjun Wu. “Data Citation: A New Provenance Challenge“. Bulletin of the Technical Committee on Data Engineering (March 2018). Vol. 41 No. 1
- A. Alawini, L. Chen, S. B. Davidson, N. Portilho Da Silva and G. Silvello, “Automating Data Citation: The eagle-i Experience,” 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Toronto, ON, 2017, pp. 1-10.
- Abdussalam Alawini, Susan B. Davidson, Wei Hu, and Yinjun Wu. 2017. Automating data citation in CiteDB. Proc. VLDB Endow. 10, 12 (August 2017), 1881-1884.
Approximating and Reasoning about Provenance
In many Big Data applications today, such as Next-Generation Sequencing, data processing pipelines are highly complex, span multiple institutions, and include many human and computational steps. The pipelines evolve over time and vary across institutions, so it is difficult to track and reason about the processing pipelines to ensure consistency and correctness of results. Provenance-enabled scientific workflow systems promise to aid here – yet such workflow systems are often avoided due to perceptions of inflexibility, lack of good provenance analytics tools, and emphasis on supporting the data consumer rather than producer.
People
- Abdussalam Alawini, (Most of the work done as part of my post-doc fellowship at UPenn)
- Susan Davidson, University of Pennsylvania
- Zackary Ives, University of Pennsylvania
- Leshang Chen, University of Pennsylvania
- Nan Zheng, University of Pennsylvania
Publications
- Nan Zheng, Abdussalam Alawini, Zachary Ives. (2019). Extending Fine-Grained Provenance to ETL Tasks. 2019 IEEE 35th International Conference on Data Engineering (ICDE).
- Jane Xu, Waley Zhang, Abdussalam Alawini, Val Tannen. “Provenance Analysis for Missing Answers and Integrity Repairs”. Bulletin of the Technical Committee on Data Engineering (March 2018). Vol. 41 No. 1
- A. Alawini, L. Chen, S. B. Davidson, N. Portilho Da Silva and G. Silvello, “Automating Data Citation: The eagle-i Experience,” 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Toronto, ON, 2017, pp. 1-10.
Identifying Relationships in Collections of Scientific Datasets
Scientific datasets associated with a research project proliferate over time as a result of activities, such as sharing datasets among collaborators, extending existing datasets with new measurements, and extracting subsets of data for analysis. As such datasets begin to accumulate, it becomes increasingly difficult for a scientist to
keep track of their derivation history, which complicates data sharing, provenance tracking, and scientific reproducibility. Understanding
what relationships exist between datasets can help scientists recall their original derivation connection. For instance, if dataset A is contained in dataset B, then the connection could be that A was extended to create B.
We introduce a set of relevant relationships, propose the relationship-identification methodology for testing relationships between pairs of datasets, develop a set of algorithms for efficient discovery of these relationships, and organize these algorithms into a new system called ReConnect to assist scientists in relationship discovery.
While ReConnect helped with identifying relationships between two datasets, It is infeasible for scientists to use it for testing relationships between all possible pairs in a collection of datasets. We introduce an end-to-end prototype system, ReDiscover, that identify, from a collection of datasets, the pairs that are most likely related. Our preliminary evaluation shows that ReDiscover predicted duplicate, row_containment, and template relationships with F1 of 80%, 57%, and 80% respectively.
People
- Abdussalam Alawini, (Work done as part of my Ph.D. research at Portland State University)
- David Maier, Maseeh Professor of Emerging Technologies, Portland State University
- Kristin Tufte, Research Assistant Professor, Portland State University
- Bill Howe, Associate Professor, Information School, Adjunct Associate Professor, Computer Science & Engineering, Associate Director and Senior Data Science Fellow, UW eScience Institute, Program Director and Faculty Chair, UW Data Science Masters Degree, University of Washington
- Rashmi Nandikur, Software Developer Engineer, Amazon
Publications
- Abdussalam Alawini, Identifying Relationships between Scientific Datasets. 2016. Dissertations and Theses. Paper 2922.
- Abdussalam Alawini, David Maier, Kristin Tufte, Bill Howe, and Rashmi Nandikur. 2015. Towards automated prediction of relationships among scientific datasets. In Proceedings of the 27th International Conference on Scientific and Statistical Database Management (SSDBM ’15), Amarnath Gupta and Susan Rathbun (Eds.). ACM, New York, NY, USA, Article 35, 5 pages.
- Abdussalam Alawini, David Maier, Kristin Tufte, and Bill Howe. 2014. Helping scientists reconnect their datasets. In Proceedings of the 26th International Conference on Scientific and Statistical Database Management (SSDBM ’14). ACM, New York, NY, USA, Article 29, 12 pages.
- Poster, from SSDBM 2013.