July 28, 2016


The One Million Spreadsheet Project:

Scientists, professionals, and managers create hundreds of millions of spreadsheets, making spreadsheets the most widely used programming tool. However, spreadsheet tools have minimal support for performing analysis over datasets stored in several spreadsheet files. This lack of sophisticated analysis of distributed datasets, along with the limited programming experience of spreadsheet users, lead to errors in spreadsheets that have resulted in substantial financial losses to many government and private organizations. 

In this project, we are developing an approach for scanning millions of spreadsheets to extract datasets automatically and to predict relationships among the extracted datasets. The relationships that our system predicts will help spreadsheet users detect errors in spreadsheets and avoid the substantial financial losses that result from these errors. It will also make it easy for them to publish their datasets to public databases or to migrate it to relational or NoSQL databases. Database systems have very sophisticated query language, such as Structured Query Language (SQL), that allows analysts to perform complex analysis tasks over data stored in several tables. These types of tasks are impossible to achieve with current spreadsheets tools.   


  • Abdussalam Alawini, University of Illinois at Urbana-Champaign
  • Hang Song, University of Illinois at Urbana-Champaign
  • Yitao Meng, University of Illinois at Urbana-Champaign

From imperative to declarative: A data-driven study of how students learn Structured Query Language (SQL)

Structured Query Language (SQL) is the de facto standard for querying databases, thus it is critical for many professionals to have a solid understanding of how SQL works to make data-driven decisions. Yet despite the importance of databases, we know little about how people learn to write SQL, especially compared to other programming paradigms. Yet, knowledge of SQL is vital to the use of data science for accelerating scientific inquiry, business decision-making, and effective healthcare.

SQL is a declarative query language, which is a completely different paradigm than imperative programming languages such as Python, Java, or C. Consequently, the majority of misconceptions and programming difficulties research conducted in these paradigms does not translate well, if at all, to how people learn SQL. In this project, we propose to both provide foundational research on how people learn SQL and create infrastructure for facilitating future research and dissemination of that research. This project aims to:

  1.  Analyze thousands of student SQL submissions using automated program analysis techniques to identify common student mistakes in writing SQL queries.
  2. Use qualitative interviews to more richly explore common student mistakes.
  3. Create an open-source web app through which instructors can quickly see the common mistakes their students are making in real-time to facilitate just-in-time instruction and active learning and contribute to a repository of data for future research on common student mistakes.


  • Abdussalam Alawini, University of Illinois at Urbana-Champaign
  • Geoffery Herman, University of Illinois at Urbana-Champaign
  • Seth Poulsen, University of Illinois at Urbana-Champaign
  • Eric Wang, University of Illinois at Urbana-Champaign
  • Micah Meng, University of Illinois at Urbana-Champaign

Data Citation for Data Science Applications

To support modern data science applications, this work aims to develop a generic framework that works across a variety of different database models, e.g., graph-based, and semi-structured data. We are working on integrating our data citation framework with existing data science environments. One problem to consider is that data science applications operate on various file formats such as comma-separated values (CSV), XML, JSON, and text, some of which do not have query language support (e.g., CSV). We want to develop techniques for tracking data ownership across these formats and using this information to generate citations for data science results automatically.

We also work on developing tools to provide data citation capabilities to biomedical databases, such as Hetionet [5], a graph of biomedical knowledge that encodes relationships uncovered by millions of studies conducted over the last half-century into a single resource.


  • Abdussalam Alawini, University of Illinois at Urbana-Champaign
  • Siqi Xiong, University of Illinois at Urbana-Champaign
  • Siyu Niu, University of Illinois at Urbana-Champaign

Data Citation

Citation is an essential part of scientific publishing and, more generally, of scholarship. It is used to gauge the trust placed in published information and, for better or for worse, is an important factor in judging academic reputation. Now that so much scientific publishing involves data and takes place through a database rather than conventional journals, how is some part of a database to be cited? More generally, how should data stored in a repository that has a complex internal structure and that is subject to change be cited?



  • Yinjun Wu, Abdussalam Alawini, Daniel Deutch, Tova Milo, Susan B. Davidson. (2019). ProvCite: provenance-based data citation. Proceedings of the VLDB Endowment.
  • Yinjun Wu, Abdussalam Alawini, Susan Davidson, and Gianmaria Silvello. Data Citation: Giving Credit where Credit is Due. To appear in the proc. of SIGMOD 2018
  • Abdussalam Alawini, Susan Davidson, Gianmaria Silvello, Val Tannen, Yinjun Wu. “Data Citation: A New Provenance Challenge“. Bulletin of the Technical Committee on Data Engineering (March 2018). Vol. 41 No. 1
  • A. Alawini, L. Chen, S. B. Davidson, N. Portilho Da Silva and G. Silvello, “Automating Data Citation: The eagle-i Experience,” 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Toronto, ON, 2017, pp. 1-10.
  • Abdussalam Alawini, Susan B. Davidson, Wei Hu, and Yinjun Wu. 2017. Automating data citation in CiteDBProc. VLDB Endow. 10, 12 (August 2017), 1881-1884.

Approximating and Reasoning about Provenance

In many Big Data applications today, such as Next-Generation Sequencing, data processing pipelines are highly complex, span multiple institutions, and include many human and computational steps. The pipelines evolve over time and vary across institutions, so it is difficult to track and reason about the processing pipelines to ensure consistency and correctness of results. Provenance-enabled scientific workflow systems promise to aid here – yet such workflow systems are often avoided due to perceptions of inflexibility, lack of good provenance analytics tools, and emphasis on supporting the data consumer rather than producer.



Identifying Relationships in Collections of Scientific Datasets

Scientific datasets associated with a research project proliferate over time as a result of activities, such as sharing datasets among collaborators, extending existing datasets with new measurements, and extracting subsets of data for analysis. As such datasets begin to accumulate, it becomes increasingly difficult for a scientist to
keep track of their derivation history, which complicates data sharing, provenance tracking, and scientific reproducibility. Understanding
what relationships exist between datasets can help scientists recall their original derivation connection. For instance, if dataset A is contained in dataset B, then the connection could be that A was extended to create B.

We introduce a set of relevant relationships, propose the relationship-identification methodology for testing relationships between pairs of datasets, develop a set of algorithms for efficient discovery of these relationships, and organize these algorithms into a new system called ReConnect to assist scientists in relationship discovery.

While ReConnect helped with identifying relationships between two datasets, It is infeasible for scientists to use it for testing relationships between all possible pairs in a collection of datasets. We introduce an end-to-end prototype system, ReDiscover, that identify, from a collection of datasets, the pairs that are most likely related. Our preliminary evaluation shows that ReDiscover predicted duplicate, row_containment, and template relationships with F1 of 80%, 57%, and 80% respectively.


  • Abdussalam Alawini, (Work done as part of my Ph.D. research at Portland State University)
  • David Maier, Maseeh Professor of Emerging Technologies, Portland State University
  • Kristin Tufte, Research Assistant Professor, Portland State University
  • Bill Howe, Associate Professor, Information School, Adjunct Associate Professor, Computer Science & Engineering, Associate Director and Senior Data Science Fellow, UW eScience Institute, Program Director and Faculty Chair, UW Data Science Masters Degree, University of Washington
  • Rashmi Nandikur, Software Developer Engineer, Amazon