Research

Data Citation

Citation is an essential part of scientific publishing and, more generally, of scholarship. It is used to gauge the trust placed in published information and, for better or for worse, is an important factor in judging academic reputation. Now that so much scientific publishing involves data and takes place through a database rather than conventional journals, how is some part of a database to be cited? More generally, how should data stored in a repository that has complex internal structure and that is subject to change be cited?

People

Publications

  • Abdussalam Alawini, Leshang Chen, Susan Davidson and Gianmaria Silvello. Automating data citation: the eagle-i experience. To appear in JCDL 2017.
  • Abdussalam Alawini, Susan Davidson Yinjun Wu and Wei Hu. Automating Data Citation in CiteDB. To appear in PVLDB 2017.

Approximating and Reasoning about Provenance

In many Big Data applications today, such as Next-Generation Sequencing, data processing pipelines are highly complex, span multiple institutions, and include many human and computational steps. The pipelines evolve over time and vary across institutions, so it is difficult to track and reason about the processing pipelines to ensure consistency and correctness of results. Provenance-enabled scientific workflow systems promise to aid here – yet such workflow systems are often avoided due to perceptions of inflexibility, lack of good provenance analytics tools, and emphasis on supporting the data consumer rather than producer.

People

ReDiscover: Predicting Relationships in Collections of Datasets

While ReConnect helped with identifying relationships between two datasets, It is infeasible for scientists to use it for testing relationship between all possible pairs in a collection of datasets. We introduce an end-to-end prototype system, ReDiscover, that identify, from collection of datasets, the pairs that are most likely related. Our preliminarily evaluation shows that ReDiscover predicted duplicate, row\_containment, and template relationships with F1 of 80%, 57%, and 80% respectively.

People

  • Abdussalam Alawini, Portland State University
  • David Maier, Maseeh Professor of Emerging Technologies, Portland State University
  • Kristin Tufte, Research Assistant Professor, Portland State University
  • Bill Howe, Associate Professor, Information School, Adjunct Associate Professor, Computer Science & Engineering, Associate Director and Senior Data Science Fellow, UW eScience Institute, Program Director and Faculty Chair, UW Data Science Masters Degree, University of Washington
  • Rashmi Nandikur, Software Developer Engineer, Amazon

Publications

ReConnect: Helping Scientists Reconnect Their Datasets

Scientific datasets associated with a research project proliferate over time as a result of activities, such as sharing datasets among collaborators, extending existing datasets with new measurements, and extracting subsets of data for analysis. As such datasets begin to accumulate, it becomes increasingly difficult for a scientist to
keep track of their derivation history, which complicates data sharing, provenance tracking, and scientific reproducibility. Understanding
what relationships exist between datasets can help scientists recall their original derivation connection. For instance, if dataset A is contained in dataset B, then the connection could be that A was extended to create B.

We introduce a set of relevant relationships, propose the relationship-identification methodology for testing relationships between pairs of datasets, develop a set of algorithms for efficient discovery of these relationships, and organize these algorithms into a new system called ReConnect to assist scientists in relationship discovery.

People

  • Abdussalam Alawini, Portland State University
  • David Maier, Maseeh Professor of Emerging Technologies, Portland State University
  • Kristin Tufte, Research Assistant Professor, Portland State University
  • Bill Howe, Associate Director, eScience Institute and Affiliate Assistant Professor, Department of Computer Science & Engineering , University of Washington

Publications

  • Poster, from SSDBM 2013.
  • Abdussalam Alawini, David Maier, Kristin Tufte, and Bill Howe. 2014. Helping scientists reconnect their datasets. In Proceedings of the 26th International Conference on Scientific and Statistical Database Management (SSDBM ’14), ACM, New York, NY, USA, , Article 29 , 12 pages.