Our paper, ProvCite: Provenance-based Data Citation, was accepted for publication at the 45th International Conference on Very Large Data Bases 2019. Here’s the abstract.
A computational challenge associated with data citation is how to automatically generate citations to arbitrary queries against a structured dataset. Previous work has explored this problem in the context of conjunctive queries and views using a Rewriting-based Model (RBM). However, an increasing number of scientic queries are aggregate, e.g. showing statistical summaries of the underlying data, for which the RBM cannot be easily extended. In this paper, we show how a Provenance-Based Model (PBM) can be leveraged to 1) generate citations to conjunctive as well as aggregate queries and views; 2) associate citations with individual result tuples to enable arbitrary subsets of the result set to be cited (fine-grained citations); and 3) be optimized to return citations in acceptable time. Our implementation of PBM in ProvCite shows that it not only handles a larger class of queries and views than RBM, but can outperform it when restricted to conjunctive views.