A New Paper at ICDE 2019.

Our paper, Extending Fine-Grained Provenance to ETL Tasks, was accepted for publication at the 35th International Conference on Data Engineering 2019. Here is the abstract.

Data provenance tools facilitate reproducibility by capturing the steps used to produce analyses. However, there are trade-offs among workflow provenance systems which allow arbitrary code and workflows but only track provenance at the granularity of files; provenance APIs, which provide tuple-level provenance but require source code modifications and incur high overhead; database-style provenance models, which track fine-grained provenance through relational-style operators, and support optimization based on algebraic equivalences, but capture a limited subset of data science workflows. No existing solution is well suited for tracing errors introduced during many common ETL tasks. Techniques are needed for identifying the sources of errors, finding why different code versions produce different results and identifying which parameter values affect output. To address these requirements, we propose PROVision, a provenance-driven troubleshooting tool that, like provenance APIs, supports a wide array of ETL and matching computations and tuple-based provenance — but builds upon database-style provenance techniques to capture equivalences, support optimizations, and enable lazy evaluation. In this paper we formalize our extensions, implement them in the PROVision system, and validate their effectiveness and scalability for common ETL and data science tasks.

Leave a Reply

Your email address will not be published. Required fields are marked *