Paper accepted at the 47th International Conference on Very Large Data Bases (VLDB)!

Our paper titled “Ananke: A Streaming Framework for Live Forward Provenance” (Dimitris Palyvos-Giannas, Bastian Havers, Marina Papatriantafilou, Vincenzo Gulisano) has been accepted at the 47th International Conference on Very Large Data Bases (VLDB)!

This work, conducted within the scope of the VR project Haren and the Vinnova project AutoSPADA (in collaboration with Volvo), introduces the first streaming framework providing fine-grained forward provenance. As we explain in the paper, such a tool is valuable for distributed and parallel analysis in edge to cloud infrastructures, since it eases the retrieval of source data connected to analysis outcomes, while being able to discriminate whether each piece of source data could still contribute to future analysis outcomes or not.

Ananke is available in Github at the following link: and implemented on top of the Apache Flink streaming processing engine (a framework used by Alibaba and Amazon Kinesis Data Analytics, among others). All the experiments we present in the paper can be reproduced with the scripts we made available in our repository.

The abstract follows:

Data streaming enables online monitoring of large and continuous event streams in Cyber-Physical Systems (CPSs). In such scenarios, fine-grained backward provenance tools can connect streaming query results to the source data producing them, allowing analysts to study the dependency/causality of CPS events. While CPS monitoring commonly produces many events, backward provenance does not help prioritize event inspection since it does not specify if an event’s provenance could still contribute to future results. To cover this gap, we introduce Ananke, a framework to extend any fine-grained backward provenance tool and deliver a live bi-partite graph of fine-grained forward provenance. With Ananke, analysts can prioritize the analysis of provenance data based on whether such data is still potentially being processed by the monitoring queries. We prove our solution is correct, discuss multiple implementations, including one leveraging streaming APIs for parallel analysis, and show Ananke results in small overheads, close to those of existing tools for fine-grained backward provenance.

Posted in Data Streaming, Research

YouTube tutorial: The Role of Event-Time Analysis Order in Data Streaming

Our tutorial, titled “The Role of Event-Time Analysis Order in Data Streaming”, will be presented next week at the 14th ACM International Conference on Distributed and Event-Based Systems (DEBS) conference. We have recorded the tutorial, and you can find the videos at the following links:

Part 1:

Part 2:

You can find the slides, as well as the code examples, here. The slides are also available at SlideShare (here)


The data streaming paradigm was introduced around the year 2000 to overcome the limitations of traditional store-then-process paradigms found in relational databases (DBs). Opposite to DBs’ “first-the-data-then-the-query” approach, data streaming applications build on the “first-the-query then-the-data” alternative. More concretely, data streaming applications do not rely on storage to initially persist data and later query it, but rather build on continuous single-pass analysis in which incoming streams of data are processed on the fly and result in continuous streams of outputs.

In contrast with traditional batch processing, data streaming applications require the user to reason about an additional dimension in the data: event-time. Numerous models have been proposed in the literature to reason about event-time, each with different guarantees and trade-offs. Since it is not always clear which of these models is appropriate for a particular application, this tutorial studies the relevant concepts and compares the available options. This study can be highly relevant for people working with data streaming applications, both researchers and industrial practitioners.

Posted in Data Streaming, Presentation, Research, Teaching

DRIVEN article accepted at Elsevier’s Future Generation Computer Systems.

Our paper DRIVEN: a framework for efficient Data Retrieval and clusterIng in VEhicular Networks has been accepted for publication at Elsevier’s Future Generation Computer Systems journal. This work is an extension of the conference publication:

Havers, Bastian, et al. “DRIVEN: a framework for efficient Data Retrieval and clusterIng in VEhicular Networks.” 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 2019.

In this extended version, we build on and extend our framework, which leverages streaming-based Piecewise Linear Approximation and clustering for edge-to-core analysis. We show that real-world raw data such as GPS, LiDAR and other vehicular signals can be compressed (within each vehicle, in a streaming fashion) to 5-35 % of its original size, significantly reducing communication costs and overheads, and clustered (at the cloud, in a streaming fashion) with an accuracy loss below 10%.


The abstract follows:

The growing interest in data analysis applications for Cyber-Physical Systems stems from the large amounts of data such large distributed systems sense in a continuous fashion.  A key research question in this context is how to jointly address the efficiency and effectiveness challenges of such data analysis applications.

DRIVEN proposes a way to jointly address these challenges for a data gathering and distance-based clustering tool in the context of vehicular networks.  To cope with the limited communication bandwidth (compared to the sensed data volume) of vehicular networks and data transmission’s monetary costs, DRIVEN avoids gathering raw data from vehicles, but  rather  relies  on  a  streaming-based  and  error-bounded  approximation,  through  Piecewise  Linear  Approximation (PLA), to compress the volumes of gathered data.  Moreover,  a streaming-based approach is also used to cluster the collected  data  (once  the  latter  is  reconstructed  from  its  PLA-approximated  form).   DRIVEN’s  clustering  algorithm leverages  the  inherent  ordering  of  the  spatial  and  temporal  data  being  collected  to  perform  clustering  in  an  online fashion,  while data is being retrieved.  As we show,  based on our prototype implementation using Apache Flink and thorough evaluation with real-world data such as GPS, LiDAR and other vehicular signals,  the accuracy loss for the clustering performed on the gathered approximated data can be small (below 10 %), even when the raw data is compressed to 5-35 % of its original size, and the transferring of historical data itself can be completed in up to one-tenth of the duration observed when gathering raw data.


Posted in Uncategorized