DRIVEN article accepted at Elsevier’s Future Generation Computer Systems.

Our paper DRIVEN: a framework for efficient Data Retrieval and clusterIng in VEhicular Networks has been accepted for publication at Elsevier’s Future Generation Computer Systems journal. This work is an extension of the conference publication:

Havers, Bastian, et al. “DRIVEN: a framework for efficient Data Retrieval and clusterIng in VEhicular Networks.” 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 2019.

In this extended version, we build on and extend our framework, which leverages streaming-based Piecewise Linear Approximation and clustering for edge-to-core analysis. We show that real-world raw data such as GPS, LiDAR and other vehicular signals can be compressed (within each vehicle, in a streaming fashion) to 5-35 % of its original size, significantly reducing communication costs and overheads, and clustered (at the cloud, in a streaming fashion) with an accuracy loss below 10%.


The abstract follows:

The growing interest in data analysis applications for Cyber-Physical Systems stems from the large amounts of data such large distributed systems sense in a continuous fashion.  A key research question in this context is how to jointly address the efficiency and effectiveness challenges of such data analysis applications.

DRIVEN proposes a way to jointly address these challenges for a data gathering and distance-based clustering tool in the context of vehicular networks.  To cope with the limited communication bandwidth (compared to the sensed data volume) of vehicular networks and data transmission’s monetary costs, DRIVEN avoids gathering raw data from vehicles, but  rather  relies  on  a  streaming-based  and  error-bounded  approximation,  through  Piecewise  Linear  Approximation (PLA), to compress the volumes of gathered data.  Moreover,  a streaming-based approach is also used to cluster the collected  data  (once  the  latter  is  reconstructed  from  its  PLA-approximated  form).   DRIVEN’s  clustering  algorithm leverages  the  inherent  ordering  of  the  spatial  and  temporal  data  being  collected  to  perform  clustering  in  an  online fashion,  while data is being retrieved.  As we show,  based on our prototype implementation using Apache Flink and thorough evaluation with real-world data such as GPS, LiDAR and other vehicular signals,  the accuracy loss for the clustering performed on the gathered approximated data can be small (below 10 %), even when the raw data is compressed to 5-35 % of its original size, and the transferring of historical data itself can be completed in up to one-tenth of the duration observed when gathering raw data.


Posted in Uncategorized

Slides for the paper “Automatic Translation of Spatio-Temporal Logics to Streaming-Based Monitoring Applications for IoT-Equipped Autonomous Agents” available in SlideShare

The slides I used to present the paper “Automatic Translation of Spatio-Temporal Logics to Streaming-Based Monitoring Applications for IoT-Equipped Autonomous Agents” at the 2019 ACM/IFIP Middleware conference – 6th International Workshop on Middleware and Applications for the Internet of Things (M4IoT) are now available at SlideShare:

The abstract of the paper follows:

Environments in which IoT-equipped autonomous agents and humans tightly interact require safety rules that monitor the agents’ behaviors. In this context, expressive and human-comprehensible rules based on Spatio-Temporal Logics (STLs) are desirable because they are informative and easy to maintain. Unfortunately, STLs usually build on ad-hoc platforms implementing the logic semantics.
We tackle this limitation with a mechanism to transparently compile STL rules to monitoring applications composed of standard data streaming operators, thus opening up the use of high-throughput and low-latency Stream Processing Engines for monitoring rule compliance in realistic, data-rich IoT scenarios. Our contribution can favor a broader and faster adoption of STLs for IoT-equipped agent monitoring by separating the concerns of designing a rule from those of implementing its semantics. Together with our formal description of how to translate STLs to the streaming domain, we evaluate our prototype implementation based on Apache Flink, studying the effects of parameters such as time and space resolution on the monitoring performance.

Posted in Uncategorized

2 papers and 1 poster accepted at the ACM International Conference on Distributed Event-Based Systems (DEBS) 2019!

We got two papers and one poster accepted at the ACM International Conference on Distributed Event-Based Systems (DEBS)!

Our two papers are:
STRETCH: Scalable and Elastic Deterministic Streaming Analysis with Virtual Shared-Nothing Parallelism (Hannaneh Najdataei, Yiannis Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas, Vincenzo Gulisano)
Haren: A Framework for Ad-Hoc Thread Scheduling Policies for Data Streaming Applications (Dimitris Palyvos-Giannas, Vincenzo Gulisano, Marina Papatriantafilou)

while the poster is:
Mimir – Streaming Operators Classification with Artificial Neural Networks (Victor Gustafsson, Hampus Nilsson, Karl Bäckström, Marina Papatriantafilou, Vincenzo Gulisano)

The first paper presents a generic framework for parallel and elastic streaming analysis that supports what we introduced as virtual shared-nothing parallelism. In a nutshell, virtual shared-nothing parallelism allows to program parallel stateful analysis using the shared-nothing parallelism model (which is convenient because, among other reasons, does not require programmers to worry about concurrent accesses to the local data managed by each parallel thread). Under the hood, its ”virtual” nature is due to the fact that the overall state managed by the threads is indeed shared. As a result, this allows for ultra-fast elastic reconfigurations (we can move from 30 to 60 threads, for instance, in approximately 10 milliseconds!) and does not require any programming of state transfer protocols!

The second paper is also introducing a novel framework, which in this case allows for easy “plug and play”-like use of custom thread-scheduling policies for streaming applications. More concretely, our framework (named Haren) provides a middleware-like abstraction that decouples thread-scheduling tasks from other components of a Stream Processing Engine and allows users to define (1) how to map operators to threads and (2) how to sort operators assigned to the same thread based on the user-defined priority. As we show in our paper, Haren could be used to define rich and complex policies in which distinct queries deployed to the same SPE instance have different priority levels, each queries of different priority level are also scheduled with different performance goals (e.g., minimize maximum latency vs. average latency). We implemented Haren on top of Liebre, the SPE developed at my research group (you can find the updated documentation here and the code here).

Finally, our accepted poster is the result of Victor’s and Hampus’ master thesis (which I supervised together with Karl and Marina). In this work, we study how Neural Networks can be used to classify the operators of streaming applications based on features such as input rates, output rates, selectivity, and so on… The rationale is that, by being able to classify operators, a third-party observer does not need to depend on the specific SPE the user chooses to use in order to find out which operators are actually deployed in his/her application to trigger or suggest performance-improving actions. As we show, NNs can in this case achieve a classification accuracy above 95%.

You can find the abstracts of these three works in the following.

STRETCH: Scalable and Elastic Deterministic Streaming Analysis with Virtual Shared-Nothing Parallelism
Despite the established scientific knowledge on efficient parallel and elastic data stream processing, it is challenging to combine generality and high level of abstraction (targeting ease of use) with fine-grained processing aspects (targeting efficiency) in stream processing frameworks. Towards this goal, we propose STRETCH, a framework that aims at guaranteeing (i) high efficiency in throughput and latency of stateful analysis and (ii) fast elastic reconfigurations (without requiring state transfer) for intra-node streaming applications. To achieve these, we introduce virtual shared-nothing Parallelization and propose a scheme to implement it in STRETCH, enabling users to leverage parallelization techniques while also taking advantage of shared-memory synchronization, which has been proven to boost the scaling-up of streaming applications while supporting determinism. We provide a fully-implemented prototype and, together with a thorough evaluation, correctness proofs for its underlying claims supporting determinism and a model (also validated empirically) of virtual shared-nothing and pure shared-nothing scalability behavior. As we show, STRETCH can match the throughput and latency figures of the front of state-of-the-art solutions, while also achieving fast elastic reconfigurations (taking only a few milliseconds). 

Haren: A Framework for Ad-Hoc Thread Scheduling Policies for Data Streaming Applications 
In modern Stream Processing Engines (SPEs), numerous diverse applications, which can differ in aspects such as cost, criticality or latency sensitivity, can co-exist in the same computing node. When these differences need to be considered to control the performance of each application, custom scheduling of operators to threads is of key importance (e.g., when a smart vehicle needs to ensure that safety-critical applications always have access to computational power, while other applications are given lower, variable priorities).
Many solutions have been proposed regarding schedulers that allocate threads to operators to optimize specific metrics (e.g., latency) but there is still lack of a tool that allows arbitrarily complex scheduling strategies to be seamlessly plugged on top of an SPE. We propose Haren to fill this gap. More specifically, we (1) formalize the thread scheduling problem in stream processing in a general way, allowing to define ad-hoc scheduling policies, (2) identify the bottlenecks and the opportunities of scheduling in stream processing, (3) distill a compact interface to connect Haren with SPEs, enabling rapid testing of various scheduling policies, (4) illustrate the usability of the framework by integrating it into an actual SPE and (5) provide a thorough evaluation. As we show, Haren makes it is possible to adapt the use of computational resources over time to meet the goals of a variety of scheduling policies.

Mimir – Streaming Operators Classification with Artificial Neural Networks
Streaming applications are used for analysing large volumes of continuous data. Achieving efficiency and effectiveness in data streaming imply challenges that all the more important when different parties (i) define applications’ semantics, (ii) choose the stream Processing Engine (SPE) to use, and (iii) provide the processing infrastructure (e.g., cloud or fog), and when one party’s decisions (e.g., how to deploy applications or when to trigger adaptive reconfigurations) depend on information held by a distinct one (and possibly hard to retrieve). In this context, machine learning can bridge the involved parties (e.g., SPEs and cloud providers) by offering tools that learn from the behavior of streaming applications and help take decisions.
Such a tool, the focus of our ongoing work, can be used to learn which operators are run by a streaming application running in a certain SPE, without relying on the SPE itself to provide such information. More concretely, to classify the type of operator based on a desired level of granularity (from a coarse-grained characterization into stateless/stateful, to a fine-grained operator classification) based on general application-related metrics. As an example application, this tool could help a Cloud provider decide which infrastructure to assign to a certain streaming application (run by a certain SPE), based on the type (and thus cost) of its operators.


Posted in Data Streaming, Research