Accepted paper, ACM Transactions on Parallel Computing: Efficient data streaming multiway aggregation through concurrent algorithmic designs and new abstract data types

I am glad to share that our paper Efficient data streaming multiway aggregation through concurrent algorithmic designs and new abstract data types has been accepted at the ACM Transactions on Parallel Computing (TOPC) journal!!!

Abstract:
Data streaming relies on continuous queries to process unbounded streams of data in a real-time fashion. It is commonly demanding in computation capacity, given that the relevant applications involve very large volumes of data. Data structures act as articulation points and maintain the state of data streaming operators, potentially supporting high parallelism and balancing the work among them. Prompted by this fact, in this work we study and analyze parallelization needs of these articulation points, focusing on the problem of streaming multiway aggregation, where large data volumes are received from multiple input streams. The analysis of the parallelization needs, as well as of the use and limitations of existing aggregate designs and their data structures, leads us to identify needs for appropriate shared objects that can achieve low-latency and high throughput multiway aggregation. We present the requirements of such objects as abstract data types and we provide efficient lock-free linearizable algorithmic implementations of them, along with new multiway aggregate algorithmic designs that leverage them, supporting both deterministic order-sensitive and order insensitive aggregate functions. Furthermore, we point out future directions that open through these contributions. The paper includes an extensive experimental study, based on a variety of continuous aggregation queries on two large datasets extracted from SoundCloud, a music social network, and from a Smart Grid network. In all the experiments, the proposed data structures and the enhanced aggregate operators improved the processing performance significantly, up to one order of magnitude, in terms of both throughput and latency, over the commonly-used techniques based on queues.

Posted in Uncategorized

The Liebre Stream Processing Engine

The LIEBRE Stream Processing Engine

Since I started by Ph.D. back in 2008, I spent a considerable amount of time digging into the internals of stream processing engines. Pioneer engines like Aurora and Borealis were not really easy to work with. During the Ph.D., I had to “fight” a lot with Borealis in order to design and develop StreamCloud, one of the first parallel and distributed stream processing engines.

Once the Ph.D. was over, I continued my research in data streaming and learned how to use new systems, like Storm, Flink or Spark. Things got way easier than they were before, and a simple “hello world” example is actually a matter of minutes. Still, I always had (and still have) mixed feelings with these systems. On the positive side, they are very powerful and supported by a large community. On the less positive side, they force you to design and implementation decisions that you might find sub-optimal (remember my perspective is the one of a researcher in data streaming).

Of course, one could always make changes to the internal architecture of such systems, but the engineering effort is considerable and might be needed each time a new streaming platform becomes popular and every time existing ones get updated.

Because of this, I decided to put together a minimal stream processing engine (Liebre) and make it available in GitHub. You can find the documentation here and the actual code here.

The idea is to keep adding to it operators I develop for research (you can already find standard operators in it) and to use it for research in data streaming, especially when using IoT devices (the footprint of streaming engines like Storm and Flink is considerable, since they are designed for large data centers, rather than cheap single-board devices). I decided to make the code available and put together some documentation in case someone is interested in trying it or wishes to provide feedback.

Tagged with: , , ,
Posted in Data Streaming, programming, Research

Abstract and slides for the “Performance Modeling of Stream Joins” paper (DEBS 17)

I just uploaded the slides I used to present the paper “Performance Modeling of Stream Joins” at the DEBS 17 conference. You can find them here.

Abstract
Streaming analysis is widely used in a variety of environments, from cloud computing infrastructures up to the network’s edge. In these contexts, accurate modeling of streaming operators’ performance enables fine-grained prediction of applications’ behavior without the need of costly monitoring. This is of utmost importance for computationally-expensive operators like stream joins, that observe throughput and latency very sensitive to rate-varying data streams, especially when deterministic processing is required.
In this paper, we present a modeling framework for estimating the throughput and the latency of stream join processing. The model is presented in an incremental step-wise manner, starting from a centralized non-deterministic stream join and expanding up to a deterministic parallel stream join. The model describes how the dynamics of throughput and latency are influenced by the number of physical input streams, as well as by the amount of parallelism in the actual processing and the requirement for determinism. We present an experimental validation of the model with respect to the actual implementation. The proposed model can provide insights that are catalytic for understanding the behavior of stream joins against different system deployments, with special emphasis on the influences of determinism and parallelization.

Posted in Uncategorized