Joining data streams is a fundamental stateful operator in stream processing. It involves evaluating join pairs of tuples from two streams that meet specific user-defined criteria. This operator is typically time-consuming and often represents the major bottleneck in several real-world continuous queries. This paper focuses on a specific class of join operator, named online interval join, where we seek join pairs of tuples that occur within a certain time frame of each other. Our contribution is to propose different parallel patterns for implementing this join operator efficiently in the presence of watermarked data streams and skewed key distributions. The proposed patterns comply with the shared-nothing parallelization paradigm, a popular paradigm adopted by most of the existing Stream Processing Engines. Among the proposed patterns, we introduce one based on hybrid parallelism, which is particularly effective in handling various scenarios in terms of key distribution, number of keys, batching, and parallelism as demonstrated in our experimental analysis.
PPOIJ: Shared-Nothing Parallel Patterns for Efficient Online Interval Joins over Data Streams
Mencagli G.;Griebler D.
2025-01-01
Abstract
Joining data streams is a fundamental stateful operator in stream processing. It involves evaluating join pairs of tuples from two streams that meet specific user-defined criteria. This operator is typically time-consuming and often represents the major bottleneck in several real-world continuous queries. This paper focuses on a specific class of join operator, named online interval join, where we seek join pairs of tuples that occur within a certain time frame of each other. Our contribution is to propose different parallel patterns for implementing this join operator efficiently in the presence of watermarked data streams and skewed key distributions. The proposed patterns comply with the shared-nothing parallelization paradigm, a popular paradigm adopted by most of the existing Stream Processing Engines. Among the proposed patterns, we introduce one based on hybrid parallelism, which is particularly effective in handling various scenarios in terms of key distribution, number of keys, batching, and parallelism as demonstrated in our experimental analysis.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


