Stream and Batch Processing Frameworks

February 27, 2019 · 2 min read

Why Do We Need Such Frameworks?

To process more data in a shorter amount of time.
To unify fault tolerance in distributed systems.
To simplify task abstractions to meet changing business requirements.
Suitable for bounded datasets (batch processing) and unbounded datasets (stream processing).

Hadoop and MapReduce. Google made batch processing as simple as MapReduce result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs) in a distributed system.
Apache Storm and directed graph topologies. MapReduce does not represent iterative algorithms well. Therefore, Nathan Marz abstracted stream processing into a graph structure composed of spouts and bolts.
Spark in-memory computation. Reynold Xin pointed out that Spark uses ten times fewer machines than Hadoop while being three times faster when processing the same data.
Google Dataflow based on Millwheel and FlumeJava. Google uses a windowed API to support both batch and stream processing simultaneously.

Flink quickly adopted the programming model of ==Google Dataflow== and Apache Beam.
Flink's efficient implementation of the Chandy-Lamport checkpointing algorithm.

To meet the above demands with commercial machines, there are several popular distributed system architectures...

DAG Topology for iterative processing - for example, GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
Delivery Guarantees. How to ensure the reliability of data delivery between nodes? At least once / at most once / exactly once.
Fault Tolerance. Implement fault tolerance using cold/warm/hot standby, checkpointing, or active-active.
Windowed API for unbounded datasets. For example, streaming windows in Apache. Window functions in Spark. Windowing in Apache Beam.

Architecture	Storm	Storm-trident	Spark	Flink
Model	Native	Micro-batch	Micro-batch	Native
Guarantees	At least once	Exactly once	Exactly once	Exactly once
Fault Tolerance	Record Ack	Record Ack	Checkpoint	Checkpoint
Maximum Fault Tolerance	High	Medium	Medium	Low
Latency	Very low	High	High	Low
Throughput	Low	Medium	High	High

References:

Let's stay in touch and Follow me for more thoughts and updates