Stream and Batch Processing Frameworks

February 16, 2019 · 2 min read

Why such frameworks?

process high-throughput in low latency.
fault-tolerance in distributed systems.
generic abstraction to serve volatile business requirements.
for bounded data set (batch processing) and for unbounded data set (stream processing).

Hadoop and MapReduce. Google made batch processing as simple as MR result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs) in a distributed system.
Apache Storm and DAG Topology. MR doesn’t efficiently express iterative algorithms. Thus Nathan Marz abstracted stream processing into a graph of spouts and bolts.
Spark in-memory Computing. Reynold Xin said Spark sorted the same data 3X faster using 10X fewer machines compared to Hadoop.
Google Dataflow based on Millwheel and FlumeJava. Google supports both batch and streaming computing with the windowing API.

To serve requirements above with commodity machines, the steaming framework use distributed systems in these architectures...

master-slave (centralized): apache storm with zookeeper, apache samza with YARN.
P2P (decentralized): apache s4.

DAG Topology for Iterative Processing. e.g. GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
Delivery Guarantees. How guaranteed to deliver data from nodes to nodes? at-least once / at-most once / exactly once.
Fault-tolerance. Using cold/warm/hot standby, checkpointing, or active-active.
Windowing API for unbounded data set. e.g. Stream Windows in Apache Flink. Spark Window Functions. Windowing in Apache Beam.

Framework	Storm	Storm-trident	Spark	Flink
Model	native	micro-batch	micro-batch	native
Guarentees	at-least-once	exactly-once	exactly-once	exactly-once
Fault-tolerance	Record-Ack	record-ack	checkpoint	checkpoint
Overhead of fault-tolerance	high	medium	medium	low
latency	very-low	high	high	low
throughput	low	medium	high	high

References:

Let's stay in touch and Follow me for more thoughts and updates