Stream and Batch Processing Frameworks
Why Do We Need Such Frameworks?
- To process more data in a shorter amount of time.
- To unify fault tolerance in distributed systems.
- To simplify task abstractions to meet changing business requirements.
- Suitable for bounded datasets (batch processing) and unbounded datasets (stream processing).
Brief History of Batch and Stream Processing Development
- Hadoop and MapReduce. Google made batch processing as simple as
MapReduce result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs)
in a distributed system. - Apache Storm and directed graph topologies. MapReduce does not represent iterative algorithms well. Therefore, Nathan Marz abstracted stream processing into a graph structure composed of spouts and bolts.
- Spark in-memory computation. Reynold Xin pointed out that Spark uses ten times fewer machines than Hadoop while being three times faster when processing the same data.
- Google Dataflow based on Millwheel and FlumeJava. Google uses a windowed API to support both batch and stream processing simultaneously.
Wait... So Why Has Flink Become So Popular?
- Flink quickly adopted the programming model of ==Google Dataflow== and Apache Beam.
- Flink's efficient implementation of the Chandy-Lamport checkpointing algorithm.
These Frameworks
Architecture Choices
To meet the above demands with commercial machines, there are several popular distributed system architectures...
- Master-slave (centralized): Apache Storm + Zookeeper, Apache Samza + YARN
- P2P (decentralized): Apache S4
Features
- DAG Topology for iterative processing - for example, GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
- Delivery Guarantees. How to ensure the reliability of data delivery between nodes? At least once / at most once / exactly once.
- Fault Tolerance. Implement fault tolerance using cold/warm/hot standby, checkpointing, or active-active.
- Windowed API for unbounded datasets. For example, streaming windows in Apache. Window functions in Spark. Windowing in Apache Beam.
Comparison Table of Different Architectures
Architecture | Storm | Storm-trident | Spark | Flink |
---|---|---|---|---|
Model | Native | Micro-batch | Micro-batch | Native |
Guarantees | At least once | Exactly once | Exactly once | Exactly once |
Fault Tolerance | Record Ack | Record Ack | Checkpoint | Checkpoint |
Maximum Fault Tolerance | High | Medium | Medium | Low |
Latency | Very low | High | High | Low |
Throughput | Low | Medium | High | High |
- https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf
- https://cs.stanford.edu/~matei/papers/2018/sigmod_structured_streaming.pdf
- https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
- https://stackoverflow.com/questions/28502787/google-dataflow-vs-apache-storm