76 posts tagged with "system design"

Designing Smart Notification of Stock Price Changes

August 13, 2019 · 18 min read

Requirements

3 million users
5000 stocks + 250 global stocks
a user gets notified about the price change when
1. subscribing the stock
2. the stock has 5% or 10% changes
3. since a) the last week or b) the last day
extensibility. may support other kinds of notifications like breaking news, earnings call, etc.

Sketching out the Architecture

Contexts:

What is clearing? Clearing is the procedure by which financial trades settle – that is, the correct and timely transfer of funds to the seller and securities to the buyer. Often with clearing, a specialized organization acts as an intermediary known as a clearinghouse.
What is a stock exchange? A facility where stock brokers and traders can buy and sell securities.

Apple Push Notification service<br>(APNs)

Google Firebase Cloud Messaging

(FCM)

Google Firebase Cloud Messaging<br>(FCM)

Email Services

AWS SES /sendgrid/etc

Email Services<br>AWS SES /sendgrid/etc

notifier

External Vendors

Market Prices

[Not supported by viewer]

Robinhood App

API Gateway

Reverse Proxy

batch write

price

ticker

[Not supported by viewer]

Time-series DB

influx or prometheus

Time-series DB<br>influx or prometheus

Tick every 5 mins

[Not supported by viewer]

periorical read

price

watcher

price<br>watcher

User Settings

Notification Queue

throttler cache

cronjob

What are those components and how do they interact with each other?

Price ticker
- data fetching policies
  - option 1 preliminary: fetches data every 5 mins and flush into the time-series database in batches.
  - option 2 advanced: nowadays external systems usually push data directly so that we do not have to pull all the time.
- ~6000 points per request or per price change.
- data retention of 1 week, because this is just the speeding layer of the lambda architecture.
Price watcher
- read the data ranging from last week or last 24 hours for each stock.
- calculate if the fluctuation exceeds 5% or 10% in those two time spans. we get tuples like (stock, up 5%, 1 week).
  - corner case: should we normalize the price data? for example, some abnormal price like someone sold UBER mistakenly for $1 USD.
- ratelimit (because 5% or 10% delta may occur many times within one day), and then emit an event PRICE_CHANGE(STOCK_CODE, timeSpan, percentage) to the notification queue.
Periodical triggers are cron jobs, e.g. Airflow, Cadence.
notification queue
- may not necessarily be introduced in the first place when users and stocks are small.
- may accept generic messaging event, like PRICE_CHANGE, EARNINGS_CALL, BREAKING_NEWS, etc.
Notifier
- subscribe the notification queue to get the event
- and then fetch who to notify from the user settings service
- finally based on user settings, send out messages through APNs, FCM or AWS SES.

Requirements

order-matching system for buy and sell orders. Types of orders:
- Market Orders
- Limit Orders
- Stop-Loss Orders
- Fill-or-Kill Orders
- Duration of Orders
high availability and low latency for millions of users
- async design - use messaging queue extensively (btw. side-effect: engineers work on one service pub to a queue and does not even know where exactly is the downstream service and hence cannot do evil.)

Architecture

Reverse Proxy

API Gateway

Order Matching

User Store

settle

Orders

Stock Meta

auth

Cache

Balances & Bookkeeping

external pricing

clearing

house

clearing<br>house

Bank, ACH, Visa, etc

Payment

Audit & Report

Components and How do they interact with each other.

order matching system

shard by stock code
order's basic data model (other metadata are omitted): Order(id, stock, side, time, qty, price)
the core abstraction of the order book is the matching algorithm. there are a bunch of matching algorithms(ref to stackoverflow, ref to medium)
example 1: price-time FIFO - a kind of 2D vector cast or flatten into 1D vector
- x-axis is price
- y-axis is orders. Price/time priority queue, FIFO.
  - Buy-side: ascending in price, descending in time.
  - Sell-side: ascending in price, ascending in time.
- in other words
  - Buy-side: the higher the price and the earlier the order, the nearer we should put it to the center of the matching.
  - Sell-side: the lower the price and the earlier the order, the nearer we should put it to the center of the matching.

x-axis

line of prices

with y-axis cast into x-axis

Id   Side    Time   Qty   Price   Qty    Time   Side  
---+------+-------+-----+-------+-----+-------+------
#3                        20.30   200   09:05   SELL  
#1                        20.30   100   09:01   SELL  
#2                        20.25   100   09:03   SELL  
#5   BUY    09:08   200   20.20                       
#4   BUY    09:06   100   20.15                       
#6   BUY    09:09   200   20.15                       

Order book from Coinbase Pro

The Single Stock-Exchange Simulator

example 2: pro-rata

pure pro-rata

How to implement the price-time FIFO matching algorithm?

shard by stock, CP over AP: one stock one partition
stateful in-memory tree-map
- periodically iterate the treemap to match orders
data persistence with cassandra
in/out requests of the order matching services are made through messaging queues
failover
- the in-memory tree-maps are snapshotting into database
- in an error case, recover from the snapshot and de-duplicate with cache

How to transmit data of the order book to the client-side in realtime?

websocket

How to support different kinds of orders?

same SELL or BUY: qty @ price in the treemap with different creation setup and matching conditions
- Market Orders: place the order at the last market price.
- Limit Orders: place the order with at a specific price.
- Stop-Loss Orders: place the order with at a specific price, and match it in certain conditions.
- Fill-or-Kill Orders: place the order with at a specific price, but match it only once.
- Duration of Orders: place the order with at a specific price, but match it only in the given time span.

Orders Service

Preserves all active orders and order history.
Writes to order matching when receives a new order.
Receives matched orders and settle with external clearing house (async external gateway call + cronjob to sync DB)

References

What is Architecture?

Architecture is the shape of a software system. To illustrate with a building:

Paradigm is the bricks.
Design principles are the rooms.
Components are the structure.

They all serve a specific purpose, just like hospitals treat patients and schools educate students.

Why Do We Need Architecture?

Behavior vs. Structure

Every software system provides two distinct values to stakeholders: behavior and structure. Software developers must ensure that both values are high.

==Due to the nature of their work, software architects focus more on the structure of the system rather than its features and functions.==

Ultimate Goal — ==Reduce the human resource costs required for adding new features==

Architecture serves the entire lifecycle of software systems, making them easy to understand, develop, test, deploy, and operate. Its goal is to minimize the human resource costs for each business use case.

O'Reilly's "Software Architecture" provides a great introduction to these five fundamental architectures.

1. Layered Architecture

Layered architecture is widely adopted and well-known among developers. Therefore, it is the de facto standard at the application level. If you are unsure which architecture to use, layered architecture is a good choice.

Examples:

TCP/IP model: Application Layer > Transport Layer > Internet Layer > Network Interface Layer
Facebook TAO: Network Layer > Cache Layer (follower + leader) > Database Layer

Pros and Cons:

Pros
- Easy to use
- Clear responsibilities
- Testability
Cons
- Large and rigid
  - Adjusting, extending, or updating the architecture requires changes across all layers, which can be quite tricky.

2. Event-Driven Architecture

Any change in state triggers an event in the system. Communication between system components is accomplished through events.

A simplified architecture includes a mediator, event queue, and channels. The diagram below illustrates a simplified event-driven architecture:

Examples:

QT: Signals and Slots
Payment infrastructure: As bank gateways often have high latency, asynchronous techniques are used in banking architecture.

3. Microkernel Architecture (aka Plug-in Architecture)

The functionality of the software is distributed between a core and multiple plugins. The core contains only the most basic functionalities. Each plugin operates independently and implements shared interfaces to achieve different goals.

Examples:

Visual Studio Code and Eclipse
MINIX operating system

4. Microservices Architecture

Large systems are decomposed into numerous microservices, each a separately deployable unit that communicates via RPCs.

uber architecture

Examples:

Uber: See designing Uber
Smartly

5. Space-Based Architecture

The name "Space-Based Architecture" comes from "tuple space," which implies a "distributed shared space." In space-based architecture, there are no databases or synchronized database access, thus avoiding database bottleneck issues. All processing units share copies of application data in memory. These processing units can be flexibly started and stopped.

Example: See Wikipedia

Primarily adopted by Java-based architectures: for example, JavaSpaces.

Introduction to Architecture

May 11, 2019 · 3 min read

What is architecture?

Architecture is the shape of the software system. Thinking it as a big picture of physical buildings.

paradigms are bricks.
design principles are rooms.
components are buildings.

Together they serve a specific purpose like a hospital is for curing patients and a school is for educating students.

Why do we need architecture?

Behavior vs. Structure

Every software system provides two different values to the stakeholders: behavior and structure. Software developers are responsible for ensuring that both those values remain high.

==Software architects are, by virtue of their job description, more focused on the structure of the system than on its features and functions.==

Ultimate Goal - ==saving human resources costs per feature==

Architecture serves the full lifecycle of the software system to make it easy to understand, develop, test, deploy, and operate. The goal is to minimize the human resources costs per business use-case.

The O’Reilly book Software Architecture Patterns by Mark Richards is a simple but effective introduction to these five fundamental architectures.

1. Layered Architecture

The layered architecture is the most common in adoption, well-known among developers, and hence the de facto standard for applications. If you do not know what architecture to use, use it.

Examples

TCP / IP Model: Application layer > transport layer > internet layer > network access layer
Facebook TAO: web layer > cache layer (follower + leader) > database layer

Pros and Cons

Pros
- ease of use
- separation of responsibility
- testability
Cons
- monolithic
  - hard to adjust, extend or update. You have to make changes to all the layers.

2. Event-Driven Architecture

A state change will emit an event to the system. All the components communicate with each other through events.

A simple project can combine the mediator, event queue, and channel. Then we get a simplified architecture:

Examples

QT: Signals and Slots
Payment Infrastructure: Bank gateways usually have very high latencies, so they adopt async technologies in their architecture design.

3. Micro-kernel Architecture (aka Plug-in Architecture)

The software's responsibilities are divided into one "core" and multiple "plugins". The core contains the bare minimum functionality. Plugins are independent of each other and implement shared interfaces to achieve different goals.

Examples

Visual Studio Code, Eclipse
MINIX operating system

4. Microservices Architecture

A massive system is decoupled to multiple micro-services, each of which is a separately deployed unit, and they communicate with each other via RPCs.

uber architecture

Examples

Uber: See designing Uber
Smartly

5. Space-based Architecture

This pattern gets its name from "tuple space", which means “distributed shared memory". There is no database or synchronous database access, and thus no database bottleneck. All the processing units share the replicated application data in memory. These processing units can be started up and shut down elastically.

Examples: See Wikipedia

Mostly adopted among Java users: e.g., JavaSpaces

Toutiao Recommendation System: P2 Content Analysis

April 19, 2019 · 3 min read

In Toutiao Recommendation System: P1 Overview, we know that content analysis and data mining of user tags are the cornerstones of the recommendation system.

What is the content analysis?

content analysis = derive intermediate data from raw articles and user behaviors.

Take articles for example. To model user interests, we need to tag contents and articles. To associate a user with the interests of the “Internet” tag, we need to know whether a user reads an article with the “Internet” tag.

Why are we analyzing those raw data?

We do it for the reason of …

Tagging users (user profile)
- Tagging users who liked articles with “Internet” tag. Tagging users who liked articles with “xiaomi” tag.
Recommending contents to users by tags
- Pushing “meizu” contents to users with “meizu” tag. Pushing “dota” contents to users with “dota” tag.
Preparing contents by topics
- Put “Bundesliga” articles to “Bundesliga topic”. Put “diet” articles to “diet topic”.

Case Study: Analysis Result of an Article

Here is an example of “article features” page. There are article features like categorizations, keywords, topics, entities.

Analysis Result of an Article

Analysis Result of an Article: Details

What are the article features?

Semantic Tags: Human predefine those tags with explicit meanings.
Implicit Semantics, including topics and keywords. Topic features are describing the statistics of words. Certain rules generate keywords.
Similarity. Duplicate recommendation once to be the most severe feedbacks we get from our customers.
Time and location.
Quality. Abusing, porn, ads, or “chicken soup for the soul”?

Article features are important

It is not true that a recommendation system cannot work at all without article features. Amazon, Walmart, Netflix can recommend by collaborative filtering.
However, in news product, users consume contents of the same day. Bootstrapping without article features is hard. Collaborative filtering cannot help with bootstrapping.
- The finer of the granularity of the article feature, the better the ability to bootstrap.

More on Semantic Tags

We divide features of semantic tags into three levels:

Categorizations: used in the user profile, filtering contents in topics, recommend recall, recommend features
Concepts: used in filtering contents in topics, searching tags, recommend recall(like)
Entities: used in filtering contents in topics, searching tags, recommend recall(like)

Why dividing into different levels? We do this so that they can capture articles in different granularities.

Categorizations: full in coverage, low in accuracy.
Concepts: medium in coverage, medium in accuracy.
Entities: low in coverage, high in accuracy. It only covers hot people, organizations, products in each area.

Categorizations and concepts are sharing the same technical infrastructure.

Why do we need semantic tags?

Implicit semantics
- have been functioning well.
- cost much less than semantic tags.
But, topics and interests need a clear-defined tagging system.
Semantic tags also evaluate the capability in NPL technology of a company.

Document classification

Classification hierarchy

Root
Science, sports, finance, entertainment
Football, tennis, table tennis, track and field, swimming
International, domestic
Team A, team B

Classifiers:

SVM
SVM + CNN
SVM + CNN + RNN

Calculating relevance

Lexical analysis for articles
Filtering keywords
Disambiguation
Calculating relevance

Stream and Batch Processing Frameworks

February 27, 2019 · 2 min read

Why Do We Need Such Frameworks?

To process more data in a shorter amount of time.
To unify fault tolerance in distributed systems.
To simplify task abstractions to meet changing business requirements.
Suitable for bounded datasets (batch processing) and unbounded datasets (stream processing).

Brief History of Batch and Stream Processing Development

Hadoop and MapReduce. Google made batch processing as simple as MapReduce result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs) in a distributed system.
Apache Storm and directed graph topologies. MapReduce does not represent iterative algorithms well. Therefore, Nathan Marz abstracted stream processing into a graph structure composed of spouts and bolts.
Spark in-memory computation. Reynold Xin pointed out that Spark uses ten times fewer machines than Hadoop while being three times faster when processing the same data.
Google Dataflow based on Millwheel and FlumeJava. Google uses a windowed API to support both batch and stream processing simultaneously.

Wait... So Why Has Flink Become So Popular?

Flink quickly adopted the programming model of ==Google Dataflow== and Apache Beam.
Flink's efficient implementation of the Chandy-Lamport checkpointing algorithm.

These Frameworks

Architecture Choices

To meet the above demands with commercial machines, there are several popular distributed system architectures...

Master-slave (centralized): Apache Storm + Zookeeper, Apache Samza + YARN
P2P (decentralized): Apache S4

Features

DAG Topology for iterative processing - for example, GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
Delivery Guarantees. How to ensure the reliability of data delivery between nodes? At least once / at most once / exactly once.
Fault Tolerance. Implement fault tolerance using cold/warm/hot standby, checkpointing, or active-active.
Windowed API for unbounded datasets. For example, streaming windows in Apache. Window functions in Spark. Windowing in Apache Beam.

Comparison Table of Different Architectures

Architecture	Storm	Storm-trident	Spark	Flink
Model	Native	Micro-batch	Micro-batch	Native
Guarantees	At least once	Exactly once	Exactly once	Exactly once
Fault Tolerance	Record Ack	Record Ack	Checkpoint	Checkpoint
Maximum Fault Tolerance	High	Medium	Medium	Low
Latency	Very low	High	High	Low
Throughput	Low	Medium	High	High

Toutiao Recommendation System: P1 Overview

February 19, 2019 · 6 min read

What are we optimizing for? User Satisfaction

We are finding the best function below to maximize user satisfaction .

user satisfaction = function(content, user profile, context)

Content: features of articles, videos, UGC short videos, Q&As, etc.
User profile: interests, occupation, age, gender and behavior patterns, etc.
Context: Mobile users in contexts of workspace, commuting, traveling, etc.

How to evaluate the satisfaction?

Measurable Goals, e.g.
- click through rate
- Session duration
- upvotes
- comments
- reposts
Hard-to-measurable Goals:
- Frequency control of ads and special-typed contents (Q&A)
- Frequency control of vulgar content
- Reducing clickbait, low quality, disgusting content
- Enforcing / pining / highly-weighting important news
- Lowly-weighting contents from low-level accounts

How to optimize for those goals? Machine Learning Models

It is a typical supervised machine learning problem to find the best function above. To implement the system, we have these algorithms:

Collaborative Filtering
Logistic Regression
DNN
Factorization Machine
GBDT

A world-class recommendation system is supposed to have the flexibility to A/B-test and combine multiple algorithms above. It is now popular to combine LR and DNN. Facebook used both LR and GBDT years ago.

How do models observe and measure the reality? Feature engineering

Correlation, between content’s characteristic and user’s interest. Explicit correlations include keywords, categories, sources, genres. Implicit correlations can be extract from user’s vector or item’s vector from models like FM.
Environmental features such as geo location, time. It’s can be used as bias or building correlation on top of it.
Hot trend. There are global hot trend, categorical hot trend, topic hot trend and keyword hot trend. Hot trend is very useful to solve cold-start issue when we have little information about user.
Collaborative features, which helps avoid situation where recommended content get more and more concentrated. Collaborative filtering is not analysing each user’s history separately, but finding users’ similarity based on their behaviour by clicks, interests, topics, keywords or event implicit vectors. By finding similar users, it can expand the diversity of recommended content.

Large-scale Training in Realtime

Users like to see news feed updated in realtime according to what we track from their actions.
Use Apache storm to train data (clicks, impressions, faves, shares) in realtime.
Collect data to a threshold and then update to the recommendation model
Store model parameters , like tens of billions of raw features and billions of vector features, in high performance computing clusters.

They are implemented in the following steps:

Online services record features in realtime.
Write data into Kafka
Ingest data from Kafka to Storm
Populate full user profiles and prepare samples
Update model parameters according to the latest samples
Online modeling gains new knowledge

How to further reducing the latency? Recall Strategy

It is impossible to predict all the things with the model, considering the super-large scale of all the contents. Therefore, we need recall strategies to focus on a representative subset of the data. Performance is critical here and timeout is 50ms.

recall strategy

Among all the recall strategies, we take the InvertedIndex<Key, List<Article>> .

The Key can be topic, entity, source, etc.

Tags of Interests	Relevance	List of Documents
E-commerce	0.3	…
Fun	0.2	…
History	0.2	…
Military	0.1	…

Data Dependencies

Features depends on tags of user-side and content-side.
recall strategy depends on tags of user-side and content-side.
content analysis and data mining of user tags are cornerstone of the recommendation system.

What is the content analysis?

content analysis = derive intermediate data from raw articles and user behaviors.

Take articles for example. To model user interests, we need to tag contents and articles. To associate a user with the interests of the “Internet” tag, we need to know whether a user reads an article with the “Internet” tag.

Why are we analyzing those raw data?

We do it for the reason of …

Tagging users (user profile)
- Tagging users who liked articles with “Internet” tag. Tagging users who liked articles with “xiaomi” tag.
Recommending contents to users by tags
- Pushing “meizu” contents to users with “meizu” tag. Pushing “dota” contents to users with “dota” tag.
Preparing contents by topics
- Put “Bundesliga” articles to “Bundesliga topic”. Put “diet” articles to “diet topic”.

Case Study: Analysis Result of an Article

Here is an example of “article features” page. There are article features like categorizations, keywords, topics, entities.

Analysis Result of an Article

Analysis Result of an Article: Details

What are the article features?

Semantic Tags: Human predefine those tags with explicit meanings.
Implicit Semantics, including topics and keywords. Topic features are describing the statistics of words. Certain rules generate keywords.
Similarity. Duplicate recommendation once to be the most severe feedbacks we get from our customers.
Time and location.
Quality. Abusing, porn, ads, or “chicken soup for the soul”?

Article features are important

It is not true that a recommendation system cannot work at all without article features. Amazon, Walmart, Netflix can recommend by collaborative filtering.
However, in news product, users consume contents of the same day. Bootstrapping without article features is hard. Collaborative filtering cannot help with bootstrapping.
- The finer of the granularity of the article feature, the better the ability to bootstrap.

More on Semantic Tags

We divide features of semantic tags into three levels:

Categorizations: used in the user profile, filtering contents in topics, recommend recall, recommend features
Concepts: used in filtering contents in topics, searching tags, recommend recall(like)
Entities: used in filtering contents in topics, searching tags, recommend recall(like)

Why dividing into different levels? We do this so that they can capture articles in different granularities.

Categorizations: full in coverage, low in accuracy.
Concepts: medium in coverage, medium in accuracy.
Entities: low in coverage, high in accuracy. It only covers hot people, organizations, products in each area.

Categorizations and concepts are sharing the same technical infrastructure.

Why do we need semantic tags?

Implicit semantics
- have been functioning well.
- cost much less than semantic tags.
But, topics and interests need a clear-defined tagging system.
Semantic tags also evaluate the capability in NPL technology of a company.

Document classification

Classification hierarchy

Root
Science, sports, finance, entertainment
Football, tennis, table tennis, track and field, swimming
International, domestic
Team A, team B

Classifiers:

SVM
SVM + CNN
SVM + CNN + RNN

Calculating relevance

Lexical analysis for articles
Filtering keywords
Disambiguation
Calculating relevance

Toutiao Recommendation System: P1 Overview

February 19, 2019 · 3 min read

What are we optimizing for? User Satisfaction

We are finding the best function below to maximize user satisfaction .

user satisfaction = function(content, user profile, context)

Content: features of articles, videos, UGC short videos, Q&As, etc.
User profile: interests, occupation, age, gender and behavior patterns, etc.
Context: Mobile users in contexts of workspace, commuting, traveling, etc.

How to evaluate the satisfaction?

Measurable Goals, e.g.
- click through rate
- Session duration
- upvotes
- comments
- reposts
Hard-to-measurable Goals:
- Frequency control of ads and special-typed contents (Q&A)
- Frequency control of vulgar content
- Reducing clickbait, low quality, disgusting content
- Enforcing / pining / highly-weighting important news
- Lowly-weighting contents from low-level accounts

How to optimize for those goals? Machine Learning Models

It is a typical supervised machine learning problem to find the best function above. To implement the system, we have these algorithms:

Collaborative Filtering
Logistic Regression
DNN
Factorization Machine
GBDT

A world-class recommendation system is supposed to have the flexibility to A/B-test and combine multiple algorithms above. It is now popular to combine LR and DNN. Facebook used both LR and GBDT years ago.

How do models observe and measure the reality? Feature engineering

Correlation, between content’s characteristic and user’s interest. Explicit correlations include keywords, categories, sources, genres. Implicit correlations can be extract from user’s vector or item’s vector from models like FM.
Environmental features such as geo location, time. It’s can be used as bias or building correlation on top of it.
Hot trend. There are global hot trend, categorical hot trend, topic hot trend and keyword hot trend. Hot trend is very useful to solve cold-start issue when we have little information about user.
Collaborative features, which helps avoid situation where recommended content get more and more concentrated. Collaborative filtering is not analysing each user’s history separately, but finding users’ similarity based on their behaviour by clicks, interests, topics, keywords or event implicit vectors. By finding similar users, it can expand the diversity of recommended content.

Large-scale Training in Realtime

Users like to see news feed updated in realtime according to what we track from their actions.
Use Apache storm to train data (clicks, impressions, faves, shares) in realtime.
Collect data to a threshold and then update to the recommendation model
Store model parameters , like tens of billions of raw features and billions of vector features, in high performance computing clusters.

They are implemented in the following steps:

Online services record features in realtime.
Write data into Kafka
Ingest data from Kafka to Storm
Populate full user profiles and prepare samples
Update model parameters according to the latest samples
Online modeling gains new knowledge

How to further reducing the latency? Recall Strategy

It is impossible to predict all the things with the model, considering the super-large scale of all the contents. Therefore, we need recall strategies to focus on a representative subset of the data. Performance is critical here and timeout is 50ms.

recall strategy

Among all the recall strategies, we take the InvertedIndex<Key, List<Article>> .

The Key can be topic, entity, source, etc.

Tags of Interests	Relevance	List of Documents
E-commerce	0.3	…
Fun	0.2	…
History	0.2	…
Military	0.1	…

Data Dependencies

Features depends on tags of user-side and content-side.
recall strategy depends on tags of user-side and content-side.
content analysis and data mining of user tags are cornerstone of the recommendation system.

Stream and Batch Processing Frameworks

February 16, 2019 · 2 min read

Why such frameworks?

process high-throughput in low latency.
fault-tolerance in distributed systems.
generic abstraction to serve volatile business requirements.
for bounded data set (batch processing) and for unbounded data set (stream processing).

Brief history of batch/stream processing

Hadoop and MapReduce. Google made batch processing as simple as MR result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs) in a distributed system.
Apache Storm and DAG Topology. MR doesn’t efficiently express iterative algorithms. Thus Nathan Marz abstracted stream processing into a graph of spouts and bolts.
Spark in-memory Computing. Reynold Xin said Spark sorted the same data 3X faster using 10X fewer machines compared to Hadoop.
Google Dataflow based on Millwheel and FlumeJava. Google supports both batch and streaming computing with the windowing API.

Wait... But why is Flink gaining popularity?

its fast adoption of ==Google Dataflow==/Beam programming model.
its highly efficient implementation of Chandy-Lamport checkpointing.

How?

Architectural Choices

To serve requirements above with commodity machines, the steaming framework use distributed systems in these architectures...

master-slave (centralized): apache storm with zookeeper, apache samza with YARN.
P2P (decentralized): apache s4.

Features

DAG Topology for Iterative Processing. e.g. GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
Delivery Guarantees. How guaranteed to deliver data from nodes to nodes? at-least once / at-most once / exactly once.
Fault-tolerance. Using cold/warm/hot standby, checkpointing, or active-active.
Windowing API for unbounded data set. e.g. Stream Windows in Apache Flink. Spark Window Functions. Windowing in Apache Beam.

Comparison

Framework	Storm	Storm-trident	Spark	Flink
Model	native	micro-batch	micro-batch	native
Guarentees	at-least-once	exactly-once	exactly-once	exactly-once
Fault-tolerance	Record-Ack	record-ack	checkpoint	checkpoint
Overhead of fault-tolerance	high	medium	medium	low
latency	very-low	high	high	low
throughput	low	medium	high	high

Fraud Detection with Semi-supervised Learning

February 13, 2019 · 4 min read

Clarify Requirements

Calculate risk probability scores in realtime and make decisions along with a rule engine to prevent ATO (account takeovers) and Botnet attacks.

Train clustering fatures with online and offline pipelines

Source from website logs, auth logs, user actions, transactions, high-risk accounts in watch list
track event data in kakfa topics
Process events and prepare clustering features

Realtime scoring and rule-based decision

assess a risk score comprehensively for online services
Maintain flexibility with manually configuration in a rule engine
share, or use the insights in online services

ATOs ranking from easy to hard to detect

from single IP
from IPs on the same device
from IPs across the world
from 100k IPs
attacks on specific accounts
phishing and malware

Challenges

Manual feature selection
Feature evolution in adversarial environment
Scalability
No online DBSCAN

High-level Architecture

Core Components and Workflows

Semi-supervised learning = unlabeled data + small amount of labeled data

Why? better learning accuracy than unsupervised learning + less time and costs than supervised learning

Training: To prepare clustering features in database

Streaming Pipeline on Spark:
- Runs continuously in real-time.
- Performs feature normalization and categorical transformation on the fly.
  - Feature Normalization: Scale your numeric features (e.g., age, income) so that they are between 0 and 1.
  - Categorical Feature Transformation: Apply one-hot encoding or another transformation to convert categorical features into a numeric format suitable for the machine learning model.
- Uses Spark MLlib’s K-means to cluster streaming data into groups.
  - After running k-means and forming clusters, you might find that certain clusters have more instances of fraud.
  - Once you’ve labeled a cluster as fraudulent based on historical data or expert knowledge, you can use that cluster assignment during inference. Any new data point assigned to that fraudulent cluster can be flagged as suspicious.
Hourly Cronjob Pipeline:
- Runs periodically every hour (batch processing).
- Applies thresholding to identify anomalies based on results from the clustering model.
- Tunes parameters of the DBSCAN algorithm to improve clustering and anomaly detection.
- Uses DBSCAN from scikit-learn to find clusters and detect outliers in batch data.
  - DBSCAN, which can detect outliers, might identify clusters of regular transactions and separate them from noise, which could be unusual, potentially fraudulent transactions.
  - Transactions in the noisy or outlier regions (points that don’t belong to any dense cluster) can be flagged as suspicious.
  - After identifying a cluster as fraudulent, DBSCAN helps detect patterns of fraud even in irregularly shaped transaction distributions.

Serving

The serving layer is where the rubber meets the road - where we turn our machine learning models and business rules into actual fraud prevention decisions. Here's how it works:

Fraud Detection Scoring Service:
- Takes real-time features extracted from incoming requests
- Applies both clustering models (K-means from streaming and DBSCAN from batch)
- Combines scores with streaming counters (like login attempts per IP)
- Outputs a unified risk score between 0 and 1
Rule Engine:
- Acts as the "brain" of the system
- Combines ML scores with configurable business rules
- Examples of rules:
  - If risk score > 0.8 AND user is accessing from new IP → require 2FA
  - If risk score > 0.9 AND account is high-value → block transaction
- Rules are stored in a database and can be updated without code changes
- Provides an admin portal for security teams to adjust rules
Integration with Other Services:
- Exposes REST APIs for real-time scoring
- Publishes results to streaming counters for monitoring
- Feeds decisions back to the training pipeline to improve model accuracy
Observability:
- Tracks key metrics like false positive/negative rates
- Monitors model drift and feature distribution changes
- Provides dashboards for security analysts to investigate patterns
- Logs detailed information for post-incident analysis

Requirements​

Sketching out the Architecture​

What are those components and how do they interact with each other?​

Requirements​

Architecture​

Components and How do they interact with each other.​

order matching system​

Orders Service​

References​

What is Architecture?​

Why Do We Need Architecture?​

Behavior vs. Structure​

Ultimate Goal — ==Reduce the human resource costs required for adding new features==​

1. Layered Architecture​

2. Event-Driven Architecture​

3. Microkernel Architecture (aka Plug-in Architecture)​

4. Microservices Architecture​

5. Space-Based Architecture​

What is architecture?​

Why do we need architecture?​

Behavior vs. Structure​

Ultimate Goal - ==saving human resources costs per feature==​

1. Layered Architecture​

2. Event-Driven Architecture​

3. Micro-kernel Architecture (aka Plug-in Architecture)​

4. Microservices Architecture​

5. Space-based Architecture​

What is the content analysis?​

Why are we analyzing those raw data?​

Case Study: Analysis Result of an Article​

Article features are important​

More on Semantic Tags​

Document classification​

Calculating relevance​

Why Do We Need Such Frameworks?​

Brief History of Batch and Stream Processing Development​

Wait... So Why Has Flink Become So Popular?​

These Frameworks​

Architecture Choices​

Features​

Comparison Table of Different Architectures​

What are we optimizing for? User Satisfaction​

How to evaluate the satisfaction?​

How to optimize for those goals? Machine Learning Models​

How do models observe and measure the reality? Feature engineering​

Large-scale Training in Realtime​

How to further reducing the latency? Recall Strategy​

Data Dependencies​

What is the content analysis?​

Why are we analyzing those raw data?​

Case Study: Analysis Result of an Article​

Article features are important​

More on Semantic Tags​

Document classification​

Calculating relevance​

What are we optimizing for? User Satisfaction​

How to evaluate the satisfaction?​

How to optimize for those goals? Machine Learning Models​

How do models observe and measure the reality? Feature engineering​

Large-scale Training in Realtime​

How to further reducing the latency? Recall Strategy​

Data Dependencies​

Why such frameworks?​

Brief history of batch/stream processing​

Wait... But why is Flink gaining popularity?​

How?​

Architectural Choices​

Features​

Comparison​

Clarify Requirements​

High-level Architecture​

Core Components and Workflows​

Training: To prepare clustering features in database​

Serving​

About Tian Pan

Stay up to date

Requirements

Sketching out the Architecture

What are those components and how do they interact with each other?

Requirements

Architecture

Components and How do they interact with each other.

order matching system

Orders Service

References

What is Architecture?

Why Do We Need Architecture?

Behavior vs. Structure

Ultimate Goal — ==Reduce the human resource costs required for adding new features==

1. Layered Architecture

2. Event-Driven Architecture

3. Microkernel Architecture (aka Plug-in Architecture)

4. Microservices Architecture

5. Space-Based Architecture

What is architecture?

Why do we need architecture?

Behavior vs. Structure

Ultimate Goal - ==saving human resources costs per feature==

1. Layered Architecture

2. Event-Driven Architecture

3. Micro-kernel Architecture (aka Plug-in Architecture)

4. Microservices Architecture

5. Space-based Architecture

What is the content analysis?

Why are we analyzing those raw data?

Case Study: Analysis Result of an Article

Article features are important

More on Semantic Tags

Document classification

Calculating relevance

Why Do We Need Such Frameworks?

Brief History of Batch and Stream Processing Development

Wait... So Why Has Flink Become So Popular?

These Frameworks

Architecture Choices

Features

Comparison Table of Different Architectures

What are we optimizing for? User Satisfaction

How to evaluate the satisfaction?

How to optimize for those goals? Machine Learning Models

How do models observe and measure the reality? Feature engineering

Large-scale Training in Realtime

How to further reducing the latency? Recall Strategy

Data Dependencies

What is the content analysis?

Why are we analyzing those raw data?

Case Study: Analysis Result of an Article

Article features are important

More on Semantic Tags

Document classification

Calculating relevance

What are we optimizing for? User Satisfaction

How to evaluate the satisfaction?

How to optimize for those goals? Machine Learning Models

How do models observe and measure the reality? Feature engineering

Large-scale Training in Realtime

How to further reducing the latency? Recall Strategy

Data Dependencies

Why such frameworks?

Brief history of batch/stream processing

Wait... But why is Flink gaining popularity?

How?

Architectural Choices

Features

Comparison

Clarify Requirements

High-level Architecture

Core Components and Workflows

Training: To prepare clustering features in database

Serving