Skip to main content

76 posts tagged with "system design"

View All Tags

Designing Smart Notification of Stock Price Changes

· 15 min read

Requirements

  • 3 million users
  • 5000 stocks + 250 global stocks
  • a user gets notified about the price change when
    1. subscribing the stock
    2. the stock has 5% or 10% changes
    3. since a) the last week or b) the last day
  • extensibility. may support other kinds of notifications like breaking news, earnings call, etc.

Sketching out the Architecture

Contexts:

  • What is clearing? Clearing is the procedure by which financial trades settle – that is, the correct and timely transfer of funds to the seller and securities to the buyer. Often with clearing, a specialized organization acts as an intermediary known as a clearinghouse.
  • What is a stock exchange? A facility where stock brokers and traders can buy and sell securities.

Apple Push Notification service


(APNs)

Apple Push Notification service<br>(APNs)

Google Firebase Cloud Messaging


(FCM)

Google Firebase Cloud Messaging<br>(FCM)

Email Services


AWS SES /sendgrid/etc

Email Services<br>AWS SES /sendgrid/etc

notifier

notifier

External Vendors



Market Prices

[Not supported by viewer]

Robinhood App

Robinhood App

API Gateway

API Gateway

Reverse Proxy

Reverse Proxy

batch write

batch write

price


ticker

[Not supported by viewer]

Time-series DB


influx or prometheus

Time-series DB<br>influx or prometheus

Tick every 5 mins

[Not supported by viewer]

periorical read

periorical read

price


watcher

price<br>watcher

User Settings

User Settings

Notification Queue

Notification Queue

throttler cache

throttler cache

cronjob

cronjob

What are those components and how do they interact with each other?

  • Price ticker
    • data fetching policies
      • option 1 preliminary: fetches data every 5 mins and flush into the time-series database in batches.
      • option 2 advanced: nowadays external systems usually push data directly so that we do not have to pull all the time.
    • ~6000 points per request or per price change.
    • data retention of 1 week, because this is just the speeding layer of the lambda architecture.
  • Price watcher
    • read the data ranging from last week or last 24 hours for each stock.
    • calculate if the fluctuation exceeds 5% or 10% in those two time spans. we get tuples like (stock, up 5%, 1 week).
      • corner case: should we normalize the price data? for example, some abnormal price like someone sold UBER mistakenly for $1 USD.
    • ratelimit (because 5% or 10% delta may occur many times within one day), and then emit an event PRICE_CHANGE(STOCK_CODE, timeSpan, percentage) to the notification queue.
  • Periodical triggers are cron jobs, e.g. Airflow, Cadence.
  • notification queue
    • may not necessarily be introduced in the first place when users and stocks are small.
    • may accept generic messaging event, like PRICE_CHANGE, EARNINGS_CALL, BREAKING_NEWS, etc.
  • Notifier
    • subscribe the notification queue to get the event
    • and then fetch who to notify from the user settings service
    • finally based on user settings, send out messages through APNs, FCM or AWS SES.

Designing Stock Exchange

· 16 min read

Requirements

  • order-matching system for buy and sell orders. Types of orders:
    • Market Orders
    • Limit Orders
    • Stop-Loss Orders
    • Fill-or-Kill Orders
    • Duration of Orders
  • high availability and low latency for millions of users
    • async design - use messaging queue extensively (btw. side-effect: engineers work on one service pub to a queue and does not even know where exactly is the downstream service and hence cannot do evil.)

Architecture

Reverse Proxy

Reverse Proxy

API Gateway

API Gateway

Order Matching

Order Matching

User Store

User Store

settle

settle

Orders

Orders

Stock Meta

Stock Meta

auth

auth

Cache

Cache

Balances & Bookkeeping

Balances & Bookkeeping

external pricing

external pricing

clearing


house

clearing<br>house

Bank, ACH, Visa, etc

Bank, ACH, Visa, etc

Payment

Payment

Audit & Report

Audit & Report

Components and How do they interact with each other.

order matching system

  • shard by stock code
  • order's basic data model (other metadata are omitted): Order(id, stock, side, time, qty, price)
  • the core abstraction of the order book is the matching algorithm. there are a bunch of matching algorithms(ref to stackoverflow, ref to medium)
  • example 1: price-time FIFO - a kind of 2D vector cast or flatten into 1D vector
    • x-axis is price
    • y-axis is orders. Price/time priority queue, FIFO.
      • Buy-side: ascending in price, descending in time.
      • Sell-side: ascending in price, ascending in time.
    • in other words
      • Buy-side: the higher the price and the earlier the order, the nearer we should put it to the center of the matching.
      • Sell-side: the lower the price and the earlier the order, the nearer we should put it to the center of the matching.

x-axis

line of prices

with y-axis cast into x-axis

Id   Side    Time   Qty   Price   Qty    Time   Side  
---+------+-------+-----+-------+-----+-------+------
#3 20.30 200 09:05 SELL
#1 20.30 100 09:01 SELL
#2 20.25 100 09:03 SELL
#5 BUY 09:08 200 20.20
#4 BUY 09:06 100 20.15
#6 BUY 09:09 200 20.15

Order book from Coinbase Pro

The Single Stock-Exchange Simulator

  • example 2: pro-rata

pure pro-rata

How to implement the price-time FIFO matching algorithm?

  • shard by stock, CP over AP: one stock one partition
  • stateful in-memory tree-map
    • periodically iterate the treemap to match orders
  • data persistence with cassandra
  • in/out requests of the order matching services are made through messaging queues
  • failover
    • the in-memory tree-maps are snapshotting into database
    • in an error case, recover from the snapshot and de-duplicate with cache

How to transmit data of the order book to the client-side in realtime?

  • websocket

How to support different kinds of orders?

  • same SELL or BUY: qty @ price in the treemap with different creation setup and matching conditions
    • Market Orders: place the order at the last market price.
    • Limit Orders: place the order with at a specific price.
    • Stop-Loss Orders: place the order with at a specific price, and match it in certain conditions.
    • Fill-or-Kill Orders: place the order with at a specific price, but match it only once.
    • Duration of Orders: place the order with at a specific price, but match it only in the given time span.

Orders Service

  • Preserves all active orders and order history.
  • Writes to order matching when receives a new order.
  • Receives matched orders and settle with external clearing house (async external gateway call + cronjob to sync DB)

References

Introduction to Architecture

· 3 min read

What is Architecture?

Architecture is the shape of a software system. To illustrate with a building:

  • Paradigm is the bricks.
  • Design principles are the rooms.
  • Components are the structure.

They all serve a specific purpose, just like hospitals treat patients and schools educate students.

Why Do We Need Architecture?

Behavior vs. Structure

Every software system provides two distinct values to stakeholders: behavior and structure. Software developers must ensure that both values are high.

==Due to the nature of their work, software architects focus more on the structure of the system rather than its features and functions.==

Ultimate Goal — ==Reduce the human resource costs required for adding new features==

Architecture serves the entire lifecycle of software systems, making them easy to understand, develop, test, deploy, and operate. Its goal is to minimize the human resource costs for each business use case.

O'Reilly's "Software Architecture" provides a great introduction to these five fundamental architectures.

1. Layered Architecture

Layered architecture is widely adopted and well-known among developers. Therefore, it is the de facto standard at the application level. If you are unsure which architecture to use, layered architecture is a good choice.

Examples:

  • TCP/IP model: Application Layer > Transport Layer > Internet Layer > Network Interface Layer
  • Facebook TAO: Network Layer > Cache Layer (follower + leader) > Database Layer

Pros and Cons:

  • Pros
    • Easy to use
    • Clear responsibilities
    • Testability
  • Cons
    • Large and rigid
      • Adjusting, extending, or updating the architecture requires changes across all layers, which can be quite tricky.

2. Event-Driven Architecture

Any change in state triggers an event in the system. Communication between system components is accomplished through events.

A simplified architecture includes a mediator, event queue, and channels. The diagram below illustrates a simplified event-driven architecture:

Examples:

  • QT: Signals and Slots
  • Payment infrastructure: As bank gateways often have high latency, asynchronous techniques are used in banking architecture.

3. Microkernel Architecture (aka Plug-in Architecture)

The functionality of the software is distributed between a core and multiple plugins. The core contains only the most basic functionalities. Each plugin operates independently and implements shared interfaces to achieve different goals.

Examples:

  • Visual Studio Code and Eclipse
  • MINIX operating system

4. Microservices Architecture

Large systems are decomposed into numerous microservices, each a separately deployable unit that communicates via RPCs.

uber architecture

Examples:

5. Space-Based Architecture

The name "Space-Based Architecture" comes from "tuple space," which implies a "distributed shared space." In space-based architecture, there are no databases or synchronized database access, thus avoiding database bottleneck issues. All processing units share copies of application data in memory. These processing units can be flexibly started and stopped.

Example: See Wikipedia

  • Primarily adopted by Java-based architectures: for example, JavaSpaces.

Introduction to Architecture

· 3 min read

What is architecture?

Architecture is the shape of the software system. Thinking it as a big picture of physical buildings.

  • paradigms are bricks.
  • design principles are rooms.
  • components are buildings.

Together they serve a specific purpose like a hospital is for curing patients and a school is for educating students.

Why do we need architecture?

Behavior vs. Structure

Every software system provides two different values to the stakeholders: behavior and structure. Software developers are responsible for ensuring that both those values remain high.

==Software architects are, by virtue of their job description, more focused on the structure of the system than on its features and functions.==

Ultimate Goal - ==saving human resources costs per feature==

Architecture serves the full lifecycle of the software system to make it easy to understand, develop, test, deploy, and operate. The goal is to minimize the human resources costs per business use-case.

The O’Reilly book Software Architecture Patterns by Mark Richards is a simple but effective introduction to these five fundamental architectures.

1. Layered Architecture

The layered architecture is the most common in adoption, well-known among developers, and hence the de facto standard for applications. If you do not know what architecture to use, use it.

Examples

  • TCP / IP Model: Application layer > transport layer > internet layer > network access layer
  • Facebook TAO: web layer > cache layer (follower + leader) > database layer

Pros and Cons

  • Pros
    • ease of use
    • separation of responsibility
    • testability
  • Cons
    • monolithic
      • hard to adjust, extend or update. You have to make changes to all the layers.

2. Event-Driven Architecture

A state change will emit an event to the system. All the components communicate with each other through events.

A simple project can combine the mediator, event queue, and channel. Then we get a simplified architecture:

Examples

  • QT: Signals and Slots
  • Payment Infrastructure: Bank gateways usually have very high latencies, so they adopt async technologies in their architecture design.

3. Micro-kernel Architecture (aka Plug-in Architecture)

The software's responsibilities are divided into one "core" and multiple "plugins". The core contains the bare minimum functionality. Plugins are independent of each other and implement shared interfaces to achieve different goals.

Examples

  • Visual Studio Code, Eclipse
  • MINIX operating system

4. Microservices Architecture

A massive system is decoupled to multiple micro-services, each of which is a separately deployed unit, and they communicate with each other via RPCs.

uber architecture

Examples

5. Space-based Architecture

This pattern gets its name from "tuple space", which means “distributed shared memory". There is no database or synchronous database access, and thus no database bottleneck. All the processing units share the replicated application data in memory. These processing units can be started up and shut down elastically.

Examples: See Wikipedia

  • Mostly adopted among Java users: e.g., JavaSpaces

Toutiao Recommendation System: P2 Content Analysis

· 3 min read

In Toutiao Recommendation System: P1 Overview, we know that content analysis and data mining of user tags are the cornerstones of the recommendation system.

What is the content analysis?

content analysis = derive intermediate data from raw articles and user behaviors.

Take articles for example. To model user interests, we need to tag contents and articles. To associate a user with the interests of the “Internet” tag, we need to know whether a user reads an article with the “Internet” tag.

Why are we analyzing those raw data?

We do it for the reason of …

  1. Tagging users (user profile)
    • Tagging users who liked articles with “Internet” tag. Tagging users who liked articles with “xiaomi” tag.
  2. Recommending contents to users by tags
    • Pushing “meizu” contents to users with “meizu” tag. Pushing “dota” contents to users with “dota” tag.
  3. Preparing contents by topics
    • Put “Bundesliga” articles to “Bundesliga topic”. Put “diet” articles to “diet topic”.

Case Study: Analysis Result of an Article

Here is an example of “article features” page. There are article features like categorizations, keywords, topics, entities.

Analysis Result of an Article

Analysis Result of an Article: Details

What are the article features?

  1. Semantic Tags: Human predefine those tags with explicit meanings.

  2. Implicit Semantics, including topics and keywords. Topic features are describing the statistics of words. Certain rules generate keywords.

  3. Similarity. Duplicate recommendation once to be the most severe feedbacks we get from our customers.

  4. Time and location.

  5. Quality. Abusing, porn, ads, or “chicken soup for the soul”?

Article features are important

  • It is not true that a recommendation system cannot work at all without article features. Amazon, Walmart, Netflix can recommend by collaborative filtering.
  • However, in news product, users consume contents of the same day. Bootstrapping without article features is hard. Collaborative filtering cannot help with bootstrapping.
    • The finer of the granularity of the article feature, the better the ability to bootstrap.

More on Semantic Tags

We divide features of semantic tags into three levels:

  1. Categorizations: used in the user profile, filtering contents in topics, recommend recall, recommend features
  2. Concepts: used in filtering contents in topics, searching tags, recommend recall(like)
  3. Entities: used in filtering contents in topics, searching tags, recommend recall(like)

Why dividing into different levels? We do this so that they can capture articles in different granularities.

  1. Categorizations: full in coverage, low in accuracy.
  2. Concepts: medium in coverage, medium in accuracy.
  3. Entities: low in coverage, high in accuracy. It only covers hot people, organizations, products in each area.

Categorizations and concepts are sharing the same technical infrastructure.

Why do we need semantic tags?

  • Implicit semantics
    • have been functioning well.
    • cost much less than semantic tags.
  • But, topics and interests need a clear-defined tagging system.
  • Semantic tags also evaluate the capability in NPL technology of a company.

Document classification

Classification hierarchy

  1. Root
  2. Science, sports, finance, entertainment
  3. Football, tennis, table tennis, track and field, swimming
  4. International, domestic
  5. Team A, team B

Classifiers:

  • SVM
  • SVM + CNN
  • SVM + CNN + RNN

Calculating relevance

  1. Lexical analysis for articles
  2. Filtering keywords
  3. Disambiguation
  4. Calculating relevance

Stream and Batch Processing Frameworks

· 3 min read

Why Do We Need Such Frameworks?

  • To process more data in a shorter amount of time.
  • To unify fault tolerance in distributed systems.
  • To simplify task abstractions to meet changing business requirements.
  • Suitable for bounded datasets (batch processing) and unbounded datasets (stream processing).

Brief History of Batch and Stream Processing Development

  1. Hadoop and MapReduce. Google made batch processing as simple as MapReduce result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs) in a distributed system.
  2. Apache Storm and directed graph topologies. MapReduce does not represent iterative algorithms well. Therefore, Nathan Marz abstracted stream processing into a graph structure composed of spouts and bolts.
  3. Spark in-memory computation. Reynold Xin pointed out that Spark uses ten times fewer machines than Hadoop while being three times faster when processing the same data.
  4. Google Dataflow based on Millwheel and FlumeJava. Google uses a windowed API to support both batch and stream processing simultaneously.
  1. Flink quickly adopted the programming model of ==Google Dataflow== and Apache Beam.
  2. Flink's efficient implementation of the Chandy-Lamport checkpointing algorithm.

These Frameworks

Architecture Choices

To meet the above demands with commercial machines, there are several popular distributed system architectures...

  • Master-slave (centralized): Apache Storm + Zookeeper, Apache Samza + YARN
  • P2P (decentralized): Apache S4

Features

  1. DAG Topology for iterative processing - for example, GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
  2. Delivery Guarantees. How to ensure the reliability of data delivery between nodes? At least once / at most once / exactly once.
  3. Fault Tolerance. Implement fault tolerance using cold/warm/hot standby, checkpointing, or active-active.
  4. Windowed API for unbounded datasets. For example, streaming windows in Apache. Window functions in Spark. Windowing in Apache Beam.

Comparison Table of Different Architectures

ArchitectureStormStorm-tridentSparkFlink
ModelNativeMicro-batchMicro-batchNative
GuaranteesAt least onceExactly onceExactly onceExactly once
Fault ToleranceRecord AckRecord AckCheckpointCheckpoint
Maximum Fault ToleranceHighMediumMediumLow
LatencyVery lowHighHighLow
ThroughputLowMediumHighHigh

Toutiao Recommendation System: P1 Overview

· 6 min read

What are we optimizing for? User Satisfaction

We are finding the best function below to maximize user satisfaction .

user satisfaction = function(content, user profile, context)
  1. Content: features of articles, videos, UGC short videos, Q&As, etc.
  2. User profile: interests, occupation, age, gender and behavior patterns, etc.
  3. Context: Mobile users in contexts of workspace, commuting, traveling, etc.

How to evaluate the satisfaction?

  1. Measurable Goals, e.g.

    • click through rate
    • Session duration
    • upvotes
    • comments
    • reposts
  2. Hard-to-measurable Goals:

    • Frequency control of ads and special-typed contents (Q&A)
    • Frequency control of vulgar content
    • Reducing clickbait, low quality, disgusting content
    • Enforcing / pining / highly-weighting important news
    • Lowly-weighting contents from low-level accounts

How to optimize for those goals? Machine Learning Models

It is a typical supervised machine learning problem to find the best function above. To implement the system, we have these algorithms:

  1. Collaborative Filtering
  2. Logistic Regression
  3. DNN
  4. Factorization Machine
  5. GBDT

A world-class recommendation system is supposed to have the flexibility to A/B-test and combine multiple algorithms above. It is now popular to combine LR and DNN. Facebook used both LR and GBDT years ago.

How do models observe and measure the reality? Feature engineering

  1. Correlation, between content’s characteristic and user’s interest. Explicit correlations include keywords, categories, sources, genres. Implicit correlations can be extract from user’s vector or item’s vector from models like FM.

  2. Environmental features such as geo location, time. It’s can be used as bias or building correlation on top of it.

  3. Hot trend. There are global hot trend, categorical hot trend, topic hot trend and keyword hot trend. Hot trend is very useful to solve cold-start issue when we have little information about user.

  4. Collaborative features, which helps avoid situation where recommended content get more and more concentrated. Collaborative filtering is not analysing each user’s history separately, but finding users’ similarity based on their behaviour by clicks, interests, topics, keywords or event implicit vectors. By finding similar users, it can expand the diversity of recommended content.

Large-scale Training in Realtime

  • Users like to see news feed updated in realtime according to what we track from their actions.
  • Use Apache storm to train data (clicks, impressions, faves, shares) in realtime.
  • Collect data to a threshold and then update to the recommendation model
  • Store model parameters , like tens of billions of raw features and billions of vector features, in high performance computing clusters.

They are implemented in the following steps:

  1. Online services record features in realtime.
  2. Write data into Kafka
  3. Ingest data from Kafka to Storm
  4. Populate full user profiles and prepare samples
  5. Update model parameters according to the latest samples
  6. Online modeling gains new knowledge

How to further reducing the latency? Recall Strategy

It is impossible to predict all the things with the model, considering the super-large scale of all the contents. Therefore, we need recall strategies to focus on a representative subset of the data. Performance is critical here and timeout is 50ms.

recall strategy

Among all the recall strategies, we take the InvertedIndex<Key, List<Article>> .

The Key can be topic, entity, source, etc.

Tags of InterestsRelevanceList of Documents
E-commerce0.3
Fun0.2
History0.2
Military0.1

Data Dependencies

  • Features depends on tags of user-side and content-side.
  • recall strategy depends on tags of user-side and content-side.
  • content analysis and data mining of user tags are cornerstone of the recommendation system.

What is the content analysis?

content analysis = derive intermediate data from raw articles and user behaviors.

Take articles for example. To model user interests, we need to tag contents and articles. To associate a user with the interests of the “Internet” tag, we need to know whether a user reads an article with the “Internet” tag.

Why are we analyzing those raw data?

We do it for the reason of …

  1. Tagging users (user profile)
    • Tagging users who liked articles with “Internet” tag. Tagging users who liked articles with “xiaomi” tag.
  2. Recommending contents to users by tags
    • Pushing “meizu” contents to users with “meizu” tag. Pushing “dota” contents to users with “dota” tag.
  3. Preparing contents by topics
    • Put “Bundesliga” articles to “Bundesliga topic”. Put “diet” articles to “diet topic”.

Case Study: Analysis Result of an Article

Here is an example of “article features” page. There are article features like categorizations, keywords, topics, entities.

Analysis Result of an Article

Analysis Result of an Article: Details

What are the article features?

  1. Semantic Tags: Human predefine those tags with explicit meanings.

  2. Implicit Semantics, including topics and keywords. Topic features are describing the statistics of words. Certain rules generate keywords.

  3. Similarity. Duplicate recommendation once to be the most severe feedbacks we get from our customers.

  4. Time and location.

  5. Quality. Abusing, porn, ads, or “chicken soup for the soul”?

Article features are important

  • It is not true that a recommendation system cannot work at all without article features. Amazon, Walmart, Netflix can recommend by collaborative filtering.
  • However, in news product, users consume contents of the same day. Bootstrapping without article features is hard. Collaborative filtering cannot help with bootstrapping.
    • The finer of the granularity of the article feature, the better the ability to bootstrap.

More on Semantic Tags

We divide features of semantic tags into three levels:

  1. Categorizations: used in the user profile, filtering contents in topics, recommend recall, recommend features
  2. Concepts: used in filtering contents in topics, searching tags, recommend recall(like)
  3. Entities: used in filtering contents in topics, searching tags, recommend recall(like)

Why dividing into different levels? We do this so that they can capture articles in different granularities.

  1. Categorizations: full in coverage, low in accuracy.
  2. Concepts: medium in coverage, medium in accuracy.
  3. Entities: low in coverage, high in accuracy. It only covers hot people, organizations, products in each area.

Categorizations and concepts are sharing the same technical infrastructure.

Why do we need semantic tags?

  • Implicit semantics
    • have been functioning well.
    • cost much less than semantic tags.
  • But, topics and interests need a clear-defined tagging system.
  • Semantic tags also evaluate the capability in NPL technology of a company.

Document classification

Classification hierarchy

  1. Root
  2. Science, sports, finance, entertainment
  3. Football, tennis, table tennis, track and field, swimming
  4. International, domestic
  5. Team A, team B

Classifiers:

  • SVM
  • SVM + CNN
  • SVM + CNN + RNN

Calculating relevance

  1. Lexical analysis for articles
  2. Filtering keywords
  3. Disambiguation
  4. Calculating relevance

Toutiao Recommendation System: P1 Overview

· 4 min read

What are we optimizing for? User Satisfaction

We are finding the best function below to maximize user satisfaction .

user satisfaction = function(content, user profile, context)
  1. Content: features of articles, videos, UGC short videos, Q&As, etc.
  2. User profile: interests, occupation, age, gender and behavior patterns, etc.
  3. Context: Mobile users in contexts of workspace, commuting, traveling, etc.

How to evaluate the satisfaction?

  1. Measurable Goals, e.g.

    • click through rate
    • Session duration
    • upvotes
    • comments
    • reposts
  2. Hard-to-measurable Goals:

    • Frequency control of ads and special-typed contents (Q&A)
    • Frequency control of vulgar content
    • Reducing clickbait, low quality, disgusting content
    • Enforcing / pining / highly-weighting important news
    • Lowly-weighting contents from low-level accounts

How to optimize for those goals? Machine Learning Models

It is a typical supervised machine learning problem to find the best function above. To implement the system, we have these algorithms:

  1. Collaborative Filtering
  2. Logistic Regression
  3. DNN
  4. Factorization Machine
  5. GBDT

A world-class recommendation system is supposed to have the flexibility to A/B-test and combine multiple algorithms above. It is now popular to combine LR and DNN. Facebook used both LR and GBDT years ago.

How do models observe and measure the reality? Feature engineering

  1. Correlation, between content’s characteristic and user’s interest. Explicit correlations include keywords, categories, sources, genres. Implicit correlations can be extract from user’s vector or item’s vector from models like FM.

  2. Environmental features such as geo location, time. It’s can be used as bias or building correlation on top of it.

  3. Hot trend. There are global hot trend, categorical hot trend, topic hot trend and keyword hot trend. Hot trend is very useful to solve cold-start issue when we have little information about user.

  4. Collaborative features, which helps avoid situation where recommended content get more and more concentrated. Collaborative filtering is not analysing each user’s history separately, but finding users’ similarity based on their behaviour by clicks, interests, topics, keywords or event implicit vectors. By finding similar users, it can expand the diversity of recommended content.

Large-scale Training in Realtime

  • Users like to see news feed updated in realtime according to what we track from their actions.
  • Use Apache storm to train data (clicks, impressions, faves, shares) in realtime.
  • Collect data to a threshold and then update to the recommendation model
  • Store model parameters , like tens of billions of raw features and billions of vector features, in high performance computing clusters.

They are implemented in the following steps:

  1. Online services record features in realtime.
  2. Write data into Kafka
  3. Ingest data from Kafka to Storm
  4. Populate full user profiles and prepare samples
  5. Update model parameters according to the latest samples
  6. Online modeling gains new knowledge

How to further reducing the latency? Recall Strategy

It is impossible to predict all the things with the model, considering the super-large scale of all the contents. Therefore, we need recall strategies to focus on a representative subset of the data. Performance is critical here and timeout is 50ms.

recall strategy

Among all the recall strategies, we take the InvertedIndex<Key, List<Article>> .

The Key can be topic, entity, source, etc.

Tags of InterestsRelevanceList of Documents
E-commerce0.3
Fun0.2
History0.2
Military0.1

Data Dependencies

  • Features depends on tags of user-side and content-side.
  • recall strategy depends on tags of user-side and content-side.
  • content analysis and data mining of user tags are cornerstone of the recommendation system.

Stream and Batch Processing Frameworks

· 2 min read

Why such frameworks?

  • process high-throughput in low latency.
  • fault-tolerance in distributed systems.
  • generic abstraction to serve volatile business requirements.
  • for bounded data set (batch processing) and for unbounded data set (stream processing).

Brief history of batch/stream processing

  1. Hadoop and MapReduce. Google made batch processing as simple as MR result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs) in a distributed system.
  2. Apache Storm and DAG Topology. MR doesn’t efficiently express iterative algorithms. Thus Nathan Marz abstracted stream processing into a graph of spouts and bolts.
  3. Spark in-memory Computing. Reynold Xin said Spark sorted the same data 3X faster using 10X fewer machines compared to Hadoop.
  4. Google Dataflow based on Millwheel and FlumeJava. Google supports both batch and streaming computing with the windowing API.
  1. its fast adoption of ==Google Dataflow==/Beam programming model.
  2. its highly efficient implementation of Chandy-Lamport checkpointing.

How?

Architectural Choices

To serve requirements above with commodity machines, the steaming framework use distributed systems in these architectures...

  • master-slave (centralized): apache storm with zookeeper, apache samza with YARN.
  • P2P (decentralized): apache s4.

Features

  1. DAG Topology for Iterative Processing. e.g. GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
  2. Delivery Guarantees. How guaranteed to deliver data from nodes to nodes? at-least once / at-most once / exactly once.
  3. Fault-tolerance. Using cold/warm/hot standby, checkpointing, or active-active.
  4. Windowing API for unbounded data set. e.g. Stream Windows in Apache Flink. Spark Window Functions. Windowing in Apache Beam.

Comparison

FrameworkStormStorm-tridentSparkFlink
Modelnativemicro-batchmicro-batchnative
Guarenteesat-least-onceexactly-onceexactly-onceexactly-once
Fault-toleranceRecord-Ackrecord-ackcheckpointcheckpoint
Overhead of fault-tolerancehighmediummediumlow
latencyvery-lowhighhighlow
throughputlowmediumhighhigh

Fraud Detection with Semi-supervised Learning

· 4 min read

Clarify Requirements

Calculate risk probability scores in realtime and make decisions along with a rule engine to prevent ATO (account takeovers) and Botnet attacks.

Train clustering fatures with online and offline pipelines

  1. Source from website logs, auth logs, user actions, transactions, high-risk accounts in watch list
  2. track event data in kakfa topics
  3. Process events and prepare clustering features

Realtime scoring and rule-based decision

  1. assess a risk score comprehensively for online services

  2. Maintain flexibility with manually configuration in a rule engine

  3. share, or use the insights in online services

ATOs ranking from easy to hard to detect

  1. from single IP
  2. from IPs on the same device
  3. from IPs across the world
  4. from 100k IPs
  5. attacks on specific accounts
  6. phishing and malware

Challenges

  • Manual feature selection
  • Feature evolution in adversarial environment
  • Scalability
  • No online DBSCAN

High-level Architecture

Core Components and Workflows

Semi-supervised learning = unlabeled data + small amount of labeled data

Why? better learning accuracy than unsupervised learning + less time and costs than supervised learning

Training: To prepare clustering features in database

  • Streaming Pipeline on Spark:
    • Runs continuously in real-time.
    • Performs feature normalization and categorical transformation on the fly.
      • Feature Normalization: Scale your numeric features (e.g., age, income) so that they are between 0 and 1.
      • Categorical Feature Transformation: Apply one-hot encoding or another transformation to convert categorical features into a numeric format suitable for the machine learning model.
    • Uses Spark MLlib’s K-means to cluster streaming data into groups.
      • After running k-means and forming clusters, you might find that certain clusters have more instances of fraud.
      • Once you’ve labeled a cluster as fraudulent based on historical data or expert knowledge, you can use that cluster assignment during inference. Any new data point assigned to that fraudulent cluster can be flagged as suspicious.
  • Hourly Cronjob Pipeline:
    • Runs periodically every hour (batch processing).
    • Applies thresholding to identify anomalies based on results from the clustering model.
    • Tunes parameters of the DBSCAN algorithm to improve clustering and anomaly detection.
    • Uses DBSCAN from scikit-learn to find clusters and detect outliers in batch data.
      • DBSCAN, which can detect outliers, might identify clusters of regular transactions and separate them from noise, which could be unusual, potentially fraudulent transactions.
      • Transactions in the noisy or outlier regions (points that don’t belong to any dense cluster) can be flagged as suspicious.
      • After identifying a cluster as fraudulent, DBSCAN helps detect patterns of fraud even in irregularly shaped transaction distributions.

Serving

The serving layer is where the rubber meets the road - where we turn our machine learning models and business rules into actual fraud prevention decisions. Here's how it works:

  • Fraud Detection Scoring Service:
    • Takes real-time features extracted from incoming requests
    • Applies both clustering models (K-means from streaming and DBSCAN from batch)
    • Combines scores with streaming counters (like login attempts per IP)
    • Outputs a unified risk score between 0 and 1
  • Rule Engine:
    • Acts as the "brain" of the system
    • Combines ML scores with configurable business rules
    • Examples of rules:
      • If risk score > 0.8 AND user is accessing from new IP → require 2FA
      • If risk score > 0.9 AND account is high-value → block transaction
    • Rules are stored in a database and can be updated without code changes
    • Provides an admin portal for security teams to adjust rules
  • Integration with Other Services:
    • Exposes REST APIs for real-time scoring
    • Publishes results to streaming counters for monitoring
    • Feeds decisions back to the training pipeline to improve model accuracy
  • Observability:
    • Tracks key metrics like false positive/negative rates
    • Monitors model drift and feature distribution changes
    • Provides dashboards for security analysts to investigate patterns
    • Logs detailed information for post-incident analysis