Skip to main content

32 posts tagged with "system-design"

View all tags

Introduction to Architecture

· 3 min read

What is Architecture?

Architecture is the shape of a software system. To illustrate with a building:

  • Paradigm is the bricks.
  • Design principles are the rooms.
  • Components are the structure.

They all serve a specific purpose, just like hospitals treat patients and schools educate students.

Why Do We Need Architecture?

Behavior vs. Structure

Every software system provides two distinct values to stakeholders: behavior and structure. Software developers must ensure that both values are high.

==Due to the nature of their work, software architects focus more on the structure of the system rather than its features and functions.==

Ultimate Goal — ==Reduce the human resource costs required for adding new features==

Architecture serves the entire lifecycle of software systems, making them easy to understand, develop, test, deploy, and operate. Its goal is to minimize the human resource costs for each business use case.

O'Reilly's "Software Architecture" provides a great introduction to these five fundamental architectures.

1. Layered Architecture

Layered architecture is widely adopted and well-known among developers. Therefore, it is the de facto standard at the application level. If you are unsure which architecture to use, layered architecture is a good choice.

Examples:

  • TCP/IP model: Application Layer > Transport Layer > Internet Layer > Network Interface Layer
  • Facebook TAO: Network Layer > Cache Layer (follower + leader) > Database Layer

Pros and Cons:

  • Pros
    • Easy to use
    • Clear responsibilities
    • Testability
  • Cons
    • Large and rigid
      • Adjusting, extending, or updating the architecture requires changes across all layers, which can be quite tricky.

2. Event-Driven Architecture

Any change in state triggers an event in the system. Communication between system components is accomplished through events.

A simplified architecture includes a mediator, event queue, and channels. The diagram below illustrates a simplified event-driven architecture:

Examples:

  • QT: Signals and Slots
  • Payment infrastructure: As bank gateways often have high latency, asynchronous techniques are used in banking architecture.

3. Microkernel Architecture (aka Plug-in Architecture)

The functionality of the software is distributed between a core and multiple plugins. The core contains only the most basic functionalities. Each plugin operates independently and implements shared interfaces to achieve different goals.

Examples:

  • Visual Studio Code and Eclipse
  • MINIX operating system

4. Microservices Architecture

Large systems are decomposed into numerous microservices, each a separately deployable unit that communicates via RPCs.

uber architecture

Examples:

5. Space-Based Architecture

The name "Space-Based Architecture" comes from "tuple space," which implies a "distributed shared space." In space-based architecture, there are no databases or synchronized database access, thus avoiding database bottleneck issues. All processing units share copies of application data in memory. These processing units can be flexibly started and stopped.

Example: See Wikipedia

  • Primarily adopted by Java-based architectures: for example, JavaSpaces.

Toutiao Recommendation System: P2 Content Analysis

· 3 min read

In Toutiao Recommendation System: P1 Overview, we know that content analysis and data mining of user tags are the cornerstones of the recommendation system.

What is the content analysis?

content analysis = derive intermediate data from raw articles and user behaviors.

Take articles for example. To model user interests, we need to tag contents and articles. To associate a user with the interests of the “Internet” tag, we need to know whether a user reads an article with the “Internet” tag.

Why are we analyzing those raw data?

We do it for the reason of …

  1. Tagging users (user profile)
    • Tagging users who liked articles with “Internet” tag. Tagging users who liked articles with “xiaomi” tag.
  2. Recommending contents to users by tags
    • Pushing “meizu” contents to users with “meizu” tag. Pushing “dota” contents to users with “dota” tag.
  3. Preparing contents by topics
    • Put “Bundesliga” articles to “Bundesliga topic”. Put “diet” articles to “diet topic”.

Case Study: Analysis Result of an Article

Here is an example of “article features” page. There are article features like categorizations, keywords, topics, entities.

Analysis Result of an Article

Analysis Result of an Article: Details

What are the article features?

  1. Semantic Tags: Human predefine those tags with explicit meanings.

  2. Implicit Semantics, including topics and keywords. Topic features are describing the statistics of words. Certain rules generate keywords.

  3. Similarity. Duplicate recommendation once to be the most severe feedbacks we get from our customers.

  4. Time and location.

  5. Quality. Abusing, porn, ads, or “chicken soup for the soul”?

Article features are important

  • It is not true that a recommendation system cannot work at all without article features. Amazon, Walmart, Netflix can recommend by collaborative filtering.
  • However, in news product, users consume contents of the same day. Bootstrapping without article features is hard. Collaborative filtering cannot help with bootstrapping.
    • The finer of the granularity of the article feature, the better the ability to bootstrap.

More on Semantic Tags

We divide features of semantic tags into three levels:

  1. Categorizations: used in the user profile, filtering contents in topics, recommend recall, recommend features
  2. Concepts: used in filtering contents in topics, searching tags, recommend recall(like)
  3. Entities: used in filtering contents in topics, searching tags, recommend recall(like)

Why dividing into different levels? We do this so that they can capture articles in different granularities.

  1. Categorizations: full in coverage, low in accuracy.
  2. Concepts: medium in coverage, medium in accuracy.
  3. Entities: low in coverage, high in accuracy. It only covers hot people, organizations, products in each area.

Categorizations and concepts are sharing the same technical infrastructure.

Why do we need semantic tags?

  • Implicit semantics
    • have been functioning well.
    • cost much less than semantic tags.
  • But, topics and interests need a clear-defined tagging system.
  • Semantic tags also evaluate the capability in NPL technology of a company.

Document classification

Classification hierarchy

  1. Root
  2. Science, sports, finance, entertainment
  3. Football, tennis, table tennis, track and field, swimming
  4. International, domestic
  5. Team A, team B

Classifiers:

  • SVM
  • SVM + CNN
  • SVM + CNN + RNN

Calculating relevance

  1. Lexical analysis for articles
  2. Filtering keywords
  3. Disambiguation
  4. Calculating relevance

Stream and Batch Processing Frameworks

· 2 min read

Why Do We Need Such Frameworks?

  • To process more data in a shorter amount of time.
  • To unify fault tolerance in distributed systems.
  • To simplify task abstractions to meet changing business requirements.
  • Suitable for bounded datasets (batch processing) and unbounded datasets (stream processing).

Brief History of Batch and Stream Processing Development

  1. Hadoop and MapReduce. Google made batch processing as simple as MapReduce result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs) in a distributed system.
  2. Apache Storm and directed graph topologies. MapReduce does not represent iterative algorithms well. Therefore, Nathan Marz abstracted stream processing into a graph structure composed of spouts and bolts.
  3. Spark in-memory computation. Reynold Xin pointed out that Spark uses ten times fewer machines than Hadoop while being three times faster when processing the same data.
  4. Google Dataflow based on Millwheel and FlumeJava. Google uses a windowed API to support both batch and stream processing simultaneously.
  1. Flink quickly adopted the programming model of ==Google Dataflow== and Apache Beam.
  2. Flink's efficient implementation of the Chandy-Lamport checkpointing algorithm.

These Frameworks

Architecture Choices

To meet the above demands with commercial machines, there are several popular distributed system architectures...

  • Master-slave (centralized): Apache Storm + Zookeeper, Apache Samza + YARN
  • P2P (decentralized): Apache S4

Features

  1. DAG Topology for iterative processing - for example, GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
  2. Delivery Guarantees. How to ensure the reliability of data delivery between nodes? At least once / at most once / exactly once.
  3. Fault Tolerance. Implement fault tolerance using cold/warm/hot standby, checkpointing, or active-active.
  4. Windowed API for unbounded datasets. For example, streaming windows in Apache. Window functions in Spark. Windowing in Apache Beam.

Comparison Table of Different Architectures

ArchitectureStormStorm-tridentSparkFlink
ModelNativeMicro-batchMicro-batchNative
GuaranteesAt least onceExactly onceExactly onceExactly once
Fault ToleranceRecord AckRecord AckCheckpointCheckpoint
Maximum Fault ToleranceHighMediumMediumLow
LatencyVery lowHighHighLow
ThroughputLowMediumHighHigh

Toutiao Recommendation System: P1 Overview

· 3 min read

What are we optimizing for? User Satisfaction

We are finding the best function below to maximize user satisfaction .

user satisfaction = function(content, user profile, context)
  1. Content: features of articles, videos, UGC short videos, Q&As, etc.
  2. User profile: interests, occupation, age, gender and behavior patterns, etc.
  3. Context: Mobile users in contexts of workspace, commuting, traveling, etc.

How to evaluate the satisfaction?

  1. Measurable Goals, e.g.

    • click through rate
    • Session duration
    • upvotes
    • comments
    • reposts
  2. Hard-to-measurable Goals:

    • Frequency control of ads and special-typed contents (Q&A)
    • Frequency control of vulgar content
    • Reducing clickbait, low quality, disgusting content
    • Enforcing / pining / highly-weighting important news
    • Lowly-weighting contents from low-level accounts

How to optimize for those goals? Machine Learning Models

It is a typical supervised machine learning problem to find the best function above. To implement the system, we have these algorithms:

  1. Collaborative Filtering
  2. Logistic Regression
  3. DNN
  4. Factorization Machine
  5. GBDT

A world-class recommendation system is supposed to have the flexibility to A/B-test and combine multiple algorithms above. It is now popular to combine LR and DNN. Facebook used both LR and GBDT years ago.

How do models observe and measure the reality? Feature engineering

  1. Correlation, between content’s characteristic and user’s interest. Explicit correlations include keywords, categories, sources, genres. Implicit correlations can be extract from user’s vector or item’s vector from models like FM.

  2. Environmental features such as geo location, time. It’s can be used as bias or building correlation on top of it.

  3. Hot trend. There are global hot trend, categorical hot trend, topic hot trend and keyword hot trend. Hot trend is very useful to solve cold-start issue when we have little information about user.

  4. Collaborative features, which helps avoid situation where recommended content get more and more concentrated. Collaborative filtering is not analysing each user’s history separately, but finding users’ similarity based on their behaviour by clicks, interests, topics, keywords or event implicit vectors. By finding similar users, it can expand the diversity of recommended content.

Large-scale Training in Realtime

  • Users like to see news feed updated in realtime according to what we track from their actions.
  • Use Apache storm to train data (clicks, impressions, faves, shares) in realtime.
  • Collect data to a threshold and then update to the recommendation model
  • Store model parameters , like tens of billions of raw features and billions of vector features, in high performance computing clusters.

They are implemented in the following steps:

  1. Online services record features in realtime.
  2. Write data into Kafka
  3. Ingest data from Kafka to Storm
  4. Populate full user profiles and prepare samples
  5. Update model parameters according to the latest samples
  6. Online modeling gains new knowledge

How to further reducing the latency? Recall Strategy

It is impossible to predict all the things with the model, considering the super-large scale of all the contents. Therefore, we need recall strategies to focus on a representative subset of the data. Performance is critical here and timeout is 50ms.

recall strategy

Among all the recall strategies, we take the InvertedIndex<Key, List<Article>> .

The Key can be topic, entity, source, etc.

Tags of InterestsRelevanceList of Documents
E-commerce0.3
Fun0.2
History0.2
Military0.1

Data Dependencies

  • Features depends on tags of user-side and content-side.
  • recall strategy depends on tags of user-side and content-side.
  • content analysis and data mining of user tags are cornerstone of the recommendation system.

Designing Uber Ride-Hailing Service

· 2 min read

Disclaimer: All content below is sourced from public resources or purely original. No confidential information regarding Uber is included here.

Requirements

  • Provide services for the global transportation market
  • Large-scale real-time scheduling
  • Backend design

Architecture

uber architecture

Why Microservices?

==Conway's Law== The structure of a software system corresponds to the organizational structure of the company.

Monolithic ==Service==Microservices
When team size and codebase are small, productivity✅ High❌ Low
==When team size and codebase are large, productivity==❌ Low✅ High (Conway's Law)
==Quality requirements for engineering==❌ High (Inadequately skilled developers can easily disrupt the entire system)✅ Low (Runtime is isolated)
Dependency version upgrades✅ Fast (Centralized management)❌ Slow
Multi-tenant support / Production-staging state isolation✅ Easy❌ Difficult (Each service must 1) either establish a staging environment connected to other services in staging 2) or support multi-tenancy across request contexts and data storage)
Debuggability, assuming the same modules, parameters, logs❌ Low✅ High (if distributed tracing is available)
Latency✅ Low (Local)❌ High (Remote)
DevOps costs✅ Low (High cost of build tools)❌ High (Capacity planning is difficult)

Combining monolithic ==codebase== and microservices can leverage the strengths of both.

Scheduling Service

  • Consistent hash addresses provided by geohash
  • Data is transient in memory, so no need for duplication. (CAP: AP over CP)
  • Use single-threaded or locked sharding to prevent double scheduling

Payment Service

==The key is to have asynchronous design==, as ACID transaction payment systems across multiple systems often have very long latencies.

User Profile Service and Trip Record Service

  • Use caching to reduce latency
  • As 1) support for more countries and regions increases 2) user roles (drivers, riders, restaurant owners, diners, etc.) gradually expand, providing user profile services for these users also faces significant challenges.

Notification Push Service

  • Apple Push Notification Service (unreliable)
  • Google Cloud Messaging (GCM) (can detect successful delivery) or
  • SMS services are generally more reliable

A Closer Look at iOS Architecture Patterns

· 3 min read

Why Should We Care About Architecture?

The answer is: to reduce the human resources spent on each feature.

Mobile developers evaluate the quality of an architecture on three levels:

  1. Whether the responsibilities of different features are evenly distributed
  2. Whether it is easy to test
  3. Whether it is easy to use and maintain
Responsibility DistributionTestabilityUsability
Tight Coupling MVC
Cocoa MVC❌ V and C are coupled✅⭐
MVP✅ Independent view lifecycleAverage: more code
MVVMAverage: View has a dependency on UIKitAverage
VIPER✅⭐️✅⭐️

Tight Coupling MVC

Traditional MVC

For example, in a multi-page web application, when you click a link to navigate to another page, the entire page reloads. The problem with this architecture is that the View is tightly coupled with the Controller and Model.

Cocoa MVC

Cocoa MVC is the architecture recommended by Apple for iOS developers. Theoretically, this architecture allows the Controller to decouple the Model from the View.

Cocoa MVC

However, in practice, Cocoa MVC encourages the use of massive view controllers, ultimately leading to the view controller handling all operations.

Realistic Cocoa MVC

Although testing such tightly coupled massive view controllers is quite difficult, Cocoa MVC performs the best in terms of development speed among the existing options.

MVP

In MVP, the Presenter has no relationship with the lifecycle of the view controller, allowing the view to be easily replaced. We can think of UIViewController as the View.

Variant of MVC

There is another type of MVP: MVP with data binding. As shown in the figure, the View is tightly coupled with the Model and Controller.

MVP

MVVM

MVVM is similar to MVP, but MVVM binds the View to the View Model.

MVVM

VIPER

Unlike the three-layer structure of MV(X), VIPER has a five-layer structure (VIPER View, Interactor, Presenter, Entity, and Routing). This structure allows for good responsibility distribution but has poorer maintainability.

VIPER

Compared to MV(X), VIPER has the following differences:

  1. The logic processing of the Model is transferred to the Interactor, so Entities have no logic and are purely data storage structures.
  2. ==UI-related business logic is handled in the Presenter, while data modification functions are handled in the Interactor==.
  3. VIPER introduces a routing module, Router, to implement inter-module navigation.

How to Stream Video to Mobile Devices Using HTTP? HTTP Live Streaming (HLS)

· One min read

Why Is Such a Protocol Needed?

Mobile video playback services using HTTP Live Streaming encounter the following issues:

  1. ==Limited memory and storage on mobile devices==.
  2. Due to unstable network connections and varying bandwidth, there is a need to ==dynamically adjust video quality during transmission==.

Solutions

  1. Server Side: In a typical setup, encoding hardware receives audio and video input, encodes it into H.264 format video and AAC format audio, and then streams it out in MPEG-2 format.
    1. A software multiplexer then splits the raw output stream into a series of short media files (with lengths possibly around 10 seconds in .ts format).
    2. The multiplexer also maintains an index file (.m3u8 format) that contains a list of all media files.
    3. The generated media files and index file are published on a web server.
  2. Client Side: The client reads the index, sequentially requests the necessary media files from the server, and smoothly plays the content of each short media file.

Architecture

HLS Architecture

Designing Facebook's Photo Storage System

· 2 min read

Why Does Facebook Handle Its Own Photo Storage?

  • Petabyte-scale volume of blob data
  • Traditional NFS-based designs (where each image is stored as a file) face metadata bottlenecks: massive metadata severely limits metadata hit rates.
    • Here are the details:

For photo applications, most metadata, such as image permissions, is useless, wasting storage space. However, the larger overhead is that the metadata of the file must be read from disk into memory to locate the file itself. While this is negligible for small-scale storage, when multiplied by billions of photos and several petabytes of data, accessing metadata becomes a throughput bottleneck.

Solution

By aggregating hundreds of thousands of images into a single Haystack storage file, the metadata burden is eliminated.

Structure

Facebook Photo Storage Architecture

Data Layout

Index file (for quick memory loading) + Haystack storage file containing many images.

Index file layout

index file layout 1

index file layout 2

Storage file

haystack store file

CRUD Operations

  • Create: Write to the storage file, then ==asynchronously== write to the index file, as indexing is not a critical step.
  • Delete: Perform soft deletes by marking the deleted bits in a flag field. Execute hard deletes through compacting operations.
  • Update: During updates, only append (append-only); if a duplicate key is encountered, the application can choose to update and read the key with the maximum offset.
  • Read: Read operations (offset, key, backup key, cookie, and data size)

Use Cases

Upload

Photo Storage Upload

Download

Photo Storage Download

Designing a URL Shortener System

· 5 min read

Design a system that can convert URLs provided by users into short URLs, allowing users to access their original URLs (hereinafter referred to as long URLs) using these short URLs. Describe how this system operates, including but not limited to the following questions: How are short URLs allocated? How is the mapping between short URLs and long URLs stored? How is the redirection service implemented? How is access data stored?

Assumptions: The initial problem description does not include these assumptions. An excellent candidate will ask about system scale when given a specific design.

  • There are approximately tens of thousands of long URL domains.
  • The traffic for new long URLs is about 10,000,000 per day (100 per second).
  • The traffic for the redirection service using short URLs to access long URLs is about 10 billion per day (100,000 per second).
  • Remind the candidate that these are average figures - during peak times, these numbers can be much higher (one type of peak time is time-related, such as when users return home from work, and another type is event-related, such as during the Spring Festival Gala).
  • Recent data (e.g., today's data) should be collected in advance and should be available within five minutes when users want to view it.
  • Historical data should be calculated daily.

Assumptions

1 billion new URLs per day, 100 billion short URL accesses. The shorter the short URL, the better. Data presentation (real-time/daily/monthly/yearly).

URL Encoding

http://blog.codinghorror.com/url-shortening-hashes-in-practice/

Method 1: md5 (128 bits, 16 hexadecimal digits, collisions, birthday paradox, 2^(n/2) = 2^64) Shorter? (64 bits, 8 hexadecimal digits, collisions 2^32), base 64.

  • Advantages: Hashing is relatively simple and easy to scale horizontally.
  • Disadvantages: Too long, how to handle expired URLs?

Method 2: Distributed ID generator. (Base 62: az, AZ, 0~9, 62 characters, 62^7), partitioning: each node contains some IDs.

  • Advantages: Easier to eliminate expired URLs, shorter URLs.
  • Disadvantages: Coordination between different partitions (e.g., ZooKeeper).

Key-Value (KV) Storage

MySQL (10k requests per second, slow, no need for a relational database), key-value (100k requests per second, Redis, Memcached).

An excellent candidate will ask about the expected lifespan of short URLs and design a system that can automatically clean up expired short URLs.

Follow-Up

Question: How to generate short URLs?

  • A poor candidate might suggest using a single ID generator (single point of failure) or require coordination between ID generators for each ID generation. For example, using an auto-increment primary key in a database.
  • An acceptable candidate might suggest using md5 or some UUID generators that can generate IDs independently on some nodes. These methods can generate non-colliding IDs in a distributed system, allowing for the production of a large number of short URLs.
  • An excellent candidate will design a method using several ID generators, where each generator first reserves a block of ID sequences from a central coordinator (e.g., ZooKeeper), and these ID generators can allocate IDs from their ID sequences independently, cleaning up their ID sequences when necessary.

Question: How to store the mapping between long URLs and short URLs?

  • A poor candidate might suggest using a single, non-distributed, non-relational database. It is merely a simple key-value database.
  • An excellent candidate will suggest using a simple distributed storage system, such as MongoDB/HBase/Voldemort, etc.
  • A more excellent candidate will ask about the expected usage cycle of short URLs and then design a system that can clean up expired short URLs.

Question: How to implement the redirection service?

  • A poor candidate will design the system from scratch to solve problems that have already been solved.
  • An excellent candidate will suggest using an existing HTTP server with a plugin to translate the short URL ID, look up this ID in the database, update access data, return a 303 status, and redirect to the long URL. Existing HTTP servers include Apache/Jetty/Netty/Tomcat, etc.

Question: How to store access data?

  • A poor candidate will suggest writing to the database on every access.
  • An excellent candidate will suggest having several different components handle this task: generating access stream data, collecting and organizing it, and writing it to a permanent database after a certain period.

Question: How to separate the different components of storing access data proposed by the excellent candidate?

  • An excellent candidate will suggest using a low-latency information system to temporarily store access data and then hand the data over to the collection and organization component.
  • The candidate may ask how often access data needs to be updated. If updated daily, a reasonable method would be to store it in HDFS and use MapReduce to compute the data. If near-real-time data is required, the collection and organization component must compute the necessary data.

Question: How to block access to restricted websites?

  • An excellent candidate will suggest maintaining a blacklist of domains in the key-value database.
  • A good candidate might propose some advanced technologies that can be used when the system scales significantly, such as bloom filters.

Improving System Availability through Failover

· 2 min read

Failover: Failover is a backup operational mode used to enhance system stability and availability. When the primary component fails or is scheduled for downtime, the functions of system components (such as processors, servers, networks, or databases) are transferred to secondary system components.

Cold Backup: Cold backup refers to copying critical files to another location, using features or metrics/alerts to track failures. The system provides a new standby node in the event of a failure; however, cold backup is only suitable for stateless services. For backing up Oracle databases, cold backup is the fastest and safest method.

Hot Backup: This involves maintaining two active systems that share the same task roles, meaning the system operates normally while providing backup. The data between the two systems is nearly mirrored in real-time and contains the same information.

Warm Backup: This keeps two active systems, where the secondary system does not consume traffic unless a failure occurs.

Checkpoint (or similar to Redis snapshots): The system uses write-ahead logging (WAL) to record requests before processing tasks. The standby node recovers from the log during failover.

  • Disadvantages
    • A large amount of log recovery can be time-consuming
    • Data loss since the last checkpoint
  • User Cases: Storm, WhillWheel, Samza

Dual-host (or all-host) mode: This keeps two active systems behind a load balancer. The hosts operate in parallel, and data replication is bidirectional.

Lambda Architecture

· 2 min read

Why Use Lambda Architecture?

To address the three issues brought by big data:

  1. Accuracy (good)
  2. Latency (fast)
  3. Throughput (high)

For example: The problems of scaling web browsing data records in a traditional way:

  1. First, use a traditional relational database.
  2. Then, add a "publish/subscribe" model queue.
  3. Next, scale through horizontal partitioning or sharding.
  4. Fault tolerance issues begin to arise.
  5. Data corruption phenomena start to appear.

The key issue is that in the AKF Scaling Cube, ==having only the X-axis for horizontal partitioning of one dimension is not enough; we also need to introduce the Y-axis for functional decomposition. The lambda architecture can guide us on how to scale a data system==.

What is Lambda Architecture?

If we define a data system in the following form:

Query=function(all data)

Then a lambda architecture is:

Lambda Architecture

batch view = function(all data at the batching job's execution time)
realtime view = function(realtime view, new data)

query = function(batch view, realtime view)

==Lambda architecture = Read/Write separation (Batch Processing Layer + Service Layer) + Real-time Processing Layer==

Lambda Architecture for big data systems

Skip List

· One min read

A skip list is essentially a linked list that allows for binary search. It achieves this by adding extra nodes that enable you to "skip" parts of the linked list. Given a random number generator to create these extra nodes, a skip list has O(log n) complexity for search, insert, and delete operations.

Use Cases

  • LevelDB MemTable
  • Redis Sorted Set
  • Lucene Inverted Index