Skip to main content

49 posts tagged with "system-design"

View all tags

Designing Facebook's Photo Storage System

· 2 min read

Why Does Facebook Handle Its Own Photo Storage?

  • Petabyte-scale volume of blob data
  • Traditional NFS-based designs (where each image is stored as a file) face metadata bottlenecks: massive metadata severely limits metadata hit rates.
    • Here are the details:

For photo applications, most metadata, such as image permissions, is useless, wasting storage space. However, the larger overhead is that the metadata of the file must be read from disk into memory to locate the file itself. While this is negligible for small-scale storage, when multiplied by billions of photos and several petabytes of data, accessing metadata becomes a throughput bottleneck.

Solution

By aggregating hundreds of thousands of images into a single Haystack storage file, the metadata burden is eliminated.

Structure

Facebook Photo Storage Architecture

Data Layout

Index file (for quick memory loading) + Haystack storage file containing many images.

Index file layout

index file layout 1

index file layout 2

Storage file

haystack store file

CRUD Operations

  • Create: Write to the storage file, then ==asynchronously== write to the index file, as indexing is not a critical step.
  • Delete: Perform soft deletes by marking the deleted bits in a flag field. Execute hard deletes through compacting operations.
  • Update: During updates, only append (append-only); if a duplicate key is encountered, the application can choose to update and read the key with the maximum offset.
  • Read: Read operations (offset, key, backup key, cookie, and data size)

Use Cases

Upload

Photo Storage Upload

Download

Photo Storage Download

Designing a URL Shortener System

· 5 min read

Design a system that can convert URLs provided by users into short URLs, allowing users to access their original URLs (hereinafter referred to as long URLs) using these short URLs. Describe how this system operates, including but not limited to the following questions: How are short URLs allocated? How is the mapping between short URLs and long URLs stored? How is the redirection service implemented? How is access data stored?

Assumptions: The initial problem description does not include these assumptions. An excellent candidate will ask about system scale when given a specific design.

  • There are approximately tens of thousands of long URL domains.
  • The traffic for new long URLs is about 10,000,000 per day (100 per second).
  • The traffic for the redirection service using short URLs to access long URLs is about 10 billion per day (100,000 per second).
  • Remind the candidate that these are average figures - during peak times, these numbers can be much higher (one type of peak time is time-related, such as when users return home from work, and another type is event-related, such as during the Spring Festival Gala).
  • Recent data (e.g., today's data) should be collected in advance and should be available within five minutes when users want to view it.
  • Historical data should be calculated daily.

Assumptions

1 billion new URLs per day, 100 billion short URL accesses. The shorter the short URL, the better. Data presentation (real-time/daily/monthly/yearly).

URL Encoding

http://blog.codinghorror.com/url-shortening-hashes-in-practice/

Method 1: md5 (128 bits, 16 hexadecimal digits, collisions, birthday paradox, 2^(n/2) = 2^64) Shorter? (64 bits, 8 hexadecimal digits, collisions 2^32), base 64.

  • Advantages: Hashing is relatively simple and easy to scale horizontally.
  • Disadvantages: Too long, how to handle expired URLs?

Method 2: Distributed ID generator. (Base 62: az, AZ, 0~9, 62 characters, 62^7), partitioning: each node contains some IDs.

  • Advantages: Easier to eliminate expired URLs, shorter URLs.
  • Disadvantages: Coordination between different partitions (e.g., ZooKeeper).

Key-Value (KV) Storage

MySQL (10k requests per second, slow, no need for a relational database), key-value (100k requests per second, Redis, Memcached).

An excellent candidate will ask about the expected lifespan of short URLs and design a system that can automatically clean up expired short URLs.

Follow-Up

Question: How to generate short URLs?

  • A poor candidate might suggest using a single ID generator (single point of failure) or require coordination between ID generators for each ID generation. For example, using an auto-increment primary key in a database.
  • An acceptable candidate might suggest using md5 or some UUID generators that can generate IDs independently on some nodes. These methods can generate non-colliding IDs in a distributed system, allowing for the production of a large number of short URLs.
  • An excellent candidate will design a method using several ID generators, where each generator first reserves a block of ID sequences from a central coordinator (e.g., ZooKeeper), and these ID generators can allocate IDs from their ID sequences independently, cleaning up their ID sequences when necessary.

Question: How to store the mapping between long URLs and short URLs?

  • A poor candidate might suggest using a single, non-distributed, non-relational database. It is merely a simple key-value database.
  • An excellent candidate will suggest using a simple distributed storage system, such as MongoDB/HBase/Voldemort, etc.
  • A more excellent candidate will ask about the expected usage cycle of short URLs and then design a system that can clean up expired short URLs.

Question: How to implement the redirection service?

  • A poor candidate will design the system from scratch to solve problems that have already been solved.
  • An excellent candidate will suggest using an existing HTTP server with a plugin to translate the short URL ID, look up this ID in the database, update access data, return a 303 status, and redirect to the long URL. Existing HTTP servers include Apache/Jetty/Netty/Tomcat, etc.

Question: How to store access data?

  • A poor candidate will suggest writing to the database on every access.
  • An excellent candidate will suggest having several different components handle this task: generating access stream data, collecting and organizing it, and writing it to a permanent database after a certain period.

Question: How to separate the different components of storing access data proposed by the excellent candidate?

  • An excellent candidate will suggest using a low-latency information system to temporarily store access data and then hand the data over to the collection and organization component.
  • The candidate may ask how often access data needs to be updated. If updated daily, a reasonable method would be to store it in HDFS and use MapReduce to compute the data. If near-real-time data is required, the collection and organization component must compute the necessary data.

Question: How to block access to restricted websites?

  • An excellent candidate will suggest maintaining a blacklist of domains in the key-value database.
  • A good candidate might propose some advanced technologies that can be used when the system scales significantly, such as bloom filters.

Improving System Availability through Failover

· 2 min read

Failover: Failover is a backup operational mode used to enhance system stability and availability. When the primary component fails or is scheduled for downtime, the functions of system components (such as processors, servers, networks, or databases) are transferred to secondary system components.

Cold Backup: Cold backup refers to copying critical files to another location, using features or metrics/alerts to track failures. The system provides a new standby node in the event of a failure; however, cold backup is only suitable for stateless services. For backing up Oracle databases, cold backup is the fastest and safest method.

Hot Backup: This involves maintaining two active systems that share the same task roles, meaning the system operates normally while providing backup. The data between the two systems is nearly mirrored in real-time and contains the same information.

Warm Backup: This keeps two active systems, where the secondary system does not consume traffic unless a failure occurs.

Checkpoint (or similar to Redis snapshots): The system uses write-ahead logging (WAL) to record requests before processing tasks. The standby node recovers from the log during failover.

  • Disadvantages
    • A large amount of log recovery can be time-consuming
    • Data loss since the last checkpoint
  • User Cases: Storm, WhillWheel, Samza

Dual-host (or all-host) mode: This keeps two active systems behind a load balancer. The hosts operate in parallel, and data replication is bidirectional.

Lambda Architecture

· 2 min read

Why Use Lambda Architecture?

To address the three issues brought by big data:

  1. Accuracy (good)
  2. Latency (fast)
  3. Throughput (high)

For example: The problems of scaling web browsing data records in a traditional way:

  1. First, use a traditional relational database.
  2. Then, add a "publish/subscribe" model queue.
  3. Next, scale through horizontal partitioning or sharding.
  4. Fault tolerance issues begin to arise.
  5. Data corruption phenomena start to appear.

The key issue is that in the AKF Scaling Cube, ==having only the X-axis for horizontal partitioning of one dimension is not enough; we also need to introduce the Y-axis for functional decomposition. The lambda architecture can guide us on how to scale a data system==.

What is Lambda Architecture?

If we define a data system in the following form:

Query=function(all data)

Then a lambda architecture is:

Lambda Architecture

batch view = function(all data at the batching job's execution time)
realtime view = function(realtime view, new data)

query = function(batch view, realtime view)

==Lambda architecture = Read/Write separation (Batch Processing Layer + Service Layer) + Real-time Processing Layer==

Lambda Architecture for big data systems

Skip List

· One min read

A skip list is essentially a linked list that allows for binary search. It achieves this by adding extra nodes that enable you to "skip" parts of the linked list. Given a random number generator to create these extra nodes, a skip list has O(log n) complexity for search, insert, and delete operations.

Use Cases

  • LevelDB MemTable
  • Redis Sorted Set
  • Lucene Inverted Index

Bloom Filter

· One min read

A Bloom filter is a data structure that is used to determine whether an element is a member of a set with a much higher space and time efficiency than other general algorithms.

The results obtained using a Bloom filter may yield false positive matches, but cannot yield false negative matches. In other words, the query returns results that are "either possibly present or definitely not present." Elements can be added to the set, but cannot be removed (although this can be addressed with an additional "counting" Bloom filter); the more elements added to the set, the greater the likelihood of false positives.

Use Cases

  • Cassandra uses Bloom filters to determine if an SSTable contains data for a specific row.
  • HBase Bloom filters are an effective mechanism for testing whether a StoreFile contains a specific row or row-column cell.
  • With Bloom filters, a website's anti-cheat system can effectively deny access to banned users.
  • Google's Chrome browser once used Bloom filters to identify malicious links.

Past Work Experience Interview

· 3 min read

Target Audience

Individuals with some or little experience, or those who have not held any leadership or design positions in their careers (whether formal or informal).

Problem Description

Describe a project experience from your past that you found particularly interesting or memorable. Follow-up questions include:

  • Why did you find it interesting?
  • What was the most challenging part of the project, and how did you address those challenges?
  • What did you learn from this project? What would you have liked to know before starting the project?
  • Did you consider other design or implementation methods? Why did you choose the approach you took? If you were to choose again for the same project, what would you do differently?

Interviewer Tips

Since the goal here is to assess a person's technical communication skills and level of interest, and they may have participated in boot camps, you should be prepared to ask them more questions (whether for more details or about other aspects of the project). If they are recent graduates who have just completed their thesis, the thesis is often a good starting point. While this question is similar in many ways to resume questions in phone interviews, the content is about four times that of a phone interview and should be proportionately more detailed in asking what they did. Therefore, while the scoring criteria are similar, they should be evaluated with higher expectations and more data.

Scoring

Excellent candidates can:

  • Discuss project experiences thoroughly, with interactions in the interview being a dialogue rather than a directive.
  • Have a good understanding of the entire project, not just their area of focus, and be able to clearly articulate the design and intent of the project.
  • Show passion for any project and clearly describe the project elements that sparked that passion.
  • Clearly explain what alternatives were considered and why they chose the implementation strategy they took.
  • Reflect on their experiences and learn from them.

Good candidates may:

  • Encounter some questions in the interview but can resolve them with the interviewer's help.
  • Lack a broader understanding of the project but still have a strong grasp of the parts they interacted with directly and specific areas.
  • Appear passionate but may struggle to accurately explain where that passion comes from.
  • Discuss alternatives they considered but may not have thought deeply about them.
  • Reflect on their past experiences and draw lessons from them.

Poor candidates exhibit:

  • Struggle during the interview, making the interviewer feel like the candidate is interrogating them rather than having a conversation.
  • Lack detailed knowledge of the project even in their field of work. They may not understand why their product was designed that way or how it interacts with other systems.
  • When asked about the most interesting projects they have worked on, they should show interest in the product, but in fact, they may appear disinterested.
  • Be unfamiliar with potential alternative implementation methods.
  • Seem not to have reflected on or learned from their past project experiences. A key indicator of this situation is that answers to "What did you learn?" and "What would you do differently?" are very brief or nearly identical.

What is Apache Kafka?

· 4 min read

Apache Kafka is a distributed streaming platform.

Why use Apache Kafka?

Its abstraction is a ==queue==, and its features include:

  • A distributed publish-subscribe (pub-sub) messaging system that simplifies N ^ 2 relationships into N. Publishers and subscribers can operate at their own rates.
  • Ultra-fast zero-copy technology.
  • Support for fault-tolerant data persistence.

It can be applied to:

  • Logging by topic.
  • Messaging systems.
  • Off-site backups.
  • Stream processing.

Why is Kafka so fast?

Kafka uses zero-copy technology, where the CPU does not perform the task of copying data across storage area replicas.

Without zero-copy technology:

With zero-copy technology:

Architecture

From the outside, producers write to the Kafka cluster, while users read from the Kafka cluster.

Data is stored by topic and divided into partitions of replicable replicas.

Kafka Cluster Overview

  1. Producers publish messages to specific topics.
    • First, they are written to an in-memory buffer and then updated to disk.
    • To achieve fast writes, an append-only sequential write is used.
    • Messages can be read only after being written.
  2. Consumers fetch messages from specific topics.
    • They use an "offset pointer" (offset is the SEQ ID) to track/control their unique reading progress.
  3. A topic includes partitions and load balancing, where each partition is an ordered, immutable sequence of records.
    • Partitions determine the parallelism of users (groups). At any given time, a user can read from only one partition.

How to serialize data? Avro

What is its network protocol? TCP

What is the storage layout within a partition? O(1) disk reads.

How does fault tolerance work?

==In-Sync Replicas (ISR) protocol==. It allows (numReplicas - 1) nodes to fail. Each partition has one leader and one or more followers.

Total replicas = In-sync replicas + Out-of-sync replicas

  1. ISR is a set of live replicas that are in sync with the leader (note that the leader is always in the ISR).
  2. When publishing new messages, the leader waits to commit the message until it has been received by all replicas in the ISR.
  3. ==If a follower fails to stay in sync, it will exit the ISR, and then the leader will continue to commit new messages with fewer replicas in the ISR. Note that at this point, the system is operating in a low-replica state.== If a leader fails, another ISR will be elected as a new leader.
  4. Out-of-sync replicas continuously pull messages from the leader. Once they catch up to the leader, they will be added back to the ISR.

Is Kafka an AP or CP system in the CAP theorem?

Jun Rao believes it is CA because "our goal is to support replication within a Kafka cluster in a single data center, where network partitions are rare, so our design focuses on maintaining high availability and strong consistency of replicas."

However, it actually depends on the configuration.

  1. If using the initial configuration (min.insync.replicas=1, default.replication.factor=1), you will have an AP system (at most once).
  2. If you want to achieve CP, you can set min.insync.replicas=2, topic replication factor to 3, and then generate acks=all messages to guarantee CP settings (at least once). However, if there are not enough replicas (replica count < 2) for a specific topic/partition, writing will not succeed.

How Does Facebook Store a Large-Scale Social Graph? TAO

· 2 min read

What Are the Challenges?

Before TAO, using the cache-aside pattern

Before TAO

The social graph is stored in MySQL and cached in Memcached.

Three existing problems:

  1. The efficiency of updating the edge list of the social graph in Memcached is too low. Instead of adding an edge to the end of the list, the entire list needs to be updated.
  2. The logic for managing the cache on the client side is very complex.
  3. It is difficult to maintain ==consistency in database reads after writes==.

To solve these problems, we have three goals:

  • Efficient graph storage even with large-scale data.
  • Optimize read operations (read-write ratio is 500:1).
    • Reduce the duration of read operations.
    • Improve the availability of read operations (eventual consistency).
  • Complete write operations in a timely manner (write first, read later).

Data Model

  • Objects with unique IDs (e.g., users, addresses, comments).
  • Associations between two IDs (e.g., tagged, liked, posted).
  • Both of the above data models have key-value data and time-related data.

Solution: TAO

  1. Accelerate read operations and efficiently handle large-scale reads.

    • Cache specifically for graphs.
    • Add a layer of cache between the stateless server layer and the database layer (see Business Splitting).
    • Split data centers (see Data Partitioning).
  2. Complete write operations in a timely manner.

    • Write-through cache.
    • Use follower/leader caching to solve the ==thundering herd problem==.
    • Asynchronous replication.
  3. Improve the availability of read operations.

    • If a read fails, read from other available sources.

Architecture of TAO

  • MySQL Database → Durability.
  • Leader Cache → Coordinates write operations for each object.
  • Follower Cache → Used for reading rather than writing. Shift all write operations to the leader cache.

Architecture of Facebook TAO

Fault tolerance for read operations.

Fault Tolerance for Read Operations in Facebook TAO

How to Design Robust and Predictable APIs with Idempotency?

· 2 min read

Why are APIs unreliable?

  1. Networks can fail.
  2. Servers can fail.

How can we solve this problem? Three principles:

  1. The client uses "retry" to ensure state consistency.

  2. The retry requests must include an ==idempotent unique ID==.

    1. In RESTful API design, the semantics of PUT and DELETE are inherently idempotent.
    2. However, POST in online payment scenarios may lead to the ==“duplicate payment” issue==, so we use an "idempotent unique ID" to identify whether a request has been sent multiple times.
      1. If the error occurs before reaching the server, after retrying, the server sees it for the first time and processes it normally.
      2. If the error occurs on the server, based on this "unique ID," an ACID-compliant database ensures that this transaction occurs only once.
      3. If the error occurs after the server returns a result, after retrying, the server only needs to return the cached successful result.
  3. Retries must be responsible, such as following the ==exponential backoff algorithm==, because we do not want a large number of clients to retry simultaneously.

For example, Stripe's client calculates the wait time for retries like this:

def self.sleep_time(retry_count)
# Apply exponential backoff with initial_network_retry_delay on the
# number of attempts so far as inputs. Do not allow the number to exceed
# max_network_retry_delay.
sleep_seconds = [Stripe.initial_network_retry_delay * (2 ** (retry_count - 1)), Stripe.max_network_retry_delay].min

# Apply some jitter by randomizing the value in the range of (sleep_seconds
# / 2) to (sleep_seconds).
sleep_seconds = sleep_seconds * (0.5 * (1 + rand()))

# But never sleep less than the base sleep seconds.
sleep_seconds = [Stripe.initial_network_retry_delay, sleep_seconds].max

sleep_seconds
end

How to Build a Scalable Web Service?

· One min read

==One Word: Split==

==The AKF Scale Cube== tells us the three dimensions of "splitting":

AKF Scale Cube

  1. ==Horizontal Scaling== Place many stateless servers behind a load balancer or reverse proxy, so that each request can be handled by any of those servers, eliminating single points of failure.
  2. ==Business Splitting== Typical microservices divided by function, such as auth service, user profile service, photo service, etc.
  3. ==Data Partitioning== Separate the entire technology stack and data storage specifically for a large group of users, for example, Uber has data centers in China and the United States, with different Pods for different cities or regions within each data center.

What are the use cases for key-value caching?

· 3 min read

The essence of KV Cache is to reduce data access latency. For example, it transforms the O(logN) read/write and complex queries on a database that is expensive and slow into O(1) read/writes on a medium that is fast but also costly. There are many strategies for cache design, with common ones being read-through/write-through (or write-back) and cache aside.

The typical read/write ratio for internet services ranges from 100:1 to 1000:1, and we often optimize for reads.

In distributed systems, these patterns represent trade-offs between consistency, availability, and partition tolerance, and the specific choice should be based on your business needs.

General Strategies

  • Read
    • Read-through: A cache layer is added between clients and databases, so clients do not access the database directly but instead access it indirectly through the cache. If the cache is empty, it updates from the database and returns the data; if not, it returns the data directly.
  • Write
    • Write-through: Clients first write data to the cache, which then updates the database. The operation is considered complete only when the database is updated.
    • Write-behind/Write-back: Clients first write data to the cache and receive a response immediately. The cache is then asynchronously updated to the database. Generally, write-back is the fastest.
    • Write-around: Clients write directly to the database, bypassing the cache.

Cache Aside Pattern

Use the Cache Aside pattern when the cache does not support read-through and write-through/write-behind.

Reading data? If the cache hits, read from the cache; if it misses, read from the database and store in the cache. Modifying data? First modify the database, then delete the cache entry.

Why not update the cache after writing to the database? The main concern is that two concurrent database write operations could lead to two concurrent cache updates, resulting in dirty data.

Does using Cache Aside eliminate concurrency issues? There is still a low probability of dirty data occurring, especially when reading from the database and updating the cache while simultaneously updating the database and deleting the cache entry.

Where to Place the Cache?

  • Client side,
  • Distinct layer,
  • Server side.

What to Do If the Cache Size Is Insufficient? Cache Eviction Strategies

  • LRU - Least Recently Used: Keeps track of time and retains the most recently used items, evicting those that have not been used recently.
  • LFU - Least Frequently Used: Tracks usage frequency, retaining the most frequently used items and evicting the least frequently used ones.
  • ARC: Performs better than LRU by maintaining both recently used (RU) and frequently used (FU) items, while also recording the history of recently evicted items.

Which Cache Solution Is the Best?

Facebook TAO