Skip to main content

75 posts tagged with "system design"

View All Tags

Public API Choices

· One min read

In summary, to choose a tool for the public API, API gateway, or BFF (Backend For Frontend) gateway, I prefer GraphQL for its features like tailing results, batching nested queries, performance tracing, and explicit caching.

JSON RPCGraphQLRESTgRPC
UsecasesEtherumGithub V2, Airbnb, Facebook BFF / API GatewaySwaggerHigh performance, Google, internal endpoints
Single Endpoint
Type System✅ as weak as JSON
No uint64

No uint64
✅ w/ Swagger
No uint64

has uint64
Tailored Results
Batch nested queries
VersioningSchema ExtensionYes, w/ v1/v2 route sField Numbers in protobuf
Error HandlingStructuredStructuredHTTP Status CodeStructured
Cross-platform
Playground UIGraphQL BinSwagger
Performance tracing?Apollo plugin??
cachingNo or HTTP cache controlApollo pluginHTTP cache controlNative support not yet. but still yes w/ HTTP cache control
ProblemLack of community support and toolchainBarrister IDL42.51 kb client-side bundle sizeUnstructured with multiple endpoints. awful portability.Grpc-web dev in progress140kb JS bundle. Compatibility issues: not all places support HTTP2 and grpc dependencies

Past Work Experience Interview

· 3 min read

Target Audience

Individuals with some or little experience, or those who have not held any leadership or design positions in their careers (whether formal or informal).

Problem Description

Describe a project experience from your past that you found particularly interesting or memorable. Follow-up questions include:

  • Why did you find it interesting?
  • What was the most challenging part of the project, and how did you address those challenges?
  • What did you learn from this project? What would you have liked to know before starting the project?
  • Did you consider other design or implementation methods? Why did you choose the approach you took? If you were to choose again for the same project, what would you do differently?

Interviewer Tips

Since the goal here is to assess a person's technical communication skills and level of interest, and they may have participated in boot camps, you should be prepared to ask them more questions (whether for more details or about other aspects of the project). If they are recent graduates who have just completed their thesis, the thesis is often a good starting point. While this question is similar in many ways to resume questions in phone interviews, the content is about four times that of a phone interview and should be proportionately more detailed in asking what they did. Therefore, while the scoring criteria are similar, they should be evaluated with higher expectations and more data.

Scoring

Excellent candidates can:

  • Discuss project experiences thoroughly, with interactions in the interview being a dialogue rather than a directive.
  • Have a good understanding of the entire project, not just their area of focus, and be able to clearly articulate the design and intent of the project.
  • Show passion for any project and clearly describe the project elements that sparked that passion.
  • Clearly explain what alternatives were considered and why they chose the implementation strategy they took.
  • Reflect on their experiences and learn from them.

Good candidates may:

  • Encounter some questions in the interview but can resolve them with the interviewer's help.
  • Lack a broader understanding of the project but still have a strong grasp of the parts they interacted with directly and specific areas.
  • Appear passionate but may struggle to accurately explain where that passion comes from.
  • Discuss alternatives they considered but may not have thought deeply about them.
  • Reflect on their past experiences and draw lessons from them.

Poor candidates exhibit:

  • Struggle during the interview, making the interviewer feel like the candidate is interrogating them rather than having a conversation.
  • Lack detailed knowledge of the project even in their field of work. They may not understand why their product was designed that way or how it interacts with other systems.
  • When asked about the most interesting projects they have worked on, they should show interest in the product, but in fact, they may appear disinterested.
  • Be unfamiliar with potential alternative implementation methods.
  • Seem not to have reflected on or learned from their past project experiences. A key indicator of this situation is that answers to "What did you learn?" and "What would you do differently?" are very brief or nearly identical.

What can we communicate in soft skills interview?

· 2 min read

What is an interview?

An interview is a process for workers to find future co-workers, during which they are finding signals to answer the following three key questions:

  1. capability question - can you do the job?
  2. willingness question - will you do the job?
  3. culture-fit question - will you fit in?

Why is soft skill so important?

None of the critical questions above can be answered without good communication. Your job will be taken away by people who can talk better than you.

Generic answers (stories) to prepare

  1. hard-won triumph. How do you deal with failure? Briefly talk about the harshest moment, and then focus on how you battled back, and then salute the people who helped you. It demonstrates that you have the grit, team-building skills, and relevance to the job.
  2. influence. Can you guide people to yes? leaders = visionaries who inspire self-sacrifice. A leader does not exist without the ability to persuade.
  3. technical skills. Do you have a story proving your great technical skills?
  4. culture-fit. The FBI used to ask prospective agents what books they read until an underground network of tipsters figured out the ideal answer: “Tom Clancy spy novels.”
  5. fascination. What's fascinating about you? What makes you different from other people?

What is Apache Kafka?

· 3 min read

Apache Kafka is a distributed streaming platform.

Why use Apache Kafka?

Its abstraction is a ==queue== and it features

  • a distributed pub-sub messaging system that resolves N^2 relationships to N. Publishers and subscribers can operate at their own rates.
  • super fast with zero-copy technology
  • support fault-tolerant data persistence

It can be applied to

  • logging by topics
  • messaging system
  • geo-replication
  • stream processing

Why is Kafka so fast?

Kafka is using zero copy in which that CPU does not perform the task of copying data from one memory area to another.

Without zero copy:

With zero copy:

Architecture

Looking from outside, producers write to brokers, and consumers read from brokers.

Data is stored in topics and split into partitions which are replicated.

Kafka Cluster Overview

  1. Producer publishes messages to a specific topic.
    • Write to in-memory buffer first and flush to disk.
    • append-only sequence write for fast write.
    • Available to read after write to disks.
  2. Consumer pulls messages from a specific topic.
    • use an "offset pointer" (offset as seqId) to track/control its only read progress.
  3. A topic consists of partitions, load balance, partition (= ordered + immutable seq of msg that is continually appended to)
    • Partitions determine max consumer (group) parallelism. One consumer can read from only one partition at the same time.

How to serialize data? Avro

What is its network protocol? TCP

What is a partition's storage layout? O(1) disk read

How to tolerate fault?

==In-sync replica (ISR) protocol==. It tolerates (numReplicas - 1) dead brokers. Every partition has one leader and one or more followers.

Total Replicas = ISRs + out-of-sync replicas

  1. ISR is the set of replicas that are alive and have fully caught up with the leader (note that the leader is always in ISR).
  2. When a new message is published, the leader waits until it reaches all replicas in the ISR before committing the message.
  3. ==If a follower replica fails, it will be dropped out of the ISR and the leader then continues to commit new messages with fewer replicas in the ISR. Notice that now, the system is running in an under replicated mode.== If a leader fails, an ISR is picked to be a new leader.
  4. Out-of-sync replica keeps pulling message from the leader. Once catches up with the leader, it will be added back to the ISR.

Is Kafka an AP or CP system in CAP theorem?

Jun Rao says it is CA, because "Our goal was to support replication in a Kafka cluster within a single datacenter, where network partitioning is rare, so our design focuses on maintaining highly available and strongly consistent replicas."

However, it actually depends on the configuration.

  1. Out of the box with default config (min.insync.replicas=1, default.replication.factor=1) you are getting AP system (at-most-once).

  2. If you want to achieve CP, you may set min.insync.replicas=2 and topic replication factor of 3 - then producing a message with acks=all will guarantee CP setup (at-least-once), but (as expected) will block in cases when not enough replicas (<2) are available for particular topic/partition pair.

What is Apache Kafka?

· 3 min read

Apache Kafka is a distributed streaming platform.

Why use Apache Kafka?

Its abstraction is a ==queue==, and its features include:

  • A distributed publish-subscribe (pub-sub) messaging system that simplifies N ^ 2 relationships into N. Publishers and subscribers can operate at their own rates.
  • Ultra-fast zero-copy technology.
  • Support for fault-tolerant data persistence.

It can be applied to:

  • Logging by topic.
  • Messaging systems.
  • Off-site backups.
  • Stream processing.

Why is Kafka so fast?

Kafka uses zero-copy technology, where the CPU does not perform the task of copying data across storage area replicas.

Without zero-copy technology:

With zero-copy technology:

Architecture

From the outside, producers write to the Kafka cluster, while users read from the Kafka cluster.

Data is stored by topic and divided into partitions of replicable replicas.

Kafka Cluster Overview

  1. Producers publish messages to specific topics.
    • First, they are written to an in-memory buffer and then updated to disk.
    • To achieve fast writes, an append-only sequential write is used.
    • Messages can be read only after being written.
  2. Consumers fetch messages from specific topics.
    • They use an "offset pointer" (offset is the SEQ ID) to track/control their unique reading progress.
  3. A topic includes partitions and load balancing, where each partition is an ordered, immutable sequence of records.
    • Partitions determine the parallelism of users (groups). At any given time, a user can read from only one partition.

How to serialize data? Avro

What is its network protocol? TCP

What is the storage layout within a partition? O(1) disk reads.

How does fault tolerance work?

==In-Sync Replicas (ISR) protocol==. It allows (numReplicas - 1) nodes to fail. Each partition has one leader and one or more followers.

Total replicas = In-sync replicas + Out-of-sync replicas

  1. ISR is a set of live replicas that are in sync with the leader (note that the leader is always in the ISR).
  2. When publishing new messages, the leader waits to commit the message until it has been received by all replicas in the ISR.
  3. ==If a follower fails to stay in sync, it will exit the ISR, and then the leader will continue to commit new messages with fewer replicas in the ISR. Note that at this point, the system is operating in a low-replica state.== If a leader fails, another ISR will be elected as a new leader.
  4. Out-of-sync replicas continuously pull messages from the leader. Once they catch up to the leader, they will be added back to the ISR.

Is Kafka an AP or CP system in the CAP theorem?

Jun Rao believes it is CA because "our goal is to support replication within a Kafka cluster in a single data center, where network partitions are rare, so our design focuses on maintaining high availability and strong consistency of replicas."

However, it actually depends on the configuration.

  1. If using the initial configuration (min.insync.replicas=1, default.replication.factor=1), you will have an AP system (at most once).
  2. If you want to achieve CP, you can set min.insync.replicas=2, topic replication factor to 3, and then generate acks=all messages to guarantee CP settings (at least once). However, if there are not enough replicas (replica count < 2) for a specific topic/partition, writing will not succeed.

How Facebook Scale its Social Graph Store? TAO

· 2 min read

What are the challenges?

Before TAO, use cache-aside pattern

Before TAO

Social graph data is stored in MySQL and cached in Memcached

3 problems:

  1. list update operation in Memcached is inefficient. cannot append but update the whole list.
  2. clients have to manage cache
  3. Hard to offer ==read-after-write consistency==

To solve those problems, we have 3 goals:

  • online data graph service that is efficiency at scale
  • optimize for read (its read-to-write ratio is 500:1)
    • low read latency
    • high read availability (eventual consistency)
  • timeliness of writes (read-after-write)

Data Model

  • Objects (e.g. user, location, comment) with unique IDs
  • Associations (e.g. tagged, like, author) between two IDs
  • Both have key-value data as well as a time field

Solutions: TAO

  1. Efficiency at scale and reduce read latency

  2. Write timeliness

    • write-through cache
    • follower/leader cache to solve thundering herd problem
    • async replication
  3. Read availability

    • Read Failover to alternate data sources

TAO's Architecture

  • MySQL databases → durability
  • Leader cache → coordinates writes to each object
  • Follower caches → serve reads but not writes. forward all writes to leader.

Facebook TAO Architecture

Read failover

Facebook TAO Read Failover

How Does Facebook Store a Large-Scale Social Graph? TAO

· 2 min read

What Are the Challenges?

Before TAO, using the cache-aside pattern

Before TAO

The social graph is stored in MySQL and cached in Memcached.

Three existing problems:

  1. The efficiency of updating the edge list of the social graph in Memcached is too low. Instead of adding an edge to the end of the list, the entire list needs to be updated.
  2. The logic for managing the cache on the client side is very complex.
  3. It is difficult to maintain ==consistency in database reads after writes==.

To solve these problems, we have three goals:

  • Efficient graph storage even with large-scale data.
  • Optimize read operations (read-write ratio is 500:1).
    • Reduce the duration of read operations.
    • Improve the availability of read operations (eventual consistency).
  • Complete write operations in a timely manner (write first, read later).

Data Model

  • Objects with unique IDs (e.g., users, addresses, comments).
  • Associations between two IDs (e.g., tagged, liked, posted).
  • Both of the above data models have key-value data and time-related data.

Solution: TAO

  1. Accelerate read operations and efficiently handle large-scale reads.

    • Cache specifically for graphs.
    • Add a layer of cache between the stateless server layer and the database layer (see Business Splitting).
    • Split data centers (see Data Partitioning).
  2. Complete write operations in a timely manner.

    • Write-through cache.
    • Use follower/leader caching to solve the ==thundering herd problem==.
    • Asynchronous replication.
  3. Improve the availability of read operations.

    • If a read fails, read from other available sources.

Architecture of TAO

  • MySQL Database → Durability.
  • Leader Cache → Coordinates write operations for each object.
  • Follower Cache → Used for reading rather than writing. Shift all write operations to the leader cache.

Architecture of Facebook TAO

Fault tolerance for read operations.

Fault Tolerance for Read Operations in Facebook TAO

How Netflix Serves Viewing Data?

· 2 min read

Motivation

How to keep users' viewing data in scale (billions of events per day)?

Here, viewing data means...

  1. viewing history. What titles have I watched?
  2. viewing progress. Where did I leave off in a given title?
  3. on-going viewers. What else is being watched on my account right now?

Architecture

Netflix Viewing Data Architecture

The viewing service has two tiers:

  1. stateful tier = active views stored in memory

    • Why? to support the highest volume read/write
    • How to scale out?
      • partitioned into N stateful nodes by account_id mod N
        • One problem is that load is not evenly distributed and hence the system is subject to hot spots
      • CP over AP in CAP theorem, and there is no replica of active states.
        • One failed node will impact 1/nth of the members. So they use stale data to degrade gracefully.
  2. stateless tier = data persistence = Cassandra + Memcached

    • Use Cassandra for very high volume, low latency writes.
      • Data is evenly distributed. No hot spots because of consistent hashing with virtual nodes to partition the data.
    • Use Memcached for very high volume, low latency reads.
      • How to update the cache?
        • after writing to Cassandra, write the updated data back to Memcached
        • eventually consistent to handling multiple writers with a short cache entry TTL and a periodic cache refresh.
      • in the future, prefer Redis' appending operation to a time-ordered list over "read-modify-writes" in Memcached.

How to design robust and predictable APIs with idempotency?

· 2 min read

How could APIs be un-robust and un-predictable?

  1. Networks are unreliable.
  2. Servers are more reliable but may still fail.

How to solve the problem? 3 Principles:

  1. Client retries to ensure consistency.

  2. Retry with idempotency and idempotency keys to allow clients to pass a unique value.

    1. In RESTful APIs, the PUT and DELETE verbs are idempotent.
    2. However, POST may cause ==“double-charge” problem in payment==. So we use a ==idempotency key== to identify the request.
      1. If the failure happens before the server, then there is a retry, and the server will see it for the first time, and process it normally.
      2. If the failure happens in the server, then ACID database will guarantee the transaction by the idempotency key.
      3. If the failure happens after the server’s reply, then client retries, and the server simply replies with a cached result of the successful operation.
  3. Retry with ==exponential backoff and random jitter==. Be considerate of the ==thundering herd problem== that servers that may be stuck in a degraded state and a burst of retries may further hurt the system.

For example, Stripe’s client retry calculates the delay like this...

def self.sleep_time(retry_count)
# Apply exponential backoff with initial_network_retry_delay on the
# number of attempts so far as inputs. Do not allow the number to exceed
# max_network_retry_delay.
sleep_seconds = [Stripe.initial_network_retry_delay * (2 ** (retry_count - 1)), Stripe.max_network_retry_delay].min

# Apply some jitter by randomizing the value in the range of (sleep_seconds
# / 2) to (sleep_seconds).
sleep_seconds = sleep_seconds * (0.5 * (1 + rand()))

# But never sleep less than the base sleep seconds.
sleep_seconds = [Stripe.initial_network_retry_delay, sleep_seconds].max

sleep_seconds
end

How to Design Robust and Predictable APIs with Idempotency?

· 2 min read

Why are APIs unreliable?

  1. Networks can fail.
  2. Servers can fail.

How can we solve this problem? Three principles:

  1. The client uses "retry" to ensure state consistency.

  2. The retry requests must include an ==idempotent unique ID==.

    1. In RESTful API design, the semantics of PUT and DELETE are inherently idempotent.
    2. However, POST in online payment scenarios may lead to the ==“duplicate payment” issue==, so we use an "idempotent unique ID" to identify whether a request has been sent multiple times.
      1. If the error occurs before reaching the server, after retrying, the server sees it for the first time and processes it normally.
      2. If the error occurs on the server, based on this "unique ID," an ACID-compliant database ensures that this transaction occurs only once.
      3. If the error occurs after the server returns a result, after retrying, the server only needs to return the cached successful result.
  3. Retries must be responsible, such as following the ==exponential backoff algorithm==, because we do not want a large number of clients to retry simultaneously.

For example, Stripe's client calculates the wait time for retries like this:

def self.sleep_time(retry_count)
# Apply exponential backoff with initial_network_retry_delay on the
# number of attempts so far as inputs. Do not allow the number to exceed
# max_network_retry_delay.
sleep_seconds = [Stripe.initial_network_retry_delay * (2 ** (retry_count - 1)), Stripe.max_network_retry_delay].min

# Apply some jitter by randomizing the value in the range of (sleep_seconds
# / 2) to (sleep_seconds).
sleep_seconds = sleep_seconds * (0.5 * (1 + rand()))

# But never sleep less than the base sleep seconds.
sleep_seconds = [Stripe.initial_network_retry_delay, sleep_seconds].max

sleep_seconds
end