Skip to main content

32 posts tagged with "system-design"

View all tags

Model Routing Is a System Design Problem, Not a Config Option

· 11 min read
Tian Pan
Software Engineer

Most teams choose their LLM the way they choose a database engine: once, during architecture review, and never again. You pick GPT-4o or Claude 3.5 Sonnet, bake it into your config, and ship. The choice feels irreversible because changing it requires a redeployment, coordination across services, and regression testing against whatever your evals look like this week.

That framing is a mistake. Your traffic is not homogeneous. A "summarize this document" request and a "debug this cryptic stack trace" request hitting the same endpoint at the same time have radically different capability requirements — but with static model selection, they're indistinguishable from your infrastructure's perspective. You're either over-provisioning one or under-serving the other, and you're doing it on every single request.

Model routing treats LLM selection as a runtime dispatch decision. Every incoming query gets evaluated on signals that predict the right model for that specific request, and the call is dispatched accordingly. The routing layer doesn't exist in your config file — it runs in your request path.

The Selective Abstention Problem: Why AI Systems That Always Answer Are Broken

· 10 min read
Tian Pan
Software Engineer

Here is a pattern that appears in almost every production AI deployment: the team ships a feature that handles 90% of queries well. Then they start getting complaints. A user asked something outside the training distribution; the model confidently produced a wrong answer. A RAG pipeline retrieved a stale document; the model answered as though it were current. A legal query hit an edge case the prompt didn't cover; the model speculated its way through it. The fix, in each case, wasn't a better model. It was teaching the system to say "I don't know."

Abstention — the principled decision to not answer — is one of the hardest and most undervalued capabilities in AI system design. Virtually all product effort goes toward making answers better. Almost none goes toward making the system reliably know when to withhold one. That asymmetry is a design debt that compounds in production.

The CAP Theorem for AI Agents: Why Your Agent Fails Completely When It Should Degrade Gracefully

· 9 min read
Tian Pan
Software Engineer

Your AI agent works perfectly until it doesn't. One tool goes down — maybe the search API is rate-limited, maybe the database is slow, maybe the code execution sandbox times out — and the entire agent collapses. Not a partial answer, not a degraded response. A complete failure. A blank screen or a hallucinated mess.

This is not a bug. It is a design choice, and almost nobody made it deliberately. The agent architectures we are building today implicitly choose "fail completely" because nobody designed the partial-availability path. If you have built distributed systems before, this pattern should feel painfully familiar. It is the CAP theorem, showing up in a new disguise.

The Caching Hierarchy for Agentic Workloads: Five Layers Most Teams Stop at Two

· 11 min read
Tian Pan
Software Engineer

Most teams deploying AI agents implement prompt caching, maybe add a semantic cache, and call it done. They're leaving 40-60% of their potential savings on the table. The reason isn't laziness — it's that agentic workloads create caching problems that don't exist in simple request-response LLM calls, and the solutions require thinking in layers that traditional web caching never needed.

A single agent task might involve a 4,000-token system prompt, three tool calls that each return different-shaped data, a multi-step plan that's structurally identical to yesterday's plan, and session context that needs to persist across a conversation but never across users. Each of these represents a different caching opportunity with different TTL requirements, different invalidation triggers, and different failure modes when the cache goes stale.

Coalesce Before You Call: The LLM Request Batching Pattern That Cuts Costs Without Slowing Users Down

· 11 min read
Tian Pan
Software Engineer

Most teams discover request coalescing the same way: through a surprisingly large invoice. They ship an LLM-backed feature, usage grows, and then the billing dashboard shows they're paying for fifty thousand requests a day when closer examination reveals that roughly thirty thousand of them were asking the same thing in slightly different words. Each paraphrase of "summarize this document" hit the model separately. Each near-duplicate triggered a full inference cycle. The cost scaled with traffic volume, not with the semantic diversity of what users actually wanted.

Request coalescing is the pattern that fixes this. It is not one technique but a layered architecture: in-flight deduplication to prevent concurrent duplicates, exact caching for repeated identical prompts, and semantic batching to catch the paraphrased variations in between. The order matters, the thresholds matter, and understanding where the pattern breaks down — particularly around streaming — is what separates a working implementation from one that saves money on a staging server but causes subtle bugs in production.

The $100M Telemetry Bug: What OpenAI's Outage Teaches Us About System Design

· 3 min read

On December 11, 2024, OpenAI experienced a catastrophic outage that took down ChatGPT, their API, and Sora for over four hours. While outages happen to every company, this one is particularly fascinating because it reveals a critical lesson about modern system design: sometimes the tools we add to prevent failures become the source of failures themselves.

The Billion-Dollar Irony

Here's the fascinating part: The outage wasn't caused by a hack, a failed deployment, or even a bug in their AI models. Instead, it was caused by a tool meant to improve reliability. OpenAI was adding better monitoring to prevent outages when they accidentally created one of their biggest outages ever.

It's like hiring a security guard who accidentally locks everyone out of the building.

The Cascade of Failures

The incident unfolded like this:

  1. OpenAI deployed a new telemetry service to better monitor their systems
  2. This service overwhelmed their Kubernetes control plane with API requests
  3. When the control plane failed, DNS resolution broke
  4. Without DNS, services couldn't find each other
  5. Engineers couldn't fix the problem because they needed the control plane to remove the problematic service

But the most interesting part isn't the failure itself – it's how multiple safety systems failed simultaneously:

  1. Testing didn't catch the issue because it only appeared at scale
  2. DNS caching masked the problem long enough for it to spread everywhere
  3. The very systems needed to fix the problem were the ones that broke

Three Critical Lessons

1. Scale Changes Everything

The telemetry service worked perfectly in testing. The problem only emerged when deployed to clusters with thousands of nodes. This highlights a fundamental challenge in modern system design: some problems only emerge at scale.

2. Safety Systems Can Become Risk Factors

OpenAI's DNS caching, meant to improve reliability, actually made the problem worse by masking the issue until it was too late. Their Kubernetes control plane, designed to manage cluster health, became a single point of failure.

3. Recovery Plans Need Recovery Plans

The most damning part? Engineers couldn't fix the problem because they needed working systems to fix the broken systems. It's like needing a ladder to reach the ladder you need.

The Future of System Design

OpenAI's response plan reveals where system design is headed:

  1. Decoupling Critical Systems: They're separating their data plane from their control plane, reducing interdependencies
  2. Improved Testing: They're adding fault injection testing to simulate failures at scale
  3. Break-Glass Procedures: They're building emergency access systems that work even when everything else fails

What This Means for Your Company

Even if you're not operating at OpenAI's scale, the lessons apply:

  1. Test at scale, not just functionality
  2. Build emergency access systems before you need them
  3. Question your safety systems – they might be hiding risks

The future of reliable systems isn't about preventing all failures – it's about ensuring we can recover from them quickly and gracefully.

Remember: The most dangerous problems aren't the ones we can see coming. They're the ones that emerge from the very systems we build to keep us safe.

Quick Intro to Optimism Architecture

· 4 min read

What is Optimism?

Optimism is an EVM equivalent, optimistic rollup protocol designed to scale Ethereum.

  • Scaling Ethereum means increasing the number of useful transactions the Ethereum network can process.
  • Optimistic rollup is a layer 2 scalability technique which increases the computation & storage capacity of Ethereum without sacrificing security or decentralization.
  • EVM Equivalence is complete compliance with the state transition function described in the Ethereum yellow paper, the formal definition of the protocol.

Optimistic rollup works by bundling multiple transactions into a single transaction, which is then verified by a smart contract on the Ethereum network. This process is called "rolling up" because the individual transactions are combined into a larger transaction that is submitted to the Ethereum network. The term "optimistic" refers to the fact that the system assumes that transactions are valid unless proven otherwise, which allows for faster and more efficient processing of transactions.

Overall Architecture

Optimism Architecture

op-node + op-geth

The rollup node can run either in validator or sequencer mode:

  1. validator (aka verifier): Similar to running an Ethereum node, it simulates L2 transactions locally, without rate limiting. It also lets the validator verify the work of the sequencer, by re-deriving output roots and comparing them against those submitted by the sequencer. In case of a mismatch, the validator can perform a fault proof.
  2. sequencer: The sequencer is a priviledged actor, which receives L2 transactions from L2 users, creates L2 blocks using them, which it then submits to data availability provider (via a batcher). It also submits output roots to L1. There is only one sequencer in the entire stack for now, and it's where people critisize that OP stack is not decenralized.

op-batcher

The batch submitter, also referred to as the batcher, is the entity submitting the L2 sequencer data to L1, to make it available for verifiers.

op-proposer

Proposer generates and submitting L2 Output checkpoints to the L2 output oracle contract on Ethereum. After finalization period has passed, this data enables withdrawals.

Both batcher and proposer submit states to L1. Why are they separated?

Batcher collect and submit tx data into L1 with a batch, while proposer submits the commitments (output roots) to the L2's state, which finalizes the view of L2 account states. They are decoupled so that they can work in parallel for efficiency.

contracts-bedrock

Various contracts for L2 to interact with the L1:

  • OptimismPortal: A feed of L2 transactions which originated as smart contract calls in the L1 state.
  • Batch inbox: An L1 address to which the Batch Submitter submits transaction batches.
  • L2 output oracle: A smart contract that stores L2 output roots for use with withdrawals and fault proofs.

Optimism components

How to deposit?

How to withdraw?

Feedback to Optimism's Documentation

Understanding the OP stack can be challenging due to a number of factors. One such factor is the numerous components that are referred to multiple times with slightly different names in code and documentation. For example, the terms "op-batcher" and "batch-submitter" / "verifiers" and "validators" may be used interchangeably, leading to confusion and difficulty in understanding the exact function of each component.

Another challenge in understanding the OP stack is the evolving architecture, which may result in some design elements becoming deprecated over time. Unfortunately, the documentation may not always be updated to reflect these changes. This can lead to further confusion and difficulty in understanding the system, as users may be working with outdated or inaccurate information.

To overcome these challenges, it is important to carefully review all available documentation, to keep concepts consistently across places, and to stay up-to-date with any changes or updates to the OP stack. This may require additional research and collaboration with other users or developers, but it is essential in order to fully understand and effectively utilize this complex system.

How to Design the Architecture of a Blockchain Server?

· 7 min read

Requirement Analysis

  • A distributed blockchain accounting and smart contract system
  • Nodes have minimal trust in each other but need to be incentivized to cooperate
    • Transactions are irreversible
    • Do not rely on trusted third parties
    • Protect privacy, disclose minimal information
    • Do not rely on centralized authority to prove that money cannot be spent twice
  • Assuming performance is not an issue, we will not consider how to optimize performance

Architecture Design

Specific Modules and Their Interactions

Base Layer (P2P Network, Cryptographic Algorithms, Storage)

P2P Network

There are two ways to implement distributed systems:

  • Centralized lead/follower distributed systems, such as Hadoop and Zookeeper, which have a simpler structure but high requirements for the lead
  • Decentralized peer-to-peer (P2P) network distributed systems, such as those organized by Chord, CAN, Pastry, and Tapestry algorithms, which have a more complex structure but are more egalitarian

Given the premise that nodes have minimal trust in each other, we choose the P2P form. How do we specifically organize the P2P network? A typical decentralized node and network maintain connections as follows:

  1. Based on the IP protocol, nodes come online occupying a certain address hostname/port, broadcasting their address using an initialized node list, and trying to flood their information across the network using these initial hops.
  2. The initial nodes receiving the broadcast save this neighbor and help with flooding; non-adjacent nodes, upon receiving it, use NAT to traverse walls and add neighbors.
  3. Nodes engage in anti-entropy by randomly sending heartbeat messages containing the latest information similar to vector clocks, ensuring they can continuously update each other with their latest information.

We can use existing libraries, such as libp2p, to implement the network module. For the choice of network protocols, see Crack the System Design Interview: Communication.

Cryptographic Algorithms

In a distributed system with minimal trust, how can a transfer be proven to be initiated by oneself without leaking secret information? Asymmetric encryption: a pair of public and private keys corresponds to "ownership." Bitcoin chooses the secp256k1 parameters of the ECDSA elliptic curve cryptographic algorithm, and for compatibility, other chains also generally choose the same algorithm.

Why not directly use the public key as the address for the transfer? Privacy concerns; the transaction process should disclose as little information as possible. Using the hash of the public key as the "address" can prevent the recipient from leaking the public key. Furthermore, people should avoid reusing the same address.

Regarding account ledgers, there are two implementation methods: UTXO vs. Account/Balance

  • UTXO (unspent transaction output), such as Bitcoin, resembles double-entry bookkeeping with credits and debits. Each transaction has inputs and outputs, but every input is linked to the previous output except for the coinbase. Although there is no concept of an account, taking all unspent outputs corresponding to an address gives the balance of that address.
    • Advantages
      • Precision: The structure similar to double-entry bookkeeping allows for very accurate recording of all asset flows.
      • Privacy protection and resistance to quantum attacks: If users frequently change addresses.
      • Stateless: Leaves room for improving concurrency.
      • Avoids replay attacks: Because replaying will not find the corresponding UTXO for the input.
    • Disadvantages
      • Records all transactions, complex, consumes storage space.
      • Traversing UTXOs takes time.
  • Account/Balance, such as Ethereum, has three main maps: account map, transaction map, transaction receipts map. Specifically, to reduce space and prevent tampering, it uses a Merkle Patricia Trie (MPT).
    • Advantages
      • Space-efficient: Unlike UTXO, a transaction connects multiple UTXOs.
      • Simplicity: Complexity is offloaded to the script.
    • Disadvantages
      • Requires using nonce to solve replay issues since there is no dependency between transactions.

It is worth mentioning that the "block + chain" data structure is essentially an append-only Merkle tree, also known as a hash tree.

Storage

Since UTXO or MPT structures serve as indexes, and to simplify operations for each node in a distributed environment, data persistence typically favors in-process databases that can run directly with the node's program, such as LevelDB or RocksDB.

Because these indexes are not universal, you cannot query them like an SQL database, which raises the barrier for data analysis. Optimizations require a dedicated indexing service, such as Etherscan.

Protocol Layer

Now that we have a functional base layer, we need a more general protocol layer for logical operations above this layer. Depending on the blockchain's usage requirements, specific logical processing modules can be plugged in and out like a microkernel architecture.

For instance, the most common accounting: upon receiving some transactions at the latest block height, organize them to establish the data structure as mentioned in the previous layer.

Writing a native module for each business logic and updating all nodes' code is not very realistic. Can we decouple this layer using virtualization? The answer is a virtual machine capable of executing smart contract code. In a non-trusting environment, we cannot allow clients to execute code for free, so the most unique feature of this virtual machine may be billing.

The difference between contract-based tokens, such as ERC20, and native tokens leads to complications when dealing with different tokens, resulting in the emergence of Wrapped Ether tokens.

Consensus Layer

After the protocol layer computes the execution results, how do we reach consensus with other nodes? There are several common mechanisms to incentivize cooperation:

  • Proof of Work (POW): Mining tokens through hash collisions, which is energy-intensive and not environmentally friendly.
  • Proof of Stake (POS): Mining tokens using staked tokens.
  • Delegated Proof-of-Stake (DPOS): Electing representatives to mine tokens using staked tokens.

Based on the incentive mechanism, the longest chain among nodes is followed; if two groups dislike each other, a fork occurs.

Additionally, there are consensus protocols that help everyone reach agreement (i.e., everyone either does something together or does nothing together):

  • 2PC: Everyone relies on a coordinator: the coordinator asks everyone: should we proceed? If anyone replies no, the coordinator tells everyone "no"; otherwise, everyone proceeds. This dependency can lead to issues if the coordinator fails in the middle of the second phase, leaving some nodes unsure of what to do with the block, requiring manual intervention to restart the coordinator.
  • 3PC: To solve the above problem, an additional phase is added to ensure everyone knows whether to proceed before doing so; if an error occurs, a new coordinator is selected.
  • Paxos: The above 2PC and 3PC both rely on a coordinator; how can we eliminate this coordinator? By using "the majority (at least f+1 in 2f + 1)" to replace it, as long as the majority agrees in two steps, consensus can be achieved.
  • PBFT (deterministic 3-step protocol): The fault tolerance of the above methods is still not high enough, leading to the development of PBFT. This algorithm ensures that the majority (2/3) of nodes either all agree or all disagree, implemented through three rounds of voting, with at least a majority (2/3) agreeing in each round before committing the block in the final round.

In practical applications, relational databases mostly use 2PC or 3PC; variants of Paxos include implementations in Zookeeper, Google Chubby distributed locks, and Spanner; in blockchain, Bitcoin and Ethereum use POW, while the new Ethereum uses POS, and IoTeX and EOS use DPOS.

API Layer

See Public API choices

Designing Human-Centric Internationalization (i18n) Engineering Solutions

· 9 min read

Requirement Analysis

If you ask what the biggest difference is between Silicon Valley companies and those in China, the answer is likely as Wu Jun said: Silicon Valley companies primarily target the global market. As Chen Zhiwu aptly stated, the ability to create wealth can be measured by three dimensions: depth, which refers to productivity—the ability to provide better products or services in the same amount of time; length, which refers to the ability to leverage finance to exchange value across time and space; and breadth, which refers to market size—the ability to create markets or new industries that transcend geographical boundaries. Internationalization, which is the localization of products and services in terms of language and culture, is indeed a strategic key for multinational companies competing in the global market.

Internationalization, abbreviated as i18n (with 18 letters between the 'i' and the 'n'), aims to solve the following issues in the development of websites and mobile apps:

  1. Language
  2. Time and Time Zones
  3. Numbers and Currency

Framework Design

Language

Logic and Details

The essence of language is as a medium for delivering messages to the audience; different languages serve as different media, each targeting different audiences. For example, if we want to display the message to the user: "Hello, Xiaoli!", the process involves checking the language table, determining the user's language, and the current required interpolation, such as the name, to display the corresponding message:

Message CodesLocalesTranslations
home.helloenHello, ${username}!
home.hellozh-CN你好, ${username}!
home.helloIW!${username}, שלום

Different languages may have slight variations in details, such as the singular and plural forms of an item, or the distinction between male and female in third-person references.

These are issues that simple table lookups cannot handle, requiring more complex logic processing. In code, you can use conditional statements to handle these exceptions. Additionally, some internationalization frameworks invent Domain-Specific Languages (DSL) to specifically address these situations. For example, The Project Fluent:

Another issue that beginners often overlook is the direction of writing. Common languages like Chinese and English are written from left to right, while some languages, such as Hebrew and Arabic, are written from right to left.

The difference in writing direction affects not only the text itself but also the input method. A Chinese person would find it very strange to input text from right to left; conversely, a Jewish colleague of mine finds it easy to mix English and Hebrew input.

Layout is another consideration. The entire UI layout and visual elements, such as the direction of arrows, may change based on the language's direction. Your HTML needs to set the appropriate dir attribute.

How to Determine the User's Locale?

You may wonder how we know the user's current language settings. In the case of a browser, when a user requests a webpage, there is a header called Accept-Language that indicates the accepted languages. These settings come from the user's system language and browser settings. In mobile apps, there is usually an API to retrieve the locale variable or constant. Another method is to determine the user's location based on their IP or GPS information and then display the corresponding language. For multinational companies, users often indicate their language preferences and geographical regions during registration.

If a user wants to change the language, websites have various approaches, while mobile apps tend to have more fixed APIs. Here are some methods for websites:

  1. Set a locale cookie
  2. Use different subdomains
  3. Use a dedicated domain. Pinterest has an article discussing how they utilize localized domains. Research shows that using local domain suffixes leads to higher click-through rates.
  4. Use different paths
  5. Use query parameters. While this method is feasible, it is not SEO-friendly.

Beginners often forget to mark the lang attribute in HTML when creating websites.

Translation Management Systems

Once you have carefully implemented the display of text languages, you will find that establishing and managing a translation library is also a cumbersome process.

Typically, developers do not have expertise in multiple languages. At this point, external translators or pre-existing translation libraries need to be introduced. The challenge here is that translators are often not technical personnel. Allowing them to directly modify code or communicate directly with developers can significantly increase translation costs. Therefore, in Silicon Valley companies, translation management systems (TMS) designed for translators are often managed by a dedicated team or involve purchasing existing solutions, such as the closed-source paid service lokalise.co or the open-source Mozilla Pontoon. A TMS can uniformly manage translation libraries, projects, reviews, and task assignments.

This way, the development process becomes: first, designers identify areas that need attention based on different languages and cultural habits during the design phase. For example, a button that is short in English may be very long in Russian, so care must be taken to avoid overflow. Then, the development team implements specific code logic based on the design requirements and provides message codes, contextual background, and examples written in a language familiar to developers in the translation management system. Subsequently, the translation team fills in translations for various languages in the management system. Finally, the development team pulls the translation library back into the codebase and releases it into the product.

Contextual background is an easily overlooked and challenging aspect. Where in the UI is the message that needs translation? What is its purpose? If the message is too short, further explanation may be needed. With this background knowledge, translators can provide more accurate translations in other languages. If translators cannot fully understand the intended message, they need a feedback channel to reach out to product designers and developers for clarification.

Given the multitude of languages and texts, it is rare for a single translator to handle everything; it typically requires a team of individuals with language expertise from various countries to contribute to the translation library. The entire process is time-consuming and labor-intensive, which is why translation teams are often established, such as outsourcing to Smartling.

Now that we have the code logic and translation library, the next question is: how do we integrate the content of the translation library into the product?

There are many different implementation methods; the most straightforward is a static approach where, each time an update occurs, a diff is submitted and merged into the code. This way, relevant translation materials are already included in the code during the build process.

Another approach is dynamic integration. On one hand, you can "pull" content from a remote translation library, which may lead to performance issues during high website traffic. However, the advantage is that translations are always up-to-date. On the other hand, for optimization, a "push" method can be employed, where any new changes in the translation library trigger a webhook to push the content to the server.

In my view, maintaining translations is more cumbersome than adding them. I have seen large projects become chaotic because old translations were not promptly removed after updates, leading to an unwieldy translation library. A good tool that ensures data consistency would greatly assist in maintaining clean code.

Alibaba's Kiwi internationalization solution has implemented a linter and VS Code plugin to help you check and extract translations from the code.

Time and Time Zones

Having discussed language, the next topic is time and time zones. As a global company, much of the data comes from around the world and is displayed to users globally. For example, how do international flights ensure that start and end times are consistent globally and displayed appropriately across different time zones? This is crucial. The same situation applies to all time-related events, such as booking hotels, reserving restaurants, and scheduling meetings.

First, there are several typical representations of time:

  1. Natural language, such as 07:23:01, Monday 28, October 2019 CST AM/PM
  2. Unix timestamp (Int type), such as 1572218668
  3. Datetime. Note that when MySQL stores datetime, it converts it to UTC based on the server's time zone and stores it, converting it back when reading. However, the server's time zone is generally set to UTC. In this case, the storage does not include time zone information, defaulting to UTC.
  4. ISO Date, such as 2019-10-27T23:24:28+00:00, which includes time zone information.

I have no strong preference for these formats; if you have relevant experience, feel free to discuss it.

When displaying time, two types of conversions may occur: one is converting the stored server time zone to the local time zone for display; the other is converting machine code to natural language. A popular approach for the latter is to use powerful libraries for handling time and dates, such as moment.js and dayjs.

Numbers and Currency

The display of numbers varies significantly across different countries and regions. The meaning of commas and periods in numbers differs from one country to another.

(1000.1)
.toLocaleString("en")(
// => "1,000.1"
1000.1,
)
.toLocaleString("de")(
// => "1.000,1"
1000.1,
)
.toLocaleString("ru");
// => "1 000,1"

Arabic numerals are not universally applicable; for instance, in Java's String.format, the digits 1, 2, 3 are represented as ١، ٢، ٣ in actual Arabic language.

Regarding pricing, should the same goods be displayed in local currency values in different countries? What is the currency symbol? How precise should the currency be? These questions must be addressed in advance.

Conclusion

The internationalization tools mentioned in this article include translation management systems, the open-source Mozilla Pontoon, the closed-source paid service lokalise.co, POEditor.com, and so on. For code consistency, Alibaba's Kiwi internationalization solution is recommended. For UI display, consider using moment.js and day.js.

Like all software system development, there is no silver bullet for internationalization; great works are crafted through foundational skills honed over time.

Designing a Load Balancer

· 4 min read

Requirements Analysis

Internet services often need to handle traffic from around the world, but a single server can only serve a limited number of requests at the same time. Therefore, we typically have a server cluster to collectively manage this traffic. The question arises: how can we evenly distribute this traffic across different servers?

From the user to the server, there are many nodes and load balancers at different levels. Specifically, our design requirements are:

  • Design a Layer 7 load balancer located internally in the data center.
  • Utilize real-time load information from the backend.
  • Handle tens of millions of requests per second and a throughput of 10 TB per second.

Note: If Service A depends on Service B, we refer to A as the downstream service and B as the upstream service.

Challenges

Why is load balancing difficult? The answer lies in the challenge of collecting accurate load distribution data.

Count-based Distribution ≠ Load-based Distribution

The simplest approach is to distribute traffic randomly or in a round-robin manner based on the number of requests. However, the actual load is not calculated based on the number of requests; for example, some requests are heavy and CPU-intensive, while others are lightweight.

To measure load more accurately, the load balancer must maintain some local state—such as the current number of requests, connection counts, and request processing delays. Based on this state, we can employ appropriate load balancing algorithms—least connections, least latency, or random N choose one.

Least Connections: Requests are directed to the server with the fewest current connections.

Least Latency: Requests are directed to the server with the lowest average response time and fewest connections. Servers can also be weighted.

Random N Choose One (N is typically 2, so we can also refer to it as the power of two choices): Randomly select two servers and choose the better of the two, which helps avoid the worst-case scenario.

Distributed Environment

In a distributed environment, local load balancers struggle to understand the complete state of upstream and downstream services, including:

  • Load of upstream services
  • Upstream services can be very large, making it difficult to select an appropriate subset for the load balancer
  • Load of downstream services
  • The specific processing time for different types of requests is hard to predict

Solutions

There are three approaches to accurately collect load information and respond accordingly:

  • A centralized balancer that dynamically manages based on the situation
  • Distributed balancers that share state among them
  • Servers return load information along with requests, or the balancer actively queries the servers

Dropbox chose the third approach when implementing Bandai, as it adapted well to the existing random N choose one algorithm.

However, unlike the original random N choose one algorithm, this approach does not rely on local state but instead uses real-time results returned by the servers.

Server Utilization: Backend servers set a maximum load, track current connections, and calculate utilization, ranging from 0.0 to 1.0.

Two issues need to be considered:

  1. Error Handling: If fail fast, the quick processing may attract more traffic, leading to more errors.
  2. Data Decay: If a server's load is too high, no requests will be sent there. Therefore, using a decay function similar to a reverse S-curve ensures that old data is purged.

Result: Requests Received by Servers are More Balanced

Lyft's Marketing Automation Platform Symphony

· 3 min read

Customer Acquisition Efficiency Issue: How can advertising campaigns achieve higher returns with less money and fewer people?

Specifically, Lyft's advertising campaigns need to address the following characteristics:

  1. Manage location-based campaigns
  2. Data-driven growth: growth must be scalable, measurable, and predictable
  3. Support Lyft's unique growth model, as shown below:

lyft growth model

The main challenge is the difficulty of scaling management across various aspects of regional marketing, including ad bidding, budgeting, creative assets, incentives, audience selection, testing, and more. The following image depicts a day in the life of a marketer:

A Day in the Life of a Marketer

We can see that "execution" takes up most of the time, while less time is spent on the more important tasks of "analysis and decision-making." Scaling means reducing complex operations and allowing marketers to focus on analysis and decision-making.

Solution: Automation

To reduce costs and improve the efficiency of experimentation, it is necessary to:

  1. Predict whether new users are interested in the product
  2. Optimize across multiple channels and effectively evaluate and allocate budgets
  3. Conveniently manage thousands of campaigns

Data is enhanced through Lyft's Amundsen system using reinforcement learning.

The automation components include:

  1. Updating bid keywords
  2. Disabling underperforming creative assets
  3. Adjusting referral values based on market changes
  4. Identifying high-value user segments
  5. Sharing strategies across multiple campaigns

Architecture

Lyft Symphony Architecture

Technology stack: Apache Hive, Presto, ML platform, Airflow, 3rd-party APIs, UI.

Specific Component Modules

LTV Prediction Module

The lifetime value (LTV) of users is an important metric for evaluating channels, and the budget is determined by both LTV and the price we are willing to pay for customer acquisition in that region.

Our understanding of new users is limited, but as interactions increase, the historical data provided will more accurately predict outcomes.

Initial feature values:

Feature Values

As historical interaction records accumulate, the predictions become more accurate:

Predicting LTV Based on Historical Records

Budget Allocation Module

Once LTV is established, the next step is to set the budget based on pricing. A curve of the form LTV = a * (spend)^b is fitted, along with similar parameter curves in the surrounding range. Achieving a global optimum requires some randomness.

Budget Calculation

Delivery Module

This module is divided into two parts: the parameter tuner and the executor. The tuner sets specific parameters based on pricing for each channel, while the executor applies these parameters to the respective channels.

There are many popular delivery strategies that are common across various channels:

Delivery Strategies

Conclusion

It is essential to recognize the importance of human experience within the system; otherwise, it results in garbage in, garbage out. When people are liberated from tedious delivery tasks and can focus on understanding users, channels, and the messages they need to convey to their audience, they can achieve better campaign results—spending less time to achieve higher ROI.

How to write solid code?

· One min read

he likes it

  1. empathy / perspective-taking is the most important.

    1. realize that code is written for human to read first and then for machines to execute.
    2. software is so "soft" and there are many ways to achieve one thing. It's all about making the proper trade-offs to fulfill the requirements.
    3. Invent and Simplify: Apple Pay RFID vs. Wechat Scan QR Code.
  2. choose a sustainable architecture to reduce human resources costs per feature.

  1. adopt patterns and best practices.

  2. avoid anti-patterns

    • missing error-handling
    • callback hell = spaghetti code + unpredictable error handling
    • over-long inheritance chain
    • circular dependency
    • over-complicated code
      • nested ternary operation
      • comment out unused code
    • missing i18n, especially RTL issues
    • don't repeat yourself
      • simple copy-and-paste
      • unreasonable comments
  3. effective refactoring

    • semantic version
    • never introduce breaking change to non major versions
      • two legged change