Skip to main content

76 posts tagged with "system design"

View All Tags

The $100M Telemetry Bug: What OpenAI's Outage Teaches Us About System Design

· 3 min read

On December 11, 2024, OpenAI experienced a catastrophic outage that took down ChatGPT, their API, and Sora for over four hours. While outages happen to every company, this one is particularly fascinating because it reveals a critical lesson about modern system design: sometimes the tools we add to prevent failures become the source of failures themselves.

The Billion-Dollar Irony

Here's the fascinating part: The outage wasn't caused by a hack, a failed deployment, or even a bug in their AI models. Instead, it was caused by a tool meant to improve reliability. OpenAI was adding better monitoring to prevent outages when they accidentally created one of their biggest outages ever.

It's like hiring a security guard who accidentally locks everyone out of the building.

The Cascade of Failures

The incident unfolded like this:

  1. OpenAI deployed a new telemetry service to better monitor their systems
  2. This service overwhelmed their Kubernetes control plane with API requests
  3. When the control plane failed, DNS resolution broke
  4. Without DNS, services couldn't find each other
  5. Engineers couldn't fix the problem because they needed the control plane to remove the problematic service

But the most interesting part isn't the failure itself – it's how multiple safety systems failed simultaneously:

  1. Testing didn't catch the issue because it only appeared at scale
  2. DNS caching masked the problem long enough for it to spread everywhere
  3. The very systems needed to fix the problem were the ones that broke

Three Critical Lessons

1. Scale Changes Everything

The telemetry service worked perfectly in testing. The problem only emerged when deployed to clusters with thousands of nodes. This highlights a fundamental challenge in modern system design: some problems only emerge at scale.

2. Safety Systems Can Become Risk Factors

OpenAI's DNS caching, meant to improve reliability, actually made the problem worse by masking the issue until it was too late. Their Kubernetes control plane, designed to manage cluster health, became a single point of failure.

3. Recovery Plans Need Recovery Plans

The most damning part? Engineers couldn't fix the problem because they needed working systems to fix the broken systems. It's like needing a ladder to reach the ladder you need.

The Future of System Design

OpenAI's response plan reveals where system design is headed:

  1. Decoupling Critical Systems: They're separating their data plane from their control plane, reducing interdependencies
  2. Improved Testing: They're adding fault injection testing to simulate failures at scale
  3. Break-Glass Procedures: They're building emergency access systems that work even when everything else fails

What This Means for Your Company

Even if you're not operating at OpenAI's scale, the lessons apply:

  1. Test at scale, not just functionality
  2. Build emergency access systems before you need them
  3. Question your safety systems – they might be hiding risks

The future of reliable systems isn't about preventing all failures – it's about ensuring we can recover from them quickly and gracefully.

Remember: The most dangerous problems aren't the ones we can see coming. They're the ones that emerge from the very systems we build to keep us safe.

Quick Intro to Optimism Architecture

· 4 min read

What is Optimism?

Optimism is an EVM equivalent, optimistic rollup protocol designed to scale Ethereum.

  • Scaling Ethereum means increasing the number of useful transactions the Ethereum network can process.
  • Optimistic rollup is a layer 2 scalability technique which increases the computation & storage capacity of Ethereum without sacrificing security or decentralization.
  • EVM Equivalence is complete compliance with the state transition function described in the Ethereum yellow paper, the formal definition of the protocol.

Optimistic rollup works by bundling multiple transactions into a single transaction, which is then verified by a smart contract on the Ethereum network. This process is called "rolling up" because the individual transactions are combined into a larger transaction that is submitted to the Ethereum network. The term "optimistic" refers to the fact that the system assumes that transactions are valid unless proven otherwise, which allows for faster and more efficient processing of transactions.

Overall Architecture

Optimism Architecture

op-node + op-geth

The rollup node can run either in validator or sequencer mode:

  1. validator (aka verifier): Similar to running an Ethereum node, it simulates L2 transactions locally, without rate limiting. It also lets the validator verify the work of the sequencer, by re-deriving output roots and comparing them against those submitted by the sequencer. In case of a mismatch, the validator can perform a fault proof.
  2. sequencer: The sequencer is a priviledged actor, which receives L2 transactions from L2 users, creates L2 blocks using them, which it then submits to data availability provider (via a batcher). It also submits output roots to L1. There is only one sequencer in the entire stack for now, and it's where people critisize that OP stack is not decenralized.

op-batcher

The batch submitter, also referred to as the batcher, is the entity submitting the L2 sequencer data to L1, to make it available for verifiers.

op-proposer

Proposer generates and submitting L2 Output checkpoints to the L2 output oracle contract on Ethereum. After finalization period has passed, this data enables withdrawals.

Both batcher and proposer submit states to L1. Why are they separated?

Batcher collect and submit tx data into L1 with a batch, while proposer submits the commitments (output roots) to the L2's state, which finalizes the view of L2 account states. They are decoupled so that they can work in parallel for efficiency.

contracts-bedrock

Various contracts for L2 to interact with the L1:

  • OptimismPortal: A feed of L2 transactions which originated as smart contract calls in the L1 state.
  • Batch inbox: An L1 address to which the Batch Submitter submits transaction batches.
  • L2 output oracle: A smart contract that stores L2 output roots for use with withdrawals and fault proofs.

Optimism components

How to deposit?

How to withdraw?

Feedback to Optimism's Documentation

Understanding the OP stack can be challenging due to a number of factors. One such factor is the numerous components that are referred to multiple times with slightly different names in code and documentation. For example, the terms "op-batcher" and "batch-submitter" / "verifiers" and "validators" may be used interchangeably, leading to confusion and difficulty in understanding the exact function of each component.

Another challenge in understanding the OP stack is the evolving architecture, which may result in some design elements becoming deprecated over time. Unfortunately, the documentation may not always be updated to reflect these changes. This can lead to further confusion and difficulty in understanding the system, as users may be working with outdated or inaccurate information.

To overcome these challenges, it is important to carefully review all available documentation, to keep concepts consistently across places, and to stay up-to-date with any changes or updates to the OP stack. This may require additional research and collaboration with other users or developers, but it is essential in order to fully understand and effectively utilize this complex system.

Enterprise Authorization Services 2022

· 6 min read

Authorization determines whether an individual or system can access a particular resource. And this process is a typical scenario that could be automated with software. We will review Google's Zanzibar, Zanzibar-inspired solutions, and other AuthZ services on the market.

Zanzibar: Google's Consistent, Global Authorization System

  • Google's = battle-tested with Google products, 20 million permissions check per second, p95 < 10ms, 99.999% availability
  • Consistent = ensure that authorization checks are based on ACL data no older than a client-specified change
  • Global = geographically distributed data centers and distributes load across thousands of servers around the world.
  • Authorization = general-purpose authorization

In Zanzibar's context, we can express the AuthZ question in this way:

isAuthorized(user, relation, object) = does the user have relation to object?

It's called relationship-based access control (==ReBAC==). Clients could build ABAC and RBAC on top of ReBAC. Unfortunately, Zanzibar is not open-sourced nor purchasable as a out-of-box service.

Zanzibar Architecture

Zanzibar Architectecture

Why is Zanzibar scalable?

  • Use Spanner as the database
  • Leopard indexing system
    • flatten group-to-group paths like a reachability problem in a graph
    • store index tuples as ordered lists of integers in a structure, such as a skip list, to achieve efficient union and intersections among sets.
    • async dataflow client > aclserver > changelog > Leopard indexing system
  • How to maintain external consistency? Zookie protocol - Clients check permissions with a timestamp-based token.

Auth0 Fine-Grained Authorization (FGA)

Auth0 FGA is an open-source implementation of Google Zanzibar. Check the interactive tutorial at https://zanzibar.academy/.

For enterprise developers in the context of microservices, how to use the managed solution of FGA?

How to use FGA?

  1. Go to the FGA dashboard to define the authorization model in DSL and relation tuples, and finally, add authorization assertions like automated tests (this is great!).
  2. Developers go back to their services and call the FGA wrapper's check endpoint

Unfortunately, I don't see changelog audits and version control to rollback in case developers break things in the FGA dashboard, probably because FGA is still a work in progress.

OSO

With Oso, you can:

  • Model: Set up common permissions patterns like role-based access control (RBAC) and relationships using Oso's built-in primitives. Extend them however you need with Oso's declarative policy language, Polar (DSL).
  • Filter: Go beyond yes/no authorization questions. Implement authorization over collections too - e.g., "Show me only the records that Juno can see."
  • Test: Write unit tests over your authorization logic now that you have a single interface for it. Use the Oso debugger or REPL to track down unexpected behavior.

Ory Keto

Keto is an open Source (Go) implementation of Zanzibar. Ships gRPC, REST APIs, newSQL, and an easy and granular permission language (DSL). Supports ACL, RBAC, and other access models.

Authzed SpiceDB

SpiceDB is an open-source database system for managing security-critical application permissions inspired by Google's Zanzibar paper.

Aserto Topaz

Topaz is an open-source authorization service providing fine-grained, real-time, policy-based access control for applications and APIs.

It uses the Open Policy Agent (OPA) as its decision engine, and provides a built-in directory that is inspired by the Google Zanzibar data model.

Authorization policies can leverage user attributes, group membership, application resources, and relationships between them. All data used for authorization is modeled and stored locally in an embedded database, so authorization decisions can be evaluated quickly and efficiently.

Cloudentity

It seems to be an integrated CIAM solution, and there is no standalone feature for enterprise authorization. Documentation is confusing...

Open Policy Agent

The Open Policy Agent (OPA) is an open-source, general-purpose policy engine that unifies policy enforcement across the stack. OPA provides a high-level declarative language that lets you specify policy as code and simple APIs to offload policy decision-making from your software. You can use OPA to enforce policies in microservices, Kubernetes, CI/CD pipelines, API gateways, and more.

OPA was originally created by Styra and a graduated project from Cloud Native Computing Foundation (CNCF).

Permit.IO

Permit.IO is a low-code AuthZ platform based on OPA and OPAL.

Scaled Access

Scaled Access is an european company that was acquired by onewelcome. It offers rich context-aware access control, real-time policy enforcement, fine-grained authorization, and relationship-based access control. There are APIs in the documentation but no SDKs.

Casbin

Casbin is an authorization library that supports access control models like ACL, RBAC, ABAC in Golang. There are SDKs in many programming languages. However, its configuration is pretty static in CSV files, and it's more for corporation internal and less for customer-facing authorization.

SGNL

This service looks pretty scrappy - beautiful websites without any content for developers. No doc, no video or self-service demo. I suspect its positioning is for non-tech enterprises. Not recommended.

Summary

Here is a preliminary ranking after my initial check. Ideally, I want a LaunchDarkly-like AuthZ platform - easy to integrate and operate, fully equipped with audit logs, version control, and a developer-facing web portal.


Github StarsModelsDevExPerfScore (out of 5)
Oso2.8kReBACDSL, API, SDK, web portal?3
Spicedb3kReBACDSL, API, SDK, web portal?3
permit.io840ReBACDSL, API, SDK, low-code web portal?3
Aserto Topas534ReBACDSL, API, SDK, web portal?3
FGA657ReBACDSL, API, SDK, web portal?3
Keto3.8kReBACDSL, API, SDK?2
Casbin13.4kABAC, RBACLibrary, static file for policies?1

Authentication and Authorization in Microservices

· 7 min read

Requirements

  • design an auth solution that starts simple but could scale with the business
  • consider both security and user experiences
  • talk about the future trends in this area

Big Picture: AuthN, AuthZ, and Identity Management

First-things-first, let's get back to basics

  • Authentication: figure out who you are
  • Authorization: figure out what you can do

In the beginning... Let there be a simple service...

  • Layered Architecture
  • Client stores a cookie or token as the proof of login status. (valet key pattern)

  • Server persists a corresponding session
  • Token is usually in the format of JWT, signed by keys fetched from somewhere secure (environment variables, AWS KMS, HarshiCorp Vault, etc.)

  • Popular web frameworks often prepare out-of-box auth solutions

Then, as the business grows, we scale the system with AKF scale cube:

  • X-axis: Horizontal clone
  • Y-axis: Functional decomposition
  • Z-axis: Sharding

Plus Conway's law: organization designs the systems mirroring its communication structure. We usually evolve the architecture to micro-services (see why microservices? for more)

  • Btw, "microservices vs. monolith" and "multi-repo vs. mono-repo" are different things.
  • For the enterprise, there are employee auth and customer auth. We focus more on the customer auth.

In the microservice world, let's take a functional slice of the authn and authz services, and there is an Identity and Access Management (IAM) team working on it.

  • Identity-aware proxy is a reverse proxy that allows either public endpoints or checks credentials for protected endpoints. If the credential is not presented but required, redirect the user to an identity provider. e.g. k8s ingress controller, nginx, envoy, Pomerium, ory.sh/oathkeeper, etc.
  • Identity provider and manager is one or a few services that manage the user identity through certain workflows like sign in, forgot password, etc. e.g. ory.sh/kratos, keycloak
  • OAuth2 and OpenID Connect provider enables 3rd-party developers to integrate with your service.
  • Authorization service controls who can do what.

Authentication

Identity Provider

  • The simplest solution is to submit the user's proof of identity and issue service credentials.
    • bcrypt, scrypt for password hash
  • However, modern apps often deal with complex workflows like conditional sign up, multi-step login, forgot password, etc. Those workflows are essentially state transition graphs in the state machine.

Workflow: User Settings and Profile Updates

Ory.sh/Kratos as an Example Architecture

2. Third-party OAuth2

OAuth2 let the user or client go through four major workflows (not sure which one to use? see this) like

  1. Authorization Code Grant for web
  2. Implicit Grant for mobile
  3. Resource Owner Password Credentials Grant for legacy app
  4. Client Credentials Grant for backend application flow

And then finally get the access token and refresh token

  1. access token is short-lived, and hence the attacking window is short if it is compromised
  2. refresh token works only when combined with client id and secret

The assumption is that there are so many entities involved in this workflow - client, resource owner, authorization server, resource server, network, etc. More entities introduce more exposure to attack. A comprehensive protocol should consider all kinds of edge cases. For example, what if the network is not HTTPs / cannot be fully trusted?

OpenID connect is the identity protocol based on OAuth2, and it defines customizable RESTful API for products to implement Single Sign-On (SSO).

There are a lot of tricky details in those workflows and token handling processes. Don't reinvent the wheel.

3. Multi-factor authentication

Problem: Credential stuffing attack

Users tend to reuse the same username and password across multiple sites. When one of those sites suffers from a data breach, hackers brute-force attack other sites with those leaked credentials.

  • Multi-factor authentication: SMS, Email, Phone Voice OTP, Authenticator TOTP
  • Rate limiter, fail to ban, and anomaly detection

Challenge: Bad deliverability of Email or SMS

  • Do not share marketing email channels with transactional ones.
  • Voice OTP usually has better deliverability.

5. Passwordless

  1. biometric: Fingerprints, facial ID, voice ID
  1. QR code
  • SQRL standard
  • Another way to implement:

  1. Push Notification

How could clients subscribe to the server's state? Short polling, long polling, web socket, or server-sent events.

4. Vendors on the market

Don't reinvent the wheel.

6. Optimization

Challenge 1: Web login is super slow or cannot submit login form at all.

  • JS bundle is too large for mobile web
    • Build a lite PWA version of your SPA (single-page web app). whatever makes the bundle small - e.g. preact or inferno
    • Or do not use SPA at all. Simple MPA (multi-page web app) works well with a raw HTML form submission
  • Browser compatibility
    • Use BrowserStack or other tools to test on different browsers
  • Data centers are too far away
    • Put static resources to the edge / CDN and relay API requests through Google backbone network
    • Build a local DC 😄

See Web App Delivery Optimization for more info

Challenge 2: Account taking-over

Challenge 3: Account creation takes too long

When the backend system gets too large, a user creation may fan out to many services and create a lot of entries in different data sources. It feels bad to wait for 15 seconds at the end of sign up, right?

  1. collect and sign up incrementally
  2. async

Authorization

isAuthorized(subject, action, resource)

1. Role-based Access Control (RBAC)

2. Policy-base Access Control (PBAC)

{
"subjects": ["alice"],
"resources": ["blog_posts:my-first-blog-post"],
"actions": ["delete"],
"effect": "allow"
}

Challenge: single point of failure and cascading failures

  • preprocess and cache permissions
  • leverage request contexts
    • assumptions: requests inside of a datacenter are trusted vs. not trusted
  • fail open vs. fail closed

Privacy

1. PII, PHI, PCI

Western culture has a tradition to respect privacy, especially after the Nazis murdered millions of people. Here are some typical sensitive data types: Personally Identifiable Information (PII), Protected Health Information (PHI, regulated by HIPAA), and Credit Card or Payment Card Industry (PCI) Information.

2. Differential Privacy

Redacting sensitive information alone may not be good enough to prevent data associated with other datasets.

Differential privacy helps analysts extract data from the databases containing personal information but still protects individuals' privacy.

3. Decentralized Identity

To decouple id from a centralized identity provider and its associated sensitive data, we can use decentralized id (DID) instead.

  • it is essentially in the format of URN: did:example:123456789abcdefghijk
  • it could be derived from asymmetric keys and its target business domain.
    • it does not involve your personal info, unlike the traditional way
    • See DID method for how it is working with blockchains.
  • it preserves privacy by
    • use different DIDs for different purposes
    • selective disclosure / verifiable claims

Imagine that Alice has a state-issued DID and wants to buy some alcohol without disclosing her real name and precise age.

drinking

A DID solution:

  • Alice has an identity profile having did:ebfeb1f712ebc6f1c276e12ec21, name, avatar url, birthday and other sensitive data.
  • Create a claim that did:ebfeb1f712ebc6f1c276e12ec21 is over the age 21
  • A trusted third-party signs the claim and make it a verifiable claim
  • Use the verifiable claim as the proof of age

Summary

This article is an overview of authn and authz in microservices, and you don't have to memorize everything to be an expert. Here are some takeaways:

  1. follow standard protocols and don't reinvent the wheel
  2. do not under-estimate the power of the security researchers/hackers
  3. it is hard to be perfect, and it does not have to be perfect. Prioritize your development comprehensively

Designing Online Judge or Leetcode

· 3 min read

Requirements

An online judge is primarily a place where you can execute code remotely for educational or recruitment purposes. In this design, we focus on designing an OJ for interview preparation like Leetcode, with the following requirements:

  • It should have core OJ functionalities like fetching a problem, submitting the solution, compiling if needed, and executing it.
  • It should be highly available with an async design, since running code may take time.
  • It should be horizontal scalable or, saying, easy to scale-out.
  • It should be robust and secure to execute untrusted source code.

Architecture

The architecture below is featured on queueing for async execution and sandboxing for secure execution. And each component is separately deployable and scalable.

designing online judge system

Components

Presentation Layer

The user agent is usually a web or mobile app like coderoma.com. It displays the problem description and provides the user with a code editor to write and submit code.

When the user submits the code, the client will get a token since it is an async call. Then the client polls the server for the submission status.

API

Please see Public API Choices for the protocols we can choose from. And let’s design the interface itself here and GraphQL for example:

type Query {
problems(id: String): [Problem]
languageSetup(id: String!, languageId: LanguageId!): LanguageSetup
submission(token: String!) Submission
}

type Mutation {
createSubmission(
problemId: String!
code: String!
languageId: LanguageId!
): CreatedSubmission!
}

enum LanguageId {
JAVA
JS
ELIXIR
# ...
}


type Problem {
id: String!
title: String!
description: String!
supportedLanguages: [Float!]!
}

type LanguageSetup {
languageId: LanguageId!
template: String!
solutions: [String!]!
}

type Status {
id: Float!
description: String!
}

type Submission {
compileOutput: String
memory: Float
message: String
status: Status
stderr: String
stdout: String
time: String
token: String
}

type CreatedSubmission {
token: String!
}

The API layer records the submission in the database, publishes it into the queue, and returns a token for the client's future reference.

Code Execution Engine

Code execution engine (CEE) polls the queue for the code, uses a sandbox to compile and run the code and parses the metadata from the compilation and execution.

The sandbox could be LXC containers, Docker, virtual machines, etc. We can choose Docker for its ease of deployment.

Coderoma.com

I am recently learning Elixir and creating an online judge coderoma.com for my daily practice. It now supports Elixir and JavaScript. And I am adding more languages (like Java) and problems to it.

We may host future events to improve your coding skills. Join us at https://t.me/coderoma for the English community or use your WeChat to scan the following QR onetptp and reply 刷题 for the Chinese community.

onetptp

How to Design the Architecture of a Blockchain Server?

· 7 min read

Requirement Analysis

  • A distributed blockchain accounting and smart contract system
  • Nodes have minimal trust in each other but need to be incentivized to cooperate
    • Transactions are irreversible
    • Do not rely on trusted third parties
    • Protect privacy, disclose minimal information
    • Do not rely on centralized authority to prove that money cannot be spent twice
  • Assuming performance is not an issue, we will not consider how to optimize performance

Architecture Design

Specific Modules and Their Interactions

Base Layer (P2P Network, Cryptographic Algorithms, Storage)

P2P Network

There are two ways to implement distributed systems:

  • Centralized lead/follower distributed systems, such as Hadoop and Zookeeper, which have a simpler structure but high requirements for the lead
  • Decentralized peer-to-peer (P2P) network distributed systems, such as those organized by Chord, CAN, Pastry, and Tapestry algorithms, which have a more complex structure but are more egalitarian

Given the premise that nodes have minimal trust in each other, we choose the P2P form. How do we specifically organize the P2P network? A typical decentralized node and network maintain connections as follows:

  1. Based on the IP protocol, nodes come online occupying a certain address hostname/port, broadcasting their address using an initialized node list, and trying to flood their information across the network using these initial hops.
  2. The initial nodes receiving the broadcast save this neighbor and help with flooding; non-adjacent nodes, upon receiving it, use NAT to traverse walls and add neighbors.
  3. Nodes engage in anti-entropy by randomly sending heartbeat messages containing the latest information similar to vector clocks, ensuring they can continuously update each other with their latest information.

We can use existing libraries, such as libp2p, to implement the network module. For the choice of network protocols, see Crack the System Design Interview: Communication.

Cryptographic Algorithms

In a distributed system with minimal trust, how can a transfer be proven to be initiated by oneself without leaking secret information? Asymmetric encryption: a pair of public and private keys corresponds to "ownership." Bitcoin chooses the secp256k1 parameters of the ECDSA elliptic curve cryptographic algorithm, and for compatibility, other chains also generally choose the same algorithm.

Why not directly use the public key as the address for the transfer? Privacy concerns; the transaction process should disclose as little information as possible. Using the hash of the public key as the "address" can prevent the recipient from leaking the public key. Furthermore, people should avoid reusing the same address.

Regarding account ledgers, there are two implementation methods: UTXO vs. Account/Balance

  • UTXO (unspent transaction output), such as Bitcoin, resembles double-entry bookkeeping with credits and debits. Each transaction has inputs and outputs, but every input is linked to the previous output except for the coinbase. Although there is no concept of an account, taking all unspent outputs corresponding to an address gives the balance of that address.
    • Advantages
      • Precision: The structure similar to double-entry bookkeeping allows for very accurate recording of all asset flows.
      • Privacy protection and resistance to quantum attacks: If users frequently change addresses.
      • Stateless: Leaves room for improving concurrency.
      • Avoids replay attacks: Because replaying will not find the corresponding UTXO for the input.
    • Disadvantages
      • Records all transactions, complex, consumes storage space.
      • Traversing UTXOs takes time.
  • Account/Balance, such as Ethereum, has three main maps: account map, transaction map, transaction receipts map. Specifically, to reduce space and prevent tampering, it uses a Merkle Patricia Trie (MPT).
    • Advantages
      • Space-efficient: Unlike UTXO, a transaction connects multiple UTXOs.
      • Simplicity: Complexity is offloaded to the script.
    • Disadvantages
      • Requires using nonce to solve replay issues since there is no dependency between transactions.

It is worth mentioning that the "block + chain" data structure is essentially an append-only Merkle tree, also known as a hash tree.

Storage

Since UTXO or MPT structures serve as indexes, and to simplify operations for each node in a distributed environment, data persistence typically favors in-process databases that can run directly with the node's program, such as LevelDB or RocksDB.

Because these indexes are not universal, you cannot query them like an SQL database, which raises the barrier for data analysis. Optimizations require a dedicated indexing service, such as Etherscan.

Protocol Layer

Now that we have a functional base layer, we need a more general protocol layer for logical operations above this layer. Depending on the blockchain's usage requirements, specific logical processing modules can be plugged in and out like a microkernel architecture.

For instance, the most common accounting: upon receiving some transactions at the latest block height, organize them to establish the data structure as mentioned in the previous layer.

Writing a native module for each business logic and updating all nodes' code is not very realistic. Can we decouple this layer using virtualization? The answer is a virtual machine capable of executing smart contract code. In a non-trusting environment, we cannot allow clients to execute code for free, so the most unique feature of this virtual machine may be billing.

The difference between contract-based tokens, such as ERC20, and native tokens leads to complications when dealing with different tokens, resulting in the emergence of Wrapped Ether tokens.

Consensus Layer

After the protocol layer computes the execution results, how do we reach consensus with other nodes? There are several common mechanisms to incentivize cooperation:

  • Proof of Work (POW): Mining tokens through hash collisions, which is energy-intensive and not environmentally friendly.
  • Proof of Stake (POS): Mining tokens using staked tokens.
  • Delegated Proof-of-Stake (DPOS): Electing representatives to mine tokens using staked tokens.

Based on the incentive mechanism, the longest chain among nodes is followed; if two groups dislike each other, a fork occurs.

Additionally, there are consensus protocols that help everyone reach agreement (i.e., everyone either does something together or does nothing together):

  • 2PC: Everyone relies on a coordinator: the coordinator asks everyone: should we proceed? If anyone replies no, the coordinator tells everyone "no"; otherwise, everyone proceeds. This dependency can lead to issues if the coordinator fails in the middle of the second phase, leaving some nodes unsure of what to do with the block, requiring manual intervention to restart the coordinator.
  • 3PC: To solve the above problem, an additional phase is added to ensure everyone knows whether to proceed before doing so; if an error occurs, a new coordinator is selected.
  • Paxos: The above 2PC and 3PC both rely on a coordinator; how can we eliminate this coordinator? By using "the majority (at least f+1 in 2f + 1)" to replace it, as long as the majority agrees in two steps, consensus can be achieved.
  • PBFT (deterministic 3-step protocol): The fault tolerance of the above methods is still not high enough, leading to the development of PBFT. This algorithm ensures that the majority (2/3) of nodes either all agree or all disagree, implemented through three rounds of voting, with at least a majority (2/3) agreeing in each round before committing the block in the final round.

In practical applications, relational databases mostly use 2PC or 3PC; variants of Paxos include implementations in Zookeeper, Google Chubby distributed locks, and Spanner; in blockchain, Bitcoin and Ethereum use POW, while the new Ethereum uses POS, and IoTeX and EOS use DPOS.

API Layer

See Public API choices

Credit Card Processing System

· 2 min read

Credit Card Payment Processing

5 Parties

  1. Cardholder: an authorized user of the card. e.g., anyone who has a credit card.
  2. Issuer: an institution that issues credit cards to its users. e.g., Chase, BOA, discover, etc.
  3. Merchant: an entity that provides products or services, and accepts payments via credit cards. e.g., Uber, Doordash, etc.
  4. Acquirer: who provides card services to merchants. e.g., BOA Merchant Services, Chase Paymentech.
  5. Electronic payments networks: a clear-house for acquirers and issuers. e.g., VisaNet, MasterCard, Discover, American Express, etc.

Are Square and Stripe acquirers? No, they are payment aggregators.

2 Workflows

  1. Authorization: The cardholder provides the card / card info to the merchant, who processes it and sends (card number, amount, merchant ID) to the acquirer. The acquirer contacts the issuer via the electronic payments networks. The issuer checks against the credit line and fraud-prevention system, and then authorize or decline the transaction.

  2. Clearing / Settlement

    1. batching: At the end of the business day, the merchant asks the acquirer with a batch of payment information for clearing.
    2. clearing and settlement: the acquirer coordinates the clearing with the issuer via the electronic payments networks, and the issuer settles the funds and bookkeeps the transaction for the cardholder.

Designing Human-Centric Internationalization (i18n) Engineering Solutions

· 9 min read

Requirement Analysis

If you ask what the biggest difference is between Silicon Valley companies and those in China, the answer is likely as Wu Jun said: Silicon Valley companies primarily target the global market. As Chen Zhiwu aptly stated, the ability to create wealth can be measured by three dimensions: depth, which refers to productivity—the ability to provide better products or services in the same amount of time; length, which refers to the ability to leverage finance to exchange value across time and space; and breadth, which refers to market size—the ability to create markets or new industries that transcend geographical boundaries. Internationalization, which is the localization of products and services in terms of language and culture, is indeed a strategic key for multinational companies competing in the global market.

Internationalization, abbreviated as i18n (with 18 letters between the 'i' and the 'n'), aims to solve the following issues in the development of websites and mobile apps:

  1. Language
  2. Time and Time Zones
  3. Numbers and Currency

Framework Design

Language

Logic and Details

The essence of language is as a medium for delivering messages to the audience; different languages serve as different media, each targeting different audiences. For example, if we want to display the message to the user: "Hello, Xiaoli!", the process involves checking the language table, determining the user's language, and the current required interpolation, such as the name, to display the corresponding message:

Message CodesLocalesTranslations
home.helloenHello, ${username}!
home.hellozh-CN你好, ${username}!
home.helloIW!${username}, שלום

Different languages may have slight variations in details, such as the singular and plural forms of an item, or the distinction between male and female in third-person references.

These are issues that simple table lookups cannot handle, requiring more complex logic processing. In code, you can use conditional statements to handle these exceptions. Additionally, some internationalization frameworks invent Domain-Specific Languages (DSL) to specifically address these situations. For example, The Project Fluent:

Another issue that beginners often overlook is the direction of writing. Common languages like Chinese and English are written from left to right, while some languages, such as Hebrew and Arabic, are written from right to left.

The difference in writing direction affects not only the text itself but also the input method. A Chinese person would find it very strange to input text from right to left; conversely, a Jewish colleague of mine finds it easy to mix English and Hebrew input.

Layout is another consideration. The entire UI layout and visual elements, such as the direction of arrows, may change based on the language's direction. Your HTML needs to set the appropriate dir attribute.

How to Determine the User's Locale?

You may wonder how we know the user's current language settings. In the case of a browser, when a user requests a webpage, there is a header called Accept-Language that indicates the accepted languages. These settings come from the user's system language and browser settings. In mobile apps, there is usually an API to retrieve the locale variable or constant. Another method is to determine the user's location based on their IP or GPS information and then display the corresponding language. For multinational companies, users often indicate their language preferences and geographical regions during registration.

If a user wants to change the language, websites have various approaches, while mobile apps tend to have more fixed APIs. Here are some methods for websites:

  1. Set a locale cookie
  2. Use different subdomains
  3. Use a dedicated domain. Pinterest has an article discussing how they utilize localized domains. Research shows that using local domain suffixes leads to higher click-through rates.
  4. Use different paths
  5. Use query parameters. While this method is feasible, it is not SEO-friendly.

Beginners often forget to mark the lang attribute in HTML when creating websites.

Translation Management Systems

Once you have carefully implemented the display of text languages, you will find that establishing and managing a translation library is also a cumbersome process.

Typically, developers do not have expertise in multiple languages. At this point, external translators or pre-existing translation libraries need to be introduced. The challenge here is that translators are often not technical personnel. Allowing them to directly modify code or communicate directly with developers can significantly increase translation costs. Therefore, in Silicon Valley companies, translation management systems (TMS) designed for translators are often managed by a dedicated team or involve purchasing existing solutions, such as the closed-source paid service lokalise.co or the open-source Mozilla Pontoon. A TMS can uniformly manage translation libraries, projects, reviews, and task assignments.

This way, the development process becomes: first, designers identify areas that need attention based on different languages and cultural habits during the design phase. For example, a button that is short in English may be very long in Russian, so care must be taken to avoid overflow. Then, the development team implements specific code logic based on the design requirements and provides message codes, contextual background, and examples written in a language familiar to developers in the translation management system. Subsequently, the translation team fills in translations for various languages in the management system. Finally, the development team pulls the translation library back into the codebase and releases it into the product.

Contextual background is an easily overlooked and challenging aspect. Where in the UI is the message that needs translation? What is its purpose? If the message is too short, further explanation may be needed. With this background knowledge, translators can provide more accurate translations in other languages. If translators cannot fully understand the intended message, they need a feedback channel to reach out to product designers and developers for clarification.

Given the multitude of languages and texts, it is rare for a single translator to handle everything; it typically requires a team of individuals with language expertise from various countries to contribute to the translation library. The entire process is time-consuming and labor-intensive, which is why translation teams are often established, such as outsourcing to Smartling.

Now that we have the code logic and translation library, the next question is: how do we integrate the content of the translation library into the product?

There are many different implementation methods; the most straightforward is a static approach where, each time an update occurs, a diff is submitted and merged into the code. This way, relevant translation materials are already included in the code during the build process.

Another approach is dynamic integration. On one hand, you can "pull" content from a remote translation library, which may lead to performance issues during high website traffic. However, the advantage is that translations are always up-to-date. On the other hand, for optimization, a "push" method can be employed, where any new changes in the translation library trigger a webhook to push the content to the server.

In my view, maintaining translations is more cumbersome than adding them. I have seen large projects become chaotic because old translations were not promptly removed after updates, leading to an unwieldy translation library. A good tool that ensures data consistency would greatly assist in maintaining clean code.

Alibaba's Kiwi internationalization solution has implemented a linter and VS Code plugin to help you check and extract translations from the code.

Time and Time Zones

Having discussed language, the next topic is time and time zones. As a global company, much of the data comes from around the world and is displayed to users globally. For example, how do international flights ensure that start and end times are consistent globally and displayed appropriately across different time zones? This is crucial. The same situation applies to all time-related events, such as booking hotels, reserving restaurants, and scheduling meetings.

First, there are several typical representations of time:

  1. Natural language, such as 07:23:01, Monday 28, October 2019 CST AM/PM
  2. Unix timestamp (Int type), such as 1572218668
  3. Datetime. Note that when MySQL stores datetime, it converts it to UTC based on the server's time zone and stores it, converting it back when reading. However, the server's time zone is generally set to UTC. In this case, the storage does not include time zone information, defaulting to UTC.
  4. ISO Date, such as 2019-10-27T23:24:28+00:00, which includes time zone information.

I have no strong preference for these formats; if you have relevant experience, feel free to discuss it.

When displaying time, two types of conversions may occur: one is converting the stored server time zone to the local time zone for display; the other is converting machine code to natural language. A popular approach for the latter is to use powerful libraries for handling time and dates, such as moment.js and dayjs.

Numbers and Currency

The display of numbers varies significantly across different countries and regions. The meaning of commas and periods in numbers differs from one country to another.

(1000.1).toLocaleString("en")
// => "1,000.1"
(1000.1).toLocaleString("de")
// => "1.000,1"
(1000.1).toLocaleString("ru")
// => "1 000,1"

Arabic numerals are not universally applicable; for instance, in Java's String.format, the digits 1, 2, 3 are represented as ١، ٢، ٣ in actual Arabic language.

Regarding pricing, should the same goods be displayed in local currency values in different countries? What is the currency symbol? How precise should the currency be? These questions must be addressed in advance.

Conclusion

The internationalization tools mentioned in this article include translation management systems, the open-source Mozilla Pontoon, the closed-source paid service lokalise.co, POEditor.com, and so on. For code consistency, Alibaba's Kiwi internationalization solution is recommended. For UI display, consider using moment.js and day.js.

Like all software system development, there is no silver bullet for internationalization; great works are crafted through foundational skills honed over time.

Designing a Load Balancer or Dropbox Bandaid

· 3 min read

Requirements

Internet-scale web services deal with high-volume traffic from the whole world. However, one server could only serve a limited amount of requests at the same time. Consequently, there is usually a server farm or a large cluster of servers to undertake the traffic altogether. Here comes the question: how to route them so that each host could evenly receive and process the request?

Since there are many hops and layers of load balancers from the user to the server, specifically speaking, this time our design requirements are

Note: If Service A depends on (or consumes) Service B, then A is downstream service of B, and B is upstream service of A.

Challenges

Why is it hard to balance loads? The answer is that it is hard to collect accurate load distribution stats and act accordingly.

Distributing-by-requests ≠ distributing-by-load

Random and round-robin distribute the traffic by requests. However, the actual load is not per request - some are heavy in CPU or thread utilization, while some are lightweight.

To be more accurate on the load, load balancers have to maintain local states of observed active request number, connection number, or request process latencies for each backend server. And based on them, we can use distribution algorithms like Least-connections, least-time, and Random N choices:

Least-connections: a request is passed to the server with the least number of active connections.

latency-based (least-time): a request is passed to the server with the least average response time and least number of active connections, taking into account weights of servers.

However, these two algorithms work well only with only one load balancer. If there are multiple ones, there might have herd effect. That is to say; all the load balancers notice that one service is momentarily faster, and then all send requests to that service.

Random N choices (where N=2 in most cases / a.k.a Power of Two Choices): pick two at random and chose the better option of the two, avoiding the worse choice.

Distributed environments.

Local LB is unaware of global downstream and upstream states, including

  • upstream service loads
  • upstream service may be super large, and thus it is hard to pick the right subset to cover with the load balancer
  • downstream service loads
  • the processing time of various requests are hard to predict

Solutions

There are three options to collect load the stats accurately and then act accordingly:

  • centralized & dynamic controller
  • distributed but with shared states
  • piggybacking server-side information in response messages or active probing

Dropbox Bandaid team chose the third option because it fits into their existing random N choices approach well.

However, instead of using local states, like the original random N choices do, they use real-time global information from the backend servers via the response headers.

Server utilization: Backend servers are configured with a max capacity and count the on-going requests, and then they have utilization percentage calculated ranging from 0.0 to 1.0.

There are two problems to consider:

  1. Handling HTTP errors: If a server fast fails requests, it attracts more traffic and fails more.
  2. Stats decay: If a server’s load is too high, no requests will be distributed there and hence the server gets stuck. They use a decay function of the inverted sigmoid curve to solve the problem.

Results: requests are more balanced

Designing a Load Balancer

· 3 min read

Requirements Analysis

Internet services often need to handle traffic from around the world, but a single server can only serve a limited number of requests at the same time. Therefore, we typically have a server cluster to collectively manage this traffic. The question arises: how can we evenly distribute this traffic across different servers?

From the user to the server, there are many nodes and load balancers at different levels. Specifically, our design requirements are:

  • Design a Layer 7 load balancer located internally in the data center.
  • Utilize real-time load information from the backend.
  • Handle tens of millions of requests per second and a throughput of 10 TB per second.

Note: If Service A depends on Service B, we refer to A as the downstream service and B as the upstream service.

Challenges

Why is load balancing difficult? The answer lies in the challenge of collecting accurate load distribution data.

Count-based Distribution ≠ Load-based Distribution

The simplest approach is to distribute traffic randomly or in a round-robin manner based on the number of requests. However, the actual load is not calculated based on the number of requests; for example, some requests are heavy and CPU-intensive, while others are lightweight.

To measure load more accurately, the load balancer must maintain some local state—such as the current number of requests, connection counts, and request processing delays. Based on this state, we can employ appropriate load balancing algorithms—least connections, least latency, or random N choose one.

Least Connections: Requests are directed to the server with the fewest current connections.

Least Latency: Requests are directed to the server with the lowest average response time and fewest connections. Servers can also be weighted.

Random N Choose One (N is typically 2, so we can also refer to it as the power of two choices): Randomly select two servers and choose the better of the two, which helps avoid the worst-case scenario.

Distributed Environment

In a distributed environment, local load balancers struggle to understand the complete state of upstream and downstream services, including:

  • Load of upstream services
  • Upstream services can be very large, making it difficult to select an appropriate subset for the load balancer
  • Load of downstream services
  • The specific processing time for different types of requests is hard to predict

Solutions

There are three approaches to accurately collect load information and respond accordingly:

  • A centralized balancer that dynamically manages based on the situation
  • Distributed balancers that share state among them
  • Servers return load information along with requests, or the balancer actively queries the servers

Dropbox chose the third approach when implementing Bandai, as it adapted well to the existing random N choose one algorithm.

However, unlike the original random N choose one algorithm, this approach does not rely on local state but instead uses real-time results returned by the servers.

Server Utilization: Backend servers set a maximum load, track current connections, and calculate utilization, ranging from 0.0 to 1.0.

Two issues need to be considered:

  1. Error Handling: If fail fast, the quick processing may attract more traffic, leading to more errors.
  2. Data Decay: If a server's load is too high, no requests will be sent there. Therefore, using a decay function similar to a reverse S-curve ensures that old data is purged.

Result: Requests Received by Servers are More Balanced