Skip to main content

75 posts tagged with "system design"

View All Tags

Designing typeahead search or autocomplete

· 2 min read

Requirements

  • realtime / low-latency typeahead and autocomplete service for social networks, like Linkedin or Facebook
  • search social profiles with prefixes
  • newly added account appear instantly in the scope of the search
  • not for “query autocomplete” (like the Google search-box dropdown), but for displaying actual search results, including
    • generic typeahead: network-agnostic results from a global ranking scheme like popularity.
    • network typeahead: results from user’s 1st and 2nd-degree network connections, and People You May Know scores.

Linkedin Search

Architecture

Multi-layer architecture

  • browser cache
  • web tier
  • result aggregator
  • various typeahead backend

Cleo Architecture

Result Aggregator

The abstraction of this problem is to find documents by prefixes and terms in a very large number of elements. The solution leverages these four major data structures:

  1. InvertedIndex<prefixes or terms, documents>: given any prefix, find all the document ids that contain the prefix.
  2. for each document, prepare a BloomFilter<prefixes or terms>: with user typing more, we can quickly filter out documents that do not contain the latest prefixes or terms, by check with their bloom filters.
  3. ForwardIndex<documents, prefixes or terms>: previous bloom filter may return false positives, and now we query the actual documents to reject them.
  4. scorer(document):relevance: Each partition return all of its true hits and scores. And then we aggregate and rank.

Performance

  • generic typeahead: latency < = 1 ms within a cluster
  • network typeahead (very-large dataset over 1st and 2nd degree network): latency < = 15 ms
  • aggregator: latency < = 25 ms

Lyft's Marketing Automation Platform -- Symphony

· 3 min read

Acquisition Efficiency Problem:How to achieve a better ROI in advertising?

In details, Lyft's advertisements should meet requirements as below:

  1. being able to manage region-specific ad campaigns
  2. guided by data-driven growth: The growth must be scalable, measurable, and predictable
  3. supporting Lyft's unique growth model as shown below

lyft growth model

However, the biggest challenge is to manage all the processes of cross-region marketing at scale, which include choosing bids, budgets, creatives, incentives, and audiences, running A/B tests, and so on. You can see what occupies a day in the life of a digital marketer:

营销者的一天

We can find out that execution occupies most of the time while analysis, thought as more important, takes much less time. A scaling strategy will enable marketers to concentrate on analysis and decision-making process instead of operational activities.

Solution: Automation

To reduce costs and improve experimental efficiency, we need to

  1. predict the likelihood of a new user to be interested in our product
  2. evaluate effectively and allocate marketing budgets across channels
  3. manage thousands of ad campaigns handily

The marketing performance data flows into the reinforcement-learning system of Lyft: Amundsen

The problems that need to be automated include:

  1. updating bids across search keywords
  2. turning off poor-performing creatives
  3. changing referrals values by market
  4. identifying high-value user segments
  5. sharing strategies across campaigns

Architecture

Lyft Symphony Architecture

The tech stack includes - Apache Hive, Presto, ML platform, Airflow, 3rd-party APIs, UI.

Main components

Lifetime Value(LTV) forecaster

The lifetime value of a user is an important criterion to measure the efficiency of acquisition channels. The budget is determined together by LTV and the price we are willing to pay in that region.

Our knowledge of a new user is limited. The historical data can help us to predict more accurately as the user interacts with our services.

Initial eigenvalue:

特征值

The forecast improves as the historical data of interactivity accumulates:

根据历史记录判断 LTV

Budget allocator

After LTV is predicted, the next is to estimate budgets based on the price. A curve of the form LTV = a * (spend)^b is fit to the data. A degree of randomness will be injected into the cost-curve creation process in order to converge a global optimum.

预算计算

Bidders

Bidders are made up of two parts - the tuners and actors. The tuners decide exact channel-specific parameters based on the price. The actors communicate the actual bid to different channels.

Some popular bidding strategies, applied in different channels, are listed as below:

投放策略

Conclusion

We have to value human experiences in the automation process; otherwise, the quality of the models may be "garbage in, garbage out". Once saved from laboring tasks, marketers can focus more on understanding users, channels, and the messages they want to convey to audiences, and thus obtain better ad impacts. That's how Lyft can achieve a higher ROI with less time and efforts.

Designing Airbnb or a hotel booking system

· 3 min read

Requirements

  • for guests
    • search rooms by locations, dates, number of rooms, and number of guests
    • get room details (like picture, name, review, address, etc.) and prices
    • pay and book room from inventory by date and room id
      • checkout as a guest
      • user is logged in already
    • notification via Email and mobile push notification
  • for hotel or rental administrators (suppliers/hosts)
    • administrators (receptionist/manager/rental owner): manage room inventory and help the guest to check-in and check out
    • housekeeper: clean up rooms routinely

Architecture

Components

Inventory <> Bookings <> Users (guests and hosts)

Suppliers provide their room details in the inventory. And users can search, get, and reserve rooms accordingly. After reserving the room, the user's payment will change the status of the reserved_room as well. You could check the data model in this post.

How to find available rooms?

  • by location: geo-search with spatial indexing, e.g. geo-hash or quad-tree.
  • by room metadata: apply filters or search conditions when querying the database.
  • by date-in and date-out and availability. Two options:
    • option 1: for a given room_id, check all occupied_room today or later, transform the data structure to an array of occupation by days, and finally find available slots in the array. This process might be time-consuming, so we can build the availability index.
    • option 2: for a given room_id, always create an entry for an occupied day. Then it will be easier to query unavailable slots by dates.

For hotels, syncing data

If it is a hotel booking system, then it will probably publish to Booking Channels like GDS, Aggregators, and Wholesalers.

Hotel Booking Ecosystem

To sync data across those places. We can

  1. retry with idempotency to improve the success rate of the external calls and ensure no duplicate orders.
  2. provide webhook callback APIs to external vendors to update status in the internal system.

Payment & Bookkeeping

Data model: double-entry bookkeeping

To execute the payment, since we are calling the external payment gateway, like bank or Stripe, Braintree, etc. It is crucial to keep data in-sync across different places. We need to sync data across the transaction table and external banks and vendors.

Notifier for reminders / alerts

The notification system is essentially a delayer scheduler (priority queue + subscriber) plus API integrations.

For example, a daily cronjob will query the database for notifications to be sent out today and put them into the priority queue by date. The subscriber will get the earliest ones from the priority queue and send out if reaching the expected timestamp. Otherwise, put the task back to the queue and sleep to make the CPU idle for other work, which can be interrupted if there are new alerts added for today.

Designing Memcached or an in-memory KV store

· 2 min read

Requirements

  1. High-performance, distributed key-value store
  • Why distributed?
    • Answer: to hold a larger size of data
  1. For in-memory storage of small data objects
  2. Simple server (pushing complexity to the client) and hence reliable and easy to deploy

Architecture

Big Picture: Client-server

  • client
  • given a list of Memcached servers
  • chooses a server based on the key
  • server
  • store KVs into the internal hash table
  • LRU eviction

The Key-value server consists of a fixed-size hash table + single-threaded handler + coarse locking

hash table

How to handle collisions? Mostly three ways to resolve:

  1. Separate chaining: the collided bucket chains a list of entries with the same index, and you can always append the newly collided key-value pair to the list.
  2. open addressing: if there is a collision, go to the next index until finding an available bucket.
  3. dynamic resizing: resize the hash table and allocate more spaces; hence, collisions will happen less frequently.

How does the client determine which server to query?

See Data Partition and Routing

How to use cache?

See Key value cache

How to further optimize?

See How Facebook Scale its Social Graph Store? TAO

Lyft's Marketing Automation Platform Symphony

· 3 min read

Customer Acquisition Efficiency Issue: How can advertising campaigns achieve higher returns with less money and fewer people?

Specifically, Lyft's advertising campaigns need to address the following characteristics:

  1. Manage location-based campaigns
  2. Data-driven growth: growth must be scalable, measurable, and predictable
  3. Support Lyft's unique growth model, as shown below:

lyft growth model

The main challenge is the difficulty of scaling management across various aspects of regional marketing, including ad bidding, budgeting, creative assets, incentives, audience selection, testing, and more. The following image depicts a day in the life of a marketer:

A Day in the Life of a Marketer

We can see that "execution" takes up most of the time, while less time is spent on the more important tasks of "analysis and decision-making." Scaling means reducing complex operations and allowing marketers to focus on analysis and decision-making.

Solution: Automation

To reduce costs and improve the efficiency of experimentation, it is necessary to:

  1. Predict whether new users are interested in the product
  2. Optimize across multiple channels and effectively evaluate and allocate budgets
  3. Conveniently manage thousands of campaigns

Data is enhanced through Lyft's Amundsen system using reinforcement learning.

The automation components include:

  1. Updating bid keywords
  2. Disabling underperforming creative assets
  3. Adjusting referral values based on market changes
  4. Identifying high-value user segments
  5. Sharing strategies across multiple campaigns

Architecture

Lyft Symphony Architecture

Technology stack: Apache Hive, Presto, ML platform, Airflow, 3rd-party APIs, UI.

Specific Component Modules

LTV Prediction Module

The lifetime value (LTV) of users is an important metric for evaluating channels, and the budget is determined by both LTV and the price we are willing to pay for customer acquisition in that region.

Our understanding of new users is limited, but as interactions increase, the historical data provided will more accurately predict outcomes.

Initial feature values:

Feature Values

As historical interaction records accumulate, the predictions become more accurate:

Predicting LTV Based on Historical Records

Budget Allocation Module

Once LTV is established, the next step is to set the budget based on pricing. A curve of the form LTV = a * (spend)^b is fitted, along with similar parameter curves in the surrounding range. Achieving a global optimum requires some randomness.

Budget Calculation

Delivery Module

This module is divided into two parts: the parameter tuner and the executor. The tuner sets specific parameters based on pricing for each channel, while the executor applies these parameters to the respective channels.

There are many popular delivery strategies that are common across various channels:

Delivery Strategies

Conclusion

It is essential to recognize the importance of human experience within the system; otherwise, it results in garbage in, garbage out. When people are liberated from tedious delivery tasks and can focus on understanding users, channels, and the messages they need to convey to their audience, they can achieve better campaign results—spending less time to achieve higher ROI.

How to write solid code?

· One min read

he likes it

  1. empathy / perspective-taking is the most important.

    1. realize that code is written for human to read first and then for machines to execute.
    2. software is so "soft" and there are many ways to achieve one thing. It's all about making the proper trade-offs to fulfill the requirements.
    3. Invent and Simplify: Apple Pay RFID vs. Wechat Scan QR Code.
  2. choose a sustainable architecture to reduce human resources costs per feature.

  1. adopt patterns and best practices.

  2. avoid anti-patterns

    • missing error-handling
    • callback hell = spaghetti code + unpredictable error handling
    • over-long inheritance chain
    • circular dependency
    • over-complicated code
      • nested tertiary operation
      • comment out unused code
    • missing i18n, especially RTL issues
    • don't repeat yourself
      • simple copy-and-paste
      • unreasonable comments
  3. effective refactoring

    • semantic version
    • never introduce breaking change to non major versions
      • two legged change

Designing a metric system

· 13 min read

Requirements

Log v.s Metric: A log is an event that happened, and a metric is a measurement of the health of a system.

We are assuming that this system’s purpose is to serve metrics - namely, counters, conversion rate, timers, etc. for monitoring the system performance and health. If the conversion rate drops drastically, the system should alert the on-call.

  1. Monitoring business metrics like signup funnel’s conversion rate
  2. Supporting various queries, like on different platforms (IE/Chrome/Safari, iOS/Android/Desktop, etc.)
  3. data visualization
  4. Scalability and Availability

Architecture

Two ways to build the system:

  1. Push Model: Influx/Telegraf/Grafana
  2. Pull Model: Prometheus/Grafana

The pull model is more scalable because it decreases the number of requests going into the metrics databases - there is no hot path and concurrency issue.

Server Farm

Server Farm

write

write

telegraf

telegraf

InfluxDB

InfluxDB

REST API

REST API

Grafana

Grafana

InfluxDB Push Model

InfluxDB Push Model

Prometheus Pull Model

Prometheus Pull Model

Application

Application

Exporter

Exporter

client library

client library

3rd Party


Application

3rd Party<br>Application

pull

pull

Prometheus

Prometheus

Retrieval

Retrieval

Service Discovery

Service Discovery

Storage

Storage

PromQL

PromQL

Alertmanager

Alertmanager

Web UI / Grafana / API Clients

Web UI / Grafana / API Clients

PagerDuty

PagerDuty

Email

Email

Features and Components

Measuring Sign-up Funnel

Take a four-step sign up on the mobile app for example

INPUT_PHONE_NUMBER -> VERIFY_SMS_CODE -> INPUT_NAME -> INPUT_PASSWORD

Every step has IMPRESSION and POST_VERIFICATION phases. And emit metrics like this:

{
"sign_up_session_id": "uuid",
"step": "VERIFY_SMS_CODE",
"os": "iOS",
"phase": "POST_VERIFICATION",
"status": "SUCCESS",
// ... ts, contexts, ...
}

Consequently, we can query the overall conversion rate of VERIFY_SMS_CODE step on iOS like

(counts of step=VERIFY_SMS_CODE, os=iOS, status: SUCCESS, phase: POST_VERIFICATION) / (counts of step=VERIFY_SMS_CODE, os=iOS, phase: IMPRESSION)

Data Visualization

Graphana is mature enough for the data visualization work. If you do not want to expose the whole site, you can use Embed Panel with iframe.

Designing Square Cash or PayPal Money Transfer System

· 19 min read

Clarifying Requirements

Designing a service money transfer backend system like Square Cash (we will call this system Cash App below) or PayPal to

  1. Deposit from and payout to bank
  2. Transfer between accounts
  3. High scalability and availability
  4. i18n: language, timezone, currency exchange
  5. Deduplication for non-idempotent APIs and for at-least-once delivery.
  6. Consistency across multiple data sources.

Architecture

AWS CloudHSM

AWS CloudHSM

Presentation Layer

Presentation Layer

SDK/Docs

SDK/Docs

mobile-dashboard

mobile-dashboard

web-dashboard

web-dashboard

dashboard-client

dashboard-client

mobile-wallet

mobile-wallet

web-wallet

web-wallet

wallet-client

wallet-client

Merchant 


User

Merchant <br>User

End User

End User

web-chrome-extension

web-chrome-extension

Operators

Operators

payment

payment

task-queue

task-queue

financial-reporter

financial-reporter

payment-gateway

payment-gateway

banks / 


vendors

[Not supported by viewer]

side-effect maker

side-effect maker

help service portal

help service portal

User


Profiles


AuthDB


[Not supported by viewer]

api-gateway


monolithic


api-gateway<br>monolithic<br>

Payment


DB


Payment<br>DB<br>

Aurora

Aurora

risk control

risk control

risk control

risk control

Event
Queue

[Not supported by viewer]

Features and Components

Payment Service

The payment data model is essentially “double-entry bookkeeping”. Every entry to an account requires a corresponding and opposite entry to a different account. Sum of all debit and credit equals to zero.

Deposit and Payout

Transaction: new user Jane Doe deposits $100 from bank to Cash App. This one transaction involves those DB entries:

bookkeeping table (for history)

+ debit, USD, 100, CashAppAccountNumber, txId
- credit, USD, 100, RoutingNumber:AccountNumber, txId

transaction table

txId, timestamp, status(pending/confirmed), [bookkeeping entries], narration

Once the bank confirmed the transaction, update the pending status above and the following balance sheet in one transaction.

balance sheet

CashAppAccountNumber, USD, 100

Transfer between accounts within Cash App

Similar to the case above, but there is no pending state because we do not need the slow external system to change their state. All changes in bookkeeping table, transaction table, and balance sheet table happen in one transaction.

i18n

We solve the i18n problems in 3 dimensions.

  1. Language: All texts like copywriting, push notifications, emails are picked up according to the accept-language header.
  2. Timezones: All server timezones are in UTC. We transform timestamps to the local timezone in the client-side.
  3. Currency: All user transferring transactions must be in the same currency. If they want to move across currencies, they have to exchange the currency first, in a rate that is favorable to the Cash App.

For example, Jane Doe wants to exchange 1 USD with 6.8 CNY with 0.2

bookkeeping table

- credit, USD, 1, CashAppAccountNumber, txId
+ debit, CNY, 6.8, CashAppAccountNumber, txId, @7.55 CNY/USD
+ debit, USD, 0.1, ExpensesOfExchangeAccountNumber, txId

Transaction table, balance sheet, etc. are similar to the transaction discussed in Deposit and Payout. The major difference is that the bank or the vendor provides the exchange service.

How to sync across the transaction table and external banks and vendors?

  • retry with idempotency to improve the success rate of the external calls and ensure no duplicate orders.
  • two ways to check if the PENDING orders are filled or failed.
    1. poll: cronjobs (SWF, Airflow, Cadence, etc.) to poll the status for PENDING orders.
    2. callback: provide a callback API for the external vendors.
  • Graceful shutdown. The bank gateway calls may take tens of seconds to finish, and restarting the servers may resume unfinished transactions from the database. The process may create too many connections. To reduce connections, before the shutdown, stop accepting new requests and wait for the existing outgoing ones to wrap up.

Deduplication

Why is Deduplication a concern?

  1. not all endpoints are idempotent
  2. Event queue may be at-least-once.

not all endpoints are idempotent: what if the external system is not idempotent?

For the poll case above, if the external gateway does not support idempotent APIs, in order not to flood with duplicate entries, we must keep record of the order ID or the reference ID the external system gives us with 200, and query GET by the order ID instead of POST all the time.

For the callback case, we can ensure we implement with idempotent APIs, and we mutate pending to confirmed anyway.

Event queue may be at-least-once

  • For the even queue, we can use an exactly-once Kafka with the producer throughput declines only by 3%.
  • In the database layer, we can use idempotency key or deduplication key.
  • In the service layer, we can use Redis key-value store.

Availability and Scalability

Designing payment webhook

· 4 min read

1. Clarifying Requirements

  1. Webhook will call the merchant back once the payment succeeds.
    1. Merchant developer registers webhook information with us.
    2. Make a POST HTTP request to the webhooks reliably and securely.
  2. High availability, error-handling, and failure-resilience.
    1. Async design. Assuming that the servers of merchants are located across the world, and may have a very high latency like 15s.
    2. At-least-once delivery. Idempotent key.
    3. Order does not matter.
    4. Robust & predictable retry and short-circuit.
  3. Security, observability & scalability
    1. Anti-spoofing.
    2. Notify the merchant when their receivers are broken.
    3. easy to extend and scale.

2. Sketch out the high-level design

async design + retry + queuing + observability + security

3. Features and Components

Core Features

  1. Users go to dashboard frontend to register webhook information with us - like the URL to call, the scope of events they want to subscribe, and then get an API key from us.
  2. When there is a new event, publish it into the queue and then get consumed by callers. Callers get the registration and make the HTTP call to external services.

Webhook callers

  1. Subscribe to the event queue for payment success events published by a payment state machine or other services.

  2. Once callers accept an event, fetch webhook URI, secret, and settings from the user settings service. Prepare the request based on those settings. For security...

  • All webhooks from user settings must be in HTTPs

  • If the payload is huge, the prospect latency is high, and we wants to make sure the target reciever is alive, we can verify its existance with a ping carrying a challenge. e.g. Dropbox verifies webhook endpoints by sending a GET request with a “challenge” param (a random string) encoded in the URL, which your endpoint is required to echo back as a response.

  • All callback requests are with header x-webhook-signature. So that the receiver can authenticate the request.

    • For symetric signature, we can use HMAC/SHA256 signature. Its value is HMAC(webhook secret, raw request payload);. Telegram takes this.
    • For asymmetric signature, we can use RSA/SHA256 signature. Its value is RSA(webhook private key, raw request payload); Stripe takes this.
    • If it's sensitive information, we can also consider encryption for the payload instead of just signing.
  1. Make an HTTP POST request to the external merchant's endpoints with event payload and security headers.

API Definition

// POST https://example.com/webhook/
{
"id": 1,
"scheduled_for": "2017-01-31T20:50:02Z",
"event": {
"id": "24934862-d980-46cb-9402-43c81b0cdba6",
"resource": "event",
"type": "charge:created",
"api_version": "2018-03-22",
"created_at": "2017-01-31T20:49:02Z",
"data": {
"code": "66BEOV2A", // or order ID the user need to fulfill
"name": "The Sovereign Individual",
"description": "Mastering the Transition to the Information Age",
"hosted_url": "https://commerce.coinbase.com/charges/66BEOV2A",
"created_at": "2017-01-31T20:49:02Z",
"expires_at": "2017-01-31T21:49:02Z",
"metadata": {},
"pricing_type": "CNY",
"payments": [
// ...
],
"addresses": {
// ...
}
}
}
}

The merchant server should respond with a 200 HTTP status code to acknowledge receipt of a webhook.

Error-handling

If there is no acknowledgment of receipt, we will retry with idempotency key and exponential backoff for up to three days. The maximum retry interval is 1 hour. If it's reaching a certain limit, short-circuit / mark it as broken. Sending out an Email to the merchant.

Metrics

The Webhook callers service emits statuses into the time-series DB for metrics.

Using Statsd + Influx DB vs. Prometheus?

  • InfluxDB: Application pushes data to InfluxDB. It has a monolithic DB for metrics and indices.
  • Prometheus: Prometheus server pulls the metrics values from the running application periodically. It uses LevelDB for indices, but each metric is stored in its own file.

Or use the expensive DataDog or other APM services if you have a generous budget.

Designing Smart Notification of Stock Price Changes

· 15 min read

Requirements

  • 3 million users
  • 5000 stocks + 250 global stocks
  • a user gets notified about the price change when
    1. subscribing the stock
    2. the stock has 5% or 10% changes
    3. since a) the last week or b) the last day
  • extensibility. may support other kinds of notifications like breaking news, earnings call, etc.

Sketching out the Architecture

Contexts:

  • What is clearing? Clearing is the procedure by which financial trades settle – that is, the correct and timely transfer of funds to the seller and securities to the buyer. Often with clearing, a specialized organization acts as an intermediary known as a clearinghouse.
  • What is a stock exchange? A facility where stock brokers and traders can buy and sell securities.

Apple Push Notification service


(APNs)

Apple Push Notification service<br>(APNs)

Google Firebase Cloud Messaging


(FCM)

Google Firebase Cloud Messaging<br>(FCM)

Email Services


AWS SES /sendgrid/etc

Email Services<br>AWS SES /sendgrid/etc

notifier

notifier

External Vendors



Market Prices

[Not supported by viewer]

Robinhood App

Robinhood App

API Gateway

API Gateway

Reverse Proxy

Reverse Proxy

batch write

batch write

price


ticker

[Not supported by viewer]

Time-series DB


influx or prometheus

Time-series DB<br>influx or prometheus

Tick every 5 mins

[Not supported by viewer]

periorical read

periorical read

price


watcher

price<br>watcher

User Settings

User Settings

Notification Queue

Notification Queue

throttler cache

throttler cache

cronjob

cronjob

What are those components and how do they interact with each other?

  • Price ticker
    • data fetching policies
      • option 1 preliminary: fetches data every 5 mins and flush into the time-series database in batches.
      • option 2 advanced: nowadays external systems usually push data directly so that we do not have to pull all the time.
    • ~6000 points per request or per price change.
    • data retention of 1 week, because this is just the speeding layer of the lambda architecture.
  • Price watcher
    • read the data ranging from last week or last 24 hours for each stock.
    • calculate if the fluctuation exceeds 5% or 10% in those two time spans. we get tuples like (stock, up 5%, 1 week).
      • corner case: should we normalize the price data? for example, some abnormal price like someone sold UBER mistakenly for $1 USD.
    • ratelimit (because 5% or 10% delta may occur many times within one day), and then emit an event PRICE_CHANGE(STOCK_CODE, timeSpan, percentage) to the notification queue.
  • Periodical triggers are cron jobs, e.g. Airflow, Cadence.
  • notification queue
    • may not necessarily be introduced in the first place when users and stocks are small.
    • may accept generic messaging event, like PRICE_CHANGE, EARNINGS_CALL, BREAKING_NEWS, etc.
  • Notifier
    • subscribe the notification queue to get the event
    • and then fetch who to notify from the user settings service
    • finally based on user settings, send out messages through APNs, FCM or AWS SES.