Skip to main content

76 posts tagged with "system design"

View all tags

Lyft's Marketing Automation Platform -- Symphony

· 3 min read

Acquisition Efficiency Problem:How to achieve a better ROI in advertising?

In details, Lyft's advertisements should meet requirements as below:

  1. being able to manage region-specific ad campaigns
  2. guided by data-driven growth: The growth must be scalable, measurable, and predictable
  3. supporting Lyft's unique growth model as shown below

lyft growth model

However, the biggest challenge is to manage all the processes of cross-region marketing at scale, which include choosing bids, budgets, creatives, incentives, and audiences, running A/B tests, and so on. You can see what occupies a day in the life of a digital marketer:

营销者的一天

We can find out that execution occupies most of the time while analysis, thought as more important, takes much less time. A scaling strategy will enable marketers to concentrate on analysis and decision-making process instead of operational activities.

Solution: Automation

To reduce costs and improve experimental efficiency, we need to

  1. predict the likelihood of a new user to be interested in our product
  2. evaluate effectively and allocate marketing budgets across channels
  3. manage thousands of ad campaigns handily

The marketing performance data flows into the reinforcement-learning system of Lyft: Amundsen

The problems that need to be automated include:

  1. updating bids across search keywords
  2. turning off poor-performing creatives
  3. changing referrals values by market
  4. identifying high-value user segments
  5. sharing strategies across campaigns

Architecture

Lyft Symphony Architecture

The tech stack includes - Apache Hive, Presto, ML platform, Airflow, 3rd-party APIs, UI.

Main components

Lifetime Value(LTV) forecaster

The lifetime value of a user is an important criterion to measure the efficiency of acquisition channels. The budget is determined together by LTV and the price we are willing to pay in that region.

Our knowledge of a new user is limited. The historical data can help us to predict more accurately as the user interacts with our services.

Initial eigenvalue:

特征值

The forecast improves as the historical data of interactivity accumulates:

根据历史记录判断 LTV

Budget allocator

After LTV is predicted, the next is to estimate budgets based on the price. A curve of the form LTV = a * (spend)^b is fit to the data. A degree of randomness will be injected into the cost-curve creation process in order to converge a global optimum.

预算计算

Bidders

Bidders are made up of two parts - the tuners and actors. The tuners decide exact channel-specific parameters based on the price. The actors communicate the actual bid to different channels.

Some popular bidding strategies, applied in different channels, are listed as below:

投放策略

Conclusion

We have to value human experiences in the automation process; otherwise, the quality of the models may be "garbage in, garbage out". Once saved from laboring tasks, marketers can focus more on understanding users, channels, and the messages they want to convey to audiences, and thus obtain better ad impacts. That's how Lyft can achieve a higher ROI with less time and efforts.

Designing Airbnb or a hotel booking system

· 3 min read

Requirements

  • for guests
    • search rooms by locations, dates, number of rooms, and number of guests
    • get room details (like picture, name, review, address, etc.) and prices
    • pay and book room from inventory by date and room id
      • checkout as a guest
      • user is logged in already
    • notification via Email and mobile push notification
  • for hotel or rental administrators (suppliers/hosts)
    • administrators (receptionist/manager/rental owner): manage room inventory and help the guest to check-in and check out
    • housekeeper: clean up rooms routinely

Architecture

Components

Inventory <> Bookings <> Users (guests and hosts)

Suppliers provide their room details in the inventory. And users can search, get, and reserve rooms accordingly. After reserving the room, the user's payment will change the status of the reserved_room as well. You could check the data model in this post.

How to find available rooms?

  • by location: geo-search with spatial indexing, e.g. geo-hash or quad-tree.
  • by room metadata: apply filters or search conditions when querying the database.
  • by date-in and date-out and availability. Two options:
    • option 1: for a given room_id, check all occupied_room today or later, transform the data structure to an array of occupation by days, and finally find available slots in the array. This process might be time-consuming, so we can build the availability index.
    • option 2: for a given room_id, always create an entry for an occupied day. Then it will be easier to query unavailable slots by dates.

For hotels, syncing data

If it is a hotel booking system, then it will probably publish to Booking Channels like GDS, Aggregators, and Wholesalers.

Hotel Booking Ecosystem

To sync data across those places. We can

  1. retry with idempotency to improve the success rate of the external calls and ensure no duplicate orders.
  2. provide webhook callback APIs to external vendors to update status in the internal system.

Payment & Bookkeeping

Data model: double-entry bookkeeping

To execute the payment, since we are calling the external payment gateway, like bank or Stripe, Braintree, etc. It is crucial to keep data in-sync across different places. We need to sync data across the transaction table and external banks and vendors.

Notifier for reminders / alerts

The notification system is essentially a delayer scheduler (priority queue + subscriber) plus API integrations.

For example, a daily cronjob will query the database for notifications to be sent out today and put them into the priority queue by date. The subscriber will get the earliest ones from the priority queue and send out if reaching the expected timestamp. Otherwise, put the task back to the queue and sleep to make the CPU idle for other work, which can be interrupted if there are new alerts added for today.

Designing Memcached or an in-memory KV store

· 2 min read

Requirements

  1. High-performance, distributed key-value store
  • Why distributed?
    • Answer: to hold a larger size of data
  1. For in-memory storage of small data objects
  2. Simple server (pushing complexity to the client) and hence reliable and easy to deploy

Architecture

Big Picture: Client-server

  • client
  • given a list of Memcached servers
  • chooses a server based on the key
  • server
  • store KVs into the internal hash table
  • LRU eviction

The Key-value server consists of a fixed-size hash table + single-threaded handler + coarse locking

hash table

How to handle collisions? Mostly three ways to resolve:

  1. Separate chaining: the collided bucket chains a list of entries with the same index, and you can always append the newly collided key-value pair to the list.
  2. open addressing: if there is a collision, go to the next index until finding an available bucket.
  3. dynamic resizing: resize the hash table and allocate more spaces; hence, collisions will happen less frequently.

How does the client determine which server to query?

See Data Partition and Routing

How to use cache?

See Key value cache

How to further optimize?

See How Facebook Scale its Social Graph Store? TAO

Lyft's Marketing Automation Platform Symphony

· 3 min read

Customer Acquisition Efficiency Issue: How can advertising campaigns achieve higher returns with less money and fewer people?

Specifically, Lyft's advertising campaigns need to address the following characteristics:

  1. Manage location-based campaigns
  2. Data-driven growth: growth must be scalable, measurable, and predictable
  3. Support Lyft's unique growth model, as shown below:

lyft growth model

The main challenge is the difficulty of scaling management across various aspects of regional marketing, including ad bidding, budgeting, creative assets, incentives, audience selection, testing, and more. The following image depicts a day in the life of a marketer:

A Day in the Life of a Marketer

We can see that "execution" takes up most of the time, while less time is spent on the more important tasks of "analysis and decision-making." Scaling means reducing complex operations and allowing marketers to focus on analysis and decision-making.

Solution: Automation

To reduce costs and improve the efficiency of experimentation, it is necessary to:

  1. Predict whether new users are interested in the product
  2. Optimize across multiple channels and effectively evaluate and allocate budgets
  3. Conveniently manage thousands of campaigns

Data is enhanced through Lyft's Amundsen system using reinforcement learning.

The automation components include:

  1. Updating bid keywords
  2. Disabling underperforming creative assets
  3. Adjusting referral values based on market changes
  4. Identifying high-value user segments
  5. Sharing strategies across multiple campaigns

Architecture

Lyft Symphony Architecture

Technology stack: Apache Hive, Presto, ML platform, Airflow, 3rd-party APIs, UI.

Specific Component Modules

LTV Prediction Module

The lifetime value (LTV) of users is an important metric for evaluating channels, and the budget is determined by both LTV and the price we are willing to pay for customer acquisition in that region.

Our understanding of new users is limited, but as interactions increase, the historical data provided will more accurately predict outcomes.

Initial feature values:

Feature Values

As historical interaction records accumulate, the predictions become more accurate:

Predicting LTV Based on Historical Records

Budget Allocation Module

Once LTV is established, the next step is to set the budget based on pricing. A curve of the form LTV = a * (spend)^b is fitted, along with similar parameter curves in the surrounding range. Achieving a global optimum requires some randomness.

Budget Calculation

Delivery Module

This module is divided into two parts: the parameter tuner and the executor. The tuner sets specific parameters based on pricing for each channel, while the executor applies these parameters to the respective channels.

There are many popular delivery strategies that are common across various channels:

Delivery Strategies

Conclusion

It is essential to recognize the importance of human experience within the system; otherwise, it results in garbage in, garbage out. When people are liberated from tedious delivery tasks and can focus on understanding users, channels, and the messages they need to convey to their audience, they can achieve better campaign results—spending less time to achieve higher ROI.

How to write solid code?

· One min read

he likes it

  1. empathy / perspective-taking is the most important.

    1. realize that code is written for human to read first and then for machines to execute.
    2. software is so "soft" and there are many ways to achieve one thing. It's all about making the proper trade-offs to fulfill the requirements.
    3. Invent and Simplify: Apple Pay RFID vs. Wechat Scan QR Code.
  2. choose a sustainable architecture to reduce human resources costs per feature.

  1. adopt patterns and best practices.

  2. avoid anti-patterns

    • missing error-handling
    • callback hell = spaghetti code + unpredictable error handling
    • over-long inheritance chain
    • circular dependency
    • over-complicated code
      • nested tertiary operation
      • comment out unused code
    • missing i18n, especially RTL issues
    • don't repeat yourself
      • simple copy-and-paste
      • unreasonable comments
  3. effective refactoring

    • semantic version
    • never introduce breaking change to non major versions
      • two legged change

Designing a metric system

· 17 min read

Requirements

Log v.s Metric: A log is an event that happened, and a metric is a measurement of the health of a system.

We are assuming that this system’s purpose is to serve metrics - namely, counters, conversion rate, timers, etc. for monitoring the system performance and health. If the conversion rate drops drastically, the system should alert the on-call.

  1. Monitoring business metrics like signup funnel’s conversion rate
  2. Supporting various queries, like on different platforms (IE/Chrome/Safari, iOS/Android/Desktop, etc.)
  3. data visualization
  4. Scalability and Availability

Architecture

Two ways to build the system:

  1. Push Model: Influx/Telegraf/Grafana
  2. Pull Model: Prometheus/Grafana

The pull model is more scalable because it decreases the number of requests going into the metrics databases - there is no hot path and concurrency issue.

Server Farm

Server Farm

write

write

telegraf

telegraf

InfluxDB

InfluxDB

REST API

REST API

Grafana

Grafana

InfluxDB Push Model

InfluxDB Push Model

Prometheus Pull Model

Prometheus Pull Model

Application

Application

Exporter

Exporter

client library

client library

3rd Party


Application

3rd Party<br>Application

pull

pull

Prometheus

Prometheus

Retrieval

Retrieval

Service Discovery

Service Discovery

Storage

Storage

PromQL

PromQL

Alertmanager

Alertmanager

Web UI / Grafana / API Clients

Web UI / Grafana / API Clients

PagerDuty

PagerDuty

Email

Email

Features and Components

Measuring Sign-up Funnel

Take a four-step sign up on the mobile app for example

INPUT_PHONE_NUMBER -> VERIFY_SMS_CODE -> INPUT_NAME -> INPUT_PASSWORD

Every step has IMPRESSION and POST_VERIFICATION phases. And emit metrics like this:

{
"sign_up_session_id": "uuid",
"step": "VERIFY_SMS_CODE",
"os": "iOS",
"phase": "POST_VERIFICATION",
"status": "SUCCESS",
// ... ts, contexts, ...
}

Consequently, we can query the overall conversion rate of VERIFY_SMS_CODE step on iOS like

(counts of step=VERIFY_SMS_CODE, os=iOS, status: SUCCESS, phase: POST_VERIFICATION) / (counts of step=VERIFY_SMS_CODE, os=iOS, phase: IMPRESSION)

Data Visualization

Graphana is mature enough for the data visualization work. If you do not want to expose the whole site, you can use Embed Panel with iframe.

Designing Square Cash or PayPal Money Transfer System

· 24 min read

Clarifying Requirements

Designing a service money transfer backend system like Square Cash (we will call this system Cash App below) or PayPal to

  1. Deposit from and payout to bank
  2. Transfer between accounts
  3. High scalability and availability
  4. i18n: language, timezone, currency exchange
  5. Deduplication for non-idempotent APIs and for at-least-once delivery.
  6. Consistency across multiple data sources.

Architecture

AWS CloudHSM

AWS CloudHSM

Presentation Layer

Presentation Layer

SDK/Docs

SDK/Docs

mobile-dashboard

mobile-dashboard

web-dashboard

web-dashboard

dashboard-client

dashboard-client

mobile-wallet

mobile-wallet

web-wallet

web-wallet

wallet-client

wallet-client

Merchant 


User

Merchant <br>User

End User

End User

web-chrome-extension

web-chrome-extension

Operators

Operators

payment

payment

task-queue

task-queue

financial-reporter

financial-reporter

payment-gateway

payment-gateway

banks / 


vendors

[Not supported by viewer]

side-effect maker

side-effect maker

help service portal

help service portal

User


Profiles


AuthDB


[Not supported by viewer]

api-gateway


monolithic


api-gateway<br>monolithic<br>

Payment


DB


Payment<br>DB<br>

Aurora

Aurora

risk control

risk control

risk control

risk control

Event
Queue

[Not supported by viewer]

Features and Components

Payment Service

The payment data model is essentially “double-entry bookkeeping”. Every entry to an account requires a corresponding and opposite entry to a different account. Sum of all debit and credit equals to zero.

Deposit and Payout

Transaction: new user Jane Doe deposits $100 from bank to Cash App. This one transaction involves those DB entries:

bookkeeping table (for history)

+ debit, USD, 100, CashAppAccountNumber, txId
- credit, USD, 100, RoutingNumber:AccountNumber, txId

transaction table

txId, timestamp, status(pending/confirmed), [bookkeeping entries], narration

Once the bank confirmed the transaction, update the pending status above and the following balance sheet in one transaction.

balance sheet

CashAppAccountNumber, USD, 100

Transfer between accounts within Cash App

Similar to the case above, but there is no pending state because we do not need the slow external system to change their state. All changes in bookkeeping table, transaction table, and balance sheet table happen in one transaction.

i18n

We solve the i18n problems in 3 dimensions.

  1. Language: All texts like copywriting, push notifications, emails are picked up according to the accept-language header.
  2. Timezones: All server timezones are in UTC. We transform timestamps to the local timezone in the client-side.
  3. Currency: All user transferring transactions must be in the same currency. If they want to move across currencies, they have to exchange the currency first, in a rate that is favorable to the Cash App.

For example, Jane Doe wants to exchange 1 USD with 6.8 CNY with 0.2

bookkeeping table

- credit, USD, 1, CashAppAccountNumber, txId
+ debit, CNY, 6.8, CashAppAccountNumber, txId, @7.55 CNY/USD
+ debit, USD, 0.1, ExpensesOfExchangeAccountNumber, txId

Transaction table, balance sheet, etc. are similar to the transaction discussed in Deposit and Payout. The major difference is that the bank or the vendor provides the exchange service.

How to sync across the transaction table and external banks and vendors?

  • retry with idempotency to improve the success rate of the external calls and ensure no duplicate orders.
  • two ways to check if the PENDING orders are filled or failed.
    1. poll: cronjobs (SWF, Airflow, Cadence, etc.) to poll the status for PENDING orders.
    2. callback: provide a callback API for the external vendors.
  • Graceful shutdown. The bank gateway calls may take tens of seconds to finish, and restarting the servers may resume unfinished transactions from the database. The process may create too many connections. To reduce connections, before the shutdown, stop accepting new requests and wait for the existing outgoing ones to wrap up.

Deduplication

Why is Deduplication a concern?

  1. not all endpoints are idempotent
  2. Event queue may be at-least-once.

not all endpoints are idempotent: what if the external system is not idempotent?

For the poll case above, if the external gateway does not support idempotent APIs, in order not to flood with duplicate entries, we must keep record of the order ID or the reference ID the external system gives us with 200, and query GET by the order ID instead of POST all the time.

For the callback case, we can ensure we implement with idempotent APIs, and we mutate pending to confirmed anyway.

Event queue may be at-least-once

  • For the even queue, we can use an exactly-once Kafka with the producer throughput declines only by 3%.
  • In the database layer, we can use idempotency key or deduplication key.
  • In the service layer, we can use Redis key-value store.

Availability and Scalability

Designing payment webhook

· 4 min read

1. Clarifying Requirements

  1. Webhook will call the merchant back once the payment succeeds.
    1. Merchant developer registers webhook information with us.
    2. Make a POST HTTP request to the webhooks reliably and securely.
  2. High availability, error-handling, and failure-resilience.
    1. Async design. Assuming that the servers of merchants are located across the world, and may have a very high latency like 15s.
    2. At-least-once delivery. Idempotent key.
    3. Order does not matter.
    4. Robust & predictable retry and short-circuit.
  3. Security, observability & scalability
    1. Anti-spoofing.
    2. Notify the merchant when their receivers are broken.
    3. easy to extend and scale.

2. Sketch out the high-level design

async design + retry + queuing + observability + security

3. Features and Components

Core Features

  1. Users go to dashboard frontend to register webhook information with us - like the URL to call, the scope of events they want to subscribe, and then get an API key from us.
  2. When there is a new event, publish it into the queue and then get consumed by callers. Callers get the registration and make the HTTP call to external services.

Webhook callers

  1. Subscribe to the event queue for payment success events published by a payment state machine or other services.

  2. Once callers accept an event, fetch webhook URI, secret, and settings from the user settings service. Prepare the request based on those settings. For security...

  • All webhooks from user settings must be in HTTPS

  • If the payload is huge, the prospect latency is high, and we want to make sure the target receiver is alive, we can verify its existence with a ping carrying a challenge. e.g. Dropbox verifies webhook endpoints by sending a GET request with a “challenge” param (a random string) encoded in the URL, which your endpoint is required to echo back as a response.

  • All callback requests are with header x-webhook-signature. So that the receiver can authenticate the request.

    • For symmetric signature, we can use HMAC/SHA256 signature. Its value is HMAC(webhook secret, raw request payload);. Telegram takes this.
      • For asymmetric signature, we can use RSA/SHA256 signature. Its value is RSA(webhook private key, raw request payload); Stripe takes this.
      • If it's sensitive information, we can also consider encryption for the payload instead of just signing.
  1. Make an HTTP POST request to the external merchant's endpoints with event payload and security headers.

API Definition

// POST https://example.com/webhook/
{
"id": 1,
"scheduled_for": "2017-01-31T20:50:02Z",
"event": {
"id": "24934862-d980-46cb-9402-43c81b0cdba6",
"resource": "event",
"type": "charge:created",
"api_version": "2018-03-22",
"created_at": "2017-01-31T20:49:02Z",
"data": {
"code": "66BEOV2A", // or order ID the user need to fulfill
"name": "The Sovereign Individual",
"description": "Mastering the Transition to the Information Age",
"hosted_url": "https://commerce.coinbase.com/charges/66BEOV2A",
"created_at": "2017-01-31T20:49:02Z",
"expires_at": "2017-01-31T21:49:02Z",
"metadata": {},
"pricing_type": "CNY",
"payments": [
// ...
],
"addresses": {
// ...
}
}
}
}

The merchant server should respond with a 200 HTTP status code to acknowledge receipt of a webhook.

Error-handling

If there is no acknowledgment of receipt, we will retry with idempotency key and exponential backoff for up to three days. The maximum retry interval is 1 hour. If it's reaching a certain limit, short-circuit / mark it as broken. Sending out an Email to the merchant.

Metrics

The Webhook callers service emits statuses into the time-series DB for metrics.

Using Statsd + Influx DB vs. Prometheus?

  • InfluxDB: Application pushes data to InfluxDB. It has a monolithic DB for metrics and indices.
  • Prometheus: Prometheus server pulls the metrics values from the running application periodically. It uses LevelDB for indices, but each metric is stored in its own file.

Or use the expensive DataDog or other APM services if you have a generous budget.

Designing Smart Notification of Stock Price Changes

· 18 min read

Requirements

  • 3 million users
  • 5000 stocks + 250 global stocks
  • a user gets notified about the price change when
    1. subscribing the stock
    2. the stock has 5% or 10% changes
    3. since a) the last week or b) the last day
  • extensibility. may support other kinds of notifications like breaking news, earnings call, etc.

Sketching out the Architecture

Contexts:

  • What is clearing? Clearing is the procedure by which financial trades settle – that is, the correct and timely transfer of funds to the seller and securities to the buyer. Often with clearing, a specialized organization acts as an intermediary known as a clearinghouse.
  • What is a stock exchange? A facility where stock brokers and traders can buy and sell securities.

Apple Push Notification service


(APNs)

Apple Push Notification service<br>(APNs)

Google Firebase Cloud Messaging


(FCM)

Google Firebase Cloud Messaging<br>(FCM)

Email Services


AWS SES /sendgrid/etc

Email Services<br>AWS SES /sendgrid/etc

notifier

notifier

External Vendors



Market Prices

[Not supported by viewer]

Robinhood App

Robinhood App

API Gateway

API Gateway

Reverse Proxy

Reverse Proxy

batch write

batch write

price


ticker

[Not supported by viewer]

Time-series DB


influx or prometheus

Time-series DB<br>influx or prometheus

Tick every 5 mins

[Not supported by viewer]

periorical read

periorical read

price


watcher

price<br>watcher

User Settings

User Settings

Notification Queue

Notification Queue

throttler cache

throttler cache

cronjob

cronjob

What are those components and how do they interact with each other?

  • Price ticker
    • data fetching policies
      • option 1 preliminary: fetches data every 5 mins and flush into the time-series database in batches.
      • option 2 advanced: nowadays external systems usually push data directly so that we do not have to pull all the time.
    • ~6000 points per request or per price change.
    • data retention of 1 week, because this is just the speeding layer of the lambda architecture.
  • Price watcher
    • read the data ranging from last week or last 24 hours for each stock.
    • calculate if the fluctuation exceeds 5% or 10% in those two time spans. we get tuples like (stock, up 5%, 1 week).
      • corner case: should we normalize the price data? for example, some abnormal price like someone sold UBER mistakenly for $1 USD.
    • ratelimit (because 5% or 10% delta may occur many times within one day), and then emit an event PRICE_CHANGE(STOCK_CODE, timeSpan, percentage) to the notification queue.
  • Periodical triggers are cron jobs, e.g. Airflow, Cadence.
  • notification queue
    • may not necessarily be introduced in the first place when users and stocks are small.
    • may accept generic messaging event, like PRICE_CHANGE, EARNINGS_CALL, BREAKING_NEWS, etc.
  • Notifier
    • subscribe the notification queue to get the event
    • and then fetch who to notify from the user settings service
    • finally based on user settings, send out messages through APNs, FCM or AWS SES.

Designing Stock Exchange

· 20 min read

Requirements

  • order-matching system for buy and sell orders. Types of orders:
    • Market Orders
    • Limit Orders
    • Stop-Loss Orders
    • Fill-or-Kill Orders
    • Duration of Orders
  • high availability and low latency for millions of users
    • async design - use messaging queue extensively (btw. side-effect: engineers work on one service pub to a queue and does not even know where exactly is the downstream service and hence cannot do evil.)

Architecture

Reverse Proxy

Reverse Proxy

API Gateway

API Gateway

Order Matching

Order Matching

User Store

User Store

settle

settle

Orders

Orders

Stock Meta

Stock Meta

auth

auth

Cache

Cache

Balances & Bookkeeping

Balances & Bookkeeping

external pricing

external pricing

clearing


house

clearing<br>house

Bank, ACH, Visa, etc

Bank, ACH, Visa, etc

Payment

Payment

Audit & Report

Audit & Report

Components and How do they interact with each other.

order matching system

  • shard by stock code
  • order's basic data model (other metadata are omitted): Order(id, stock, side, time, qty, price)
  • the core abstraction of the order book is the matching algorithm. there are a bunch of matching algorithms(ref to stackoverflow, ref to medium)
  • example 1: price-time FIFO - a kind of 2D vector cast or flatten into 1D vector
    • x-axis is price
    • y-axis is orders. Price/time priority queue, FIFO.
      • Buy-side: ascending in price, descending in time.
      • Sell-side: ascending in price, ascending in time.
    • in other words
      • Buy-side: the higher the price and the earlier the order, the nearer we should put it to the center of the matching.
      • Sell-side: the lower the price and the earlier the order, the nearer we should put it to the center of the matching.

x-axis

line of prices

with y-axis cast into x-axis

Id   Side    Time   Qty   Price   Qty    Time   Side  
---+------+-------+-----+-------+-----+-------+------
#3 20.30 200 09:05 SELL
#1 20.30 100 09:01 SELL
#2 20.25 100 09:03 SELL
#5 BUY 09:08 200 20.20
#4 BUY 09:06 100 20.15
#6 BUY 09:09 200 20.15

Order book from Coinbase Pro

The Single Stock-Exchange Simulator

  • example 2: pro-rata

pure pro-rata

How to implement the price-time FIFO matching algorithm?

  • shard by stock, CP over AP: one stock one partition
  • stateful in-memory tree-map
    • periodically iterate the treemap to match orders
  • data persistence with cassandra
  • in/out requests of the order matching services are made through messaging queues
  • failover
    • the in-memory tree-maps are snapshotting into database
    • in an error case, recover from the snapshot and de-duplicate with cache

How to transmit data of the order book to the client-side in realtime?

  • websocket

How to support different kinds of orders?

  • same SELL or BUY: qty @ price in the treemap with different creation setup and matching conditions
    • Market Orders: place the order at the last market price.
    • Limit Orders: place the order with at a specific price.
    • Stop-Loss Orders: place the order with at a specific price, and match it in certain conditions.
    • Fill-or-Kill Orders: place the order with at a specific price, but match it only once.
    • Duration of Orders: place the order with at a specific price, but match it only in the given time span.

Orders Service

  • Preserves all active orders and order history.
  • Writes to order matching when receives a new order.
  • Receives matched orders and settle with external clearing house (async external gateway call + cronjob to sync DB)

References

Introduction to Architecture

· 3 min read

What is Architecture?

Architecture is the shape of a software system. To illustrate with a building:

  • Paradigm is the bricks.
  • Design principles are the rooms.
  • Components are the structure.

They all serve a specific purpose, just like hospitals treat patients and schools educate students.

Why Do We Need Architecture?

Behavior vs. Structure

Every software system provides two distinct values to stakeholders: behavior and structure. Software developers must ensure that both values are high.

==Due to the nature of their work, software architects focus more on the structure of the system rather than its features and functions.==

Ultimate Goal — ==Reduce the human resource costs required for adding new features==

Architecture serves the entire lifecycle of software systems, making them easy to understand, develop, test, deploy, and operate. Its goal is to minimize the human resource costs for each business use case.

O'Reilly's "Software Architecture" provides a great introduction to these five fundamental architectures.

1. Layered Architecture

Layered architecture is widely adopted and well-known among developers. Therefore, it is the de facto standard at the application level. If you are unsure which architecture to use, layered architecture is a good choice.

Examples:

  • TCP/IP model: Application Layer > Transport Layer > Internet Layer > Network Interface Layer
  • Facebook TAO: Network Layer > Cache Layer (follower + leader) > Database Layer

Pros and Cons:

  • Pros
    • Easy to use
    • Clear responsibilities
    • Testability
  • Cons
    • Large and rigid
      • Adjusting, extending, or updating the architecture requires changes across all layers, which can be quite tricky.

2. Event-Driven Architecture

Any change in state triggers an event in the system. Communication between system components is accomplished through events.

A simplified architecture includes a mediator, event queue, and channels. The diagram below illustrates a simplified event-driven architecture:

Examples:

  • QT: Signals and Slots
  • Payment infrastructure: As bank gateways often have high latency, asynchronous techniques are used in banking architecture.

3. Microkernel Architecture (aka Plug-in Architecture)

The functionality of the software is distributed between a core and multiple plugins. The core contains only the most basic functionalities. Each plugin operates independently and implements shared interfaces to achieve different goals.

Examples:

  • Visual Studio Code and Eclipse
  • MINIX operating system

4. Microservices Architecture

Large systems are decomposed into numerous microservices, each a separately deployable unit that communicates via RPCs.

uber architecture

Examples:

5. Space-Based Architecture

The name "Space-Based Architecture" comes from "tuple space," which implies a "distributed shared space." In space-based architecture, there are no databases or synchronized database access, thus avoiding database bottleneck issues. All processing units share copies of application data in memory. These processing units can be flexibly started and stopped.

Example: See Wikipedia

  • Primarily adopted by Java-based architectures: for example, JavaSpaces.

Introduction to Architecture

· 3 min read

What is architecture?

Architecture is the shape of the software system. Thinking it as a big picture of physical buildings.

  • paradigms are bricks.
  • design principles are rooms.
  • components are buildings.

Together they serve a specific purpose like a hospital is for curing patients and a school is for educating students.

Why do we need architecture?

Behavior vs. Structure

Every software system provides two different values to the stakeholders: behavior and structure. Software developers are responsible for ensuring that both those values remain high.

==Software architects are, by virtue of their job description, more focused on the structure of the system than on its features and functions.==

Ultimate Goal - ==saving human resources costs per feature==

Architecture serves the full lifecycle of the software system to make it easy to understand, develop, test, deploy, and operate. The goal is to minimize the human resources costs per business use-case.

The O’Reilly book Software Architecture Patterns by Mark Richards is a simple but effective introduction to these five fundamental architectures.

1. Layered Architecture

The layered architecture is the most common in adoption, well-known among developers, and hence the de facto standard for applications. If you do not know what architecture to use, use it.

Examples

  • TCP / IP Model: Application layer > transport layer > internet layer > network access layer
  • Facebook TAO: web layer > cache layer (follower + leader) > database layer

Pros and Cons

  • Pros
    • ease of use
    • separation of responsibility
    • testability
  • Cons
    • monolithic
      • hard to adjust, extend or update. You have to make changes to all the layers.

2. Event-Driven Architecture

A state change will emit an event to the system. All the components communicate with each other through events.

A simple project can combine the mediator, event queue, and channel. Then we get a simplified architecture:

Examples

  • QT: Signals and Slots
  • Payment Infrastructure: Bank gateways usually have very high latencies, so they adopt async technologies in their architecture design.

3. Micro-kernel Architecture (aka Plug-in Architecture)

The software's responsibilities are divided into one "core" and multiple "plugins". The core contains the bare minimum functionality. Plugins are independent of each other and implement shared interfaces to achieve different goals.

Examples

  • Visual Studio Code, Eclipse
  • MINIX operating system

4. Microservices Architecture

A massive system is decoupled to multiple micro-services, each of which is a separately deployed unit, and they communicate with each other via RPCs.

uber architecture

Examples

5. Space-based Architecture

This pattern gets its name from "tuple space", which means “distributed shared memory". There is no database or synchronous database access, and thus no database bottleneck. All the processing units share the replicated application data in memory. These processing units can be started up and shut down elastically.

Examples: See Wikipedia

  • Mostly adopted among Java users: e.g., JavaSpaces