76 posts tagged with "system design"

Lyft's Marketing Automation Platform -- Symphony

October 9, 2019 · 3 min read

Acquisition Efficiency Problem：How to achieve a better ROI in advertising?

In details, Lyft's advertisements should meet requirements as below:

being able to manage region-specific ad campaigns
guided by data-driven growth: The growth must be scalable, measurable, and predictable
supporting Lyft's unique growth model as shown below

lyft growth model

However, the biggest challenge is to manage all the processes of cross-region marketing at scale, which include choosing bids, budgets, creatives, incentives, and audiences, running A/B tests, and so on. You can see what occupies a day in the life of a digital marketer:

营销者的一天

We can find out that execution occupies most of the time while analysis, thought as more important, takes much less time. A scaling strategy will enable marketers to concentrate on analysis and decision-making process instead of operational activities.

Solution: Automation

To reduce costs and improve experimental efficiency, we need to

predict the likelihood of a new user to be interested in our product
evaluate effectively and allocate marketing budgets across channels
manage thousands of ad campaigns handily

The marketing performance data flows into the reinforcement-learning system of Lyft: Amundsen

The problems that need to be automated include:

updating bids across search keywords
turning off poor-performing creatives
changing referrals values by market
identifying high-value user segments
sharing strategies across campaigns

Architecture

Lyft Symphony Architecture

The tech stack includes - Apache Hive, Presto, ML platform, Airflow, 3rd-party APIs, UI.

Main components

Lifetime Value(LTV) forecaster

The lifetime value of a user is an important criterion to measure the efficiency of acquisition channels. The budget is determined together by LTV and the price we are willing to pay in that region.

Our knowledge of a new user is limited. The historical data can help us to predict more accurately as the user interacts with our services.

Initial eigenvalue:

特征值

The forecast improves as the historical data of interactivity accumulates:

根据历史记录判断 LTV

Budget allocator

After LTV is predicted, the next is to estimate budgets based on the price. A curve of the form LTV = a * (spend)^b is fit to the data. A degree of randomness will be injected into the cost-curve creation process in order to converge a global optimum.

预算计算

Bidders

Bidders are made up of two parts - the tuners and actors. The tuners decide exact channel-specific parameters based on the price. The actors communicate the actual bid to different channels.

Some popular bidding strategies, applied in different channels, are listed as below:

投放策略

Conclusion

We have to value human experiences in the automation process; otherwise, the quality of the models may be "garbage in, garbage out". Once saved from laboring tasks, marketers can focus more on understanding users, channels, and the messages they want to convey to audiences, and thus obtain better ad impacts. That's how Lyft can achieve a higher ROI with less time and efforts.

Designing Airbnb or a hotel booking system

October 6, 2019 · 3 min read

Requirements

for guests
- search rooms by locations, dates, number of rooms, and number of guests
- get room details (like picture, name, review, address, etc.) and prices
- pay and book room from inventory by date and room id
  - checkout as a guest
  - user is logged in already
- notification via Email and mobile push notification
for hotel or rental administrators (suppliers/hosts)
- administrators (receptionist/manager/rental owner): manage room inventory and help the guest to check-in and check out
- housekeeper: clean up rooms routinely

Architecture

Components

Inventory <> Bookings <> Users (guests and hosts)

Suppliers provide their room details in the inventory. And users can search, get, and reserve rooms accordingly. After reserving the room, the user's payment will change the status of the reserved_room as well. You could check the data model in this post.

How to find available rooms?

by location: geo-search with spatial indexing, e.g. geo-hash or quad-tree.
by room metadata: apply filters or search conditions when querying the database.
by date-in and date-out and availability. Two options:
- option 1: for a given room_id, check all occupied_room today or later, transform the data structure to an array of occupation by days, and finally find available slots in the array. This process might be time-consuming, so we can build the availability index.
- option 2: for a given room_id, always create an entry for an occupied day. Then it will be easier to query unavailable slots by dates.

For hotels, syncing data

If it is a hotel booking system, then it will probably publish to Booking Channels like GDS, Aggregators, and Wholesalers.

Hotel Booking Ecosystem

To sync data across those places. We can

retry with idempotency to improve the success rate of the external calls and ensure no duplicate orders.
provide webhook callback APIs to external vendors to update status in the internal system.

Payment & Bookkeeping

Data model: double-entry bookkeeping

To execute the payment, since we are calling the external payment gateway, like bank or Stripe, Braintree, etc. It is crucial to keep data in-sync across different places. We need to sync data across the transaction table and external banks and vendors.

Notifier for reminders / alerts

The notification system is essentially a delayer scheduler (priority queue + subscriber) plus API integrations.

For example, a daily cronjob will query the database for notifications to be sent out today and put them into the priority queue by date. The subscriber will get the earliest ones from the priority queue and send out if reaching the expected timestamp. Otherwise, put the task back to the queue and sleep to make the CPU idle for other work, which can be interrupted if there are new alerts added for today.

Designing Memcached or an in-memory KV store

October 3, 2019 · 2 min read

Requirements

High-performance, distributed key-value store

Why distributed?
- Answer: to hold a larger size of data

For in-memory storage of small data objects
Simple server (pushing complexity to the client) and hence reliable and easy to deploy

Architecture

Big Picture: Client-server

client
given a list of Memcached servers
chooses a server based on the key
server
store KVs into the internal hash table
LRU eviction

The Key-value server consists of a fixed-size hash table + single-threaded handler + coarse locking

hash table

How to handle collisions? Mostly three ways to resolve:

Separate chaining: the collided bucket chains a list of entries with the same index, and you can always append the newly collided key-value pair to the list.
open addressing: if there is a collision, go to the next index until finding an available bucket.
dynamic resizing: resize the hash table and allocate more spaces; hence, collisions will happen less frequently.

How does the client determine which server to query?

See Data Partition and Routing

How to use cache?

See Key value cache

How to further optimize?

See How Facebook Scale its Social Graph Store? TAO

Lyft's Marketing Automation Platform Symphony

October 2, 2019 · 3 min read

Customer Acquisition Efficiency Issue: How can advertising campaigns achieve higher returns with less money and fewer people?

Specifically, Lyft's advertising campaigns need to address the following characteristics:

Manage location-based campaigns
Data-driven growth: growth must be scalable, measurable, and predictable
Support Lyft's unique growth model, as shown below:

lyft growth model

The main challenge is the difficulty of scaling management across various aspects of regional marketing, including ad bidding, budgeting, creative assets, incentives, audience selection, testing, and more. The following image depicts a day in the life of a marketer:

A Day in the Life of a Marketer

We can see that "execution" takes up most of the time, while less time is spent on the more important tasks of "analysis and decision-making." Scaling means reducing complex operations and allowing marketers to focus on analysis and decision-making.

Solution: Automation

To reduce costs and improve the efficiency of experimentation, it is necessary to:

Predict whether new users are interested in the product
Optimize across multiple channels and effectively evaluate and allocate budgets
Conveniently manage thousands of campaigns

Data is enhanced through Lyft's Amundsen system using reinforcement learning.

The automation components include:

Updating bid keywords
Disabling underperforming creative assets
Adjusting referral values based on market changes
Identifying high-value user segments
Sharing strategies across multiple campaigns

Architecture

Lyft Symphony Architecture

Technology stack: Apache Hive, Presto, ML platform, Airflow, 3rd-party APIs, UI.

Specific Component Modules

LTV Prediction Module

The lifetime value (LTV) of users is an important metric for evaluating channels, and the budget is determined by both LTV and the price we are willing to pay for customer acquisition in that region.

Our understanding of new users is limited, but as interactions increase, the historical data provided will more accurately predict outcomes.

Initial feature values:

Feature Values

As historical interaction records accumulate, the predictions become more accurate:

Predicting LTV Based on Historical Records

Budget Allocation Module

Once LTV is established, the next step is to set the budget based on pricing. A curve of the form LTV = a * (spend)^b is fitted, along with similar parameter curves in the surrounding range. Achieving a global optimum requires some randomness.

Budget Calculation

Delivery Module

This module is divided into two parts: the parameter tuner and the executor. The tuner sets specific parameters based on pricing for each channel, while the executor applies these parameters to the respective channels.

There are many popular delivery strategies that are common across various channels:

Delivery Strategies

Conclusion

It is essential to recognize the importance of human experience within the system; otherwise, it results in garbage in, garbage out. When people are liberated from tedious delivery tasks and can focus on understanding users, channels, and the messages they need to convey to their audience, they can achieve better campaign results—spending less time to achieve higher ROI.

How to write solid code?

September 25, 2019 · One min read

he likes it

empathy / perspective-taking is the most important.
1. realize that code is written for human to read first and then for machines to execute.
2. software is so "soft" and there are many ways to achieve one thing. It's all about making the proper trade-offs to fulfill the requirements.
3. Invent and Simplify: Apple Pay RFID vs. Wechat Scan QR Code.
choose a sustainable architecture to reduce human resources costs per feature.

adopt patterns and best practices.
avoid anti-patterns
- missing error-handling
- callback hell = spaghetti code + unpredictable error handling
- over-long inheritance chain
- circular dependency
- over-complicated code
  - nested ternary operation
  - comment out unused code
- missing i18n, especially RTL issues
- don't repeat yourself
  - simple copy-and-paste
  - unreasonable comments
effective refactoring
- semantic version
- never introduce breaking change to non major versions
  - two legged change

Designing a metric system

August 26, 2019 · 17 min read

Requirements

Log v.s Metric: A log is an event that happened, and a metric is a measurement of the health of a system.

We are assuming that this system’s purpose is to serve metrics - namely, counters, conversion rate, timers, etc. for monitoring the system performance and health. If the conversion rate drops drastically, the system should alert the on-call.

Monitoring business metrics like signup funnel’s conversion rate
Supporting various queries, like on different platforms (IE/Chrome/Safari, iOS/Android/Desktop, etc.)
data visualization
Scalability and Availability

Architecture

Two ways to build the system:

Push Model: Influx/Telegraf/Grafana
Pull Model: Prometheus/Grafana

The pull model is more scalable because it decreases the number of requests going into the metrics databases - there is no hot path and concurrency issue.

Server Farm

write

telegraf

InfluxDB

REST API

Grafana

InfluxDB Push Model

Prometheus Pull Model

Application

Exporter

client library

3rd Party

Application

3rd Party Application

pull

Prometheus

Retrieval

Service Discovery

Storage

PromQL

Alertmanager

Web UI / Grafana / API Clients

PagerDuty

Features and Components

Take a four-step sign up on the mobile app for example

INPUT_PHONE_NUMBER -> VERIFY_SMS_CODE -> INPUT_NAME -> INPUT_PASSWORD

Every step has IMPRESSION and POST_VERIFICATION phases. And emit metrics like this:

{
  "sign_up_session_id": "uuid",
  "step": "VERIFY_SMS_CODE",
  "os": "iOS",
  "phase": "POST_VERIFICATION",
  "status": "SUCCESS",
  // ... ts, contexts, ...
}

Consequently, we can query the overall conversion rate of VERIFY_SMS_CODE step on iOS like

(counts of step=VERIFY_SMS_CODE, os=iOS, status: SUCCESS, phase: POST_VERIFICATION) / (counts of step=VERIFY_SMS_CODE, os=iOS, phase: IMPRESSION)

Data Visualization

Graphana is mature enough for the data visualization work. If you do not want to expose the whole site, you can use Embed Panel with iframe.

Designing Square Cash or PayPal Money Transfer System

August 23, 2019 · 24 min read

Clarifying Requirements

Designing a service money transfer backend system like Square Cash (we will call this system Cash App below) or PayPal to

Deposit from and payout to bank
Transfer between accounts
High scalability and availability
i18n: language, timezone, currency exchange
Deduplication for non-idempotent APIs and for at-least-once delivery.
Consistency across multiple data sources.

Architecture

AWS CloudHSM

Presentation Layer

SDK/Docs

mobile-dashboard

web-dashboard

dashboard-client

mobile-wallet

web-wallet

wallet-client

Merchant

User

Merchant User

End User

web-chrome-extension

Operators

payment

task-queue

financial-reporter

payment-gateway

banks /

vendors

[Not supported by viewer]

side-effect maker

help service portal

User

Profiles

AuthDB

[Not supported by viewer]

api-gateway

monolithic

api-gateway monolithic

Payment

Payment DB

Aurora

risk control

Event
Queue

[Not supported by viewer]

Features and Components

Payment Service

The payment data model is essentially “double-entry bookkeeping”. Every entry to an account requires a corresponding and opposite entry to a different account. Sum of all debit and credit equals to zero.

Deposit and Payout

Transaction: new user Jane Doe deposits $100 from bank to Cash App. This one transaction involves those DB entries:

bookkeeping table (for history)

+ debit, USD, 100, CashAppAccountNumber, txId
- credit, USD, 100, RoutingNumber:AccountNumber, txId

transaction table

txId, timestamp, status(pending/confirmed), [bookkeeping entries], narration

Once the bank confirmed the transaction, update the pending status above and the following balance sheet in one transaction.

balance sheet

CashAppAccountNumber, USD, 100

Transfer between accounts within Cash App

Similar to the case above, but there is no pending state because we do not need the slow external system to change their state. All changes in bookkeeping table, transaction table, and balance sheet table happen in one transaction.

i18n

We solve the i18n problems in 3 dimensions.

Language: All texts like copywriting, push notifications, emails are picked up according to the accept-language header.
Timezones: All server timezones are in UTC. We transform timestamps to the local timezone in the client-side.
Currency: All user transferring transactions must be in the same currency. If they want to move across currencies, they have to exchange the currency first, in a rate that is favorable to the Cash App.

For example, Jane Doe wants to exchange 1 USD with 6.8 CNY with 0.2

bookkeeping table

- credit, USD, 1, CashAppAccountNumber, txId
+ debit, CNY, 6.8, CashAppAccountNumber, txId, @7.55 CNY/USD
+ debit, USD, 0.1, ExpensesOfExchangeAccountNumber, txId

Transaction table, balance sheet, etc. are similar to the transaction discussed in Deposit and Payout. The major difference is that the bank or the vendor provides the exchange service.

How to sync across the transaction table and external banks and vendors?

retry with idempotency to improve the success rate of the external calls and ensure no duplicate orders.
two ways to check if the PENDING orders are filled or failed.
1. poll: cronjobs (SWF, Airflow, Cadence, etc.) to poll the status for PENDING orders.
2. callback: provide a callback API for the external vendors.
Graceful shutdown. The bank gateway calls may take tens of seconds to finish, and restarting the servers may resume unfinished transactions from the database. The process may create too many connections. To reduce connections, before the shutdown, stop accepting new requests and wait for the existing outgoing ones to wrap up.

Deduplication

Why is Deduplication a concern?

not all endpoints are idempotent
Event queue may be at-least-once.

not all endpoints are idempotent: what if the external system is not idempotent?

For the poll case above, if the external gateway does not support idempotent APIs, in order not to flood with duplicate entries, we must keep record of the order ID or the reference ID the external system gives us with 200, and query GET by the order ID instead of POST all the time.

For the callback case, we can ensure we implement with idempotent APIs, and we mutate pending to confirmed anyway.

Event queue may be at-least-once

For the even queue, we can use an exactly-once Kafka with the producer throughput declines only by 3%.
In the database layer, we can use idempotency key or deduplication key.
In the service layer, we can use Redis key-value store.

Availability and Scalability

Overall failover strategies: Improving availability with failover: Cold Standby, Hot Standby, Warm Standby, Active-active.
Service layer scaling: AKF Scale Cube
Data layer scaling: CQRS Pattern
Needing a speed layer? Lambda Architecture

Designing payment webhook

August 19, 2019 · 4 min read

1. Clarifying Requirements

Webhook will call the merchant back once the payment succeeds.
1. Merchant developer registers webhook information with us.
2. Make a POST HTTP request to the webhooks reliably and securely.
High availability, error-handling, and failure-resilience.
1. Async design. Assuming that the servers of merchants are located across the world, and may have a very high latency like 15s.
2. At-least-once delivery. Idempotent key.
3. Order does not matter.
4. Robust & predictable retry and short-circuit.
Security, observability & scalability
1. Anti-spoofing.
2. Notify the merchant when their receivers are broken.
3. easy to extend and scale.

2. Sketch out the high-level design

async design + retry + queuing + observability + security

3. Features and Components

Core Features

Users go to dashboard frontend to register webhook information with us - like the URL to call, the scope of events they want to subscribe, and then get an API key from us.
When there is a new event, publish it into the queue and then get consumed by callers. Callers get the registration and make the HTTP call to external services.

Webhook callers

Subscribe to the event queue for payment success events published by a payment state machine or other services.
Once callers accept an event, fetch webhook URI, secret, and settings from the user settings service. Prepare the request based on those settings. For security...

All webhooks from user settings must be in HTTPS
If the payload is huge, the prospect latency is high, and we want to make sure the target receiver is alive, we can verify its existence with a ping carrying a challenge. e.g. Dropbox verifies webhook endpoints by sending a GET request with a “challenge” param (a random string) encoded in the URL, which your endpoint is required to echo back as a response.
All callback requests are with header x-webhook-signature. So that the receiver can authenticate the request.
- For symmetric signature, we can use HMAC/SHA256 signature. Its value is HMAC(webhook secret, raw request payload);. Telegram takes this.
  - For asymmetric signature, we can use RSA/SHA256 signature. Its value is RSA(webhook private key, raw request payload); Stripe takes this.
  - If it's sensitive information, we can also consider encryption for the payload instead of just signing.

Make an HTTP POST request to the external merchant's endpoints with event payload and security headers.

API Definition

// POST https://example.com/webhook/
{
  "id": 1,
  "scheduled_for": "2017-01-31T20:50:02Z",
  "event": {
    "id": "24934862-d980-46cb-9402-43c81b0cdba6",
    "resource": "event",
    "type": "charge:created",
    "api_version": "2018-03-22",
    "created_at": "2017-01-31T20:49:02Z",
    "data": {
      "code": "66BEOV2A", // or order ID the user need to fulfill
      "name": "The Sovereign Individual",
      "description": "Mastering the Transition to the Information Age",
      "hosted_url": "https://commerce.coinbase.com/charges/66BEOV2A",
      "created_at": "2017-01-31T20:49:02Z",
      "expires_at": "2017-01-31T21:49:02Z",
      "metadata": {},
      "pricing_type": "CNY",
      "payments": [
        // ...
      ],
      "addresses": {
        // ...
      }
    }
  }
}

The merchant server should respond with a 200 HTTP status code to acknowledge receipt of a webhook.

Error-handling

If there is no acknowledgment of receipt, we will retry with idempotency key and exponential backoff for up to three days. The maximum retry interval is 1 hour. If it's reaching a certain limit, short-circuit / mark it as broken. Sending out an Email to the merchant.

Metrics

The Webhook callers service emits statuses into the time-series DB for metrics.

Using Statsd + Influx DB vs. Prometheus?

InfluxDB: Application pushes data to InfluxDB. It has a monolithic DB for metrics and indices.
Prometheus: Prometheus server pulls the metrics values from the running application periodically. It uses LevelDB for indices, but each metric is stored in its own file.

Or use the expensive DataDog or other APM services if you have a generous budget.

Designing Smart Notification of Stock Price Changes

August 13, 2019 · 18 min read

Requirements

3 million users
5000 stocks + 250 global stocks
a user gets notified about the price change when
1. subscribing the stock
2. the stock has 5% or 10% changes
3. since a) the last week or b) the last day
extensibility. may support other kinds of notifications like breaking news, earnings call, etc.

Sketching out the Architecture

Contexts:

What is clearing? Clearing is the procedure by which financial trades settle – that is, the correct and timely transfer of funds to the seller and securities to the buyer. Often with clearing, a specialized organization acts as an intermediary known as a clearinghouse.
What is a stock exchange? A facility where stock brokers and traders can buy and sell securities.

Apple Push Notification service (APNs)

Google Firebase Cloud Messaging

(FCM)

Google Firebase Cloud Messaging (FCM)

Email Services

AWS SES /sendgrid/etc

Email Services AWS SES /sendgrid/etc

notifier

External Vendors

Market Prices

[Not supported by viewer]

Robinhood App

API Gateway

Reverse Proxy

batch write

price

ticker

[Not supported by viewer]

Time-series DB

influx or prometheus

Time-series DB influx or prometheus

Tick every 5 mins

[Not supported by viewer]

periorical read

price

watcher

price watcher

User Settings

Notification Queue

throttler cache

cronjob

What are those components and how do they interact with each other?

Price ticker
- data fetching policies
  - option 1 preliminary: fetches data every 5 mins and flush into the time-series database in batches.
  - option 2 advanced: nowadays external systems usually push data directly so that we do not have to pull all the time.
- ~6000 points per request or per price change.
- data retention of 1 week, because this is just the speeding layer of the lambda architecture.
Price watcher
- read the data ranging from last week or last 24 hours for each stock.
- calculate if the fluctuation exceeds 5% or 10% in those two time spans. we get tuples like (stock, up 5%, 1 week).
  - corner case: should we normalize the price data? for example, some abnormal price like someone sold UBER mistakenly for $1 USD.
- ratelimit (because 5% or 10% delta may occur many times within one day), and then emit an event PRICE_CHANGE(STOCK_CODE, timeSpan, percentage) to the notification queue.
Periodical triggers are cron jobs, e.g. Airflow, Cadence.
notification queue
- may not necessarily be introduced in the first place when users and stocks are small.
- may accept generic messaging event, like PRICE_CHANGE, EARNINGS_CALL, BREAKING_NEWS, etc.
Notifier
- subscribe the notification queue to get the event
- and then fetch who to notify from the user settings service
- finally based on user settings, send out messages through APNs, FCM or AWS SES.

Designing Stock Exchange

August 12, 2019 · 20 min read

Requirements

order-matching system for buy and sell orders. Types of orders:
- Market Orders
- Limit Orders
- Stop-Loss Orders
- Fill-or-Kill Orders
- Duration of Orders
high availability and low latency for millions of users
- async design - use messaging queue extensively (btw. side-effect: engineers work on one service pub to a queue and does not even know where exactly is the downstream service and hence cannot do evil.)

Architecture

Reverse Proxy

API Gateway

Order Matching

User Store

settle

Orders

Stock Meta

auth

Cache

Balances & Bookkeeping

external pricing

clearing

house

clearing house

Bank, ACH, Visa, etc

Payment

Audit & Report

Components and How do they interact with each other.

order matching system

shard by stock code
order's basic data model (other metadata are omitted): Order(id, stock, side, time, qty, price)
the core abstraction of the order book is the matching algorithm. there are a bunch of matching algorithms(ref to stackoverflow, ref to medium)
example 1: price-time FIFO - a kind of 2D vector cast or flatten into 1D vector
- x-axis is price
- y-axis is orders. Price/time priority queue, FIFO.
  - Buy-side: ascending in price, descending in time.
  - Sell-side: ascending in price, ascending in time.
- in other words
  - Buy-side: the higher the price and the earlier the order, the nearer we should put it to the center of the matching.
  - Sell-side: the lower the price and the earlier the order, the nearer we should put it to the center of the matching.

x-axis

line of prices

with y-axis cast into x-axis

Id   Side    Time   Qty   Price   Qty    Time   Side  
---+------+-------+-----+-------+-----+-------+------
#3                        20.30   200   09:05   SELL  
#1                        20.30   100   09:01   SELL  
#2                        20.25   100   09:03   SELL  
#5   BUY    09:08   200   20.20                       
#4   BUY    09:06   100   20.15                       
#6   BUY    09:09   200   20.15                       

Order book from Coinbase Pro

The Single Stock-Exchange Simulator

example 2: pro-rata

pure pro-rata

How to implement the price-time FIFO matching algorithm?

shard by stock, CP over AP: one stock one partition
stateful in-memory tree-map
- periodically iterate the treemap to match orders
data persistence with cassandra
in/out requests of the order matching services are made through messaging queues
failover
- the in-memory tree-maps are snapshotting into database
- in an error case, recover from the snapshot and de-duplicate with cache

How to transmit data of the order book to the client-side in realtime?

websocket

How to support different kinds of orders?

same SELL or BUY: qty @ price in the treemap with different creation setup and matching conditions
- Market Orders: place the order at the last market price.
- Limit Orders: place the order with at a specific price.
- Stop-Loss Orders: place the order with at a specific price, and match it in certain conditions.
- Fill-or-Kill Orders: place the order with at a specific price, but match it only once.
- Duration of Orders: place the order with at a specific price, but match it only in the given time span.

Orders Service

Preserves all active orders and order history.
Writes to order matching when receives a new order.
Receives matched orders and settle with external clearing house (async external gateway call + cronjob to sync DB)

References

Introduction to Architecture

June 11, 2019 · 3 min read

What is Architecture?

Architecture is the shape of a software system. To illustrate with a building:

Paradigm is the bricks.
Design principles are the rooms.
Components are the structure.

They all serve a specific purpose, just like hospitals treat patients and schools educate students.

Why Do We Need Architecture?

Behavior vs. Structure

Every software system provides two distinct values to stakeholders: behavior and structure. Software developers must ensure that both values are high.

==Due to the nature of their work, software architects focus more on the structure of the system rather than its features and functions.==

Ultimate Goal — ==Reduce the human resource costs required for adding new features==

Architecture serves the entire lifecycle of software systems, making them easy to understand, develop, test, deploy, and operate. Its goal is to minimize the human resource costs for each business use case.

O'Reilly's "Software Architecture" provides a great introduction to these five fundamental architectures.

1. Layered Architecture

Layered architecture is widely adopted and well-known among developers. Therefore, it is the de facto standard at the application level. If you are unsure which architecture to use, layered architecture is a good choice.

Examples:

TCP/IP model: Application Layer > Transport Layer > Internet Layer > Network Interface Layer
Facebook TAO: Network Layer > Cache Layer (follower + leader) > Database Layer

Pros and Cons:

Pros
- Easy to use
- Clear responsibilities
- Testability
Cons
- Large and rigid
  - Adjusting, extending, or updating the architecture requires changes across all layers, which can be quite tricky.

2. Event-Driven Architecture

Any change in state triggers an event in the system. Communication between system components is accomplished through events.

A simplified architecture includes a mediator, event queue, and channels. The diagram below illustrates a simplified event-driven architecture:

Examples:

QT: Signals and Slots
Payment infrastructure: As bank gateways often have high latency, asynchronous techniques are used in banking architecture.

3. Microkernel Architecture (aka Plug-in Architecture)

The functionality of the software is distributed between a core and multiple plugins. The core contains only the most basic functionalities. Each plugin operates independently and implements shared interfaces to achieve different goals.

Examples:

Visual Studio Code and Eclipse
MINIX operating system

4. Microservices Architecture

Large systems are decomposed into numerous microservices, each a separately deployable unit that communicates via RPCs.

uber architecture

Examples:

Uber: See designing Uber
Smartly

5. Space-Based Architecture

The name "Space-Based Architecture" comes from "tuple space," which implies a "distributed shared space." In space-based architecture, there are no databases or synchronized database access, thus avoiding database bottleneck issues. All processing units share copies of application data in memory. These processing units can be flexibly started and stopped.

Example: See Wikipedia

Primarily adopted by Java-based architectures: for example, JavaSpaces.

Introduction to Architecture

May 11, 2019 · 3 min read

What is architecture?

Architecture is the shape of the software system. Thinking it as a big picture of physical buildings.

paradigms are bricks.
design principles are rooms.
components are buildings.

Together they serve a specific purpose like a hospital is for curing patients and a school is for educating students.

Why do we need architecture?

Behavior vs. Structure

Every software system provides two different values to the stakeholders: behavior and structure. Software developers are responsible for ensuring that both those values remain high.

==Software architects are, by virtue of their job description, more focused on the structure of the system than on its features and functions.==

Ultimate Goal - ==saving human resources costs per feature==

Architecture serves the full lifecycle of the software system to make it easy to understand, develop, test, deploy, and operate. The goal is to minimize the human resources costs per business use-case.

The O’Reilly book Software Architecture Patterns by Mark Richards is a simple but effective introduction to these five fundamental architectures.

1. Layered Architecture

The layered architecture is the most common in adoption, well-known among developers, and hence the de facto standard for applications. If you do not know what architecture to use, use it.

Examples

TCP / IP Model: Application layer > transport layer > internet layer > network access layer
Facebook TAO: web layer > cache layer (follower + leader) > database layer

Pros and Cons

Pros
- ease of use
- separation of responsibility
- testability
Cons
- monolithic
  - hard to adjust, extend or update. You have to make changes to all the layers.

2. Event-Driven Architecture

A state change will emit an event to the system. All the components communicate with each other through events.

A simple project can combine the mediator, event queue, and channel. Then we get a simplified architecture:

Examples

QT: Signals and Slots
Payment Infrastructure: Bank gateways usually have very high latencies, so they adopt async technologies in their architecture design.

3. Micro-kernel Architecture (aka Plug-in Architecture)

The software's responsibilities are divided into one "core" and multiple "plugins". The core contains the bare minimum functionality. Plugins are independent of each other and implement shared interfaces to achieve different goals.

Examples

Visual Studio Code, Eclipse
MINIX operating system

4. Microservices Architecture

A massive system is decoupled to multiple micro-services, each of which is a separately deployed unit, and they communicate with each other via RPCs.

uber architecture

Examples

Uber: See designing Uber
Smartly

5. Space-based Architecture

This pattern gets its name from "tuple space", which means “distributed shared memory". There is no database or synchronous database access, and thus no database bottleneck. All the processing units share the replicated application data in memory. These processing units can be started up and shut down elastically.

Examples: See Wikipedia

Mostly adopted among Java users: e.g., JavaSpaces

Acquisition Efficiency Problem：How to achieve a better ROI in advertising?​

Solution: Automation​

Architecture​

Main components​

Lifetime Value(LTV) forecaster​

Budget allocator​

Bidders​

Conclusion​

Requirements​

Architecture​

Components​

Inventory <> Bookings <> Users (guests and hosts)​

How to find available rooms?​

For hotels, syncing data​

Payment & Bookkeeping​

Notifier for reminders / alerts​

Requirements​

Architecture​

How does the client determine which server to query?​

How to use cache?​

How to further optimize?​

Customer Acquisition Efficiency Issue: How can advertising campaigns achieve higher returns with less money and fewer people?​

Solution: Automation​

Architecture​

Specific Component Modules​

LTV Prediction Module​

Budget Allocation Module​

Delivery Module​

Conclusion​

Requirements​

Architecture​

Features and Components​

Measuring Sign-up Funnel​

Data Visualization​

Clarifying Requirements​

Architecture​

Features and Components​

Payment Service​

Deposit and Payout​

Transfer between accounts within Cash App​

i18n​

How to sync across the transaction table and external banks and vendors?​

Deduplication​

not all endpoints are idempotent: what if the external system is not idempotent?​

Event queue may be at-least-once​

Availability and Scalability​

1. Clarifying Requirements​

2. Sketch out the high-level design​

3. Features and Components​

Core Features​

Webhook callers​

API Definition​

Error-handling​

Metrics​

Requirements​

Sketching out the Architecture​

What are those components and how do they interact with each other?​

Requirements​

Architecture​

Components and How do they interact with each other.​

order matching system​

Orders Service​

References​

What is Architecture?​

Why Do We Need Architecture?​

Behavior vs. Structure​

Ultimate Goal — ==Reduce the human resource costs required for adding new features==​

1. Layered Architecture​

2. Event-Driven Architecture​

3. Microkernel Architecture (aka Plug-in Architecture)​

4. Microservices Architecture​

5. Space-Based Architecture​

What is architecture?​

Why do we need architecture?​

Behavior vs. Structure​

Ultimate Goal - ==saving human resources costs per feature==​

1. Layered Architecture​

2. Event-Driven Architecture​

3. Micro-kernel Architecture (aka Plug-in Architecture)​

4. Microservices Architecture​

Acquisition Efficiency Problem：How to achieve a better ROI in advertising?

Solution: Automation

Architecture

Main components

Lifetime Value(LTV) forecaster

Budget allocator

Bidders

Conclusion

Requirements

Architecture

Components

Inventory <> Bookings <> Users (guests and hosts)

How to find available rooms?

For hotels, syncing data

Payment & Bookkeeping

Notifier for reminders / alerts

Requirements

Architecture

How does the client determine which server to query?

How to use cache?

How to further optimize?

Customer Acquisition Efficiency Issue: How can advertising campaigns achieve higher returns with less money and fewer people?

Solution: Automation

Architecture

Specific Component Modules

LTV Prediction Module

Budget Allocation Module

Delivery Module

Conclusion

Requirements

Architecture

Features and Components

Measuring Sign-up Funnel

Data Visualization

Clarifying Requirements

Architecture

Features and Components

Payment Service

Deposit and Payout

Transfer between accounts within Cash App

i18n

How to sync across the transaction table and external banks and vendors?

Deduplication

not all endpoints are idempotent: what if the external system is not idempotent?

Event queue may be at-least-once

Availability and Scalability

1. Clarifying Requirements

2. Sketch out the high-level design

3. Features and Components

Core Features

Webhook callers

API Definition

Error-handling

Metrics

Requirements

Sketching out the Architecture

What are those components and how do they interact with each other?

Requirements

Architecture

Components and How do they interact with each other.

order matching system

Orders Service

References

What is Architecture?

Why Do We Need Architecture?

Behavior vs. Structure

Ultimate Goal — ==Reduce the human resource costs required for adding new features==

1. Layered Architecture

2. Event-Driven Architecture

3. Microkernel Architecture (aka Plug-in Architecture)

4. Microservices Architecture

5. Space-Based Architecture

What is architecture?

Why do we need architecture?

Behavior vs. Structure

Ultimate Goal - ==saving human resources costs per feature==

1. Layered Architecture

2. Event-Driven Architecture

3. Micro-kernel Architecture (aka Plug-in Architecture)

4. Microservices Architecture