Skip to main content

Designing Human-Centric Internationalization (i18n) Engineering Solutions

· 9 min read

Requirement Analysis

If you ask what the biggest difference is between Silicon Valley companies and those in China, the answer is likely as Wu Jun said: Silicon Valley companies primarily target the global market. As Chen Zhiwu aptly stated, the ability to create wealth can be measured by three dimensions: depth, which refers to productivity—the ability to provide better products or services in the same amount of time; length, which refers to the ability to leverage finance to exchange value across time and space; and breadth, which refers to market size—the ability to create markets or new industries that transcend geographical boundaries. Internationalization, which is the localization of products and services in terms of language and culture, is indeed a strategic key for multinational companies competing in the global market.

Internationalization, abbreviated as i18n (with 18 letters between the 'i' and the 'n'), aims to solve the following issues in the development of websites and mobile apps:

  1. Language
  2. Time and Time Zones
  3. Numbers and Currency

Framework Design

Language

Logic and Details

The essence of language is as a medium for delivering messages to the audience; different languages serve as different media, each targeting different audiences. For example, if we want to display the message to the user: "Hello, Xiaoli!", the process involves checking the language table, determining the user's language, and the current required interpolation, such as the name, to display the corresponding message:

Message CodesLocalesTranslations
home.helloenHello, ${username}!
home.hellozh-CN你好, ${username}!
home.helloIW!${username}, שלום

Different languages may have slight variations in details, such as the singular and plural forms of an item, or the distinction between male and female in third-person references.

These are issues that simple table lookups cannot handle, requiring more complex logic processing. In code, you can use conditional statements to handle these exceptions. Additionally, some internationalization frameworks invent Domain-Specific Languages (DSL) to specifically address these situations. For example, The Project Fluent:

Another issue that beginners often overlook is the direction of writing. Common languages like Chinese and English are written from left to right, while some languages, such as Hebrew and Arabic, are written from right to left.

The difference in writing direction affects not only the text itself but also the input method. A Chinese person would find it very strange to input text from right to left; conversely, a Jewish colleague of mine finds it easy to mix English and Hebrew input.

Layout is another consideration. The entire UI layout and visual elements, such as the direction of arrows, may change based on the language's direction. Your HTML needs to set the appropriate dir attribute.

How to Determine the User's Locale?

You may wonder how we know the user's current language settings. In the case of a browser, when a user requests a webpage, there is a header called Accept-Language that indicates the accepted languages. These settings come from the user's system language and browser settings. In mobile apps, there is usually an API to retrieve the locale variable or constant. Another method is to determine the user's location based on their IP or GPS information and then display the corresponding language. For multinational companies, users often indicate their language preferences and geographical regions during registration.

If a user wants to change the language, websites have various approaches, while mobile apps tend to have more fixed APIs. Here are some methods for websites:

  1. Set a locale cookie
  2. Use different subdomains
  3. Use a dedicated domain. Pinterest has an article discussing how they utilize localized domains. Research shows that using local domain suffixes leads to higher click-through rates.
  4. Use different paths
  5. Use query parameters. While this method is feasible, it is not SEO-friendly.

Beginners often forget to mark the lang attribute in HTML when creating websites.

Translation Management Systems

Once you have carefully implemented the display of text languages, you will find that establishing and managing a translation library is also a cumbersome process.

Typically, developers do not have expertise in multiple languages. At this point, external translators or pre-existing translation libraries need to be introduced. The challenge here is that translators are often not technical personnel. Allowing them to directly modify code or communicate directly with developers can significantly increase translation costs. Therefore, in Silicon Valley companies, translation management systems (TMS) designed for translators are often managed by a dedicated team or involve purchasing existing solutions, such as the closed-source paid service lokalise.co or the open-source Mozilla Pontoon. A TMS can uniformly manage translation libraries, projects, reviews, and task assignments.

This way, the development process becomes: first, designers identify areas that need attention based on different languages and cultural habits during the design phase. For example, a button that is short in English may be very long in Russian, so care must be taken to avoid overflow. Then, the development team implements specific code logic based on the design requirements and provides message codes, contextual background, and examples written in a language familiar to developers in the translation management system. Subsequently, the translation team fills in translations for various languages in the management system. Finally, the development team pulls the translation library back into the codebase and releases it into the product.

Contextual background is an easily overlooked and challenging aspect. Where in the UI is the message that needs translation? What is its purpose? If the message is too short, further explanation may be needed. With this background knowledge, translators can provide more accurate translations in other languages. If translators cannot fully understand the intended message, they need a feedback channel to reach out to product designers and developers for clarification.

Given the multitude of languages and texts, it is rare for a single translator to handle everything; it typically requires a team of individuals with language expertise from various countries to contribute to the translation library. The entire process is time-consuming and labor-intensive, which is why translation teams are often established, such as outsourcing to Smartling.

Now that we have the code logic and translation library, the next question is: how do we integrate the content of the translation library into the product?

There are many different implementation methods; the most straightforward is a static approach where, each time an update occurs, a diff is submitted and merged into the code. This way, relevant translation materials are already included in the code during the build process.

Another approach is dynamic integration. On one hand, you can "pull" content from a remote translation library, which may lead to performance issues during high website traffic. However, the advantage is that translations are always up-to-date. On the other hand, for optimization, a "push" method can be employed, where any new changes in the translation library trigger a webhook to push the content to the server.

In my view, maintaining translations is more cumbersome than adding them. I have seen large projects become chaotic because old translations were not promptly removed after updates, leading to an unwieldy translation library. A good tool that ensures data consistency would greatly assist in maintaining clean code.

Alibaba's Kiwi internationalization solution has implemented a linter and VS Code plugin to help you check and extract translations from the code.

Time and Time Zones

Having discussed language, the next topic is time and time zones. As a global company, much of the data comes from around the world and is displayed to users globally. For example, how do international flights ensure that start and end times are consistent globally and displayed appropriately across different time zones? This is crucial. The same situation applies to all time-related events, such as booking hotels, reserving restaurants, and scheduling meetings.

First, there are several typical representations of time:

  1. Natural language, such as 07:23:01, Monday 28, October 2019 CST AM/PM
  2. Unix timestamp (Int type), such as 1572218668
  3. Datetime. Note that when MySQL stores datetime, it converts it to UTC based on the server's time zone and stores it, converting it back when reading. However, the server's time zone is generally set to UTC. In this case, the storage does not include time zone information, defaulting to UTC.
  4. ISO Date, such as 2019-10-27T23:24:28+00:00, which includes time zone information.

I have no strong preference for these formats; if you have relevant experience, feel free to discuss it.

When displaying time, two types of conversions may occur: one is converting the stored server time zone to the local time zone for display; the other is converting machine code to natural language. A popular approach for the latter is to use powerful libraries for handling time and dates, such as moment.js and dayjs.

Numbers and Currency

The display of numbers varies significantly across different countries and regions. The meaning of commas and periods in numbers differs from one country to another.

(1000.1).toLocaleString("en")
// => "1,000.1"
(1000.1).toLocaleString("de")
// => "1.000,1"
(1000.1).toLocaleString("ru")
// => "1 000,1"

Arabic numerals are not universally applicable; for instance, in Java's String.format, the digits 1, 2, 3 are represented as ١، ٢، ٣ in actual Arabic language.

Regarding pricing, should the same goods be displayed in local currency values in different countries? What is the currency symbol? How precise should the currency be? These questions must be addressed in advance.

Conclusion

The internationalization tools mentioned in this article include translation management systems, the open-source Mozilla Pontoon, the closed-source paid service lokalise.co, POEditor.com, and so on. For code consistency, Alibaba's Kiwi internationalization solution is recommended. For UI display, consider using moment.js and day.js.

Like all software system development, there is no silver bullet for internationalization; great works are crafted through foundational skills honed over time.

Designing a Load Balancer or Dropbox Bandaid

· 3 min read

Requirements

Internet-scale web services deal with high-volume traffic from the whole world. However, one server could only serve a limited amount of requests at the same time. Consequently, there is usually a server farm or a large cluster of servers to undertake the traffic altogether. Here comes the question: how to route them so that each host could evenly receive and process the request?

Since there are many hops and layers of load balancers from the user to the server, specifically speaking, this time our design requirements are

Note: If Service A depends on (or consumes) Service B, then A is downstream service of B, and B is upstream service of A.

Challenges

Why is it hard to balance loads? The answer is that it is hard to collect accurate load distribution stats and act accordingly.

Distributing-by-requests ≠ distributing-by-load

Random and round-robin distribute the traffic by requests. However, the actual load is not per request - some are heavy in CPU or thread utilization, while some are lightweight.

To be more accurate on the load, load balancers have to maintain local states of observed active request number, connection number, or request process latencies for each backend server. And based on them, we can use distribution algorithms like Least-connections, least-time, and Random N choices:

Least-connections: a request is passed to the server with the least number of active connections.

latency-based (least-time): a request is passed to the server with the least average response time and least number of active connections, taking into account weights of servers.

However, these two algorithms work well only with only one load balancer. If there are multiple ones, there might have herd effect. That is to say; all the load balancers notice that one service is momentarily faster, and then all send requests to that service.

Random N choices (where N=2 in most cases / a.k.a Power of Two Choices): pick two at random and chose the better option of the two, avoiding the worse choice.

Distributed environments.

Local LB is unaware of global downstream and upstream states, including

  • upstream service loads
  • upstream service may be super large, and thus it is hard to pick the right subset to cover with the load balancer
  • downstream service loads
  • the processing time of various requests are hard to predict

Solutions

There are three options to collect load the stats accurately and then act accordingly:

  • centralized & dynamic controller
  • distributed but with shared states
  • piggybacking server-side information in response messages or active probing

Dropbox Bandaid team chose the third option because it fits into their existing random N choices approach well.

However, instead of using local states, like the original random N choices do, they use real-time global information from the backend servers via the response headers.

Server utilization: Backend servers are configured with a max capacity and count the on-going requests, and then they have utilization percentage calculated ranging from 0.0 to 1.0.

There are two problems to consider:

  1. Handling HTTP errors: If a server fast fails requests, it attracts more traffic and fails more.
  2. Stats decay: If a server’s load is too high, no requests will be distributed there and hence the server gets stuck. They use a decay function of the inverted sigmoid curve to solve the problem.

Results: requests are more balanced

Designing a Load Balancer

· 3 min read

Requirements Analysis

Internet services often need to handle traffic from around the world, but a single server can only serve a limited number of requests at the same time. Therefore, we typically have a server cluster to collectively manage this traffic. The question arises: how can we evenly distribute this traffic across different servers?

From the user to the server, there are many nodes and load balancers at different levels. Specifically, our design requirements are:

  • Design a Layer 7 load balancer located internally in the data center.
  • Utilize real-time load information from the backend.
  • Handle tens of millions of requests per second and a throughput of 10 TB per second.

Note: If Service A depends on Service B, we refer to A as the downstream service and B as the upstream service.

Challenges

Why is load balancing difficult? The answer lies in the challenge of collecting accurate load distribution data.

Count-based Distribution ≠ Load-based Distribution

The simplest approach is to distribute traffic randomly or in a round-robin manner based on the number of requests. However, the actual load is not calculated based on the number of requests; for example, some requests are heavy and CPU-intensive, while others are lightweight.

To measure load more accurately, the load balancer must maintain some local state—such as the current number of requests, connection counts, and request processing delays. Based on this state, we can employ appropriate load balancing algorithms—least connections, least latency, or random N choose one.

Least Connections: Requests are directed to the server with the fewest current connections.

Least Latency: Requests are directed to the server with the lowest average response time and fewest connections. Servers can also be weighted.

Random N Choose One (N is typically 2, so we can also refer to it as the power of two choices): Randomly select two servers and choose the better of the two, which helps avoid the worst-case scenario.

Distributed Environment

In a distributed environment, local load balancers struggle to understand the complete state of upstream and downstream services, including:

  • Load of upstream services
  • Upstream services can be very large, making it difficult to select an appropriate subset for the load balancer
  • Load of downstream services
  • The specific processing time for different types of requests is hard to predict

Solutions

There are three approaches to accurately collect load information and respond accordingly:

  • A centralized balancer that dynamically manages based on the situation
  • Distributed balancers that share state among them
  • Servers return load information along with requests, or the balancer actively queries the servers

Dropbox chose the third approach when implementing Bandai, as it adapted well to the existing random N choose one algorithm.

However, unlike the original random N choose one algorithm, this approach does not rely on local state but instead uses real-time results returned by the servers.

Server Utilization: Backend servers set a maximum load, track current connections, and calculate utilization, ranging from 0.0 to 1.0.

Two issues need to be considered:

  1. Error Handling: If fail fast, the quick processing may attract more traffic, leading to more errors.
  2. Data Decay: If a server's load is too high, no requests will be sent there. Therefore, using a decay function similar to a reverse S-curve ensures that old data is purged.

Result: Requests Received by Servers are More Balanced

Concurrency Models

· One min read

  • Single-threaded - Callbacks, Promises, Observables and async/await: vanilla JS
  • threading/multiprocessing, lock-based concurrency
    • protecting critical section vs. performance
  • Communicating Sequential Processes (CSP)
    • Golang or Clojure’s core.async.
    • process/thread passes data through channels.
  • Actor Model (AM): Elixir, Erlang, Scala
    • asynchronous by nature, and have location transparency that spans runtimes and machines - if you have a reference (Akka) or PID (Erlang) of an actor, you can message it via mailboxes.
    • powerful fault tolerance by organizing actors into a supervision hierarchy, and you can handle failures at its exact level of hierarchy.
  • Software Transactional Memory (STM): Clojure, Haskell
    • like MVCC or pure functions: commit / abort / retry

The World Is Not Flat — The Economist Special Report: Global Supply Chains

· 5 min read

The Two Forces Influencing Supply Chains: Technology and Politics

There are two types of globalization: one is the vast dispersion brought about by the Industrial Revolution, and the other is the significant convergence resulting from the Information Revolution, where developed countries provide capital and high technology while deindustrializing, and developing countries offer cheap labor and industrialization. The latter form of globalization has created a golden age for supply chains, but its impact on the world has been more sudden and difficult to control.

Currently, we find ourselves at the end of this latter globalization, with slowing growth. The Economist has coined the term slowbalisation to describe this phenomenon, exemplified by events like Brexit and the U.S.-China trade war.

On the technological front, advancements in AI, robotics, 3D printing, autonomous vehicles, and 5G are leading to unprecedented transformations in supply chains, making globalization more manageable. However, these changes are hindered by a technological cold war.

Changes in Industry

More than half of companies believe they should implement major changes in their supply chains, with one in ten businesses advocating for a complete overhaul. There are two primary considerations:

One is the risk associated with reducing supply chain costs — you are merely part of a node in the supply chain network, making it difficult to control adjacent nodes. The typhoon in Japan in 2011 caused a semiconductor giant to spend over a year with 100 executives trying to understand the specific companies involved in the network created by their supply chain.

Another consideration is that global trade has not only brought more products but also more services. Services account for one-third of trade value, and their growth is 60% faster than that of products, with telecommunications and information industries seeing growth rates 2 to 3 times faster.

Examining three specific industries:

  1. Apparel: Some production has shifted from China to Southeast Asia, Vietnam, Bangladesh, and Ethiopia. Despite rising costs, China's skilled labor remains highly competitive.

  2. Automotive: Regionalization and hub formation are evident, with Mexico serving North America, Eastern Europe and Morocco serving Western Europe, and Southeast Asia and China serving Asia. A major reason for regionalization is differing consumer habits; for example, pickup trucks are only popular in the U.S. Other factors include trade wars and the electrification of vehicles (which require significantly fewer components).

  3. Computing: Some companies are relocating from China to Vietnam, Cambodia, and Mexico, and even back to the U.S. Many companies find it challenging to move out of the U.S. Shenzhen remains a powerhouse due to its automation capabilities and high added value.

Amazon and Alibaba Remain Benchmarks for the Next Transformation

Amazon uses data, algorithms, and robotics to achieve logistics that are one-third faster than most competitors.

Digitalization: How New Technologies Are Transforming Old Industries

  1. Predictability
    1. Companies traditionally use historical records to forecast sales and adjust manufacturing and inventory accordingly. AI can leverage social media data to more accurately fine-tune parameters at each node in the supply chain.
    2. Research shows that executives believe the most important technologies to adopt are cognitive analytics and AI, while blockchain and drones have seen a decline in priority.
    3. JDA's Blue Yonder deep learning algorithms have reduced stockouts by 30% and shortened excess inventory time by several days.
    4. ORSAY uses JDA's automated pricing system to reduce inventory.
    5. Intel saved $58 million using predictive models.
  2. Transparency
    1. For multinational corporations, "Where are the goods?" remains a significant unresolved issue. Why? Because their control over products is not direct; they do not manufacture, transport, store, or sell the products.
      1. For example, in 2017, a strike in Germany's freight industry left IBM's expensive mainframes exposed to the elements on the tarmac for a month, while the company mistakenly believed the goods were safely stored in an airport warehouse.
    2. The Internet of Things (IoT) can address this issue. Sensors can report not only location but also direction, temperature, humidity, and other parameters.
    3. Singapore is automating port transport with autonomous vehicles.
    4. IBM and Maersk are using blockchain to make transportation paperless and transparent.
  3. Speed
    1. In Silicon Valley, Tom Linton's Pulse command center resembles a Pentagon command center, allowing real-time visibility of 92 variables in the supply chain, sharing data and decisions with customers via mobile and computer, reducing inventory by 11 days and freeing up $580 million in cash flow.
    2. Li & Fung has reduced the time from product design to shelf from 40 weeks to half that.
    3. 3D printing unicorn Carbon is helping Ford and Adidas develop and produce products.
    4. Uber or Airbnb for warehousing.

Security Challenges

Supply chain management is transitioning from intuition and experience-driven practices to data-driven approaches, facing three major security challenges:

  1. Huawei. Replacing Huawei products will increase costs, and the adoption of incompatible technology standards for 5G will force different countries to take sides, while 5G technology is crucial for IoT.
  2. Cyberattacks. New IoT devices are rushed to market without adequate security considerations; however, "no one wants to drive a tank to work" because it's just too slow. A better response strategy is to be reactive and quick.
  3. Trade wars. Policies can change daily, but factories cannot be moved at will.

Designing typeahead search or autocomplete

· 2 min read

Requirements

  • realtime / low-latency typeahead and autocomplete service for social networks, like Linkedin or Facebook
  • search social profiles with prefixes
  • newly added account appear instantly in the scope of the search
  • not for “query autocomplete” (like the Google search-box dropdown), but for displaying actual search results, including
    • generic typeahead: network-agnostic results from a global ranking scheme like popularity.
    • network typeahead: results from user’s 1st and 2nd-degree network connections, and People You May Know scores.

Linkedin Search

Architecture

Multi-layer architecture

  • browser cache
  • web tier
  • result aggregator
  • various typeahead backend

Cleo Architecture

Result Aggregator

The abstraction of this problem is to find documents by prefixes and terms in a very large number of elements. The solution leverages these four major data structures:

  1. InvertedIndex<prefixes or terms, documents>: given any prefix, find all the document ids that contain the prefix.
  2. for each document, prepare a BloomFilter<prefixes or terms>: with user typing more, we can quickly filter out documents that do not contain the latest prefixes or terms, by check with their bloom filters.
  3. ForwardIndex<documents, prefixes or terms>: previous bloom filter may return false positives, and now we query the actual documents to reject them.
  4. scorer(document):relevance: Each partition return all of its true hits and scores. And then we aggregate and rank.

Performance

  • generic typeahead: latency < = 1 ms within a cluster
  • network typeahead (very-large dataset over 1st and 2nd degree network): latency < = 15 ms
  • aggregator: latency < = 25 ms

Lyft's Marketing Automation Platform -- Symphony

· 3 min read

Acquisition Efficiency Problem:How to achieve a better ROI in advertising?

In details, Lyft's advertisements should meet requirements as below:

  1. being able to manage region-specific ad campaigns
  2. guided by data-driven growth: The growth must be scalable, measurable, and predictable
  3. supporting Lyft's unique growth model as shown below

lyft growth model

However, the biggest challenge is to manage all the processes of cross-region marketing at scale, which include choosing bids, budgets, creatives, incentives, and audiences, running A/B tests, and so on. You can see what occupies a day in the life of a digital marketer:

营销者的一天

We can find out that execution occupies most of the time while analysis, thought as more important, takes much less time. A scaling strategy will enable marketers to concentrate on analysis and decision-making process instead of operational activities.

Solution: Automation

To reduce costs and improve experimental efficiency, we need to

  1. predict the likelihood of a new user to be interested in our product
  2. evaluate effectively and allocate marketing budgets across channels
  3. manage thousands of ad campaigns handily

The marketing performance data flows into the reinforcement-learning system of Lyft: Amundsen

The problems that need to be automated include:

  1. updating bids across search keywords
  2. turning off poor-performing creatives
  3. changing referrals values by market
  4. identifying high-value user segments
  5. sharing strategies across campaigns

Architecture

Lyft Symphony Architecture

The tech stack includes - Apache Hive, Presto, ML platform, Airflow, 3rd-party APIs, UI.

Main components

Lifetime Value(LTV) forecaster

The lifetime value of a user is an important criterion to measure the efficiency of acquisition channels. The budget is determined together by LTV and the price we are willing to pay in that region.

Our knowledge of a new user is limited. The historical data can help us to predict more accurately as the user interacts with our services.

Initial eigenvalue:

特征值

The forecast improves as the historical data of interactivity accumulates:

根据历史记录判断 LTV

Budget allocator

After LTV is predicted, the next is to estimate budgets based on the price. A curve of the form LTV = a * (spend)^b is fit to the data. A degree of randomness will be injected into the cost-curve creation process in order to converge a global optimum.

预算计算

Bidders

Bidders are made up of two parts - the tuners and actors. The tuners decide exact channel-specific parameters based on the price. The actors communicate the actual bid to different channels.

Some popular bidding strategies, applied in different channels, are listed as below:

投放策略

Conclusion

We have to value human experiences in the automation process; otherwise, the quality of the models may be "garbage in, garbage out". Once saved from laboring tasks, marketers can focus more on understanding users, channels, and the messages they want to convey to audiences, and thus obtain better ad impacts. That's how Lyft can achieve a higher ROI with less time and efforts.

Designing Airbnb or a hotel booking system

· 3 min read

Requirements

  • for guests
    • search rooms by locations, dates, number of rooms, and number of guests
    • get room details (like picture, name, review, address, etc.) and prices
    • pay and book room from inventory by date and room id
      • checkout as a guest
      • user is logged in already
    • notification via Email and mobile push notification
  • for hotel or rental administrators (suppliers/hosts)
    • administrators (receptionist/manager/rental owner): manage room inventory and help the guest to check-in and check out
    • housekeeper: clean up rooms routinely

Architecture

Components

Inventory <> Bookings <> Users (guests and hosts)

Suppliers provide their room details in the inventory. And users can search, get, and reserve rooms accordingly. After reserving the room, the user's payment will change the status of the reserved_room as well. You could check the data model in this post.

How to find available rooms?

  • by location: geo-search with spatial indexing, e.g. geo-hash or quad-tree.
  • by room metadata: apply filters or search conditions when querying the database.
  • by date-in and date-out and availability. Two options:
    • option 1: for a given room_id, check all occupied_room today or later, transform the data structure to an array of occupation by days, and finally find available slots in the array. This process might be time-consuming, so we can build the availability index.
    • option 2: for a given room_id, always create an entry for an occupied day. Then it will be easier to query unavailable slots by dates.

For hotels, syncing data

If it is a hotel booking system, then it will probably publish to Booking Channels like GDS, Aggregators, and Wholesalers.

Hotel Booking Ecosystem

To sync data across those places. We can

  1. retry with idempotency to improve the success rate of the external calls and ensure no duplicate orders.
  2. provide webhook callback APIs to external vendors to update status in the internal system.

Payment & Bookkeeping

Data model: double-entry bookkeeping

To execute the payment, since we are calling the external payment gateway, like bank or Stripe, Braintree, etc. It is crucial to keep data in-sync across different places. We need to sync data across the transaction table and external banks and vendors.

Notifier for reminders / alerts

The notification system is essentially a delayer scheduler (priority queue + subscriber) plus API integrations.

For example, a daily cronjob will query the database for notifications to be sent out today and put them into the priority queue by date. The subscriber will get the earliest ones from the priority queue and send out if reaching the expected timestamp. Otherwise, put the task back to the queue and sleep to make the CPU idle for other work, which can be interrupted if there are new alerts added for today.

Designing Memcached or an in-memory KV store

· 2 min read

Requirements

  1. High-performance, distributed key-value store
  • Why distributed?
    • Answer: to hold a larger size of data
  1. For in-memory storage of small data objects
  2. Simple server (pushing complexity to the client) and hence reliable and easy to deploy

Architecture

Big Picture: Client-server

  • client
  • given a list of Memcached servers
  • chooses a server based on the key
  • server
  • store KVs into the internal hash table
  • LRU eviction

The Key-value server consists of a fixed-size hash table + single-threaded handler + coarse locking

hash table

How to handle collisions? Mostly three ways to resolve:

  1. Separate chaining: the collided bucket chains a list of entries with the same index, and you can always append the newly collided key-value pair to the list.
  2. open addressing: if there is a collision, go to the next index until finding an available bucket.
  3. dynamic resizing: resize the hash table and allocate more spaces; hence, collisions will happen less frequently.

How does the client determine which server to query?

See Data Partition and Routing

How to use cache?

See Key value cache

How to further optimize?

See How Facebook Scale its Social Graph Store? TAO

Lyft's Marketing Automation Platform Symphony

· 3 min read

Customer Acquisition Efficiency Issue: How can advertising campaigns achieve higher returns with less money and fewer people?

Specifically, Lyft's advertising campaigns need to address the following characteristics:

  1. Manage location-based campaigns
  2. Data-driven growth: growth must be scalable, measurable, and predictable
  3. Support Lyft's unique growth model, as shown below:

lyft growth model

The main challenge is the difficulty of scaling management across various aspects of regional marketing, including ad bidding, budgeting, creative assets, incentives, audience selection, testing, and more. The following image depicts a day in the life of a marketer:

A Day in the Life of a Marketer

We can see that "execution" takes up most of the time, while less time is spent on the more important tasks of "analysis and decision-making." Scaling means reducing complex operations and allowing marketers to focus on analysis and decision-making.

Solution: Automation

To reduce costs and improve the efficiency of experimentation, it is necessary to:

  1. Predict whether new users are interested in the product
  2. Optimize across multiple channels and effectively evaluate and allocate budgets
  3. Conveniently manage thousands of campaigns

Data is enhanced through Lyft's Amundsen system using reinforcement learning.

The automation components include:

  1. Updating bid keywords
  2. Disabling underperforming creative assets
  3. Adjusting referral values based on market changes
  4. Identifying high-value user segments
  5. Sharing strategies across multiple campaigns

Architecture

Lyft Symphony Architecture

Technology stack: Apache Hive, Presto, ML platform, Airflow, 3rd-party APIs, UI.

Specific Component Modules

LTV Prediction Module

The lifetime value (LTV) of users is an important metric for evaluating channels, and the budget is determined by both LTV and the price we are willing to pay for customer acquisition in that region.

Our understanding of new users is limited, but as interactions increase, the historical data provided will more accurately predict outcomes.

Initial feature values:

Feature Values

As historical interaction records accumulate, the predictions become more accurate:

Predicting LTV Based on Historical Records

Budget Allocation Module

Once LTV is established, the next step is to set the budget based on pricing. A curve of the form LTV = a * (spend)^b is fitted, along with similar parameter curves in the surrounding range. Achieving a global optimum requires some randomness.

Budget Calculation

Delivery Module

This module is divided into two parts: the parameter tuner and the executor. The tuner sets specific parameters based on pricing for each channel, while the executor applies these parameters to the respective channels.

There are many popular delivery strategies that are common across various channels:

Delivery Strategies

Conclusion

It is essential to recognize the importance of human experience within the system; otherwise, it results in garbage in, garbage out. When people are liberated from tedious delivery tasks and can focus on understanding users, channels, and the messages they need to convey to their audience, they can achieve better campaign results—spending less time to achieve higher ROI.