The abstraction of this problem is to find documents by prefixes and terms in a very large number of elements. The solution leverages these four major data structures:
InvertedIndex<prefixes or terms, documents>: given any prefix, find all the document ids that contain the prefix.
for each document, prepare a BloomFilter<prefixes or terms>: with user typing more, we can quickly filter out documents that do not contain the latest prefixes or terms, by check with their bloom filters.
ForwardIndex<documents, prefixes or terms>: previous bloom filter may return false positives, and now we query the actual documents to reject them.
scorer(document):relevance: Each partition return all of its true hits and scores. And then we aggregate and rank.
In details, Lyft’s advertisements should meet requirements as below:
However, the biggest challenge is to manage all the processes of cross-region marketing at scale, which include choosing bids, budgets, creatives, incentives, and audiences, running A/B tests, and so on. You can see what occupies a day in the life of a digital marketer:
We can find out that execution occupies most of the time while analysis, thought as more important, takes much less time. A scaling strategy will enable marketers to concentrate on analysis and decision-making process instead of operational activities.
To reduce costs and improve experimental efficiency, we need to
The marketing performance data flows into the reinforcement-learning system of Lyft: Amundsen
The problems that need to be automated include:
The tech stack includes - Apache Hive, Presto, ML platform, Airflow, 3rd-party APIs, UI.
The lifetime value of a user is an important criterion to measure the efficiency of acquisition channels. The budget is determined together by LTV and the price we are willing to pay in that region.
Our knowledge of a new user is limited. The historical data can help us to predict more accurately as the user interacts with our services.
The forecast improves as the historical data of interactivity accumulates:
After LTV is predicted, the next is to estimate budgets based on the price. A curve of the form
LTV = a * (spend)^b is fit to the data. A degree of randomness will be injected into the cost-curve creation process in order to converge a global optimum.
Bidders are made up of two parts - the tuners and actors. The tuners decide exact channel-specific parameters based on the price. The actors communicate the actual bid to different channels.
Some popular bidding strategies, applied in different channels, are listed as below:
We have to value human experiences in the automation process; otherwise, the quality of the models may be “garbage in, garbage out”. Once saved from laboring tasks, marketers can focus more on understanding users, channels, and the messages they want to convey to audiences, and thus obtain better ad impacts. That’s how Lyft can achieve a higher ROI with less time and efforts.
Suppliers provide their room details in the inventory. And users can search, get, and reserve rooms accordingly. After reserving the room, the user’s payment will change the
status of the
reserved_room as well. You could check the data model in this post.
room_id, check all
occupied_roomtoday or later, transform the data structure to an array of occupation by days, and finally find available slots in the array. This process might be time-consuming, so we can build the availability index.
room_id, always create an entry for an occupied day. Then it will be easier to query unavailable slots by dates.
If it is a hotel booking system, then it will probably publish to Booking Channels like GDS, Aggregators, and Wholesalers.
To sync data across those places. We can
Data model: double-entry bookkeeping
To execute the payment, since we are calling the external payment gateway, like bank or Stripe, Braintree, etc. It is crucial to keep data in-sync across different places. We need to sync data across the transaction table and external banks and vendors.
The notification system is essentially a delayer scheduler (priority queue + subscriber) plus API integrations.
For example, a daily cronjob will query the database for notifications to be sent out today and put them into the priority queue by date. The subscriber will get the earliest ones from the priority queue and send out if reaching the expected timestamp. Otherwise, put the task back to the queue and sleep to make the CPU idle for other work, which can be interrupted if there are new alerts added for today.
Big Picture: Client-server
The Key-value server consists of a fixed-size hash table + single-threaded handler + coarse locking
How to handle collisions? Mostly three ways to resolve:
See Key value cache