Skip to main content

Dongxu Huang: Building a Database Startup in China

· 4 min read

Company

  • Established for four years, with the first two years spent writing code; in the last year and a half, one or two hundred companies have started using it.
  • Infra has a significant advantage in China due to the large market; companies adopt things quickly and aggressively, allowing good infra products to be utilized rapidly.
  • None of the co-founders have database experience.
  • The open-source model is the future; if it's closed-source, negotiations have to be done one by one.
  • The most important business decision: it's not about how advanced the algorithms are or how strong the team is; the key to the moat lies in 1) community 2) MySQL interface (leveraging the MySQL community is crucial, and SQL support is important, as even Kafka and Spark support SQL).

TiDB Database Principles

The more advanced the engineer, the more they prefer high performance, believing that faster is better. However, TiDB's primary goal is not to be the fastest but to achieve availability, reliability, stability, I/O, and infinite scalability. The cost of high performance is too high and should be optimized based on the user's hardware. However, this is a general-purpose database, so optimizing for various scenarios is not feasible. Keep it simple for users and complex for ourselves (contrary to AWS's approach).

They believe eventual consistency is a pseudo-concept; in reality, it means no consistency or weak consistency. For example, with Cassandra, once WRN is set up, how long does it take to resolve? It's uncertain. It's too complex; users should ideally not have to worry about these complicated settings.

Benchmark scores are not the only metric; high TPCC/TPCH scores do not provide much practical guidance. Databases are refined through use. For instance, the first customer, a gaming company, created an astonishing 30,000 tables, and the JSON metadata was very slow to connect.

The architecture is not P2P; different roles are clearly defined.

KV uses RocksDB, but the typical write amplification of LSM trees is 15 times; here, it has been optimized by extracting values to solve the write amplification issue.

Supports MySQL clients and also supports reading from SparkSQL.

The SQL layer does not use MySQL modules. Initially, there was an attempt, but 1) it was challenging to distribute 2) the code was too poor and difficult to modify. If modified drastically, it could take six months; however, the long-term maintenance cost would be higher. Redoing is not troublesome; they have already refactored three times. Using Go is more convenient for refactoring than using C/C++/Rust.

Initially, they only wanted to do F1 and SQL, collaborating with CockroachDB. Later, as they moved towards SQL, they had to focus on storage.

They hired two members from the Rust core team. Rust is challenging to recruit for; typically, they hire C++ developers and then transition them to Rust, as they find that many compiler conventions are resolved for them.

Do not underestimate the difficulty of creating industrial-grade solutions. Using gRPC, Raft, and RocksDB means that if there are new developments in the industry, users will directly benefit.

Chunk (region) splitting took two months, while merging took three years. Merging has undergone formal verification.

Why now?

  1. Hardware
  2. Hot/cold data -> warm data
  3. Log is the new database

Everything is pluggable. The top-level API remains unchanged, while the underlying components are plug-and-play and replaceable.

Distributed Transactions

  • 2PC is the only option
  • Challenges: reduce round-trips

Multi-tenancy achieved through Kubernetes

China Biz Trend

  1. Chinese speed; once it's said, it must be done.
  2. Higher expectations for new technologies to empower businesses. Companies in second and third-tier cities or non-BAT firms are using new technologies to compete with giants. They must be able to alleviate users' technical anxieties.
  3. The talent pool for foundational software is gradually strengthening. P's production capacity CAP has contributed significantly to technical content marketing.
  4. Some core scenarios (core banking systems) dare to use domestic technologies.
  5. PingCAP's path: open source (internet/community) < - > commercialization.

Stream and Batch Processing Frameworks

· 3 min read

Why Do We Need Such Frameworks?

  • To process more data in a shorter amount of time.
  • To unify fault tolerance in distributed systems.
  • To simplify task abstractions to meet changing business requirements.
  • Suitable for bounded datasets (batch processing) and unbounded datasets (stream processing).

Brief History of Batch and Stream Processing Development

  1. Hadoop and MapReduce. Google made batch processing as simple as MapReduce result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs) in a distributed system.
  2. Apache Storm and directed graph topologies. MapReduce does not represent iterative algorithms well. Therefore, Nathan Marz abstracted stream processing into a graph structure composed of spouts and bolts.
  3. Spark in-memory computation. Reynold Xin pointed out that Spark uses ten times fewer machines than Hadoop while being three times faster when processing the same data.
  4. Google Dataflow based on Millwheel and FlumeJava. Google uses a windowed API to support both batch and stream processing simultaneously.
  1. Flink quickly adopted the programming model of ==Google Dataflow== and Apache Beam.
  2. Flink's efficient implementation of the Chandy-Lamport checkpointing algorithm.

These Frameworks

Architecture Choices

To meet the above demands with commercial machines, there are several popular distributed system architectures...

  • Master-slave (centralized): Apache Storm + Zookeeper, Apache Samza + YARN
  • P2P (decentralized): Apache S4

Features

  1. DAG Topology for iterative processing - for example, GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
  2. Delivery Guarantees. How to ensure the reliability of data delivery between nodes? At least once / at most once / exactly once.
  3. Fault Tolerance. Implement fault tolerance using cold/warm/hot standby, checkpointing, or active-active.
  4. Windowed API for unbounded datasets. For example, streaming windows in Apache. Window functions in Spark. Windowing in Apache Beam.

Comparison Table of Different Architectures

ArchitectureStormStorm-tridentSparkFlink
ModelNativeMicro-batchMicro-batchNative
GuaranteesAt least onceExactly onceExactly onceExactly once
Fault ToleranceRecord AckRecord AckCheckpointCheckpoint
Maximum Fault ToleranceHighMediumMediumLow
LatencyVery lowHighHighLow
ThroughputLowMediumHighHigh

Toutiao Recommendation System: P1 Overview

· 6 min read

What are we optimizing for? User Satisfaction

We are finding the best function below to maximize user satisfaction .

user satisfaction = function(content, user profile, context)
  1. Content: features of articles, videos, UGC short videos, Q&As, etc.
  2. User profile: interests, occupation, age, gender and behavior patterns, etc.
  3. Context: Mobile users in contexts of workspace, commuting, traveling, etc.

How to evaluate the satisfaction?

  1. Measurable Goals, e.g.

    • click through rate
    • Session duration
    • upvotes
    • comments
    • reposts
  2. Hard-to-measurable Goals:

    • Frequency control of ads and special-typed contents (Q&A)
    • Frequency control of vulgar content
    • Reducing clickbait, low quality, disgusting content
    • Enforcing / pining / highly-weighting important news
    • Lowly-weighting contents from low-level accounts

How to optimize for those goals? Machine Learning Models

It is a typical supervised machine learning problem to find the best function above. To implement the system, we have these algorithms:

  1. Collaborative Filtering
  2. Logistic Regression
  3. DNN
  4. Factorization Machine
  5. GBDT

A world-class recommendation system is supposed to have the flexibility to A/B-test and combine multiple algorithms above. It is now popular to combine LR and DNN. Facebook used both LR and GBDT years ago.

How do models observe and measure the reality? Feature engineering

  1. Correlation, between content’s characteristic and user’s interest. Explicit correlations include keywords, categories, sources, genres. Implicit correlations can be extract from user’s vector or item’s vector from models like FM.

  2. Environmental features such as geo location, time. It’s can be used as bias or building correlation on top of it.

  3. Hot trend. There are global hot trend, categorical hot trend, topic hot trend and keyword hot trend. Hot trend is very useful to solve cold-start issue when we have little information about user.

  4. Collaborative features, which helps avoid situation where recommended content get more and more concentrated. Collaborative filtering is not analysing each user’s history separately, but finding users’ similarity based on their behaviour by clicks, interests, topics, keywords or event implicit vectors. By finding similar users, it can expand the diversity of recommended content.

Large-scale Training in Realtime

  • Users like to see news feed updated in realtime according to what we track from their actions.
  • Use Apache storm to train data (clicks, impressions, faves, shares) in realtime.
  • Collect data to a threshold and then update to the recommendation model
  • Store model parameters , like tens of billions of raw features and billions of vector features, in high performance computing clusters.

They are implemented in the following steps:

  1. Online services record features in realtime.
  2. Write data into Kafka
  3. Ingest data from Kafka to Storm
  4. Populate full user profiles and prepare samples
  5. Update model parameters according to the latest samples
  6. Online modeling gains new knowledge

How to further reducing the latency? Recall Strategy

It is impossible to predict all the things with the model, considering the super-large scale of all the contents. Therefore, we need recall strategies to focus on a representative subset of the data. Performance is critical here and timeout is 50ms.

recall strategy

Among all the recall strategies, we take the InvertedIndex<Key, List<Article>> .

The Key can be topic, entity, source, etc.

Tags of InterestsRelevanceList of Documents
E-commerce0.3
Fun0.2
History0.2
Military0.1

Data Dependencies

  • Features depends on tags of user-side and content-side.
  • recall strategy depends on tags of user-side and content-side.
  • content analysis and data mining of user tags are cornerstone of the recommendation system.

What is the content analysis?

content analysis = derive intermediate data from raw articles and user behaviors.

Take articles for example. To model user interests, we need to tag contents and articles. To associate a user with the interests of the “Internet” tag, we need to know whether a user reads an article with the “Internet” tag.

Why are we analyzing those raw data?

We do it for the reason of …

  1. Tagging users (user profile)
    • Tagging users who liked articles with “Internet” tag. Tagging users who liked articles with “xiaomi” tag.
  2. Recommending contents to users by tags
    • Pushing “meizu” contents to users with “meizu” tag. Pushing “dota” contents to users with “dota” tag.
  3. Preparing contents by topics
    • Put “Bundesliga” articles to “Bundesliga topic”. Put “diet” articles to “diet topic”.

Case Study: Analysis Result of an Article

Here is an example of “article features” page. There are article features like categorizations, keywords, topics, entities.

Analysis Result of an Article

Analysis Result of an Article: Details

What are the article features?

  1. Semantic Tags: Human predefine those tags with explicit meanings.

  2. Implicit Semantics, including topics and keywords. Topic features are describing the statistics of words. Certain rules generate keywords.

  3. Similarity. Duplicate recommendation once to be the most severe feedbacks we get from our customers.

  4. Time and location.

  5. Quality. Abusing, porn, ads, or “chicken soup for the soul”?

Article features are important

  • It is not true that a recommendation system cannot work at all without article features. Amazon, Walmart, Netflix can recommend by collaborative filtering.
  • However, in news product, users consume contents of the same day. Bootstrapping without article features is hard. Collaborative filtering cannot help with bootstrapping.
    • The finer of the granularity of the article feature, the better the ability to bootstrap.

More on Semantic Tags

We divide features of semantic tags into three levels:

  1. Categorizations: used in the user profile, filtering contents in topics, recommend recall, recommend features
  2. Concepts: used in filtering contents in topics, searching tags, recommend recall(like)
  3. Entities: used in filtering contents in topics, searching tags, recommend recall(like)

Why dividing into different levels? We do this so that they can capture articles in different granularities.

  1. Categorizations: full in coverage, low in accuracy.
  2. Concepts: medium in coverage, medium in accuracy.
  3. Entities: low in coverage, high in accuracy. It only covers hot people, organizations, products in each area.

Categorizations and concepts are sharing the same technical infrastructure.

Why do we need semantic tags?

  • Implicit semantics
    • have been functioning well.
    • cost much less than semantic tags.
  • But, topics and interests need a clear-defined tagging system.
  • Semantic tags also evaluate the capability in NPL technology of a company.

Document classification

Classification hierarchy

  1. Root
  2. Science, sports, finance, entertainment
  3. Football, tennis, table tennis, track and field, swimming
  4. International, domestic
  5. Team A, team B

Classifiers:

  • SVM
  • SVM + CNN
  • SVM + CNN + RNN

Calculating relevance

  1. Lexical analysis for articles
  2. Filtering keywords
  3. Disambiguation
  4. Calculating relevance

Toutiao Recommendation System: P1 Overview

· 4 min read

What are we optimizing for? User Satisfaction

We are finding the best function below to maximize user satisfaction .

user satisfaction = function(content, user profile, context)
  1. Content: features of articles, videos, UGC short videos, Q&As, etc.
  2. User profile: interests, occupation, age, gender and behavior patterns, etc.
  3. Context: Mobile users in contexts of workspace, commuting, traveling, etc.

How to evaluate the satisfaction?

  1. Measurable Goals, e.g.

    • click through rate
    • Session duration
    • upvotes
    • comments
    • reposts
  2. Hard-to-measurable Goals:

    • Frequency control of ads and special-typed contents (Q&A)
    • Frequency control of vulgar content
    • Reducing clickbait, low quality, disgusting content
    • Enforcing / pining / highly-weighting important news
    • Lowly-weighting contents from low-level accounts

How to optimize for those goals? Machine Learning Models

It is a typical supervised machine learning problem to find the best function above. To implement the system, we have these algorithms:

  1. Collaborative Filtering
  2. Logistic Regression
  3. DNN
  4. Factorization Machine
  5. GBDT

A world-class recommendation system is supposed to have the flexibility to A/B-test and combine multiple algorithms above. It is now popular to combine LR and DNN. Facebook used both LR and GBDT years ago.

How do models observe and measure the reality? Feature engineering

  1. Correlation, between content’s characteristic and user’s interest. Explicit correlations include keywords, categories, sources, genres. Implicit correlations can be extract from user’s vector or item’s vector from models like FM.

  2. Environmental features such as geo location, time. It’s can be used as bias or building correlation on top of it.

  3. Hot trend. There are global hot trend, categorical hot trend, topic hot trend and keyword hot trend. Hot trend is very useful to solve cold-start issue when we have little information about user.

  4. Collaborative features, which helps avoid situation where recommended content get more and more concentrated. Collaborative filtering is not analysing each user’s history separately, but finding users’ similarity based on their behaviour by clicks, interests, topics, keywords or event implicit vectors. By finding similar users, it can expand the diversity of recommended content.

Large-scale Training in Realtime

  • Users like to see news feed updated in realtime according to what we track from their actions.
  • Use Apache storm to train data (clicks, impressions, faves, shares) in realtime.
  • Collect data to a threshold and then update to the recommendation model
  • Store model parameters , like tens of billions of raw features and billions of vector features, in high performance computing clusters.

They are implemented in the following steps:

  1. Online services record features in realtime.
  2. Write data into Kafka
  3. Ingest data from Kafka to Storm
  4. Populate full user profiles and prepare samples
  5. Update model parameters according to the latest samples
  6. Online modeling gains new knowledge

How to further reducing the latency? Recall Strategy

It is impossible to predict all the things with the model, considering the super-large scale of all the contents. Therefore, we need recall strategies to focus on a representative subset of the data. Performance is critical here and timeout is 50ms.

recall strategy

Among all the recall strategies, we take the InvertedIndex<Key, List<Article>> .

The Key can be topic, entity, source, etc.

Tags of InterestsRelevanceList of Documents
E-commerce0.3
Fun0.2
History0.2
Military0.1

Data Dependencies

  • Features depends on tags of user-side and content-side.
  • recall strategy depends on tags of user-side and content-side.
  • content analysis and data mining of user tags are cornerstone of the recommendation system.

Stream and Batch Processing Frameworks

· 2 min read

Why such frameworks?

  • process high-throughput in low latency.
  • fault-tolerance in distributed systems.
  • generic abstraction to serve volatile business requirements.
  • for bounded data set (batch processing) and for unbounded data set (stream processing).

Brief history of batch/stream processing

  1. Hadoop and MapReduce. Google made batch processing as simple as MR result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs) in a distributed system.
  2. Apache Storm and DAG Topology. MR doesn’t efficiently express iterative algorithms. Thus Nathan Marz abstracted stream processing into a graph of spouts and bolts.
  3. Spark in-memory Computing. Reynold Xin said Spark sorted the same data 3X faster using 10X fewer machines compared to Hadoop.
  4. Google Dataflow based on Millwheel and FlumeJava. Google supports both batch and streaming computing with the windowing API.
  1. its fast adoption of ==Google Dataflow==/Beam programming model.
  2. its highly efficient implementation of Chandy-Lamport checkpointing.

How?

Architectural Choices

To serve requirements above with commodity machines, the steaming framework use distributed systems in these architectures...

  • master-slave (centralized): apache storm with zookeeper, apache samza with YARN.
  • P2P (decentralized): apache s4.

Features

  1. DAG Topology for Iterative Processing. e.g. GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
  2. Delivery Guarantees. How guaranteed to deliver data from nodes to nodes? at-least once / at-most once / exactly once.
  3. Fault-tolerance. Using cold/warm/hot standby, checkpointing, or active-active.
  4. Windowing API for unbounded data set. e.g. Stream Windows in Apache Flink. Spark Window Functions. Windowing in Apache Beam.

Comparison

FrameworkStormStorm-tridentSparkFlink
Modelnativemicro-batchmicro-batchnative
Guarenteesat-least-onceexactly-onceexactly-onceexactly-once
Fault-toleranceRecord-Ackrecord-ackcheckpointcheckpoint
Overhead of fault-tolerancehighmediummediumlow
latencyvery-lowhighhighlow
throughputlowmediumhighhigh

Fraud Detection with Semi-supervised Learning

· One min read

Motivation

Leveraging user and device data during user login to fight against

  1. ATO (account takeovers)
  2. Botnet attacks

ATOs ranking from easy to hard to detect

  1. from single IP
  2. from IPs on the same device
  3. from IPs across the world
  4. from 100k IPs
  5. attacks on specific accounts
  6. phishing and malware

Solutions

Semi-supervised learning = unlabeled data + small amount of labeled data

Why? better learning accuracy than unsupervised learning + less time and costs than supervised learning

  • K-means: not good
  • DBSCAN: better. Use labels to
    1. Tune hyperparameter
    2. Constrain

Challenges

  • Manual feature selection
  • Feature evolution in adversarial environment
  • Scalability
  • No online DBSCAN

Architecture

Anti-fraud Query

Anti-fraud Featuring

Production Setup

  • Batch: 7 days worth of data, run DBSCAN hourly
  • Streaming: 60 minutes moving window, run streaming k-means
  • Used feedback signal success ratios to mark clusters as good, bad or unknown
  • Bad clusters: Always throw
  • Good clusters: Small % of attempts
  • Unknown clusters: X% of attempts

Designing Uber Ride-Hailing Service

· 3 min read

Disclaimer: All content below is sourced from public resources or purely original. No confidential information regarding Uber is included here.

Requirements

  • Provide services for the global transportation market
  • Large-scale real-time scheduling
  • Backend design

Architecture

uber architecture

Why Microservices?

==Conway's Law== The structure of a software system corresponds to the organizational structure of the company.

Monolithic ==Service==Microservices
When team size and codebase are small, productivity✅ High❌ Low
==When team size and codebase are large, productivity==❌ Low✅ High (Conway's Law)
==Quality requirements for engineering==❌ High (Inadequately skilled developers can easily disrupt the entire system)✅ Low (Runtime is isolated)
Dependency version upgrades✅ Fast (Centralized management)❌ Slow
Multi-tenant support / Production-staging state isolation✅ Easy❌ Difficult (Each service must 1) either establish a staging environment connected to other services in staging 2) or support multi-tenancy across request contexts and data storage)
Debuggability, assuming the same modules, parameters, logs❌ Low✅ High (if distributed tracing is available)
Latency✅ Low (Local)❌ High (Remote)
DevOps costs✅ Low (High cost of build tools)❌ High (Capacity planning is difficult)

Combining monolithic ==codebase== and microservices can leverage the strengths of both.

Scheduling Service

  • Consistent hash addresses provided by geohash
  • Data is transient in memory, so no need for duplication. (CAP: AP over CP)
  • Use single-threaded or locked sharding to prevent double scheduling

Payment Service

==The key is to have asynchronous design==, as ACID transaction payment systems across multiple systems often have very long latencies.

User Profile Service and Trip Record Service

  • Use caching to reduce latency
  • As 1) support for more countries and regions increases 2) user roles (drivers, riders, restaurant owners, diners, etc.) gradually expand, providing user profile services for these users also faces significant challenges.

Notification Push Service

  • Apple Push Notification Service (unreliable)
  • Google Cloud Messaging (GCM) (can detect successful delivery) or
  • SMS services are generally more reliable

Task-Related Maturity

· One min read

Andy Grove emphasizes: ==The most important responsibility of a manager is to inspire their subordinates to perform at their best==.

Unfortunately, there is no single management style that fits everyone in all situations. The fundamental variable in finding the best management style is the task-related maturity (TRM) of the subordinates.

Subordinate's Work MaturityEffective Leadership Style
LowOrganized; Task-oriented; Detail-focused; Accurately points out the details of "when - what - how"
MediumPeople-oriented; Provides support; "Two-way communication" model
HighGoal-oriented "monitoring" model

A person's task-related maturity depends on the specific work project, and its improvement takes time. When task-related maturity reaches its highest level, the individual's knowledge and motivation will also reach a certain height, allowing their manager to successfully delegate work to them.

The key takeaway is: ==There is no good or bad management style; there is only effective and ineffective==.

Aaron Siedler: 'Change Aversion': Why Users Dislike Your New Products and Features (and How to Address It)

· One min read

What is "Change Aversion"?

In general, "change aversion" refers to the unrest and opposition that arise among users whenever you alter something they frequently use in your product. This unrest is almost always present in every version of products like Gmail, YouTube, and the iPhone.

How to Mitigate or Even Avoid "Change Aversion"?

  1. Inform users in advance and provide understanding afterward. Notify users of significant changes in the new version beforehand and explain the reasons for these changes, then offer guidance on how to transition afterward.
  2. Allow users to switch freely. Do not cut off users' ability to revert to the original version; do not leave them feeling isolated and helpless.
  3. Continuously follow up on user feedback.

Don't Let "Change Aversion" Be an Excuse for Poor Execution

Over time, whether changes to the product are good or bad will ultimately be revealed.

change aversion patterns

Task-Relevant Maturity

· One min read

Andy Grove emphasizes that ==a manager’s most important responsibility is to elicit top performance from his subordinates.==.

Unfortunately, one management style does not fit all the people in all the scenarios. A fundamental variable to find the best management style is task-relevant maturity (TRM) of the subordinates.

TRMEffective Management Style
lowstructured; task-oriented; detailed-oriented; instruct exactly "what/when/how mode"
mediumIndividual-oriented; support, "mutual-reasoning mode"
highgoal-oriented; monitoring mode

A person's TRM depends on the specific work items. It takes time to improve. When TRM reaches the highest level, the person's both knowledge-level and motivation are ready for her manager to delegate work.

The key here is to regard any management mode not as either good or bad but rather as effective or not effective.