Skip to main content

Stream and Batch Processing Frameworks

· 2 min read

Why such frameworks?

  • process high-throughput in low latency.
  • fault-tolerance in distributed systems.
  • generic abstraction to serve volatile business requirements.
  • for bounded data set (batch processing) and for unbounded data set (stream processing).

Brief history of batch/stream processing

  1. Hadoop and MapReduce. Google made batch processing as simple as MR result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs) in a distributed system.
  2. Apache Storm and DAG Topology. MR doesn’t efficiently express iterative algorithms. Thus Nathan Marz abstracted stream processing into a graph of spouts and bolts.
  3. Spark in-memory Computing. Reynold Xin said Spark sorted the same data 3X faster using 10X fewer machines compared to Hadoop.
  4. Google Dataflow based on Millwheel and FlumeJava. Google supports both batch and streaming computing with the windowing API.
  1. its fast adoption of ==Google Dataflow==/Beam programming model.
  2. its highly efficient implementation of Chandy-Lamport checkpointing.

How?

Architectural Choices

To serve requirements above with commodity machines, the steaming framework use distributed systems in these architectures...

  • master-slave (centralized): apache storm with zookeeper, apache samza with YARN.
  • P2P (decentralized): apache s4.

Features

  1. DAG Topology for Iterative Processing. e.g. GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
  2. Delivery Guarantees. How guaranteed to deliver data from nodes to nodes? at-least once / at-most once / exactly once.
  3. Fault-tolerance. Using cold/warm/hot standby, checkpointing, or active-active.
  4. Windowing API for unbounded data set. e.g. Stream Windows in Apache Flink. Spark Window Functions. Windowing in Apache Beam.

Comparison

FrameworkStormStorm-tridentSparkFlink
Modelnativemicro-batchmicro-batchnative
Guarenteesat-least-onceexactly-onceexactly-onceexactly-once
Fault-toleranceRecord-Ackrecord-ackcheckpointcheckpoint
Overhead of fault-tolerancehighmediummediumlow
latencyvery-lowhighhighlow
throughputlowmediumhighhigh

Fraud Detection with Semi-supervised Learning

· 4 min read

Clarify Requirements

Calculate risk probability scores in realtime and make decisions along with a rule engine to prevent ATO (account takeovers) and Botnet attacks.

Train clustering fatures with online and offline pipelines

  1. Source from website logs, auth logs, user actions, transactions, high-risk accounts in watch list
  2. track event data in kakfa topics
  3. Process events and prepare clustering features

Realtime scoring and rule-based decision

  1. assess a risk score comprehensively for online services

  2. Maintain flexibility with manually configuration in a rule engine

  3. share, or use the insights in online services

ATOs ranking from easy to hard to detect

  1. from single IP
  2. from IPs on the same device
  3. from IPs across the world
  4. from 100k IPs
  5. attacks on specific accounts
  6. phishing and malware

Challenges

  • Manual feature selection
  • Feature evolution in adversarial environment
  • Scalability
  • No online DBSCAN

High-level Architecture

Core Components and Workflows

Semi-supervised learning = unlabeled data + small amount of labeled data

Why? better learning accuracy than unsupervised learning + less time and costs than supervised learning

Training: To prepare clustering features in database

  • Streaming Pipeline on Spark:
    • Runs continuously in real-time.
    • Performs feature normalization and categorical transformation on the fly.
      • Feature Normalization: Scale your numeric features (e.g., age, income) so that they are between 0 and 1.
      • Categorical Feature Transformation: Apply one-hot encoding or another transformation to convert categorical features into a numeric format suitable for the machine learning model.
    • Uses Spark MLlib’s K-means to cluster streaming data into groups.
      • After running k-means and forming clusters, you might find that certain clusters have more instances of fraud.
      • Once you’ve labeled a cluster as fraudulent based on historical data or expert knowledge, you can use that cluster assignment during inference. Any new data point assigned to that fraudulent cluster can be flagged as suspicious.
  • Hourly Cronjob Pipeline:
    • Runs periodically every hour (batch processing).
    • Applies thresholding to identify anomalies based on results from the clustering model.
    • Tunes parameters of the DBSCAN algorithm to improve clustering and anomaly detection.
    • Uses DBSCAN from scikit-learn to find clusters and detect outliers in batch data.
      • DBSCAN, which can detect outliers, might identify clusters of regular transactions and separate them from noise, which could be unusual, potentially fraudulent transactions.
      • Transactions in the noisy or outlier regions (points that don’t belong to any dense cluster) can be flagged as suspicious.
      • After identifying a cluster as fraudulent, DBSCAN helps detect patterns of fraud even in irregularly shaped transaction distributions.

Serving

The serving layer is where the rubber meets the road - where we turn our machine learning models and business rules into actual fraud prevention decisions. Here's how it works:

  • Fraud Detection Scoring Service:
    • Takes real-time features extracted from incoming requests
    • Applies both clustering models (K-means from streaming and DBSCAN from batch)
    • Combines scores with streaming counters (like login attempts per IP)
    • Outputs a unified risk score between 0 and 1
  • Rule Engine:
    • Acts as the "brain" of the system
    • Combines ML scores with configurable business rules
    • Examples of rules:
      • If risk score > 0.8 AND user is accessing from new IP → require 2FA
      • If risk score > 0.9 AND account is high-value → block transaction
    • Rules are stored in a database and can be updated without code changes
    • Provides an admin portal for security teams to adjust rules
  • Integration with Other Services:
    • Exposes REST APIs for real-time scoring
    • Publishes results to streaming counters for monitoring
    • Feeds decisions back to the training pipeline to improve model accuracy
  • Observability:
    • Tracks key metrics like false positive/negative rates
    • Monitors model drift and feature distribution changes
    • Provides dashboards for security analysts to investigate patterns
    • Logs detailed information for post-incident analysis

Designing Uber Ride-Hailing Service

· 3 min read

Disclaimer: All content below is sourced from public resources or purely original. No confidential information regarding Uber is included here.

Requirements

  • Provide services for the global transportation market
  • Large-scale real-time scheduling
  • Backend design

Architecture

uber architecture

Why Microservices?

==Conway's Law== The structure of a software system corresponds to the organizational structure of the company.

Monolithic ==Service==Microservices
When team size and codebase are small, productivity✅ High❌ Low
==When team size and codebase are large, productivity==❌ Low✅ High (Conway's Law)
==Quality requirements for engineering==❌ High (Inadequately skilled developers can easily disrupt the entire system)✅ Low (Runtime is isolated)
Dependency version upgrades✅ Fast (Centralized management)❌ Slow
Multi-tenant support / Production-staging state isolation✅ Easy❌ Difficult (Each service must 1) either establish a staging environment connected to other services in staging 2) or support multi-tenancy across request contexts and data storage)
Debuggability, assuming the same modules, parameters, logs❌ Low✅ High (if distributed tracing is available)
Latency✅ Low (Local)❌ High (Remote)
DevOps costs✅ Low (High cost of build tools)❌ High (Capacity planning is difficult)

Combining monolithic ==codebase== and microservices can leverage the strengths of both.

Scheduling Service

  • Consistent hash addresses provided by geohash
  • Data is transient in memory, so no need for duplication. (CAP: AP over CP)
  • Use single-threaded or locked sharding to prevent double scheduling

Payment Service

==The key is to have asynchronous design==, as ACID transaction payment systems across multiple systems often have very long latencies.

User Profile Service and Trip Record Service

  • Use caching to reduce latency
  • As 1) support for more countries and regions increases 2) user roles (drivers, riders, restaurant owners, diners, etc.) gradually expand, providing user profile services for these users also faces significant challenges.

Notification Push Service

  • Apple Push Notification Service (unreliable)
  • Google Cloud Messaging (GCM) (can detect successful delivery) or
  • SMS services are generally more reliable

Task-Related Maturity

· One min read

Andy Grove emphasizes: ==The most important responsibility of a manager is to inspire their subordinates to perform at their best==.

Unfortunately, there is no single management style that fits everyone in all situations. The fundamental variable in finding the best management style is the task-related maturity (TRM) of the subordinates.

Subordinate's Work MaturityEffective Leadership Style
LowOrganized; Task-oriented; Detail-focused; Accurately points out the details of "when - what - how"
MediumPeople-oriented; Provides support; "Two-way communication" model
HighGoal-oriented "monitoring" model

A person's task-related maturity depends on the specific work project, and its improvement takes time. When task-related maturity reaches its highest level, the individual's knowledge and motivation will also reach a certain height, allowing their manager to successfully delegate work to them.

The key takeaway is: ==There is no good or bad management style; there is only effective and ineffective==.

Aaron Siedler: 'Change Aversion': Why Users Dislike Your New Products and Features (and How to Address It)

· One min read

What is "Change Aversion"?

In general, "change aversion" refers to the unrest and opposition that arise among users whenever you alter something they frequently use in your product. This unrest is almost always present in every version of products like Gmail, YouTube, and the iPhone.

How to Mitigate or Even Avoid "Change Aversion"?

  1. Inform users in advance and provide understanding afterward. Notify users of significant changes in the new version beforehand and explain the reasons for these changes, then offer guidance on how to transition afterward.
  2. Allow users to switch freely. Do not cut off users' ability to revert to the original version; do not leave them feeling isolated and helpless.
  3. Continuously follow up on user feedback.

Don't Let "Change Aversion" Be an Excuse for Poor Execution

Over time, whether changes to the product are good or bad will ultimately be revealed.

change aversion patterns

Task-Relevant Maturity

· One min read

Andy Grove emphasizes that ==a manager’s most important responsibility is to elicit top performance from his subordinates.==.

Unfortunately, one management style does not fit all the people in all the scenarios. A fundamental variable to find the best management style is task-relevant maturity (TRM) of the subordinates.

TRMEffective Management Style
lowstructured; task-oriented; detailed-oriented; instruct exactly "what/when/how mode"
mediumIndividual-oriented; support, "mutual-reasoning mode"
highgoal-oriented; monitoring mode

A person's TRM depends on the specific work items. It takes time to improve. When TRM reaches the highest level, the person's both knowledge-level and motivation are ready for her manager to delegate work.

The key here is to regard any management mode not as either good or bad but rather as effective or not effective.

Ads Ecosystem

· One min read

Ads Ecosystem

  • Brand / Advertiser: individuals or organizations who want to publish advertising messages to the customers.

  • Agency: they help the brand to interact with the rest of the ecosystem and manage the whole lifecycle of the advertising messages, including planning, creating, and distributing ad campaigns.

  • Trading Desk: It streamlines the media buying process.

  • Demand-side Platform (DSP): it automates online ad inventory and buying, helping agencies to manage accounts across different accounts and campaigns through one platform.

  • Data-management Platform (DMP)

    1. Ads-based Analytics: attrition, targeting, profiling, session replay, and more.
    2. Anti-fraud
    3. Market-based Analytics
  • Ad Exchange / Real-time Bidding (RTB): It matches ads suppliers with buyers.

  • Ad Network: It aggregates publisher inventory and sells it to advertisers.

  • Supply Side Platform (SSP): It monitors the entire ads inventory and suggest prices for ad space.

  • Publisher: Ad-space owners like website operators.

Canadians Complain About Online Payment Systems

· One min read

Since 1990, the internet's 402 Payment Required status code has been designed, yet nearly thirty years later, online payments still do not provide a seamless experience.

  1. I saw an article online and just wanted to pay to support it, but you made me create an account, set a password, fill in my credit card number, and then I get bombarded with endless spam.
  2. Payments should be as convenient as browsing—one-click payments without needing to fill in an account, just using crypto tokens. However, centralized payments are monopolized by large companies.
  3. Content creators are forced to rely on ads; only the big players can win, and they become shameless in their pursuit of money.
  4. The internet urgently needs a foundational framework that incentivizes creativity and protects privacy. While Bitcoin is not the answer, blockchain has potential.

Aaron Sedley: Change aversion: why users hate what you launched (and what to do about it)

· One min read

What is change aversion?

By and large, anytime you change what people regularly use in a product, they will always throw an uproar. This happens to almost every release of products like Gmail, YouTube, iPhone, etc.

How to avoid or mitigate change aversion?

  1. Let users understand, in advance and afterward. Warn them about the significant changes early and communicate why those places changed. Provide transition instructions afterward.
  2. Let users switch. Don’t shut the door and leave them alone in the helplessness.
  3. Let users give feedbacks and follow through.

Change Aversion isn’t an Excuse

The product changes may turn out to be good or bad ones.

change aversion patterns

Differences and Similarities in Eastern and Western Workplace Cultures

· 2 min read

Here are some immature observations. I have never worked in China, so I filled in the section for Chinese people based on what I heard and saw in school.

Differences

What Americans ValueWhat Chinese People Value
Persuasiveness
Public speaking is the foundation of a democratic society
Actions speak louder than words
A gentleman is cautious in speech but quick in action
Being emotional is shameful/unprofessional
Hypocrisy/professionalism
Being genuine
Losing face
Meetings are for making decisionsMeetings are for finalizing decisions
Preference for diamondsPreference for jade
Confidence
Even if it’s not possible, say it is
I am better than others
Humility
Even if it’s possible, say it’s not
Others are better than me
Independent thinking
Challenging authority
Team spirit
Obedience and execution

Similarities

Both Americans and Chinese value:

  • Business acumen
  • Vision and strategic decision-making ability
  • Unity is strength (The Chinese people I have worked with in the past were very united, although many have biases against this)
  • People are more important than tasks; relationships determine outcomes
  • Flexibility in responding to VUCA (Volatility, Uncertainty, Complexity, Ambiguity)