Fraud Detection with Semi-supervised Learning

February 13, 2019 · 4 min read

Clarify Requirements

Calculate risk probability scores in realtime and make decisions along with a rule engine to prevent ATO (account takeovers) and Botnet attacks.

Train clustering fatures with online and offline pipelines

Source from website logs, auth logs, user actions, transactions, high-risk accounts in watch list
track event data in kakfa topics
Process events and prepare clustering features

Realtime scoring and rule-based decision

assess a risk score comprehensively for online services
Maintain flexibility with manually configuration in a rule engine
share, or use the insights in online services

ATOs ranking from easy to hard to detect

from single IP
from IPs on the same device
from IPs across the world
from 100k IPs
attacks on specific accounts
phishing and malware

Challenges

Manual feature selection
Feature evolution in adversarial environment
Scalability
No online DBSCAN

High-level Architecture

Core Components and Workflows

Semi-supervised learning = unlabeled data + small amount of labeled data

Why? better learning accuracy than unsupervised learning + less time and costs than supervised learning

Training: To prepare clustering features in database

Streaming Pipeline on Spark:
- Runs continuously in real-time.
- Performs feature normalization and categorical transformation on the fly.
  - Feature Normalization: Scale your numeric features (e.g., age, income) so that they are between 0 and 1.
  - Categorical Feature Transformation: Apply one-hot encoding or another transformation to convert categorical features into a numeric format suitable for the machine learning model.
- Uses Spark MLlib’s K-means to cluster streaming data into groups.
  - After running k-means and forming clusters, you might find that certain clusters have more instances of fraud.
  - Once you’ve labeled a cluster as fraudulent based on historical data or expert knowledge, you can use that cluster assignment during inference. Any new data point assigned to that fraudulent cluster can be flagged as suspicious.
Hourly Cronjob Pipeline:
- Runs periodically every hour (batch processing).
- Applies thresholding to identify anomalies based on results from the clustering model.
- Tunes parameters of the DBSCAN algorithm to improve clustering and anomaly detection.
- Uses DBSCAN from scikit-learn to find clusters and detect outliers in batch data.
  - DBSCAN, which can detect outliers, might identify clusters of regular transactions and separate them from noise, which could be unusual, potentially fraudulent transactions.
  - Transactions in the noisy or outlier regions (points that don’t belong to any dense cluster) can be flagged as suspicious.
  - After identifying a cluster as fraudulent, DBSCAN helps detect patterns of fraud even in irregularly shaped transaction distributions.

Serving

The serving layer is where the rubber meets the road - where we turn our machine learning models and business rules into actual fraud prevention decisions. Here's how it works:

Fraud Detection Scoring Service:
- Takes real-time features extracted from incoming requests
- Applies both clustering models (K-means from streaming and DBSCAN from batch)
- Combines scores with streaming counters (like login attempts per IP)
- Outputs a unified risk score between 0 and 1
Rule Engine:
- Acts as the "brain" of the system
- Combines ML scores with configurable business rules
- Examples of rules:
  - If risk score > 0.8 AND user is accessing from new IP → require 2FA
  - If risk score > 0.9 AND account is high-value → block transaction
- Rules are stored in a database and can be updated without code changes
- Provides an admin portal for security teams to adjust rules
Integration with Other Services:
- Exposes REST APIs for real-time scoring
- Publishes results to streaming counters for monitoring
- Feeds decisions back to the training pipeline to improve model accuracy
Observability:
- Tracks key metrics like false positive/negative rates
- Monitors model drift and feature distribution changes
- Provides dashboards for security analysts to investigate patterns
- Logs detailed information for post-incident analysis

References:

https://www.slideshare.net/Hadoop_Summit/semisupervised-learning-in-an-adversarial-environment

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Fraud Detection with Semi-supervised Learning

Clarify Requirements

High-level Architecture

Core Components and Workflows

Training: To prepare clustering features in database

Serving

About Tian Pan

Stay up to date

Clarify Requirements​

High-level Architecture​

Core Components and Workflows​

Training: To prepare clustering features in database​

Serving​

About Tian Pan

Stay up to date

Clarify Requirements

High-level Architecture

Core Components and Workflows

Training: To prepare clustering features in database

Serving