At both of my last two jobs I was faced with a task that seemed insurmountable: identify and block fraudulent activity. Many commercial tools exist, but none are particularly suited to the job. At Ticketmaster I learned the lessons I would carry to Weedmaps—that success requires customized workflows for identifying and remediating malicious, non-human traffic.
At both Ticketmaster and Weedmaps, we faced the same adversaries: bots, scrapers, fraudulent reviews, and other forms of malicious automation. While the request seemed deceptively simple, the task was Sisyphean at best and Kafkaesque the rest of the time. How do you find a needle in a haystack? Easy—burn the hay and use a magnet. But this was more like finding one particular needle in a stack of needles, without knowing what made that needle unique.
During my time at Weedmaps I wasn’t able to fully realize the workflow I envisioned, but I did prototype a simple proof of concept that successfully identified fraudulent review traffic with surprising accuracy.
So I present to you, Dear Reader: “The Stack I Wish I Had.”
We start from a simple truth: we don’t know what malicious traffic looks like. We have no reliable data points to anchor on. But we do know what good traffic looks like. Good traffic ends in a conversion or sale, generates no malicious errors, and can be safely labeled as “known-good.” This is exactly the sort of problem machine learning was built for.
We can identify good traffic fairly simply. So how do we leverage this to identify bad traffic? By creating a model of known-good traffic, we can surface the outliers — the sessions that don’t quite fit the pattern of legitimate buyers. These outliers then become the seed data for supervised learning models that can classify and score malicious behavior.
The workflow I wish I had is built on a simple progression:
- Data Ingestion — get the raw traffic in.
- Feature Engineering — distill the patterns that separate real users from non-human traffic.
- Modeling — combine anomaly detection and supervised learning.
- Scoring & Reporting — translate model outputs into actionable risk scores.
- Enforcement — push those scores into the WAF to block or challenge bad actors.
Let’s start where everything starts: the data.
Data Ingestion
Every good workflow starts with the raw material: data. And in this case, that means traffic logs. Loads of them. CDN logs, WAF logs, load balancer logs, application events, even the humble server access log — all of it is signal. Some of it is noise. You don’t get to choose; you take it all in.
The stack I wish I had would make ingestion seamless. I’d want every request to flow into a central warehouse — in my case, Snowflake. Using a Snowpipe Streaming or a Kafka/Kinesis pipeline, I’d stream events directly from the edge (Fastly, Cloudflare, or AWS WAF) into Snowflake’s Bronze layer. That’s the messy raw zone, where nothing is filtered, nothing is cleaned, and every quirky (and unreliable) User-Agent string is preserved. Think of it as the forensic record.
From there, the Silver layer would normalize the chaos. Logs would be flattened into a consistent schema:
session_id
timestamp
ip_address
user_agent
uri_path
status_code
bytes_sent
referrer
This is where malformed JSON gets fixed, IPs get geolocated, and UA strings are parsed into devices and browsers and compared against passive fingerprinting to identify UA manipulation.
Finally, in the Gold layer, traffic data becomes consumable features. Sessions are aggregated, user behavior is stitched together, and metadata from purchases is joined in. This is where we can clearly label known-good traffic: the sessions that actually resulted in a conversion.
Why bother with all this layering? Because fraud detection is messy. You need the Bronze layer when something slips through and you want to retrace the attack. You need the Silver layer to give your models consistency. And you need the Gold layer to train against something meaningful: traffic that really mattered to the business.
Feature Engineering
Raw logs are like ore. Valuable, but only once refined, a step that most companies miss entirely. Feature engineering is the refinery — the process of distilling millions of requests into signals that can tell us something about intent.
At this stage, the goal isn’t to predict fraud directly. It’s to surface the patterns that separate “buyer” from “bot.”
Some examples of the features I’d want:
Velocity & Burstiness
- Requests per second, 95th and 99th percentile latencies.
- Dozens of product page hits in milliseconds? That’s not a human.
Account Validity
- Does the account use an email from a known temp email provider?
- Is the age of the email domain suspicious?
- Does the email host have an MX but no A record?
Path Consistency
- Did the session follow a funnel (browse → cart → checkout)?
- Does the activity hit the web front end or go directly to the API?
Header & Protocol Hygiene
- Ratio of HEAD requests to GET.
- Are cookies stable or regenerated every hit?
- Did the client execute the JavaScript beacon, or skip it entirely?
Device & Network Fingerprints
- ASN (autonomous system number): residential ISP vs. datacenter.
- Geo consistency: does the user teleport from California to Romania in 5 minutes?
- Do they search from Texas for a concert or product in California?
- User-Agent edit distance: how close is this string to a real browser UA? Does the passive fingerprint match the UA string?
Review/Interaction Signals
- Review submission rate per account age.
- Duplicate reviews across multiple accounts from the same IP.
Session Continuity
- Do actions cluster like a human’s (bursty, then pause)?
- Or do they march forward with machine-like precision?
The point isn’t to find a silver bullet feature. It’s to build a constellation of weak signals that, together, start to draw a clear outline of “this is what a real customer looks like.”
In my ideal stack, I’d use dbt to materialize these features in Snowflake, then expose them via a feature store (like Feast). That way, models get a clean and consistent diet, and analysts can reuse the same features without reinventing the wheel.
With features in hand, we’re finally ready for the real heavy lifting: Modeling.
Modeling
Once the features are in place, the real fun begins: teaching machines to tell friend from foe. But fraud isn’t a binary “spam vs. ham” problem. It’s subtler, messier, and adversarial. That’s why I’d build it in two layers.
1. Anomaly Detection (the outlier lens)
Start with what we do know: good traffic. Sessions that end in a purchase. Train an anomaly detection model on these alone.
- Isolation Forests can quickly flag sessions that behave unlike any buyer you’ve ever seen.
- Autoencoders can compress “buyer behavior” into a smaller space, then highlight requests that don’t fit.
The output: an anomaly score — how far this session deviates from “buyer normal.”
2. Supervised Learning (the fraud lens)
Outliers are interesting, but you also need a classifier trained on labeled bad traffic:
- Confirmed scrapers.
- Review farms you’ve already busted.
- Traffic that triggered honeypots.
Here, a gradient boosting model (XGBoost or LightGBM) works well. It handles imbalanced data, plays nicely with tabular features, and gives you explainability through feature importance.
Pro tip: consider purchasing fraudulent services and tagging them with a specific UA string — this is an underutilized and incredibly powerful indicator.
The output: a fraud probability — how likely the session belongs to a known class of malicious behavior.
3. Rules & Deterministic Checks (the sanity lens)
Machine learning is powerful, but some signals are too obvious to leave to chance. If a request comes from a known TOR exit node, or if an account posts 100 reviews in 60 seconds, you don’t need a model to tell you something’s off. Those rules should be encoded directly, weighted alongside the model outputs.
4. Blending It Together
Each lens has blind spots. Outlier detection finds “weird,” but weird doesn’t always mean bad. Supervised models are only as good as the labels you have. Rules catch the obvious but miss the clever.
That’s why the stack I wish I had would unify them in a weighted scoring system:
risk_score = 0.35 * anomaly_score + 0.45 * fraud_probability + Σ (rule_weight * rule_flag)
The result isn’t a binary yes/no. It’s a sliding scale of risk, tunable to your business’s tolerance for false positives. High scores can be blocked outright. Medium scores can be challenged. Low scores can be logged and watched.
With this model, you don’t need to know what bad actors look like, you create a feedback loop that adapts as they evolve.
Scoring & Reporting
Models are only useful if their outputs can be understood, trusted, and acted upon. A raw risk score on its own doesn’t mean much. What matters is how you present that score, and what decisions it drives.
Weighted Scoring
As described earlier, the stack would unify anomaly scores, fraud probabilities, and rule triggers into a single risk score. This score isn’t binary; it’s a spectrum:
- 0.0 – 0.4: Normal. Observe only.
- 0.4 – 0.7: Suspicious. Flag or challenge.
- 0.7 – 1.0: High risk. Block or escalate.
These thresholds shouldn’t be static. They should be tuned per use case: checkout flows, product reviews, or high-value catalog scraping each have different risk appetites.
Transparency & Explainability
For security analysts and engineers, it’s not enough to see “risk score = 0.82.” They need to know why.
- Show the top features influencing the decision (e.g., “100 requests in 10s from a hosting provider ASN”).
- Store these reason codes alongside each score for future investigation.
- Aggregate reason codes to spot trends: “60% of high-risk traffic today came from one ASN in Singapore.”
Dashboards & Visualizations
Executives want big-picture impact. Analysts want drill-downs. Both should be served.
- Operational Dashboards: real-time charts of blocked/challenged/allowed traffic; heatmaps of suspicious IP ranges.
- Investigations UI: session-level drilldowns (timeline of requests, features, score breakdown, SHAP explanations).
- Business Impact Reports: weekly summaries: “Blocked 1.2M requests this week, prevented ~20K fraudulent reviews, saved estimated $X in bandwidth & chargebacks.”
We can accomplish this with a mix of Snowsight, Metabase, or Hex for dashboards, and a Streamlit app for analysts. The key is that everyone, from engineers to execs, can see both the forest and the trees.
Closing the Loop
Finally, the scores shouldn’t just sit in dashboards. They should flow back into operations:
- High scores → auto-generate WAF rules (with TTLs).
- Medium scores → feed review queues for human analysts.
- Low scores → stay in the data warehouse, enriching the next model training set.
This is the feedback cycle that keeps the system alive: models produce scores → scores drive actions → actions produce new data → new data retrains the models.
Enforcement (WAF Integration)
A risk score without an enforcement path is just a number on a dashboard. To make this stack truly useful, the scores have to shape traffic at the edge — where the bad actors live.
Graduated Responses
Not every suspicious session deserves the ban hammer. The stack should drive tiered actions:
- Observe (score < T1): Log the request, enrich with metadata, do nothing else.
- Challenge (T1 ≤ score < T2): Slow the client down or force proof-of-work — CAPTCHA, JavaScript puzzle, or rate limiting with backoff. Legitimate humans will pass; bots will choke.
- Block (score ≥ T2): Drop traffic outright with a 403 or 429. High-confidence malicious actors never touch the app.
This laddered approach protects conversions while still cutting off the worst offenders.
Automated Rule Pushes
The enforcement engine should push risk intelligence back to the WAF in near-real-time.
- Cloudflare/AWS WAF: update IP sets or firewall expressions via API.
- Fastly: use Edge Dictionaries + VCL to block or challenge risky fingerprints.
- Custom edge stack: pull fresh rules from Snowflake every few minutes and cache them locally.
Every rule should have a time-to-live (TTL) — say, 2–24 hours. If malicious behavior persists, the score refreshes and the rule renews. If not, it expires gracefully. This prevents “set it and forget it” lists that accidentally block legitimate users.
Feedback Loop
When the WAF acts, that event needs to flow back into the data warehouse. Blocked requests become new labeled data. Challenged-but-passed sessions can be studied as edge cases. This closes the loop and sharpens the next generation of the model.
Guardrails & Overrides
Because false positives hurt business, enforcement must include safety valves:
- Allowlists for VIP customers, internal traffic, and monitoring services.
- Canary periods where rules apply only to a percentage of traffic before full rollout.
- Analyst review queues for borderline sessions — so a human can override before harm is done.
With these guardrails in place, the system isn’t just blocking blindly. It’s adapting, balancing protection with business impact.
Conclusion
At Ticketmaster and Weedmaps the mandate was always the same: “stop the bad traffic.” On paper it sounded simple, in practice it was an exercise in frustration. Each time we solved one problem, another variation would appear.
The stack I wish I had is about breaking that cycle. By anchoring on what we do know — the shape of real customer traffic — and layering ingestion, features, models, scoring, and enforcement, the problem becomes tractable. Not easy, but structured. Not endless, but iterative.
Every piece of this workflow exists today:
- Snowflake/dbt for clean data layers.
- Feature stores for consistency.
- XGBoost/autoencoders for modeling.
- Redis/k8s for serving.
- WAF APIs for action at the edge.
The feedback loop that turns raw logs into intelligence is the true key to rolling the boulder up the hill.
Bots, scrapers, and fraudsters will keep adapting. But with the right stack, so can we. Instead of endlessly chasing needles in stacks of needles, we can build systems that can find a bad needle without knowing what it looks like — and learn from every pass.