Turning 3,652 anonymous website visitors into 62 sales-actionable leads.

A 60-year-old industrial manufacturer with a global customer base couldn't tell which web visitors were actual prospects vs. crawlers, scanners, and noise. We built a daily-refreshed sales-intent dashboard, matched against their existing customer list. Six weeks, in production.

Industry

Industrial Manufacturing

Engagement

6 weeks · fixed scope

Stack

Python, Flask, MySQL, ip-api, OSM Overpass, Nominatim

Status

In production

The before state

The client makes industrial valves for oil & gas, petrochemical, and pipeline operators. Their website got steady traffic but the AWStats output was unusable for sales: 3,652 unique IPs in a typical month, dominated by Asian crawlers, security scanners, and AWS/Azure/GCP automation pinging the RFQ form.

When sales asked "who's actually looking at our products this week?", the answer was: here's a list of IP addresses, good luck. The team had a 329-customer SAGE ERP export and a 266-pair distributor list with cities — but no way to connect either to the live web traffic.

What we shipped

1. Pipeline: log → enriched leads

A daily cron parses the Plesk access logs (90-day rolling window across rotated archives), filters crawlers by user-agent, drops attack-probe paths, then enriches every surviving IP with:

Reverse DNS via parallel gethostbyaddr with a 2s timeout (so a single slow nameserver doesn't stall the run).
Geo + ASN lookup via ip-api batch endpoint — city, region, country, lat/lon, ISP, organization.
Automation-noise filter — a tightly maintained regex of cloud, hosting, scanner, CDN, and AI-scraper ASNs. Drops automation hard, keeps residential and corporate.
OpenStreetMap Overpass scan for industrial sites within 25mi of each visitor (oil/gas/chemical/refinery/petroleum/pipeline). Water-industry hits filtered out post-fetch since the client doesn't serve that segment.
Customer-proximity match — every visitor's coordinates are checked against the geocoded customer list. If a visitor in Chicago is 0.3mi from an existing customer's office, that's the killer signal.

2. Internal HQ dashboard

An M365-SSO-gated Flask page renders the leads as a sortable, filterable table. Each row shows: hits, category (corporate / residential), IP + reverse DNS, location, ASN organization, pages visited, last seen, customer matches with distance, and a status pill (new / reviewed / interesting / ignored). Sales reps can expand any row to add notes, set status, and write a company-name guess that gets written back to MySQL.

Noise filter

3,652 → 175

Unique IPs in a 90-day window narrowed to North-American non-bot visitors who actually browsed real pages.

Customer matches

62 leads

Visitors within 25mi of a known customer location. Closest hit: 0.1mi from an existing account in DC.

3. The funnel

Raw unique IPs (90d)

3,652

After crawler filter

2,278

North America only

~1,100

After cloud/hosting/scanner drop

175

Within 25mi of a customer

Hit a real intent page

4. Architecture

Why the pipeline has so many filter layers

Raw web traffic on a B2B industrial site is dominated by automation: foreign crawlers, security scanners, cloud-hosted bots, and AI training scrapers. Without aggressive filtering, the highest-ranked "leads" are all bots. We architected for that from day one:

ASN classification, not user-agent. Bots can lie about user-agents, but they can't fake the AS number their packets came from. The blocklist is a maintained regex of cloud, hosting, scanner, CDN, and AI-scraper ASNs — with residential ISPs (Comcast, Verizon, etc.) explicitly kept since that's where home offices and small businesses live.
Page-depth scoring is intentionally de-emphasized. A visitor who hits RFQ + contact + every product page is almost always a crawler. We treat per-visit page depth as a tie-breaker, not a primary signal — and surface customer-proximity matches as the actual ranking signal.
City-level proximity is a known floor, not a final answer. The client's first customer file had cities only, so we shipped with that and made the pipeline ready to swap in street-level addresses for v2 without code changes.
Resilient against Overpass timeouts. Some metros (Bay Area, Toronto, Vegas) saturate the public OpenStreetMap endpoint. Those queries fail, get logged, and are retried by the next cron run — the rounded-coordinate cache makes consecutive misses cheap.

How a 6-week loop actually ran

Week 1. Walk the data: customer list shapes, log paths, existing AWStats output. Lock the scope: lead list with proximity matches, internal-only, M365-gated, daily refresh.

Week 2-3. Pipeline: log parser, ip-api enrichment, multi-layer automation filter, OSM proximity scan, JSON cache. First end-to-end run reaching the production database.

Week 4. HQ dashboard route, table UI, status/notes review state in MySQL, sidebar navigation. Customer-list ingest from the client's existing distributor xlsx (266 unique customer/location pairs, geocoded via Nominatim).

Week 5. UAT with sales. Tuned the noise list against their domain knowledge (they flagged a few hosting providers we hadn't seen before), added water-industry filtering (out of scope for valves), and switched the default sort to recency-first based on how reps actually used the page.

Week 6. Documentation, cron setup at 06:30 daily, deploy memos, runbooks. SSO next_url preservation so direct-link sharing works after login. Handoff.

What's next

Phase 2 swaps the city-only customer list for full street addresses (much tighter proximity scoring, far fewer false positives), wires the dashboard into their CRM so leads sync automatically, and adds a weekly digest email to sales for accounts they haven't reviewed.

Have data your team can't get answers from?

If you've got messy operational data — web logs, customer lists, scanned PDFs, technical specs — the same playbook works.

Start a Conversation →