Turning 3,652 anonymous website visitors into 62 sales-actionable leads.

A 60-year-old industrial manufacturer with a global customer base couldn't tell which web visitors were actual prospects vs. crawlers, scanners, and noise. We built a daily-refreshed sales-intent dashboard, matched against their existing customer list. Six weeks, in production.

Industry
Industrial Manufacturing
Engagement
6 weeks · fixed scope
Stack
Python, Flask, MySQL, ip-api, OSM Overpass, Nominatim
Status
In production

The before state

The client makes industrial valves for oil & gas, petrochemical, and pipeline operators. Their website got steady traffic but the AWStats output was unusable for sales: 3,652 unique IPs in a typical month, dominated by Asian crawlers, security scanners, and AWS/Azure/GCP automation pinging the RFQ form.

When sales asked "who's actually looking at our products this week?", the answer was: here's a list of IP addresses, good luck. The team had a 329-customer SAGE ERP export and a 266-pair distributor list with cities — but no way to connect either to the live web traffic.

What we shipped

1. Pipeline: log → enriched leads

A daily cron parses the Plesk access logs (90-day rolling window across rotated archives), filters crawlers by user-agent, drops attack-probe paths, then enriches every surviving IP with:

2. Internal HQ dashboard

An M365-SSO-gated Flask page renders the leads as a sortable, filterable table. Each row shows: hits, category (corporate / residential), IP + reverse DNS, location, ASN organization, pages visited, last seen, customer matches with distance, and a status pill (new / reviewed / interesting / ignored). Sales reps can expand any row to add notes, set status, and write a company-name guess that gets written back to MySQL.

Noise filter
3,652 → 175
Unique IPs in a 90-day window narrowed to North-American non-bot visitors who actually browsed real pages.
Customer matches
62 leads
Visitors within 25mi of a known customer location. Closest hit: 0.1mi from an existing account in DC.

3. The funnel

Raw unique IPs (90d)
3,652
After crawler filter
2,278
North America only
~1,100
After cloud/hosting/scanner drop
175
Within 25mi of a customer
62
Hit a real intent page
5

4. Architecture

Plesk access logs ip-api · OSM Customer xlsx Daily cron parse / enrich / geocode JSON cache /sales dashboard MySQL review state Python · Flask · M365 SSO

Why the pipeline has so many filter layers

Raw web traffic on a B2B industrial site is dominated by automation: foreign crawlers, security scanners, cloud-hosted bots, and AI training scrapers. Without aggressive filtering, the highest-ranked "leads" are all bots. We architected for that from day one:

How a 6-week loop actually ran

Week 1. Walk the data: customer list shapes, log paths, existing AWStats output. Lock the scope: lead list with proximity matches, internal-only, M365-gated, daily refresh.

Week 2-3. Pipeline: log parser, ip-api enrichment, multi-layer automation filter, OSM proximity scan, JSON cache. First end-to-end run reaching the production database.

Week 4. HQ dashboard route, table UI, status/notes review state in MySQL, sidebar navigation. Customer-list ingest from the client's existing distributor xlsx (266 unique customer/location pairs, geocoded via Nominatim).

Week 5. UAT with sales. Tuned the noise list against their domain knowledge (they flagged a few hosting providers we hadn't seen before), added water-industry filtering (out of scope for valves), and switched the default sort to recency-first based on how reps actually used the page.

Week 6. Documentation, cron setup at 06:30 daily, deploy memos, runbooks. SSO next_url preservation so direct-link sharing works after login. Handoff.

What's next

Phase 2 swaps the city-only customer list for full street addresses (much tighter proximity scoring, far fewer false positives), wires the dashboard into their CRM so leads sync automatically, and adds a weekly digest email to sales for accounts they haven't reviewed.

Have data your team can't get answers from?

If you've got messy operational data — web logs, customer lists, scanned PDFs, technical specs — the same playbook works.

Start a Conversation →