Turning 3,652 anonymous website visitors into 62 sales-actionable leads.
A 60-year-old industrial manufacturer with a global customer base couldn't tell which web visitors were actual prospects vs. crawlers, scanners, and noise. We built a daily-refreshed sales-intent dashboard, matched against their existing customer list. Six weeks, in production.
The before state
The client makes industrial valves for oil & gas, petrochemical, and pipeline operators. Their website got steady traffic but the AWStats output was unusable for sales: 3,652 unique IPs in a typical month, dominated by Asian crawlers, security scanners, and AWS/Azure/GCP automation pinging the RFQ form.
When sales asked "who's actually looking at our products this week?", the answer was: here's a list of IP addresses, good luck. The team had a 329-customer SAGE ERP export and a 266-pair distributor list with cities — but no way to connect either to the live web traffic.
What we shipped
1. Pipeline: log → enriched leads
A daily cron parses the Plesk access logs (90-day rolling window across rotated archives), filters crawlers by user-agent, drops attack-probe paths, then enriches every surviving IP with:
- Reverse DNS via parallel
gethostbyaddrwith a 2s timeout (so a single slow nameserver doesn't stall the run). - Geo + ASN lookup via ip-api batch endpoint — city, region, country, lat/lon, ISP, organization.
- Automation-noise filter — a tightly maintained regex of cloud, hosting, scanner, CDN, and AI-scraper ASNs. Drops automation hard, keeps residential and corporate.
- OpenStreetMap Overpass scan for industrial sites within 25mi of each visitor (oil/gas/chemical/refinery/petroleum/pipeline). Water-industry hits filtered out post-fetch since the client doesn't serve that segment.
- Customer-proximity match — every visitor's coordinates are checked against the geocoded customer list. If a visitor in Chicago is 0.3mi from an existing customer's office, that's the killer signal.
2. Internal HQ dashboard
An M365-SSO-gated Flask page renders the leads as a sortable, filterable table. Each row shows: hits, category (corporate / residential), IP + reverse DNS, location, ASN organization, pages visited, last seen, customer matches with distance, and a status pill (new / reviewed / interesting / ignored). Sales reps can expand any row to add notes, set status, and write a company-name guess that gets written back to MySQL.
3. The funnel
4. Architecture
Why the pipeline has so many filter layers
Raw web traffic on a B2B industrial site is dominated by automation: foreign crawlers, security scanners, cloud-hosted bots, and AI training scrapers. Without aggressive filtering, the highest-ranked "leads" are all bots. We architected for that from day one:
- ASN classification, not user-agent. Bots can lie about user-agents, but they can't fake the AS number their packets came from. The blocklist is a maintained regex of cloud, hosting, scanner, CDN, and AI-scraper ASNs — with residential ISPs (Comcast, Verizon, etc.) explicitly kept since that's where home offices and small businesses live.
- Page-depth scoring is intentionally de-emphasized. A visitor who hits RFQ + contact + every product page is almost always a crawler. We treat per-visit page depth as a tie-breaker, not a primary signal — and surface customer-proximity matches as the actual ranking signal.
- City-level proximity is a known floor, not a final answer. The client's first customer file had cities only, so we shipped with that and made the pipeline ready to swap in street-level addresses for v2 without code changes.
- Resilient against Overpass timeouts. Some metros (Bay Area, Toronto, Vegas) saturate the public OpenStreetMap endpoint. Those queries fail, get logged, and are retried by the next cron run — the rounded-coordinate cache makes consecutive misses cheap.
How a 6-week loop actually ran
Week 1. Walk the data: customer list shapes, log paths, existing AWStats output. Lock the scope: lead list with proximity matches, internal-only, M365-gated, daily refresh.
Week 2-3. Pipeline: log parser, ip-api enrichment, multi-layer automation filter, OSM proximity scan, JSON cache. First end-to-end run reaching the production database.
Week 4. HQ dashboard route, table UI, status/notes review state in MySQL, sidebar navigation. Customer-list ingest from the client's existing distributor xlsx (266 unique customer/location pairs, geocoded via Nominatim).
Week 5. UAT with sales. Tuned the noise list against their domain knowledge (they flagged a few hosting providers we hadn't seen before), added water-industry filtering (out of scope for valves), and switched the default sort to recency-first based on how reps actually used the page.
Week 6. Documentation, cron setup at 06:30 daily, deploy memos, runbooks. SSO next_url preservation so direct-link sharing works after login. Handoff.
What's next
Phase 2 swaps the city-only customer list for full street addresses (much tighter proximity scoring, far fewer false positives), wires the dashboard into their CRM so leads sync automatically, and adds a weekly digest email to sales for accounts they haven't reviewed.
Have data your team can't get answers from?
If you've got messy operational data — web logs, customer lists, scanned PDFs, technical specs — the same playbook works.
Start a Conversation →