Skip to main content
WorkProjects

AI Email Scraper

Saves 20+ hours/month of manual data entry

stable
View raw

AI-powered case referral intake automation. Extracts structured referral data from unstructured Outlook emails every 5 minutes. Replaces a manual copy-paste intake workflow that took 15+ minutes per referral with an always-on pipeline that reads top-of-inbox mail, runs a domain-specific Gemini extraction, and appends clean rows to a 36-column Google Sheet matching the team's Excel template.

What it is

An internal service for an early-intervention referral intake workflow at a pediatric therapy staffing agency. Incoming referral emails are unstructured: free-text bodies, forwarded chains, service-coordinator signatures several replies deep, multi-child attachments, inconsistent case-ID formats. The service pulls those emails via Microsoft Graph, runs a constrained Gemini prompt over each one, validates and deduplicates the result, and writes rows to a shared sheet. Processed emails are marked read so the inbox acts as the queue.

By the numbers

MetricValue
Referrals processed / mo~150
Time saved / mo20+ hrs
Infra cost / mo$0
Tick interval5 min
LOC across src/3,260
Commits on master88
AI module size1,410 lines (src/lib/ai.ts)
Extraction prompt600 lines
Output schema36 columns
Emails fetched per ticktop 100 unread
Parallel batch size15 (≤20 runs fully parallel)
Retry attempts per email3, exponential backoff

Architecture

Microsoft Graph (Outlook)  ->  Gemini 2.5 Flash extraction  ->  Google Sheets append
         ^                              |                              |
         |                              v                              v
   node-cron (5 min)            retry / regex fallback         duplicate check
  1. node-cron fires every 5 minutes.
  2. Fetch top 100 unread emails via Microsoft Graph (Azure AD MSAL, application permissions).
  3. Keyword-filter for staffing / referral / services.
  4. Extract in parallel batches through Gemini 2.5 Flash (temperature 0.1, 32K max output tokens).
  5. Post-process: garbage filter, duplicate detection, field normalization.
  6. Append non-duplicate rows to the Google Sheet.
  7. Mark source emails as read.

Key features

  • 600-line Gemini extraction prompt with plate-math-style domain rules: NYC borough-from-ZIP mapping, case-ID prefix stripping, multi-child email handling (one row per child per service type), service-coordinator signature detection in forwarded chains, and an agency-name validation blacklist.
  • Dual-layer garbage filter — AI-side rejection of invalid agency names and non-person SC names, followed by a pre-write safety net that re-checks every row before it hits the sheet.
  • 2-tier duplicate detection — primary key on case_id + service_type, fallback on child_name + service_type + borough for referrals without a case ID. In-memory set backed by a read of existing sheet rows on boot.
  • 3-attempt retry with exponential backoff per email, plus a regex fallback that extracts the minimum viable row if Gemini fails all three attempts.
  • Batch parallel processing — ≤20 emails run fully in parallel; above that, batches of 15 to stay under rate limits.
  • Fire-and-forget mark-as-read so a sheet-write failure does not re-process the same email on the next tick.
  • EndpointsGET / and GET /health for health; POST /api/process for manual run triggers, guarded by a CRON_SECRET bearer token.
  • External heartbeatcron-job.org hits POST /api/process every 5 minutes as a backup for the internal node-cron schedule.

What makes it stand out

  • Replaces 15+ minutes per referral with a 5-minute always-on tick — ~150 referrals / month, 20+ hours saved.
  • Inbox-as-queue: read state is the durable queue, no external broker.
  • Regex fallback guarantees a minimum viable row even when Gemini fails three attempts in a row.
  • Runs on Render's $0 tier. Single always-on service, no database.

Stack

LayerTechnology
RuntimeNode.js 20, TypeScript, tsx
HTTPExpress.js
EmailMicrosoft Graph API via Azure AD MSAL (@azure/msal-node, @microsoft/microsoft-graph-client)
AIGoogle Gemini 2.5 Flash (@google/generative-ai)
StorageGoogle Sheets API (googleapis)
Schedulingnode-cron internal, cron-job.org external backup
HostingRender.com Web Service, free tier