AI Email Scraper
Saves 20+ hours/month of manual data entry
AI-powered case referral intake automation. Extracts structured referral data from unstructured Outlook emails every 5 minutes. Replaces a manual copy-paste intake workflow that took 15+ minutes per referral with an always-on pipeline that reads top-of-inbox mail, runs a domain-specific Gemini extraction, and appends clean rows to a 36-column Google Sheet matching the team's Excel template.
What it is
An internal service for an early-intervention referral intake workflow at a pediatric therapy staffing agency. Incoming referral emails are unstructured: free-text bodies, forwarded chains, service-coordinator signatures several replies deep, multi-child attachments, inconsistent case-ID formats. The service pulls those emails via Microsoft Graph, runs a constrained Gemini prompt over each one, validates and deduplicates the result, and writes rows to a shared sheet. Processed emails are marked read so the inbox acts as the queue.
By the numbers
| Metric | Value |
|---|---|
| Referrals processed / mo | ~150 |
| Time saved / mo | 20+ hrs |
| Infra cost / mo | $0 |
| Tick interval | 5 min |
LOC across src/ | 3,260 |
Commits on master | 88 |
| AI module size | 1,410 lines (src/lib/ai.ts) |
| Extraction prompt | 600 lines |
| Output schema | 36 columns |
| Emails fetched per tick | top 100 unread |
| Parallel batch size | 15 (≤20 runs fully parallel) |
| Retry attempts per email | 3, exponential backoff |
Architecture
Microsoft Graph (Outlook) -> Gemini 2.5 Flash extraction -> Google Sheets append
^ | |
| v v
node-cron (5 min) retry / regex fallback duplicate checknode-cronfires every 5 minutes.- Fetch top 100 unread emails via Microsoft Graph (Azure AD MSAL, application permissions).
- Keyword-filter for staffing / referral / services.
- Extract in parallel batches through Gemini 2.5 Flash (temperature 0.1, 32K max output tokens).
- Post-process: garbage filter, duplicate detection, field normalization.
- Append non-duplicate rows to the Google Sheet.
- Mark source emails as read.
Key features
- 600-line Gemini extraction prompt with plate-math-style domain rules: NYC borough-from-ZIP mapping, case-ID prefix stripping, multi-child email handling (one row per child per service type), service-coordinator signature detection in forwarded chains, and an agency-name validation blacklist.
- Dual-layer garbage filter — AI-side rejection of invalid agency names and non-person SC names, followed by a pre-write safety net that re-checks every row before it hits the sheet.
- 2-tier duplicate detection — primary key on
case_id + service_type, fallback onchild_name + service_type + boroughfor referrals without a case ID. In-memory set backed by a read of existing sheet rows on boot. - 3-attempt retry with exponential backoff per email, plus a regex fallback that extracts the minimum viable row if Gemini fails all three attempts.
- Batch parallel processing — ≤20 emails run fully in parallel; above that, batches of 15 to stay under rate limits.
- Fire-and-forget mark-as-read so a sheet-write failure does not re-process the same email on the next tick.
- Endpoints —
GET /andGET /healthfor health;POST /api/processfor manual run triggers, guarded by aCRON_SECRETbearer token. - External heartbeat —
cron-job.orghitsPOST /api/processevery 5 minutes as a backup for the internalnode-cronschedule.
What makes it stand out
- Replaces 15+ minutes per referral with a 5-minute always-on tick — ~150 referrals / month, 20+ hours saved.
- Inbox-as-queue: read state is the durable queue, no external broker.
- Regex fallback guarantees a minimum viable row even when Gemini fails three attempts in a row.
- Runs on Render's $0 tier. Single always-on service, no database.
Stack
| Layer | Technology |
|---|---|
| Runtime | Node.js 20, TypeScript, tsx |
| HTTP | Express.js |
Microsoft Graph API via Azure AD MSAL (@azure/msal-node, @microsoft/microsoft-graph-client) | |
| AI | Google Gemini 2.5 Flash (@google/generative-ai) |
| Storage | Google Sheets API (googleapis) |
| Scheduling | node-cron internal, cron-job.org external backup |
| Hosting | Render.com Web Service, free tier |