---
title: AI Email Scraper
description: Saves 20+ hours/month of manual data entry
section: craft
tags: [project, ai-and-automation]
genre: reference
stability: stable
lastUpdated: 2026-04-19
url: https://fardiniqbal.com/docs/craft/projects/ai-email-scraper
---


AI-powered case referral intake automation. Extracts structured referral
data from unstructured Outlook emails every 5 minutes. Replaces a manual
copy-paste intake workflow that took 15+ minutes per referral with an
always-on pipeline that reads top-of-inbox mail, runs a domain-specific
Gemini extraction, and appends clean rows to a 36-column Google Sheet
matching the team's Excel template.

## What it is [#what-it-is]

An internal service for an early-intervention referral intake workflow
at a pediatric therapy staffing agency. Incoming referral emails are
unstructured: free-text bodies, forwarded chains, service-coordinator
signatures several replies deep, multi-child attachments, inconsistent
case-ID formats. The service pulls those emails via Microsoft Graph,
runs a constrained Gemini prompt over each one, validates and
deduplicates the result, and writes rows to a shared sheet. Processed
emails are marked read so the inbox acts as the queue.

## By the numbers [#by-the-numbers]

| Metric                   | Value                         |
| ------------------------ | ----------------------------- |
| Referrals processed / mo | \~150                         |
| Time saved / mo          | 20+ hrs                       |
| Infra cost / mo          | $0                            |
| Tick interval            | 5 min                         |
| LOC across `src/`        | 3,260                         |
| Commits on `master`      | 88                            |
| AI module size           | 1,410 lines (`src/lib/ai.ts`) |
| Extraction prompt        | 600 lines                     |
| Output schema            | 36 columns                    |
| Emails fetched per tick  | top 100 unread                |
| Parallel batch size      | 15 (≤20 runs fully parallel)  |
| Retry attempts per email | 3, exponential backoff        |

## Architecture [#architecture]

```
Microsoft Graph (Outlook)  ->  Gemini 2.5 Flash extraction  ->  Google Sheets append
         ^                              |                              |
         |                              v                              v
   node-cron (5 min)            retry / regex fallback         duplicate check
```

1. `node-cron` fires every 5 minutes.
2. Fetch top 100 unread emails via Microsoft Graph (Azure AD MSAL,
   application permissions).
3. Keyword-filter for staffing / referral / services.
4. Extract in parallel batches through Gemini 2.5 Flash (temperature
   0.1, 32K max output tokens).
5. Post-process: garbage filter, duplicate detection, field
   normalization.
6. Append non-duplicate rows to the Google Sheet.
7. Mark source emails as read.

## Key features [#key-features]

* **600-line Gemini extraction prompt** with plate-math-style domain
  rules: NYC borough-from-ZIP mapping, case-ID prefix stripping,
  multi-child email handling (one row per child per service type),
  service-coordinator signature detection in forwarded chains, and an
  agency-name validation blacklist.
* **Dual-layer garbage filter** — AI-side rejection of invalid agency
  names and non-person SC names, followed by a pre-write safety net
  that re-checks every row before it hits the sheet.
* **2-tier duplicate detection** — primary key on
  `case_id + service_type`, fallback on
  `child_name + service_type + borough` for referrals without a case
  ID. In-memory set backed by a read of existing sheet rows on boot.
* **3-attempt retry with exponential backoff** per email, plus a regex
  fallback that extracts the minimum viable row if Gemini fails all
  three attempts.
* **Batch parallel processing** — ≤20 emails run fully in parallel;
  above that, batches of 15 to stay under rate limits.
* **Fire-and-forget mark-as-read** so a sheet-write failure does not
  re-process the same email on the next tick.
* **Endpoints** — `GET /` and `GET /health` for health; `POST /api/process`
  for manual run triggers, guarded by a `CRON_SECRET` bearer token.
* **External heartbeat** — `cron-job.org` hits `POST /api/process` every
  5 minutes as a backup for the internal `node-cron` schedule.

## What makes it stand out [#what-makes-it-stand-out]

* Replaces 15+ minutes per referral with a 5-minute always-on tick —
  \~150 referrals / month, 20+ hours saved.
* Inbox-as-queue: read state is the durable queue, no external broker.
* Regex fallback guarantees a minimum viable row even when Gemini
  fails three attempts in a row.
* Runs on Render's $0 tier. Single always-on service, no database.

## Stack [#stack]

| Layer      | Technology                                                                                      |
| ---------- | ----------------------------------------------------------------------------------------------- |
| Runtime    | Node.js 20, TypeScript, `tsx`                                                                   |
| HTTP       | Express.js                                                                                      |
| Email      | Microsoft Graph API via Azure AD MSAL (`@azure/msal-node`, `@microsoft/microsoft-graph-client`) |
| AI         | Google Gemini 2.5 Flash (`@google/generative-ai`)                                               |
| Storage    | Google Sheets API (`googleapis`)                                                                |
| Scheduling | `node-cron` internal, `cron-job.org` external backup                                            |
| Hosting    | Render.com Web Service, free tier                                                               |

## Links [#links]

* **Source:** [https://github.com/FardinIqbal/BigMindsTinyHandsEmailScraper](https://github.com/FardinIqbal/BigMindsTinyHandsEmailScraper)
