---
title: Box Office Revenue Predictor
description: ~100 engineered features, R² = 0.77
section: craft
tags: [project, machine-learning]
genre: reference
stability: stable
lastUpdated: 2026-04-19
url: https://fardiniqbal.com/docs/craft/projects/box-office-revenue-predictor
---


End-to-end ML pipeline predicting box-office revenue from TMDB 5000
metadata. Three regression families compared under 5-fold
cross-validation. Random Forest wins on MAE at $31.4M; Linear
Regression wins on R-squared at 0.769.

## What it is [#what-it-is]

A graduate-level machine learning study modeling worldwide theatrical
revenue from movie metadata. The pipeline covers outlier filtering,
nested-JSON extraction, domain-driven feature construction (talent
prestige, franchise effects, temporal windows, studio history), and
head-to-head evaluation of linear, instance-based, and ensemble
regressors on the same design matrix. Stony Brook graduate ML
coursework, April-May 2025, with Yetro Cheng and Tanjim Ahammad.

## By the numbers [#by-the-numbers]

| Metric              | Value                                   |
| ------------------- | --------------------------------------- |
| Source dataset      | TMDB 5000 Movies + Credits (Kaggle)     |
| Rows after cleaning | 4,504 (from 4,803)                      |
| Engineered features | \~100 numeric                           |
| Models compared     | 3 (OLS, k-NN, Random Forest)            |
| Best MAE            | $31.4M (Random Forest, top-20 features) |
| Best R²             | 0.769 (Linear Regression)               |
| Validation          | 80/20 hold-out + 5-fold K-Fold          |
| Random seed         | 42 (fully deterministic)                |

## Architecture [#architecture]

```
Raw TMDB CSVs (movies + credits, 4,803 rows)
      |
      v
Outlier filter (budget/revenue/votes/runtime/popularity)
      |
      v  4,504 rows
Feature engineering
  +-- Nested-JSON extraction (genres, cast, crew, collections)
  +-- Temporal bins (month, weekday, summer/holiday, 5-yr periods)
  +-- Talent prestige (director/actor historical means)
  +-- Franchise + studio historical revenue
  +-- 12 interaction terms + log1p + quadratics
      |
      v  ~100 numeric features
Model training (same design matrix)
  +-- OLS (closed-form)
  +-- k-NN (k=5, z-scored, Euclidean)
  +-- Random Forest (200 trees, depth 25)
      |
      v
5-fold K-Fold CV (MAE, RMSE, R²)
```

## Key features [#key-features]

* **Domain-driven feature engineering** — binary genre indicators
  (frequency >= 50), per-studio historical mean revenue, `is_franchise`
  from `belongs_to_collection`, director/actor prestige via historical
  means over >=5 high-impact releases.
* **Nested-JSON extraction** — parses TMDB's embedded JSON blobs for
  cast, crew, genres, production companies, and collection membership
  into flat numeric features.
* **Cleaning filters** — compound thresholds drop implausible
  extremes: `budget <= 175M`, `revenue <= 700M`, `vote_count <= 8,000`,
  `3.5 <= vote_average <= 8.3`, `60 <= runtime <= 200`,
  `popularity <= 150`.
* **Per-model preprocessing** — z-scoring applied only where it helps
  (k-NN); OLS and Random Forest consume raw numerics.
* **Deterministic reruns** — all splits and ensembles seeded at
  `random_state=42`; metrics reproduce bit-identical across runs.

## What makes it stand out [#what-makes-it-stand-out]

* **Head-to-head on identical design matrix.** OLS, k-NN, and Random
  Forest all fit the same \~100-feature matrix — the comparison
  isolates model family, not preprocessing.
* **Split verdict.** Random Forest wins MAE ($31.4M vs 36.1M); Linear
  Regression wins R² (0.769 vs 0.752). The page reports both rather
  than picking a single winner.
* **Top-20 feature restriction improves Random Forest.** Pruning to
  the 20 highest-importance features lowers MAE relative to the full
  feature set, suggesting residual noise in the long tail.

## Results [#results]

| Model                  | MAE (USD) | RMSE (USD) | R²        | Notes                      |
| ---------------------- | --------- | ---------- | --------- | -------------------------- |
| Linear Regression      | 36.1M     | 56.6M      | **0.769** | best variance explained    |
| k-NN (k=5)             | 41.3M     | 73.3M      | 0.613     | degrades in high-dim space |
| Random Forest (top-20) | **31.4M** | 58.7M      | 0.752     | best absolute error        |

## Stack [#stack]

| Layer         | Technology                                                         |
| ------------- | ------------------------------------------------------------------ |
| Language      | Python 3.10+                                                       |
| Modeling      | scikit-learn 1.5 (OLS, KNeighborsRegressor, RandomForestRegressor) |
| Data          | pandas 2.2, NumPy 2.x                                              |
| Visualization | matplotlib 3.9                                                     |
| Workflow      | Jupyter notebook + headless reproduction script                    |

## Links [#links]

* **Source:** [https://github.com/FardinIqbal/movie-revenue-prediction](https://github.com/FardinIqbal/movie-revenue-prediction)
