Box Office Revenue Predictor

stableUpdated 2026-04-19

End-to-end ML pipeline predicting box-office revenue from TMDB 5000 metadata. Three regression families compared under 5-fold cross-validation. Random Forest wins on MAE at $31.4M; Linear Regression wins on R-squared at 0.769.

What it is

A graduate-level machine learning study modeling worldwide theatrical revenue from movie metadata. The pipeline covers outlier filtering, nested-JSON extraction, domain-driven feature construction (talent prestige, franchise effects, temporal windows, studio history), and head-to-head evaluation of linear, instance-based, and ensemble regressors on the same design matrix. Stony Brook graduate ML coursework, April-May 2025, with Yetro Cheng and Tanjim Ahammad.

By the numbers

Metric	Value
Source dataset	TMDB 5000 Movies + Credits (Kaggle)
Rows after cleaning	4,504 (from 4,803)
Engineered features	~100 numeric
Models compared	3 (OLS, k-NN, Random Forest)
Best MAE	$31.4M (Random Forest, top-20 features)
Best R²	0.769 (Linear Regression)
Validation	80/20 hold-out + 5-fold K-Fold
Random seed	42 (fully deterministic)

Architecture

Raw TMDB CSVs (movies + credits, 4,803 rows)
      |
      v
Outlier filter (budget/revenue/votes/runtime/popularity)
      |
      v  4,504 rows
Feature engineering
  +-- Nested-JSON extraction (genres, cast, crew, collections)
  +-- Temporal bins (month, weekday, summer/holiday, 5-yr periods)
  +-- Talent prestige (director/actor historical means)
  +-- Franchise + studio historical revenue
  +-- 12 interaction terms + log1p + quadratics
      |
      v  ~100 numeric features
Model training (same design matrix)
  +-- OLS (closed-form)
  +-- k-NN (k=5, z-scored, Euclidean)
  +-- Random Forest (200 trees, depth 25)
      |
      v
5-fold K-Fold CV (MAE, RMSE, R²)

Key features

Domain-driven feature engineering — binary genre indicators (frequency >= 50), per-studio historical mean revenue, is_franchise from belongs_to_collection, director/actor prestige via historical means over >=5 high-impact releases.
Nested-JSON extraction — parses TMDB's embedded JSON blobs for cast, crew, genres, production companies, and collection membership into flat numeric features.
Cleaning filters — compound thresholds drop implausible extremes: budget <= 175M, revenue <= 700M, vote_count <= 8,000, 3.5 <= vote_average <= 8.3, 60 <= runtime <= 200, popularity <= 150.
Per-model preprocessing — z-scoring applied only where it helps (k-NN); OLS and Random Forest consume raw numerics.
Deterministic reruns — all splits and ensembles seeded at random_state=42; metrics reproduce bit-identical across runs.

What makes it stand out

Head-to-head on identical design matrix. OLS, k-NN, and Random Forest all fit the same ~100-feature matrix — the comparison isolates model family, not preprocessing.
Split verdict. Random Forest wins MAE ($31.4M vs 36.1M); Linear Regression wins R² (0.769 vs 0.752). The page reports both rather than picking a single winner.
Top-20 feature restriction improves Random Forest. Pruning to the 20 highest-importance features lowers MAE relative to the full feature set, suggesting residual noise in the long tail.

Results

Model	MAE (USD)	RMSE (USD)	R²	Notes
Linear Regression	36.1M	56.6M	0.769	best variance explained
k-NN (k=5)	41.3M	73.3M	0.613	degrades in high-dim space
Random Forest (top-20)	31.4M	58.7M	0.752	best absolute error

Stack

Layer	Technology
Language	Python 3.10+
Modeling	scikit-learn 1.5 (OLS, KNeighborsRegressor, RandomForestRegressor)
Data	pandas 2.2, NumPy 2.x
Visualization	matplotlib 3.9
Workflow	Jupyter notebook + headless reproduction script