Skip to main content
WorkProjects

Box Office Revenue Predictor

~100 engineered features, R² = 0.77

stable
View raw

End-to-end ML pipeline predicting box-office revenue from TMDB 5000 metadata. Three regression families compared under 5-fold cross-validation. Random Forest wins on MAE at $31.4M; Linear Regression wins on R-squared at 0.769.

What it is

A graduate-level machine learning study modeling worldwide theatrical revenue from movie metadata. The pipeline covers outlier filtering, nested-JSON extraction, domain-driven feature construction (talent prestige, franchise effects, temporal windows, studio history), and head-to-head evaluation of linear, instance-based, and ensemble regressors on the same design matrix. Stony Brook graduate ML coursework, April-May 2025, with Yetro Cheng and Tanjim Ahammad.

By the numbers

MetricValue
Source datasetTMDB 5000 Movies + Credits (Kaggle)
Rows after cleaning4,504 (from 4,803)
Engineered features~100 numeric
Models compared3 (OLS, k-NN, Random Forest)
Best MAE$31.4M (Random Forest, top-20 features)
Best R²0.769 (Linear Regression)
Validation80/20 hold-out + 5-fold K-Fold
Random seed42 (fully deterministic)

Architecture

Raw TMDB CSVs (movies + credits, 4,803 rows)
      |
      v
Outlier filter (budget/revenue/votes/runtime/popularity)
      |
      v  4,504 rows
Feature engineering
  +-- Nested-JSON extraction (genres, cast, crew, collections)
  +-- Temporal bins (month, weekday, summer/holiday, 5-yr periods)
  +-- Talent prestige (director/actor historical means)
  +-- Franchise + studio historical revenue
  +-- 12 interaction terms + log1p + quadratics
      |
      v  ~100 numeric features
Model training (same design matrix)
  +-- OLS (closed-form)
  +-- k-NN (k=5, z-scored, Euclidean)
  +-- Random Forest (200 trees, depth 25)
      |
      v
5-fold K-Fold CV (MAE, RMSE, R²)

Key features

  • Domain-driven feature engineering — binary genre indicators (frequency >= 50), per-studio historical mean revenue, is_franchise from belongs_to_collection, director/actor prestige via historical means over >=5 high-impact releases.
  • Nested-JSON extraction — parses TMDB's embedded JSON blobs for cast, crew, genres, production companies, and collection membership into flat numeric features.
  • Cleaning filters — compound thresholds drop implausible extremes: budget <= 175M, revenue <= 700M, vote_count <= 8,000, 3.5 <= vote_average <= 8.3, 60 <= runtime <= 200, popularity <= 150.
  • Per-model preprocessing — z-scoring applied only where it helps (k-NN); OLS and Random Forest consume raw numerics.
  • Deterministic reruns — all splits and ensembles seeded at random_state=42; metrics reproduce bit-identical across runs.

What makes it stand out

  • Head-to-head on identical design matrix. OLS, k-NN, and Random Forest all fit the same ~100-feature matrix — the comparison isolates model family, not preprocessing.
  • Split verdict. Random Forest wins MAE ($31.4M vs 36.1M); Linear Regression wins R² (0.769 vs 0.752). The page reports both rather than picking a single winner.
  • Top-20 feature restriction improves Random Forest. Pruning to the 20 highest-importance features lowers MAE relative to the full feature set, suggesting residual noise in the long tail.

Results

ModelMAE (USD)RMSE (USD)Notes
Linear Regression36.1M56.6M0.769best variance explained
k-NN (k=5)41.3M73.3M0.613degrades in high-dim space
Random Forest (top-20)31.4M58.7M0.752best absolute error

Stack

LayerTechnology
LanguagePython 3.10+
Modelingscikit-learn 1.5 (OLS, KNeighborsRegressor, RandomForestRegressor)
Datapandas 2.2, NumPy 2.x
Visualizationmatplotlib 3.9
WorkflowJupyter notebook + headless reproduction script