WorkProjects
Box Office Revenue Predictor
~100 engineered features, R² = 0.77
stable
End-to-end ML pipeline predicting box-office revenue from TMDB 5000 metadata. Three regression families compared under 5-fold cross-validation. Random Forest wins on MAE at $31.4M; Linear Regression wins on R-squared at 0.769.
What it is
A graduate-level machine learning study modeling worldwide theatrical revenue from movie metadata. The pipeline covers outlier filtering, nested-JSON extraction, domain-driven feature construction (talent prestige, franchise effects, temporal windows, studio history), and head-to-head evaluation of linear, instance-based, and ensemble regressors on the same design matrix. Stony Brook graduate ML coursework, April-May 2025, with Yetro Cheng and Tanjim Ahammad.
By the numbers
| Metric | Value |
|---|---|
| Source dataset | TMDB 5000 Movies + Credits (Kaggle) |
| Rows after cleaning | 4,504 (from 4,803) |
| Engineered features | ~100 numeric |
| Models compared | 3 (OLS, k-NN, Random Forest) |
| Best MAE | $31.4M (Random Forest, top-20 features) |
| Best R² | 0.769 (Linear Regression) |
| Validation | 80/20 hold-out + 5-fold K-Fold |
| Random seed | 42 (fully deterministic) |
Architecture
Raw TMDB CSVs (movies + credits, 4,803 rows)
|
v
Outlier filter (budget/revenue/votes/runtime/popularity)
|
v 4,504 rows
Feature engineering
+-- Nested-JSON extraction (genres, cast, crew, collections)
+-- Temporal bins (month, weekday, summer/holiday, 5-yr periods)
+-- Talent prestige (director/actor historical means)
+-- Franchise + studio historical revenue
+-- 12 interaction terms + log1p + quadratics
|
v ~100 numeric features
Model training (same design matrix)
+-- OLS (closed-form)
+-- k-NN (k=5, z-scored, Euclidean)
+-- Random Forest (200 trees, depth 25)
|
v
5-fold K-Fold CV (MAE, RMSE, R²)Key features
- Domain-driven feature engineering — binary genre indicators
(frequency >= 50), per-studio historical mean revenue,
is_franchisefrombelongs_to_collection, director/actor prestige via historical means over >=5 high-impact releases. - Nested-JSON extraction — parses TMDB's embedded JSON blobs for cast, crew, genres, production companies, and collection membership into flat numeric features.
- Cleaning filters — compound thresholds drop implausible
extremes:
budget <= 175M,revenue <= 700M,vote_count <= 8,000,3.5 <= vote_average <= 8.3,60 <= runtime <= 200,popularity <= 150. - Per-model preprocessing — z-scoring applied only where it helps (k-NN); OLS and Random Forest consume raw numerics.
- Deterministic reruns — all splits and ensembles seeded at
random_state=42; metrics reproduce bit-identical across runs.
What makes it stand out
- Head-to-head on identical design matrix. OLS, k-NN, and Random Forest all fit the same ~100-feature matrix — the comparison isolates model family, not preprocessing.
- Split verdict. Random Forest wins MAE ($31.4M vs 36.1M); Linear Regression wins R² (0.769 vs 0.752). The page reports both rather than picking a single winner.
- Top-20 feature restriction improves Random Forest. Pruning to the 20 highest-importance features lowers MAE relative to the full feature set, suggesting residual noise in the long tail.
Results
| Model | MAE (USD) | RMSE (USD) | R² | Notes |
|---|---|---|---|---|
| Linear Regression | 36.1M | 56.6M | 0.769 | best variance explained |
| k-NN (k=5) | 41.3M | 73.3M | 0.613 | degrades in high-dim space |
| Random Forest (top-20) | 31.4M | 58.7M | 0.752 | best absolute error |
Stack
| Layer | Technology |
|---|---|
| Language | Python 3.10+ |
| Modeling | scikit-learn 1.5 (OLS, KNeighborsRegressor, RandomForestRegressor) |
| Data | pandas 2.2, NumPy 2.x |
| Visualization | matplotlib 3.9 |
| Workflow | Jupyter notebook + headless reproduction script |