Predicting Success in Film: IMDb Rating Forecasting
Predicted IMDb ratings with R² = 0.90 using a Random Forest model in PySpark, trained on metadata from over 206,000 U.S. films.
Engineered custom features for actor and director experience using IMDb-weighted formulas based on rating and vote history. Analyzed trends across genre, decade, and runtime.
Tuned the model using TrainValidationSplit with a grid search over tree depth and tree count. Evaluated performance on train/test splits and visualized feature importance and prediction accuracy using Seaborn & Matplotlib.
Built an interactive prediction tool that estimates IMDb ratings for new films based on title, cast, genre, and runtime. Scaled large datasets using AWS EMR, S3, and Docker.
Tools: PySpark, AWS EMR, S3, Docker, RandomForestRegressor, TrainValidationSplit, Pandas, Seaborn, Matplotlib