Viral Tweet Prediction#

Clear and unambiguous instructions on how to reproduce the predictions from start to finish including data pre-processing, feature extraction, model training and predictions generation in notebook.

In this notebook#

  • Data processing: one-hot encoding + cyclical encoding for categorical features. Normalization.

  • LASSO regression for feature selection

  • Memory footprint reduction of data

  • Hyper-parameter tuning with RandomizedSearchCV

  • Building LightGBM classifier model for prediction

  • Feature importance visualization

Environment details#

OS: macOS Big Sur 11.4
Memory: 16 GB 2133 MHz LPDDR3
Disk Space: 1 TB Flask Storage
CPU/GPU: Intel HD Graphics 530 1536 MB

Which data files are being used?#

  • train_tweets.csv

  • train_tweets_vectorized_media.csv

  • train_tweets_vectorized_text.csv

  • users.csv

  • user_vectorized_descriptions.csv

  • user_vectorized_profile_images.csv

  • test_tweets.csv

  • test_tweets_vectorized_media.csv

  • test_tweets_vectorized_text.csv

How are these files processed?#

  • Filling missing topic_ids with [‘0’]

  • One hot encoding for categorical variables

  • Cyclical encoding for hour

What is the algorithm used and what are its main hyper-parameters?#

Used LightGBM Classifier:

LGBMClassifier(colsample_bytree=0.7076074093370144, min_child_samples=105, min_child_weight=1e-05, num_leaves=26, reg_alpha=5, reg_lambda=5, subsample=0.7468773130235173)