Movie Synopsis Feature Extraction with Bart text summarization

In this notebook, will will make use of the BART model to extract features from movie synopsis.

Note: this notebook should be executed from within the below container:

docker pull huggingface/transformers-pytorch-gpu
docker run --gpus=all  --rm -it --net=host -v $PWD:/workspace --ipc=host huggingface/transformers-pytorch-gpu 

Then from within the container:

cd /workspace
pip install jupyter jupyterlab
jupyter server extension disable nbclassic
jupyter-lab --allow-root --ip='' --NotebookApp.token='admin'

First, we install some extra package.

!pip install imdbpy

# Cuda 11 and A100 support
!pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f
import IPython

{'status': 'ok', 'restart': True}

Download pretrained BART model

First, we download a pretrained BART model from HuggingFace library.

from transformers import BartTokenizer, BartModel
import torch

tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
model = BartModel.from_pretrained('facebook/bart-large').cuda()

Extracting embeddings for all movie’s synopsis

We will use the average hidden state of the last decoder layer as text feature, comprising 1024 float values.

import pickle

with open('movies_info.pkl', 'rb') as f:
    movies_infos = pickle.load(f)['movies_infos']
import torch
import numpy as np
from tqdm import tqdm

embeddings = {}
for movie, movie_info in tqdm(movies_infos.items()):
    synopsis = None
    synopsis = movie_info.get('synopsis')
    if synopsis is None:
        plots = movie_info.get('plot')
        if plots is not None:
            synopsis = plots[0]
    if synopsis is not None:
        inputs = tokenizer(synopsis, return_tensors="pt", truncation=True, max_length=1024).to('cuda')
        with torch.no_grad():
            outputs = model(**inputs, output_hidden_states=True)
        embeddings[movie] = outputs.last_hidden_state.cpu().detach().numpy()
100%|██████████| 62423/62423 [43:41<00:00, 23.81it/s]  
average_embeddings = {}
for movie in embeddings:
    average_embeddings[movie] = np.mean(embeddings[movie].squeeze(), axis=0)
with open('movies_synopsis_embeddings-1024.pkl', 'wb') as f:
    pickle.dump({"embeddings": average_embeddings}, f, protocol=pickle.HIGHEST_PROTOCOL)