From the course: Large Language Models: Text Classification for NLP using BERT

Unlock the full course today

Join today to access over 24,800 courses taught by industry experts.

Overview of IMDb dataset

Overview of IMDb dataset

- One of the nice things about the Hugging Face dataset library is that we can convert it to Panda so that we can visualize the dataset and make any other changes that we might want to. So let's go ahead and do this using just the training split and let's display the first 10 entries. Now let's go ahead and take a look at the first review. Now, when I've looked through this dataset what I've sometimes found is that there are a couple of reviews where you can find the HTML line break tag. So let's use a simple regular expression that says that if we find any items and HTML tags so these are these angle brackets, we want to remove them. Now, the other thing we want to make sure is that we have a balanced dataset. So we need to make sure that we have approximately the same number of negative reviews as positive reviews. Now using Panda's value counts we can check that this is the case. And you can see that we have almost…

Contents