Introduction to Sentiment Analysis using NLP #1
This Blog will introduce beginners to Natural Language Processing, and will dive into using Recurrent Neural Networks for sentiment analysis gradually in the upcoming blogs.
The main aim of NLP is to make computers understand natural languages. Computers are fairly good at understanding structured data forms like tables, databases, and spreadsheets, however understanding natural languages comes under the unstructured form of data, and that is where NLP techniques are used. NLP has various applications including sentiment analysis, name entity recognition, machine translation, automated question-answering, and mining, combatting misinformation, and much more.
Here, we use Recurrent Neural Networks to classify sentiments. For this, we use the IMDB movie reviews Dataset, which is built-in on Keras. In this part-1, I intend to briefly explain how the dataset is formatted and begin reading what we have.
Firstly we import the library.
from keras.datasets import imdb
Then, we set the size of the vocabulary, and load it in the training and testing data.
num_words = 10000(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = num_words)print('Loaded dataset with {} training samples, {} test samples'.format(len(X_train), len(X_test)))
Loaded dataset with 25000 training samples, and 25000 test samples.
Now we can visualise the specific review and the label assigned to the review in the dataset.
print("Label: ")print(y_train[10])print("Review: ")print(X_train[10])
The output of this is as follows:
Here, the reviews are stored as a sequence of integers, were each integer is a Word ID, which is pre-assigned to each word. Also, the label of 0 means a negative sentiment, while 1 means a positive sentiment.
To better understand it, we can use “imdb.get_word_index()” to map the reviews from the word ID to the words.
word_to_id = imdb.get_word_index()id_to_word = {i: word for word, i in word_to_id.items()}print('Label: ')print(y_train[10])print('Review in words: ')print([id_to_word.get(i, ' ') for i in X_train[10]])
Hence, this gives us a better understanding of how words are assigned values, and therefore each review has a pre-assigned label based on its sentiment. This can be useful since we can now build a model and train it, after which we can predict the sentiment, given an input.