Intro to Sentiment Analysis
We all have seen the movie I, Robot(2004) in which the robot claimed that it had feelings, which the detective refused to believe.
PS: If you have not seen this movie, please go watch it.
The above example just shows us the future application of sentiment analysis. The movie of course is a fiction movie.
Until now, our AI industry has made perfect robots, the only thing which is lacking in a robot is that it doesn't have feelings or sentiments to understand human emotions. Because many times human emotions are the key to solve a problem.
Through Sentiment Analysis, we are trying to achieve recognition of sentiments through machine learning.
What is Sentiment Analysis?
Sentiment Analysis is often known as emotion AI, which refers to the extraction of subjective information in the source material. It is mainly used to help a business understand the social sentiment of their product. Its application is only limited by your imagination. However, analysis of social media for just a basic sentiment analysis and count-based metrics is just scratching the surface and missing out on the major use of it.
With recent development in deep learning, the accuracy of recognizing information has improved exponentially. For in-depth research, Artificial Intelligence can be used to achieve much better accuracy and higher performance.
Types of Sentiment Analysis
This model majorly focuses on polarity (Positive, Neutral, Negative) but also on emotions and feelings (happy, angry, sad, frustrated, excited, etc.).
Some of the popular models are mentioned below:
1. Emotion detection
This model aims to detect emotions like happiness, anger, sadness, etc. Most of these models use lexicons (a list of words with the emotions they carry).
One of the drawbacks of lexicons is that a combination of some words can give a positive meaning and the combination of the same words can give a negative meaning as well. For example, the word “kill” can make the sentence “I will kill you ” which is a negative sentence on the other, the sentence “You are killing it” has a positive meaning to it.
2. Fine-grained Sentiment Analysis
If you want an output based on polarity then this model is the one you are looking for. It gives the output as:
a. Positive
b. Neutral
c. Negative
3. Multilingual sentiment analysis
This is the most difficult of all but it is the most realistic to use. Since these days everyone uses more than one language to write. There are many pre-existing models to use but none of them are good enough to be used without inspecting the result.
Why is Sentiment Analysis important?
With the COVID-19 pandemic, everything has shifted online. It has become really tough to understand customer needs and sentiments if we are talking about business. The only way to listen and work on user needs is through Sentiment Analysis.
A human can’t go through every like/dislike, comment, and other things on every user. There has to be an automated solution to it, which is none other than the Sentiment Analysis model.
The main benefits of sentiment analysis include:
- Real-Time Analysis: Sentiment Analysis can identify real-time critical issues. For example, there is a video criticizing a culture/religion, if that video is not taken down soon there will be havoc in that region. Using this model it will detect the video to be inappropriate and notify the concerned officials about it or take action on its own if given the right set of permissions.
- Sorting Data at Scale: Without this automated solution imagine going through every social media comment on Twitter, Facebook, Reddit, etc. about a business and that too manually. Also, the dataset is not limited, it is increasing every second. It will take years to sort that data and create a conclusion from it. That is why Sentiment Analysis is used to save time and sort the data at a much larger scale.
How does Sentiment Analysis work?
Sentiment Analysis uses various NLP (Natural Language Processing) methods.
Some types of algorithms used include:
- Automatic: uses Machine Learning techniques to learn.
- Rule-Based: this model performs based on manually set rules and instructions.
- Hybrid: as the name suggests it is a combination of both automatic and rule-based algorithms.
Automatic Algorithm:
The Automatic Algorithm relies on Machine Learning techniques. A sentiment analysis model is usually a classification problem. The usual prediction returns a category of sentiment i.e., Positive, Negative, or Neutral.
Rule-Based Algorithm:
Rule-Based Algorithm uses a set of user-crafted rules and instructions to help identify polarity or subjective sentiment.
Some of the techniques used as pre-processing such as:
- Lemitization, tokenization, and tagging
- Lexicons (list of words)
There is a limitation to this approach that it does account for the different combination of words in different sequences give different meanings which this algorithm can’t differentiate. To solve this problem automatic Machine Learning based algorithms are used.
Hybrid Algorithm:
This Algorithm combines both rule-based algorithm and automatic algorithm into a hybrid algorithm. This algorithm has an advantage among these algorithms that its results are more accurate.
How a machine learning classifier works:
Training Process:
In this process, the text is combined with a tag that specifies the sentiment of the text with it. The text is passed through a feature extractor function which converts text into a feature vector. Tags and pairs of feature vectors are fed into the machine learning algorithm to generate a model.
Prediction Process:
In this process, the text is converted to feature vectors using the feature extractor. These feature vectors are again fed into the machine learning model to generate an output which will be a tag categorizing the text into Positive, Negative, or Neutral.
How Features are extracted from the text?
The classical approach for feature extraction is bag-of-words or bag-of-n-grams with their frequencies.
Classification Algorithms
Some of the most common models which are used these days are Naive Bayes, Support Vector Machines, or Neural Networks.
- Naive Bayes classifier: This classification algorithm is based on Bayes’ Theorem. It is a family of algorithms where all of these algorithms share a common principle that is every pair of features that are being classified is independent of each other.
- Support Vector Machines: a non-probabilistic model that uses a representation of text examples as points in a multidimensional space. Sentiments from text examples are mapped to distinct regions within multidimensional space.
- Neural Network: A huge set of algorithms that attempt to mimic the human brain’s activity and employ an artificial neural network to predict sentiments.
Sentiment Analysis Challenges
Sentiment Analysis is one of the toughest tasks in NLP because even humans struggle to analyze sentiments correctly.
For example: In the TV series Big Bang Theory, Sheldon could never analyze someone's sentiment.
Implementation of Sentiment Analysis in Python:
For implementing sentiment analysis in python there are some prerequisites for it.
- Python 3 should be installed on your machine if you are working on a local notebook.
- Familiarity in working with language data is recommended.
Step 1 — Installing NLTK
We will use the NLTK library in python for natural language processing in this article.
First Install this library with “pip”:
pip install nltk
Now import NLTK in your python notebook:
import nltk
Step 2 — Downloading sample data
We will be downloading sample twitter data from the NLTK library only for demo purposes:
nltk.download('twitter_samples')
Step 3 — Tokenizing the Data
A token is a sequence of characters in text that serves as a unit. Based on how you create the tokens, they may consist of words, emoticons, hashtags, links, or even individual characters. A basic way of breaking language into tokens is by splitting the text based on whitespace and punctuation.
Import Twitter samples into the python notebook:
from nltk.corpus import twitter_samples
Let’s create variables for positive tweets and negative tweets:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
For tokenizing “punkt” resource from NLTK is required:
nltk.download('punkt')
Once the resources have been downloaded, we are ready to use NTLK’s tokenizers.
For testing purposes we tokenize positive tweets and see the result:
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
print(tweet_tokens[0])
and the output will be like:
['#FollowFriday',
'@France_Inte',
'@PKuchly57',
'@Milipol_Paris',
'for',
'being',
'top',
'engaged',
'members',
'in',
'my',
'community',
'this',
'week',
':)']
Step 4 — Normalizing the Data
We will normalize the data using lemmatization. We will also download some other important resources.
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
We will import the pos_tag function to provide a list of tokens with their associated tags with them.
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
def lemmatize_sentence(tokens):
lemmatizer = WordNetLemmatizer()
lemmatized_sentence = []
for word, tag in pos_tag(tokens):
if tag.startswith('NN'):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
return lemmatized_sentence
print(lemmatize_sentence(tweet_tokens[0]))
This is the output:
['#FollowFriday',
'@France_Inte',
'@PKuchly57',
'@Milipol_Paris',
'for',
'be',
'top',
'engage',
'member',
'in',
'my',
'community',
'this',
'week',
':)']
The verb in this tweet is changed to its root form. This will put words of different forms into one category instead of all other categories.
Step 5 — Removing Noise
We will use regular expressions to search and remove unwanted data in the dataset:
import re, string
def remove_noise(tweet_tokens, stop_words = ()):
cleaned_tokens = []
for token, tag in pos_tag(tweet_tokens):
token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
'(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
token = re.sub("(@[A-Za-z0-9_]+)","", token)
if tag.startswith("NN"):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatizer = WordNetLemmatizer()
token = lemmatizer.lemmatize(token, pos)
if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
cleaned_tokens.append(token.lower())
return cleaned_tokens
Another resource “stopwords” will be used which is also from the NLTK library.
nltk.download('stopwords')
For example, we will remove noise from a positive tweet:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(remove_noise(tweet_tokens[0], stop_words))
And the output will be:
['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']
As we can see all the “@” mentions, stop words are removed and converted the whole text to lowercase.
Tokenizing and cleaning the whole dataset:
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')
positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []
for tokens in positive_tweet_tokens:
positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
for tokens in negative_tweet_tokens:
negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
Step 6 — Converting Tokens to a Dictionary
We will be using the Naive Bayes Classifier in NLTK to perform the sentiment analysis. The model requires a dictionary of tokens as its keys and True as its value for classification.
def get_tweets_for_model(cleaned_tokens_list):
for tweet_tokens in cleaned_tokens_list:
yield dict([token, True] for token in tweet_tokens)
positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)
Step 7 — Splitting the dataset
We will prepare the dataset for the training of the model.
import random
positive_dataset = [(tweet_dict, "Positive")
for tweet_dict in positive_tokens_for_model]
negative_dataset = [(tweet_dict, "Negative")
for tweet_dict in negative_tokens_for_model]
dataset = positive_dataset + negative_dataset
random.shuffle(dataset)
train_data = dataset[:8000]
test_data = dataset[8000:]
Step 8 — Building the Model
We will use “NaiveBayesClassifier” to build the model.
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)
print("Accuracy is:", classify.accuracy(classifier, test_data))
The Output is:
Accuracy is: 0.9956666666666667
After running this we got a pretty good accuracy of 99.56% on the test dataset.
Now, we can check the performance of random tweets or texts as shown below:
from nltk.tokenize import word_tokenize
custom_tweet = "I really like this new app on my phone called Slack"
custom_tokens = remove_noise(word_tokenize(custom_tweet))
print(classifier.classify(dict([token, True] for token in custom_tokens)))
And the Output is:
'Positive'
We can test our model on multiple datasets and even try different classifiers if some other dataset is not giving good accuracy.