NLP: twitter sentiment analysis with Tensorflow
Implementation of BOW, TF-IDF, word2vec, GLOVE and own embeddings for sentiment analysis. This approach can be replicated for any NLP task.
The object of this post is to show some of the top NLP solutions specific in deep learning and some in classical machine learning methods. This a compilation of some posts and papers I have made in the past few months. As an example, I will use the Analytics Vidhya twitter sentiment analysis data set. Yes, another post of sentiment analysis. It’s important to be awarded that for getting competition results all the models proposed in this post should be training on a bigger scale (GPU, more data, more epochs, etc.).
The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.¹
Formally, given a training sample of tweets and labels, where label ‘1’ denotes the tweet is racist/sexist and label ‘0’ denotes the tweet is not racist/sexist, the objective is to predict the labels on the test dataset.
“Reason shapes the future, but superstition infects the present.”
― Iain M. Banks
We need to clean the text data in the tweets to continue with the experiment process. But first I will give you some helpful functions.
Now we can load and clean the text data. We will only apply the steamer when we are using BOW and TF-IDF. In a word embedding is better to use the full word. Also, we will add a new column to count how many words are in each text sentence (tweet). This will allow us to understand the distributions of the sentences and build the desired size of the embedding matrix (more of this later).
So now that we have clean tweets we are ready to convert the text to a numerical approximation. Why? Because we need to have a way to put this text as input in a neural network. We can use a number for each word, but that will leave us with a matrix of all the words in the world X all the words in the world. That doesn’t seem right for this we can do a several transformations as BOW, TF-IDF or Word Embeddings. I will explain each one:
BOW (bag-of-words model)
This approximation is a simplifying representation used in natural language processing. In this model, a text (such as a sentence or a document) is represented as a bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier. ² ³
TF-IDF ( Term Frequency — Inverse document frequency)
It is a numerical statistic that is intended to reflect how important a word is to a corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.⁴ ⁵
Word Embeddings
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually, it involves a mathematical embedding from space with many dimensions per word to a continuous vector space with a much lower dimension. ⁶
Now for classical machine learning we can use TF-IDF and BOW, each one or join both together this is the code for testing some of the most used machine learning methods.
Preparing the BOW and TF-IDF
Now some classical methods, for this exercise we will use logistic regression and decision trees. But you can test any kind of classical machine learning model.
So we had tested with BOW and TF-IDF by separated, but what happens if we do it together, this is how. We can also use this approach as input for a neural network, but this is trivial, so you can do it at home.
For now, we only had cleaned the data and trained some classical models using BOW and TF-IDF approaches. Let’s see how to implement our own embedding using TensorFlow and Keras.
Deep Learning Embeddings
Before we start to train we need to prepare our data by using Keras tokenizer and build a text matrix of sentence size by total data length. In the preprocessing, we did before we print the distribution of the text data length and we obtain a median of 38 words per sentence (tweet) and a maximum of 120. This means that the word matrix should have a size of 120 by the data length. This will restrict our model of a sentence of maximum 120 words by sentence (tweet), if new data come bigger than 120 it only will get the first 120, and if it is smaller it will be filled with zeros.
Next, we will create the model architecture and print the summary to see our model layer connections. The model is really simple, it is a dropout after the embedding then an LSTM and finally the output layer.
Word2Vec and GLOVE
For Word2Vec and GLOVE approach we need to load the pre-trained values of the embedding matrix. This method could be also used with Numberbatch. Remember that the size of the matrix depends on the pre-trained model weights you download. For building this matrix we will use all the words seen in train and test (if it is possible all the words that we could see in our case o study). We will build a matrix with these vectors so each time an input word is processed it will find its appropriate vector so finally, we will have an input matrix of the max length of sentence by the embedding size (EJ: word2vec is 300). The code for loading the embeddings is presented below.
For this method, we will have an independent input layer before the embedding but we can build it the same as the own embedding propose. The model architecture propose is the following:
Each one of these methods comes with their own pre-train weights, and for building comparable results we won’t train these weights. The only case in which we will do this is when we build from scratch our own embedding using Keras. The true ideal process for training this kind of model should be in my experience, first training the recurrent network part with the embedding (or feature extraction in images or other subjects) weights freeze when finish train all together including the embedding. This is done because in the initial process of backpropagation the weights of the RNN are random (even if you use an initializer like Xavier they are random) so the error tends to be really big, and this makes a big disarrangement of the pre-train weights. But if you do it at the end you would adjust the embedding weights to your specific problem.
This is the GitHub that has all the code and the jupyter notebooks. It also has some experiments results.
In other posts, I will do an implementation of BERT and ELMO using TensorFlow hub. I hope you enjoy.
“It isn’t what we say or think that defines us, but what we do.” ― Jane Austen, Sense and Sensibility
My name is Sebastian Correa here is my web page if you wanna see more of my projects.
[1]: Analytics Vidhya, Twitter Sentiment Analysis
https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/
[2]: Wikipedia, Bag of words
https://en.wikipedia.org/wiki/Bag-of-words_model
[3]:McTear, Michael (et al) (2016). The Conversational Interface. Springer International Publishing. https://www.springer.com/gp/book/9783319329659
[4]: Wikipedia, TF-IDF
https://es.wikipedia.org/wiki/Tf-idf
[5]: Beel, J., Gipp, B., Langer, S. et al. Int J Digit Libr (2016) 17: 305. https://doi.org/10.1007/s00799-015-0156-0
[6]: Lebret, Rémi; Collobert, Ronan (2013). “Word Emdeddings through Hellinger PCA”. Conference of the European Chapter of the Association for Computational Linguistics (EACL). 2014. arXiv:1312.5542. Bibcode:2013arXiv1312.5542L