The sarcasm detector
Learning sarcasm from tweets!
Why a sarcasm detector?
This all started when I was looking for a toy project to teach myself
natural language processing
(NLP), a field that takes ideas from machine learning and applies them to text data. Your email spam filter is an application of NLP; there is a learning algorithm that learns how to differentiate a spam email from a regular email by looking at the text content of the email. It had just came out in the
that the U.S. secret service was looking for a sarcasm detector to improve their intelligence coming from Twitter and I was curious to see whether this was even possible. This was a perfect project to use NLP, so I decided to give it a try.
It wasn't clear to me that this was possible because sarcasm is a complicated concept. Let's go back to the spam filter example for a minute. If you look at a spam filter algorithm, the features that will be most relevant to the classification of emails will be certain keywords:
Enlarge your ...
for instance. A good learning algorithm will learn the vocabulary associated with spam emails, so when presented with an email which contains words in that vocabulary the classifier will classify that email as spam. My initial intuition was that sarcasm detection is more complicated than spam detection, because I didn't think there was a vocabulary associated with sarcastic sentences. I thought sarcasm is hidden in the tone and the ambivalence of the sentence.
defines sarcasm as
the use of words that mean the opposite of what you really want to say especially in order to insult someone, to show irritation, or to be funny
. So to detect sarcasm properly a computer would have to figure out that you meant the opposite of what you just said. It is sometimes hard for Humans to detect sarcasm, and Humans have a much better grasp at the English language than computers do, so this was not going to be an easy task.
Getting the data
To train an algorithm to detect sarcasm, we first need some data to train our algorithm on. Classification is a supervised learning exercise, which means we need to have some sentences labeled as sarcastic and sentences labeled as non-sarcastic so that our classifier can learn the difference between the two. One option would be to go over an online corpus which might contain some sarcastic sentences, for example online reviews or comments, and label the sentences by hand. This can be a very tedious exercise if we want to have a large data set. The other option is to rely on the people writing the sentences to tell us whether their sentences are sarcastic or not. This is what we are going to do. The idea here is to use the
to stream tweets with the label #sarcasm, these will be our sarcastic texts, and other tweets that don't have the label #sarcasm, these will be our non-sarcastic texts.
The obvious advantage of taking our data from Twitter is that we can have as many samples as we want. Every day people write new sarcastic tweets, we can simply stream them and store them in a database. I ended up collecting 20 000 clean sarcastic tweets and 100 000 clean non-sarcastic tweets over a period of three weeks in June-July 2014 (see section below to understand what a clean tweet is). Since tweets are often about what is currently happening in the world, it is important to collect the positive (sarcastic) and negative (non-sarcastic) samples during the same time period in order to isolate the sarcasm variable.
However there is a drawback to taking our data from Twitter; it's noisy. Some people use the #sarcasm hashtag to point out that their tweet was meant to be sarcastic, but a Human would not have been able to guess that the tweet is sarcastic without the label #sarcasm (example:
What a great summer vacation I've been having so far :) #sarcasm
). One may argue however that this is not really noise since the tweet is still sarcastic, at least according to the tweet's owner, and that sarcasm is in the eyes of the beholder. The converse also happens, someone may write a tweet which is clearly sarcastic but without the label #sarcasm. There are also instances of sarcastic tweets where the sarcasm is in a linked picture or article. Sometimes tweets are responses to other tweets, in which case the sarcasm can only be understood within the context of the previous tweets. Sometimes the label #sarcasm is meant to indicate that, while the tweet itself is not sarcastic, some of its hashtags are (example:
Time to do my homework #yay #sarcasm
). I will discuss in the next section how to remove most of that noise, but short of reading all the tweets and labeling them by hand we cannot remove all the noise.
Preprocessing the data
Before extracting features from our text data it is important to clean it up. To remove the possibility of having sarcastic tweets in which the sarcasm is either in an attached link or in response to another tweet, we simply discard all tweets that have http addresses in them and all tweets that start with the @ symbol. Ideally we would only collect tweets that are written in English. When we collect sarcastic tweets, the requirement that it contains the label #sarcasm makes it very likely that the tweet will be in English. To maximize the number of English tweets when we collect non-sarcastic tweets, we require that the location of the tweet is either San-Francisco or New-York. In addition to these steps, we remove tweets which contain Non-ASCII characters. We then remove all the hashtags, all the friend tags and all mentions of the word sarcasm or sarcastic from the remaining tweets. If after this pruning stage the tweet is at least 3 words long, we add it to our dataset. We add this last requirement in order to remove some noise from the sarcastic dataset since I do not believe that one can be sarcastic with only 2 words. Finally, we remove duplicates.
This is really the meat of the algorithm. The question here is, what are the variables in a tweet that make it sarcastic or non-sarcastic? And how do we extract them from the tweet? To this end I engineered several features that might help the classification of tweets and I tested them on a
set (I will discuss metrics for evaluating cross-validation in a later section). The most important features that came out of this analysis are the following:
: More precisely, unigrams and bigrams. These are just collections of one word (example:
, etc.) and two words (example:
, etc.). To extract those, each tweet was
, uncapitalized and then each n-gram was added to a binary feature dictionary.
: My hypothesis here is that sarcastic tweets might be more negative than non-sarcastic tweets or the other way around. Moreover, there is often a big contrast of sentiments in sarcastic tweets. What I mean by this is that tweets often start with a very positive sentiment and end with a very negative sentiment (example:
I love being cheated on #sarcasm
of tweets is a subject on its own so the idea here is to have something simple that can test my hypothesis. To this end I first split each tweet in one, two and three parts, and then do a sentiment analysis on all parts of the three splittings. I used two distinct sentiment analyzers. The first one is my own quick and dirty implementation which uses the
dictionary. This dictionary gives a positive and a negative sentiment score to each word of the English language. By looking up words in this dictionary, we can give a sentiment score to each part of the tweets. The other implementation of the sentiment analysis used the python library
which has a built-in sentiment score function.
: There are words that are often grouped together in the same tweets (example:
, etc.). We call these groups of words topics. If we first learn the topics, then the classifier will just have to learn which topics are more associated with sarcasm and that will make the supervised learning easier and more accurate. To learn the topics, I used the python library
which implements topic modeling using
latent Dirichlet allocation
(LDA). We first feed all the tweets to the topic modeler which learns the topics. Then each tweet can be decomposed as a sum of topics, which we use as features.
Choosing a classifier
There is a very wide range of machine learning algorithms to choose from, most of which are available in the python library
. However, most of the implementations of these algorithms do not accept sparse matrices as inputs, and since we have a large number of nominal features coming from our n-grams features it is imperative that we encode our features in a sparse matrix. Out of the algorithms that do support sparse matrices in Scikit-learn, I ended up trying naive Bayes, logistic regression and
support vector machine
(SVM) with a linear kernel. I got the best results in cross validation using SVM with an euclidean regularization coefficient of 0.1.
The metric I used to guide my cross-validation is the
. This is a good metric when we have a lot more samples from one category than from the other categories. In our case we have 5 times more non-sarcastic tweets than sarcastic tweets. If we just use for our metric the accuracy, that is the number of correct predictions divided by the total number of tweets in our cross-validation set, then a simple classifier which always predicts the tweets as non-sarcastic would get a 83% accuracy. This is obviously very misleading so we need a better metric. We can do better by considering precision and recall for the sarcastic category. Precision is the number sarcastic tweets correctly identified divided by the total number of tweets classified as sarcastic, while recall is the number sarcastic tweets correctly identified divided by the total number of sarcastic tweets in the cross validation set. Both precision and recall would be equal to 0% with a dumb classifier which always predicts tweets to be non-sarcastic, so these are already much better scores to quantify the quality of a sarcasm classifier. The F-score is simply the harmonic mean of precision and recall.
Results and insights
To understand the relative importance of each set of features, we can train our algorithm using only n-grams, only the sentiments or only the topics and look at the corresponding F-scores. For the n-grams we get F = 0.56, while for the sentiments we get F = 0.41 and for the topics we get F = 0.35. Note that the addition of all the features will not give an F-score which is a linear addition of the individual F-scores, but we should expect an improvement by combining the features together. Initially I thought that the main features would come from the sentiment analysis but the n-grams are by far the most important features. This shows that there is a vocabulary associated with sarcasm after all and it's not just in the tone!
To gain some insight into what the algorithm has learned, we can look at the coefficients of our features in the trained SVM to see which ones are the most important. For n-grams, the features that are the most important to classify sarcastic tweets are
, etc. These n-grams all make perfect sense, but since I streamed the tweets during the 2014 FIFA World Cup I also ended up with three relevant unigrams for sarcastic tweets that are somewhat unexpected:
. The most relevant n-grams for the classification of non-sarcastic tweets are
, etc. For the sentiment analysis, the classifier learned that sarcastic tweets are overall slightly more positive than non-sarcastic tweets and that the first half of a sarcastic tweet is more often positive while the last half of a sarcastic tweet is more often negative. However what seems to matter the most is the subjectivity which is also measured by the library Textblob. This means that sarcastic tweets are more about expressing feelings, either positive or negative, than non-sarcastic tweets. Finally, the topic modeler learned that topics which contain words such as
etc. are more associated with sarcastic tweets while topics which contain words such as
, etc. are more associated with non-sarcastic tweets.
After the cross-validation phase, I tested the sarcasm classifier on a separate test set made of 24 000 tweets and obtained an
F-score of 0.60
on the sarcastic category. This is a small improvement over the previous studies in sarcasm detection using tweets which obtained F-scores in the range 0.50-0.55, see references
. The previous work on sarcasm detection uses only n-grams features and/or rule-based classification algorithms. Our analysis improves over those previous ones by incorporating a sentiment decomposition analysis and a topic modeling analysis. Moreover, we used a modern machine learning algorithm, SVM, which is not rule-based.
I think sarcasm detection is a really fascinating subject, and I'm not being sarcastic! It's not clear to me whether or not this sarcasm detector would be useful for the U.S. secret service, but I do feel confident that I answered my original question; it is possible to detect sarcasm using tools from NLP. One quick way to improve this detector would be to incorporate a spell check for the tweets. This should reduce the number of dimensions of the dictionary for the n-gram features and improve the sentiment analysis as well. I did some preliminary experiments with the spell check of the library Textblob but I did not get any improvements using it, perhaps a better spell checker would do better.
The tools used
Here is a laundry list of tools used in this project:
: Python with Numpy, Scipy, Scikit-learn, NLTK, gensim, Textblob and tweepy.
The code can be found on
 Dmitry Davidov, Oren Tsur and Ari Rappoport,
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon
 Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalindra De Silva, Nathan Gilbert and Ruihong Huang,
Sarcasm as Contrast between a Positive Sentiment and Negative Situation
 Roberto Gonzalez-Ibanez, Smaranda Muresan and Nina Wacholder,
Identifying Sarcasm in Twitter: A Closer Look
 Christine Liebrecht, Florian Kunneman and Antal Van den Bosch,
The perfect solution for detecting sarcasm in tweets #not