Assignments‎ > ‎

Sentiment analysis


DUE: May 11, 2012 by 10am (submission on Blackboard)

For this homework, you will complete a sentiment analysis system for tweets by implementing features that will be used by supervised machine learners.

  • As usual, the description of the problems for this homework is somewhat long, and as usual, much of the text is there to give you documentation and help you know what to do.
  • You should read all of the problems before starting. Then read through each problem carefully, in its entirety, before answering questions and doing the implementation.
  • It will be best to solve the problems in order since each one builds on the previous problem.
  • My solutions to these problems don’t use a single “var” variable. I encourage you to try to do without them too, but you may use them if you really must.
  • If you run into what you think might be a bug with Scalabha, please let me know right away so I can fix it if it indeed is a bug.
  • Feel free to look at the sentiment analysis homework on which this is based. I unfortunately had to cut certain aspects of that homework when creating this one, like doing subjectivity analysis separately in a two-stage classification setup.
  • Tip: Check out Bo Pang and Lillian Lee's book: Opinion Mining and Sentiment Analysis (free online!)
  • If you have any questions or problems with any of the materials, don't hesitate to ask!
If you don’t absolutely need to have words on paper, don’t print this homework description out -- it has a lot of example output that makes it lengthy, and it will be easier to do some cutting and pasting etc right from the web page.


Getting Scalabha

You must obtain and install Scalabha, a software package that will allow you to access code written to support this assignment and to easily compile and run your own code. For this homework, you need version 0.2.5, which you can obtain here:

Note: if you know how to use Git (or are interested in using it), you can just clone the Scalabha repository.

You should unzip the contents of and follow the instructions for installing it in the file scalabha-0.2.5/ Note that one of those instructions requires you to set an environment variable SCALABHA_DIR that points to the location of Scalabha on your machine. At times, this homework will reference file locations that are relative to that directory.

The data

You will work with classification datasets that are located in $SCALABHA_DIR/data/classify. Go to that directory and have a look at it.

Setting up the code

Like the previous assignments, you will this time develop your solutions in the context of the Scalabha build system. See tutorial 11 on SBT and Scalabha for a refresher on SBT if you need it.

There are five files in the homework bundle four stub files TwitterSentiment.scala, StudentCode.scala, ClassifierUtil.scala, TwitterUtil.scala and an answers file hw6_answers.txt. Please modify these only StudentCode.scala and hw6_answers.txt when solving the problems; do not use different names for any of them. To prepare for working with the assignment, you should do the following steps:

$ unzip
$ scalabha build compile

The last command should end with a line starting with [success]. If you are having trouble with this, get in touch with me right away.

Your implemented solutions will be done in StudentCode.scala. Any portions of problems that begin with Written question or that request example output should go in hw6_answers.txt.

Tip: while you are working on your solutions, you should definitely take advantage of the ~compile command in SBT. It will compile your code automatically every time you save a file you are working on.

Note: all of the commands given below assume that you are running them from the $SCALABHA_DIR/data/classify directory.

Submitting your solutions

For submission, create a zip file with the name hw6_<lastname>_<firstname>.zip that contains your src/main/scala/ata/hw6 directory and its contents. Here are some example commands for doing this:

$ $ zip -r src/main/scala/ata/hw6/
  adding: src/main/scala/ata/hw6/ (stored 0%)
  adding: src/main/scala/ata/hw6/ClassifierUtil.scala (deflated 66%)
  adding: src/main/scala/ata/hw6/hw6_answers.txt (deflated 80%)
  adding: src/main/scala/ata/hw6/StudentCode.scala (deflated 56%)
  adding: src/main/scala/ata/hw6/TwitterSentiment.scala (deflated 69%)
  adding: src/main/scala/ata/hw6/TwitterUtil.scala (deflated 53%)

Make sure that your code compiles before you do this. If it does not, I’ll deduct 20 points right away and ask you to fix it and resubmit.

Tip: Remember that you can create the scaladoc API documentation by running the "doc" command in SBT, and that you also have access to the source code in $SCALABHA_DIR/src/main/scala, which you will be looking at at times to solve the problems in this homework.

Prelude: The twitter sentiment datasets

We now turn to the sentiment analysis task: predicting the sentiment of tweets. There are three datasets: the Debate08 (Obama-McCain) dataset, Health Care Reform (HCR) dataset, and the Stanford Twitter Sentiment dataset.

The Debate08 dataset comes from the following papers:

David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill. 2009. Tweet the Debates: Understanding Community Annotation of Uncollected Sources , ACM Multimedia, ACM.

Nicholas A. Diakopoulos; David A. Shamma. 2010. Characterizing Debate Performance via Aggregated Twitter Sentiment , CHI 2010, ACM.

This dataset can be found in data/classify/debate08. It has been split into train/dev/test XML files that you'll be using for the obvious purposes. See the script data/classify/debate08/orig/ if you want to see the details of how the raw annotations were processed to create the files you'll be using.

The HCR dataset comes from the following paper:

Michael Speriosu, Nikita Sudan, Sid Upadhyay, and Jason Baldridge. 2011. Twitter Polarity Classification with Label Propagation over Lexical Links and the Follower Graph. In Proceedings of the First Workshop on Unsupervised Methods in NLP. Edinburgh, Scotland.

As before, there is a train/dev/test split, with each split in its own XML file.

The Stanford Twitter Sentiment dataset comes from the following paper:

Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision. Unpublished manuscript. Stanford University, 2009.

This dataset is much smaller, so it will be used only for evaluating models after they have been trained on the materials from the other datasets. Also, it is not in XML format, so you'll need to write a converter that reads the native format for it and outputs the XML format required by the TwitterSentiment.scala application.

The TwitterSentiment.scala application is your entry into training and evaluating different models in the context of twitter sentiment analysis. The script actually hides a number of the details of running various models for you, including making it so you don't have to run a command for training, another for applying, doing evaluation, etc. Many of these details are contained in the classes and objects defined in ClassifierUtil.scala -- feel free to have a look at that if you are interested.

You will modify the file StudentCode.scala for your submission, as instructed in the following problems.

You should go ahead and look at these files, and then look at the Tweet object in TwitterUtil.scala to understand how you can access information about a tweet in your code.

Problem 1 - Sentiment analysis: majority class baseline [10 pts]

One of the most important things to do when working on empirical natural language processing is to compare your results to reasonable baselines to ensure that the effort you are putting into some fancy model is better than a super simple approach. This is the "do the dumb thing first" rule, so let's do that.

If you have training data, then the easiest rule to follow is to find the majority class label and use that for labeling new instances. Using standard Unix tools, you can find this out for data/debate08/train.xml as follows:

$ grep 'label="' debate08/train.xml | cut -d ' ' -f4 | sort | uniq -c
    369 label="negative"
    143 label="neutral"
    283 label="positive"

So, negative is the majority label. However, you need to compute this in the Scala code based on a sequence of labels. Look at the apply method of the MajorityClassCalculator object in StudentCode.scala -- the stub implementation returns the label neutral with a confidence of .333. You can see the result of running TwitterSentiment.scala with this stub implementation as follows:

$ scalabha run ata.hw6.TwitterSentiment -t debate08/train.xml -e debate08/dev.xml -m majority
Confusion matrix.
Columns give predicted counts. Rows give gold counts.
0    454    0    |    454    negative
0    141    0    |    141    neutral
0    200    0    |    200    positive
0    795    0
negative neutral positive

               17.74    Overall accuracy
P        R       F
0.00     0.00    0.00    negative
17.74  100.00   30.13    neutral
0.00     0.00    0.00    positive
5.91    33.33    10.04    Average

The output shows evaluation of for full polarity classification (positive, negative, neutral) with respect to both the raw confusion matrix and precision (P), recall (R) and F-score (F) for each category. Right now the detailed results aren't very interesting because we're not predicting more than a single label, in this case neutral, so we'll discuss what these mean more in the next problem. For now, note that the overall accuracy of assigning every tweet in the evaluation set (dev.xml) the label neutral is quite poor. Overall accuracy is computed simply as the number of tweets that received the correct label, divided by the total number of tweets.

Implementation. Fix the MajorityClassCalculator object so that it computes the majority class label from the sequence of labels that is given to it as an argument and returns that majority class label and its probability (the relative proportion of occurrences of that label). Then run TwitterSentiment.scala as shown above. Writing. What is the overall accuracy when you do this?

Of course, predicting everything to be one label or the other is a pretty useless thing to do in terms of the kinds of latter processing and analysis one might do based on sentiment analysis. That means it is very important to not just consider the overall accuracy, but also to pay attention to the performance for each label. The lexicon based method we consider next enables prediction of any label, and thus allows us to start considering label-level performance and trying to find models that not only have good overall accuracy, but which also do a good job at finding instances of each label.

Problem 2 - Lexicon ratio baseline [15 pts]

Another reasonable baseline is to use a polarity lexicon to find the number of positive tokens and negative tokens and pick the label that has more tokens. The lexicon we'll use is Bing Liu's Opinion Lexicon, which you can find in the following files:
  • src/main/resources/lang/eng/lexicon/positive-words.txt
  • src/main/resources/lang/eng/lexicon/negative-words.txt
Look at the header of one of those files for a few comments on what is in them. Note that there are misspelled words in these lists, which is intentional.

For this problem, you'll modify the apply method of the object LexiconRatioClassifier in StudentCode.scala. It extends AbstractLexiconRatioClassifier in ClassifierUtil.scala, and provides useless (but compilable) return values for the two methods numPositiveTokens and numNegativeTokens. You'll want to take advantage of the positive and negative word sets in Lexicon in TwitterUtil.scala. The code in AbstractLexiconRatioClassifier handles the rest (have a look, though you don't need to change the code). The default setting predicts neutral for all tweets. Run it like this:

$ scalabha run ata.hw6.TwitterSentiment -e debate08/dev.xml -m lexicon

Note that no training material is needed because this is only based on the words that are present and the polarity lexicon entries. The results are the same as the pick-neutral-for-every-item baseline you fixed in the previous problem. Let's fix the lexicon baseline next.

Part (a). Implementation. Change the LexiconRatiClassifie object so that it computes the number of positive tokens and negative tokens, using the sets positive and negative (which have been read in from the Opinion Lexicon in the Lexicon object). See the instructions in the file. Note that this is all you should do. The code in AbstractLexiconRatioClassifier handles ties and allows the neutral label to be predicted. Also note that you could surely improve on that implementation with some more effort, but the point here is to get a reasonable baseline, not make a great lexicon based classifier. After you've done this, your output should like like the following:

$ scalabha run ata.hw6.TwitterSentiment -e debate08/dev.xml -m lexicon
Confusion matrix.
Columns give predicted counts. Rows give gold counts.
107    279    68    |    454    negative
11     117    13    |    141    neutral
22     106    72    |    200    positive
140    502    153
negative neutral positive

                  37.23    Overall accuracy
P         R        F
76.43    23.57    36.03    negative
23.31    82.98    36.39    neutral
47.06    36.00    40.79    positive
48.93    47.52    37.74    Average

At this point, let's stop and look at the results in  more detail. The overall accuracy is lower that what you should have gotten for the majority class baseline (which should be in the 50-60% range). However, the lexicon ratio method can predict any of the labels, which leads to more interesting patterns. Note the following:
  • P, R and F stand for Precision, Recall and F-score. For example, for the label neutral, the precision is the number of tweets correctly identified as neutral divided by all the tweets that the system classified as neutral. The recall is the number of items correctly identified as neutral divided by all the tweets that the gold standard annotations say are neutral. F-score is the harmonic mean of precision and recall: F = 2*P*R/(P+R).
  • For each evaluation type, an average of P/R/F is provided.
  • The values we'll care about the most in final evaluations are the F-score average and the overall accuracy. However, it is important to consider precision and recall individually for different kinds of tasks.
  • Even though the overall accuracy is a lot lower than the majority class baseline, the output is far more meaningful; this shows in the label-level results, and the P/R/F averages, which are much higher than for the majority class baseline.
You may have noticed that the tweet content is tokenized on white space. This is sub-optimal. For example, it means that for a tweet like:

RT @hippieprof: RT @loudhearted: RT @quaigee: I find that the ppl who decry ''socialism'' loudest have no idea what the word means. #hcr #p2

the tokens produced are:

(RT, @hippieprof:, RT, @loudhearted:, RT, @quaigee:, I, find, that, the, ppl, who, decry, ''socialism'', loudest, have, no, idea, what, the, word, means., #hcr, #p2)

Obviously there are some problems here that are going to lead to loss of generalization for features, including tokens like "@hippieprof:", which should be "@hippieprof" and ":", "means." -> "means" ".", etc. Included in the code base is opennlp.scalabha.lang.eng.Twokenize.scala (from Brendan O'Connor's TweetMotif source code) which provides tokenization tailored for twitter. The tokens produced by Twokenize for the above example are:

(RT, @hippieprof, :, RT, @loudhearted, :, RT, @quaigee, :, I, find, that, the, ppl, who, decry, '', socialism, '', loudest, have, no, idea, what, the, word, means, ., #hcr, #p2)

Much better!

Part (b). Implementation. Modify the Tokenizer object's apply method so that it uses tokens from Twokenize. This is very simple, but you must figure out how to do this for yourself. Note that AbstractLexiconRatioClassifier "gets" its tokens from Tokenizer, so as you soon as you change Tokenizer, it makes better tokens available to AbstractLexiconRatioClassifier and other methods that use Tokenizer.

Run the lexicon classifier again -- you should see overall accuracy for polarity go to 44.28. 

Part (c). Written answer. Looking at the tokens in the example above and at the polarity lexicon, there are some fairly obvious problems for the lexicon ration classifier as it is defined. Briefly say what at least one of these is. 

Overall, this is clearly a poor set of results, but that is okay -- it's just a baseline! Let's do better with models acquired from the training data.

Problem 3 - Supervised models for sentiment classification [30 pts]

Now that we have done a couple of simple sanity checks to see what we should expect to do at least as well as (and hopefully much better), we can turn to machine learning from labeled training examples. The code is already set up to do classification. If no options are provided, a maxent model is trained and used, with a sigma of 1.0 (see the classification homework for more details). Here's what you you should see (unfortunately, a lot of output is provided, including a status for every iteration of the maxent training algorithm, but that couldn't be helped for this assignment):

$ scalabha run ata.hw6.TwitterSentiment -t debate08/train.xml -e debate08/dev.xml
Indexing events using cutoff of 1

    Computing event counts...  done. 795 events
    Indexing...  done.
Sorting and merging events... done. Reduced 795 events to 787.
Done indexing.
Incorporating indexed data for training... 
    Number of Event Tokens: 787
        Number of Outcomes: 3
      Number of Predicates: 3130
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-873.3967694911638    0.17987421383647798
  2:  ... loglikelihood=-719.5051853187072    0.8465408805031447
  3:  ... loglikelihood=-621.9388163826765    0.8817610062893082
<...many more lines like the above...>
 98:  ... loglikelihood=-169.99680317277065    0.989937106918239
 99:  ... loglikelihood=-169.783314182488    0.989937106918239
100:  ... loglikelihood=-169.57346480090493    0.989937106918239
Confusion matrix.
Columns give predicted counts. Rows give gold counts.
354    61    39    |    454    negative
101    34    6    |    141    neutral
114    20    66    |    200    positive
569    115    111
negative neutral positive

        57.11    Overall accuracy
P    R    F
62.21    77.97    69.21    negative
29.57    24.11    26.56    neutral
59.46    33.00    42.44    positive
50.41    45.03    46.07    Average

Already looking much better than the baselines! Try it with naive Bayes, using the option --method naive-bayes (-m nb). For Debate08 with just the unigram features, naive Bayes gets a bit better result than maxent, and for HCR it is a bit worse than maxent. (And these values will change with different smoothing values -- see part (a).)

For this problem, you'll improve the extraction of features and determine a good value for smoothing (for both maxent and naive Bayes). The option --smoothing (-s) to TwitterSentiment allows you to specify this for both model types (the value will be used to set sigma for maxent and lambda for naive Bayes).

Part (a) . Written answer. Find a better smoothing value than the default (1.0) for both maxent and naive Bayes for both the Debate08 and HCR datasets. Write down what your best values are for each model and paste the output for both in your ANSWERS file. You should find a good balance between the overall accuracy and the average F-score. Is either of the model types a clear winner after you've done this optimization?

Part (b). Implementation. Improve and extend the features used by the classifiers by modifying the ExtendedFeatureExtractor object. Possibilities:
  • lower casing all tokens
  • using stems (see the PorterStemmer)
  • excluding stop words from being features in unigrams
  • consider bigrams and trigrams (and possibly as raw and/or stems)
  • using the polarity lexicon (e.g. output a feature polarity=POSITIVE for every word that is in the positive lexicon
  • regular expressions that detect patterns like "loooove", "LOVE", "*love*", presence of emoticons, etc.
To make it so TwitterSentiment uses these features, add the --extended (-x) flag to your command line options, e.g.:

$ scalabha run ata.hw6.TwitterSentiment -t debate08/train.xml -e debate08/dev.xml -x

You'll want to consider different smoothing values than what you had before. 

Part (c). Written answer. Describe the features that you used and why, giving examples from the training set. Include the output from your best model.

For comparison, here are my best results on data/debate08/dev.xml:

Confusion matrix.
Columns give predicted counts. Rows give gold counts.
437    7    10    |    454    negative
112    27    2    |    141    neutral
109    6    85    |    200    positive
658    40    97
negative neutral positive

        69.06    Overall accuracy
 P        R        F
66.41    96.26    78.60    negative
67.50    19.15    29.83    neutral
87.63    42.50    57.24    positive
73.85    52.63    55.22    Average

Hopefully some of you will beat this!

Problem 4 - The Stanford Twitter Sentiment set [20 pts]

This problem involves predicting the polarity of tweets in the Stanford Twitter Sentiment dataset.

Part (a) . Implementation. Convert the Stanford Twitter Sentiment corpus so that it is in the XML format of the other datasets. You should be able to run it as follows:

$ scalabha run ata.hw6.ConvertStanfordDataToXml stanford/orig/testdata.manual.2009.05.25 > stanford/dev.xml

To do this conversion, look at the original data file and at the paper referenced above. Everything you need to create the XML elements is in there.

Tip: Don't forget that you need <dataset>...</dataset> around the entire set of tweets.

Now, use it as an evaluation set. Do so for the lexicon ratio classifier and for maxent trained on both the Debate08 and HCR training sets. For the latter, you need to use the --auxiliary (-a) option, e.g.

$ scalabha run ata.hw6.TwitterSentiment -t debate08/train.xml -a hcr/train.xml -e stanford/dev.xml -x

For the lexicon classifier, it is just like you did for the other datasets, since no training is involved.

Part (b). Written answer. In a paragraph or two, describe what happened, and why you think it did.

Problem 5 - Reporting and analysis [25 pts]

We'll wrap up with a summary of your best results and a look at the output of your best model.

Part (a). Written answer. For debate08/dev.xml, hcr/dev.xml, and stanford/dev.xml state your best score for each model and configuration, including which parameters were involved. Do it as a table:

Overall Accuracy
Neutral F-score
Positive F-score
Average F-score

Part (b). Written answer. Run all of the above model/configurations with these same parameters on debate08/test.xml and hcr/test.xml and produce a table of those results. Did the models maintain the same relative performance as they had on the development set? Are the differences in performance about the same, or have they increased?

The option --detailed (-d) outputs the correctly resolved tweets and the incorrectly resolved ones.

Part (c) Written answer. Obtain the detailed output for your best system for data/hcr/dev.xml. Look at at least 20 of the incorrect ones and discuss the following:
  1. Which tweets, if any, do you think have the wrong gold label?
  2. Which tweets, if any, are too complex or subtle for the simple positive/negative/neutral distinction? (For example, they are positive toward one thing and negative toward another, even in the same breath.)
  3. Which tweets, if any, have a clear sentiment value, but are linguistically too complex for the kind of classifiers you are using here?
  4. Which tweets, if any, do you think the system should have gotten? Are there any additional features you could easily add that could fix them (provided sufficient training data)?
For each of these, paste the relevant tweets you discuss into your hw6_answers.txt file.

Part (d) Written answer. Based on your experience creating features with the resources made available to you and having looked at the errors in detail, describe 2-3 additional strategies you think would help in this context, such as other forms of machine learning, additional linguistic processing, etc.


(a) How long did this homework take you?

(b) Do you have any suggestions?

(c) Do you have any complaints?

Extra credit - Train on noisy emoticon-labeled tweets [up to 30 additional pts]

NOTE: You should also run on the Stanford set too.

Look in data/emoticon -- you'll see:
  • happy.txt: 2000 tweets that have a smiley emoticon
  • sad.txt: 2000 tweets that have a frowny emoticon
  • neutral.txt: 2000 tweets that don't have smilies or frownies or certain subjective terms (it's noisy, so it is very much neutral-ISH)
Part 1. Write an object CreateEmoticonDataSplits to produce emoticon/train.xml file from the above files in the format that TwitterSentiment expects. All the tweets in happy.txt should be labeled positive, those in sad.txt should be labeled negative, and those in neutral.txt should be labeled, well, neutral. So, this is clearly an attempt to get annotations for free -- and there are indications that it should work, e.g. see the technical report by Go et al 2009: Twitter Sentiment Classification using Distant Supervision

Part 2. Use emoticon/train.xml as a training source and evaluate on debate08/dev.xml and hcr/dev.xml. 

Writing. Discuss the results, with reference to both datasets. Does it at least beat a lexicon based classifier? How close does it come to the supervised models? Does the effectiveness of the noisy labels vary with respect to the model? Pay particular attention to the label-level precision and recall values. Are they more balanced or less balanced than the results from models trained on human annotated examples? If there are differences, can you think of reasons why?

Part 3. I haven't actually run the above experiment myself, but I'm willing to bet the results aren't as good as with the models trained on human annotated examples. So, perhaps there is a way to take advantage of both the human annotations and this larger set of noisily labeled examples. Actually, there are many ways of doing this -- here you'll do the very simple strategy of just concatenating the two training sets. You can do this without changing the files by using the --auxtrain (-a) option. As an example, you can evaluate on debate08/dev.xml using a model trained on both debate08/train.xml and hcr/train.xml as follows:

$ scalabha run ata.hw6.TwitterSentiment -t debate08/train.xml -a hcr/train.xml -e debate08/dev.xml

You'll probably find that you need to adjust the smoothing value to get better results. Try this strategy for both Debate08 and HCR, using emoticon/train.xml.

Part 4. Determine the best parameter values and run the evaluation for (a) using just emoticon training, (b) using both human annotated and emoticon training, for both naive Bayes and maxent (optional: try two-way classification too), and fill in a table like 8(a) for both  data/debate08/dev.xml and data/hcr/dev.xml. Did you get better results than the best results for 8(a)? Fill in the table for  data/debate08/train.xml and discuss what comes of that.

General comment. For this extra credit, things are much more free form -- go crazy and impress me a lot, or do the bare minimum and get a few points. It's up to you. I'm happy to discuss ideas! (For example, you could create your own emoticon set, with more examples, or do a better job at identifying actually neutral items -- you just can't do any hand labeling.)

Make sure to include CreateEmoticonDataSplits in your submission.

Extra credit - Make something interesting up to do

E.g. grab another sentiment dataset and work with it. (I'll post some links later, but email me if you want them and I haven't yet done so.)

Copyright 2012 Jason Baldridge

The text of this homework is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to and to this original homework.

Please email Jason at with suggestions, improvements, extensions and bug fixes.

Jason Baldridge,
Apr 28, 2012, 12:40 AM