Assignments‎ > ‎



DUE: April 16, 2012 by 10am (submission on Blackboard)

This assignment involves using naive Bayes and maximum entropy classifiers for a prepositional phrase attachment task. The main tasks are using existing software for classification and extracting features for them to use in learning and making predictions.

  • As usual, the description of the problems for this homework is somewhat long, and as usual, much of the text is there to give you documentation and help you know what to do.
  • You should read all of the problems before starting. Then read through each problem carefully, in its entirety, before answering questions and doing the implementation.
  • It will be best to solve the problems in order since each one builds on the previous problem.
  • My solutions to these problems don’t use a single “var” variable. I encourage you to try to do without them too, but you may use them if you really must.
  • If you run into what you think might be a bug with Scalabha, please let me know right away so I can fix it if it indeed is a bug.
  • Feel free to look at the naive Bayes homework on which this is based. It won't necessarily help you, but if you want to dig deeper, that homework actually involves implementing naive Bayes yourself rather than just using an existing implementation.
  • If you have any questions or problems with any of the materials, don't hesitate to ask!
If you don’t absolutely need to have words on paper, don’t print this homework description out -- it has a lot of example output that makes it lengthy, and it will be easier to do some cutting and pasting etc right from the web page.


Getting Scalabha

You must obtain and install Scalabha, a software package that will allow you to access code written to support this assignment and to easily compile and run your own code. For this homework, you need version 0.2.3., which you can obtain here:

Note: if you know how to use Git (or are interested in using it), you can just clone the Scalabha repository.

You should unzip the contents of and follow the instructions for installing it in the file scalabha-0.2.3/ Note that one of those instructions requires you to set an environment variable SCALABHA_DIR that points to the location of Scalabha on your machine. At times, this homework will reference file locations that are relative to that directory.

The data

You will work with clustering datasets that are located in $SCALABHA_DIR/data/classify. Go to that directory and have a look at it.

Setting up the code

Like the previous assignment, you will this time develop your solutions in the context of the Scalabha build system. See tutorial 11 on SBT and Scalabha for a refresher on SBT if you need it.

There are just three files in the homework bundle two stub files PrepAttach.scala and ConfidenceScorer.scala, and an answers file hw5_answers.txt. Please modify these files when solving the problems; do not use different names for either. To prepare for working with the assignment, you should do the following steps:

$ unzip

$ scalabha build compile

The last command should end with a line starting with [success]. If you are having trouble with this, get in touch with me right away.

Your implemented solutions will be done in PrepAttach.scala and ConfidenceScorer.scala. Any portions of problems that begin with Written question or that request example output should go in hw5_answers.txt.

Tip: while you are working on your solutions, you should definitely take advantage of the ~compile command in SBT. It will compile your code automatically every time you save a file you are working on.

Note that most of the commands suggested in the problems assume that you are running them in the $SCALABHA_DIR/data/classify directory.

Submitting your solutions

For submission, create a zip file with the name hw5_<lastname>_<firstname>.zip that contains your src/main/scala/ata/hw5 directory and its contents. Here are some example commands for doing this:

$ zip -r src/main/scala/ata/hw5/
  adding: src/main/scala/ata/hw5/ (stored 0%)
  adding: src/main/scala/ata/hw5/PrepAttach.scala (deflated 65%)
  adding: src/main/scala/ata/hw5/ConfidenceScorer.scala (deflated 57%)

Make sure that your code compiles before you do this. If it does not, I’ll deduct 20 points right away and ask you to fix it and resubmit.

Tip: Remember that you can create the scaladoc API documentation by running the "doc" command in SBT, and that you also have access to the source code in $SCALABHA_DIR/src/main/scala, which you will be looking at at times to solve the problems in this homework.

Problem 1 - A good day to play tennis? [5 pts]

Let’s start with a simple example problem from Tom Mitchell’s book Machine Learning (you don't need the book for this problem). The problem is to predict whether it is a good day to play tennis given various factors and some initial data that provides information about whether previous days were good or bad days for tennis. The factors include (in the format "Attribute: <List of Possible Values>"):
  • Outlook: Sunny, Rain, Overcast
  • Temperature: Hot, Mild, Cool
  • Humidity: High, Normal
  • Wind: String, Weak
Table 3.2 on page 59 of Mitchell’s book contains information for fourteen days; this data is encoded in the file $SCALABHA_DIR/data/classify/tennis/train. There is another related file called test in the same directory. As you might expect, you will learn model parameters using train and test the resulting model on the examples in test.

Each row in the data files corresponds to a single classification instance. For example, here is the training set.


Each instance consists of a list of attribute values, separated by commas, and the last element is the classification value. The value is "Yes" if it is a good day to play tennis based on the conditions, and "No" if it is not.

What we are interested in for this toy example is to determine whether the probability of playing tennis is higher than the probability of not playing tennis. We can represent the probability of playing tennis as:

Note that Label is the output prediction we are making (with yes as one of its two values), Outlook, Temperature, Humidity, and Wind are all attributes of interest, , and o, t, h, and w are variables for the possible values of each attributes. As an example, if we have a new instance that we wish to classify, like:


what we seek is (shortening notation to ignore the feature names since they are clear from context):

Where we obtained the second line from the first by using Bayes rule and assuming independence, as discussed in the lecture slides.

This simply means we need to compute the two values:

and pick the label that produced the higher value. The naive Bayes implementation in Scalabha will do this for you, provided you give it the data in the right format.

The data includes a test set for the tennis task as well, provided in full here:


Like the training set, this provides a list of instances, and for each instance, the values for each of the attributes and the classification outcome. We'll use Scalabha's naive Bayes classifier to make predictions on each instance and then compare its most probably label with the gold standard label provide in the test set.

The only thing you need to do for this problem is run the naive Bayes classifier such that it trains on the train file and is evaluated on the test file. Do the following commands:

$ cd $SCALABHA_DIR/data/classify
$ scalabha classify --train tennis/train --predict tennis/test --out predictions.txt
$ scalabha score --gold tennis/test --predict predictions.txt

You should see the following output:

Accuracy: 61.54

Tip: things are set up so that:
  • if scalabha classify is not given the --out option, then the predictions are output to STDOUT
  • if scalabha score is not given the --predict option, it will take input from STDIN
  • for both programs, all of the options have short versions using the first letter of the option
That means you can do classification and scoring using a single line with UNIX pipes:

$ scalabha classify -t tennis/train -p tennis/test | scalabha score -g tennis/test
Accuracy: 61.54

Here's a bit more detail on what is going on. The classify command outputs a probability distribution for each instance, for example:

$ scalabha classify -t tennis/train -p tennis/test
No 0.7098248137334299 Yes 0.2901751862665702
Yes 0.9653030731512793 No 0.03469692684872127
Yes 0.676016267177877 No 0.323983732822123
No 0.7334732282476866 Yes 0.2665267717523131
Yes 0.8066941597364939 No 0.19330584026350536
Yes 0.5767484281985087 No 0.4232515718014911
No 0.5571385693746576 Yes 0.44286143062534244
Yes 0.9085594742433201 No 0.09144052575668082
Yes 0.7607393593960057 No 0.23926064060399474
Yes 0.9226202882823726 No 0.07737971171762743
Yes 0.7876617086814863 No 0.21233829131851342
No 0.5118066815368538 Yes 0.48819331846314595
Yes 0.6138655053429548 No 0.38613449465704547

The most probable label is the leftmost on each line. Notice that the probabilities sum to one (as they should). So, if we take the most probable label as the prediction (which is the usual thing to do), this means that the model predicts "No" for the first instance in the test file:


This does match the actual label given in the file for that instance. (Other predictions are of course different from the test labels.)

Written answer: Run the naive Bayes classifier by training it on the "test" file and evaluating it on the "train" file. Report the accuracy.

Problem 2 - Prepositional Phrase Attachment and smoothing [15 pts]

Prepositional phrase attachment is the well-known task of resolving a common ambiguity in English syntax regarding whether a prepositional phrase attaches to the verb or the noun in sentences with the pattern Verb Noun_Phrase Prepositional_Phrase. An example is I saw the man with the telescope. If the prepositional phrase attaches to the verb, then the seeing was done with the telescope; if it attaches to the noun, it indicates that the man had the telescope in his possession. A clear difference can be seen with the following related examples:
  • Attach to the noun: He ate the spaghetti with meatballs.
  • Attach to the verb: He ate the spaghetti with chopsticks.
We can deal with this decision just like any simple labeling problem: each sentence receives a label V or N indicating the attachment decision, and there is no benefit to be gained from using previous attachment decisions.

For this problem, you will use a conveniently formatted data set for prepositional phrase attachment which has been made available by Adwait Ratnaparkhi. You can find it in the directory $SCALABHA_DIR/data/classify/ppa. Go to that directory and list the contents. There are three files which you will use for this problem: training, devset, and test. Look at the contents of training:

0 join board as director V
1 is chairman of N.V. N
2 named director of conglomerate N
3 caused percentage of deaths N
5 using crocidolite in filters V

Each row lists an abbreviated form of a prepositional phrase attachment. The first item is just a source sentence identifier that we can ignore. The four words correspond to the head verb, head noun, preposition, and head noun object of the preposition, in that order. The final element indicates whether the attachment was to the verb (V) or to the noun (N). For example, for the two spaghetti eating sentences given above, the abbreviated forms would be:

4242 ate spaghetti with meatballs N
4243 ate spaghetti with chopsticks V

For this exercise, you will build a classifier that learns a model from the data in training and use it to classify new instances. You will develop your model using the material in devset. You must not personally inspect the contents of test — you will run your classifier on test only once (for problem 5), when you are done developing.

The first thing you must do is produce features from the ppa data in the format used previously for the tennis problem. Each of the four columns in the ppa data implicitly indicates a separate attribute – let’s call them verb, noun, prep, and prep_obj, respectively. To begin, we'll just create features that are based directly on the attributes and their values. So, a line like,

0 join board as director V

becomes the following:


The main method of ata.hw5.PpaFeatures reads in a ppa file and produce the features in the above format. Using this program, produce files for training and development as follows:

$ cd $SCALABHA_DIR/data/classify
$ mkdir out

$ scalabha run ata.hw5.PpaFeatures ppa/training > out/ppa.basic.train
$ scalabha run ata.hw5.PpaFeatures ppa/devset > out/
$ scalabha run ata.hw5.PpaFeatures ppa/test > out/ppa.basic.test

You can run scalabha classify on the feature files that are produced and score the output with scalabha score:

$ scalabha classify -t out/ppa.basic.train -p out/ | scalabha score -g out/
Accuracy: 68.80

Ratnaparkhi et al (1994) obtain accuracies of around 80%, so we clearly should be able to do much better.  One obvious problem shows up if you look at the actual output:

$ scalabha classify -t out/ppa.basic.train -p out/ | more
V 0.8324686834013496 N 0.16753131659864912
V 1.0 N 0.0
V 1.0 N 0.0
V 1.0 N 0.0
V 1.0 N 0.0
V 0.5 N 0.5
V 1.0 N 0.0
V 0.9999400936526425 N 5.990634735608704E-5
V 0.5 N 0.5
V 1.0 N 0.0
V 0.9997272120557299 N 2.7278794426904216E-4
V 0.9997444685957534 N 2.5553140424721684E-4
V 0.9996717479741555 N 3.28252025843356E-4

There are many items that have zero probability for N, V, or both (those are the ones with uniform .5 probability for both labels). The problem is that we haven't done any smoothing, so there are many parameters that we assign zero to, and then the overall probability for the class becomes zero. For example, the tenth line in out/ is:


The output gives zero probability to N because the only training instance that has noun=performer is with the V label:


Thus, the value of P(Noun=performer | Label=V) is zero, making P(Label=V | Verb=juxtapose, Noun=performer, Prep=with, PrepObj=tracks) also zero, regardless of how much the rest of the information looks like a V attachment. We can fix this by using add-λ smoothing, which hallucinates counts for every event and ensures none of the probabilities are zero.

Using smoothing is easy: just use the -lambda option and provide a value greater than zero. For example, here is what we get with a lambda value of one.

$ scalabha classify -t out/ppa.basic.train -p out/ -l 1 | scalabha score -g out/ 
Accuracy: 80.81

Find a good λ for improving the accuracy on the development set. By exploring other values for λ, you may be able to improve the results.

Written answer. Report the best λ you found and what accuracy you obtained.

Tip. You can easily check many possible values for λ using the Unix command line. For example,

$ for i in {0..5}; do echo $i; scalabha classify -t out/ppa.basic.train -p out/ -l $i | scalabha score -g out/  ; done

You should see multiple accuracy values go by, each one associated with the λ value printed above it. (If this doesn't work, then you might not be using the bash shell.) You can also specify ranges such as for i in .5 1 1.5 2 2.5 for drilling down further.

don't do anything with ppa/test yet!

Problem 3 - Extending the feature set [30 pts]

The simple set of features used for the ppa data above can definitely be improved upon. For example, you could have features that:
  • are combinations of the head noun and verb, e.g. verb+noun=join+board
  • are a stemmed version of the verb, e.g. verb_stem=caus from cause, causes, caused, and causing
  • identify all numbers, e.g. noun_form=number for 42, 1.5, 100,000, etc.
  • identify capitalized forms, e.g. noun_form=Xx for Texas and noun_form=XX for USA.
  • use word clusters derived from a large corpus (see the discussion of bitstrings in Ratnaparkhi et al (1994))
Part (a). Implementation. Create extended ppa features by modifying ExtendedFeatureExtractor, following instructions in the code. Look at the suggestions above, at some examples of extended features already in ExtendedFeatureExtractor, and also look at the training file to see if you can think of other features. You should come up with at least five new (types of) features, but are encouraged to do much more. Be creative! Look at the directions in ExtendedFeatureExtractor so that you can define features that will be output when the --extended option is given to PpaFeatures.

Part (b). Written answer. Describe five of the features you added, and why you think they would be useful.

Tip. Some resources have been imported for you:
  • The Porter stemmer is available via the stemmer object -- to get the stem of a word, you can call, for example, stemmer("walked")
  • The bitstrings supplied with the data set capture word clusters. They can be imported by using the --bitstrings option, which gives you a dictionary from words to BitVector objects. There is an example in the code of how you can use them, and the code for BitVectors (which is very rudimentary, not a serious implementation at all, but fine for the needs of this problem) is included in PrepAttach.scala.
As you enable new features, you can try them out by generating new feature files and running the classifier:

$ scalabha run ata.hw5.PpaFeatures -e -b ppa/bitstrings ppa/training > out/ppa.extended.train
$ scalabha run ata.hw5.PpaFeatures -e -b ppa/bitstrings ppa/devset > out/

$ scalabha classify -t out/ppa.extended.train -p out/ -l 1 | scalabha score -g out/

Notice that the above commands use the shortened versions of the options.

As before with the basic features, find an optimal λ on the dev set with the extended features. (Note that this may change as you add more features.)  When you are satisfied that you cannot improve the performance any further, you can finally try out the test set! (Once you've done this, there is no going back to change the features any more.)

Part (c). Written answer. What is the performance you obtain when using the basic features and the extended features, each with their best λ?

Problem 4 - Use OpenNLP Maxent on the tennis dataset [10 pts]

This problem will get you using the OpenNLP Maxent classifier on the tennis dataset. OpenNLP Maxent has been imported as a dependency in Scalabha, and I have enabled OpenNLP's command line interface to be called from the bin/scalabha script with the commands maxent-train and maxent-apply.

The native format of the tennis dataset is ready to use with OpenNLP Maxent. All of the instructions assume you are running the commands from the $SCALABHA_DIR/data/classify directory.

Next try it out with maxent.

$ mkdir out
$ scalabha maxent-train tennis/train out/tennis_model.txt
Indexing events using cutoff of 1

    Computing event counts...  done. 14 events
    Indexing...  done.
Sorting and merging events... done. Reduced 14 events to 14.
Done indexing.
Incorporating indexed data for training... 
    Number of Event Tokens: 14
        Number of Outcomes: 2
      Number of Predicates: 10
Computing model parameters...
Performing 100 iterations.
  1:  .. loglikelihood=-9.704060527839234    0.35714285714285715
  2:  .. loglikelihood=-7.977369481882935    0.7142857142857143
  3:  .. loglikelihood=-7.274865786406583    0.8571428571428571
  4:  .. loglikelihood=-6.820070735855129    0.8571428571428571
  5:  .. loglikelihood=-6.494901774488233    0.8571428571428571
<more lines>
 75:  .. loglikelihood=-5.081164775456419    0.8571428571428571
 76:  .. loglikelihood=-5.081044303386861    0.8571428571428571
 77:  .. loglikelihood=-5.080931523277981    0.8571428571428571
 78:  .. loglikelihood=-5.080825909041938    0.8571428571428571
 79:  .. loglikelihood=-5.080726973756942    0.8571428571428571

This output shows the log likelihood (log prob) of the training dataset (data/classify/tennis/train) and the accuracy on the training set after every iteration of the parameter setting algorithm. Note that even though it states that it is doing 100 iterations, it stops early if the change in log likelihood is below a threshold.

You can see the resulting model by inspecting out/tennis_model.txt, which contains information about the features and the parameters values that have been learned from the training data. Next, we'll use this model to classify test items.

Tip: You can suppress the output by directing it to /dev/null, e.g.: scalabha maxent-train tennis/train out/tennis_model.txt > /dev/null

Classify the items in the test set as follows:

$ scalabha maxent-apply out/tennis_model.txt tennis/test
No 0.8436671652082209 Yes 0.15633283479177895
Yes 0.9430013482443135 No 0.05699865175568643
Yes 0.716243615645877 No 0.28375638435412287
No 0.8285289429029252 Yes 0.17147105709707472
Yes 0.7592631514228292 No 0.24073684857717087
No 0.5624507721921893 Yes 0.43754922780781064
Yes 0.5321312571177222 No 0.4678687428822778
Yes 0.9067302278999085 No 0.09326977210009141
Yes 0.5766008802968591 No 0.423399119703141
Yes 0.8365744412930616 No 0.16342555870693856
Yes 0.7788844854214183 No 0.2211155145785817
No 0.6254365317232463 Yes 0.3745634682767537
Yes 0.5215105433689039 No 0.4784894566310961

The output is the same format as you saw before with the naive Bayes implementation. That means you can funnel it right over to the scoring program:

$ scalabha maxent-apply out/tennis_model.txt tennis/test | scalabha score -g tennis/test
Accuracy: 76.92

Like naive Bayes, a maxent model can be smoothed. The underlying mechanism is quite different from add-lambda smoothing for naive Bayes. Instead of adding virtual counts, smoothing in maxent involves constraining parameter values; the result is that model parameters don't get as extreme as they would without smoothing. Another way of thinking about it is that we don't trust the training data fully, so we don't make the model parameters a perfect fit to the training data and instead ensure that they don't move too far from zero, obeying a Gaussian (normal) distribution.

The smoothing value you can provide to OpenNLP Maxent is the standard deviation of that Gaussian distribution, sigma. Larger values for sigma perform less smoothing in that they allow the parameters to be more extreme, while smaller values keep the parameters closer to zero. Here's how you specify it:

$ scalabha maxent-train -sigma 1.0 tennis/train out/tennis_model.txt

In the problems for this homework, a suggested set of values to try for sigma are [.01, .1, .5, 1.0, 5.0, 10, 50, 100]. You are of course welcome to explore other values, including others in this range.

Keep in mind that you need to retrain the model and then apply the model (whereas the naive Bayes model does this in one step. While this may seem like an extra step that you would prefer not to do, it is usually the case that one wants to train a model, save it, and then use many many times.

Do the following exercises, the main purpose of which is to ensure your setup is working properly.

Part (a). Written answer. What is the accuracy when you use a sigma value of .1?

Part (b). Written answer. What is the accuracy when you use naive Bayes with a lambda of 5.0, trained on tennis/test and tested on tennis/train?

Part (c). Written answer. What is the accuracy when you use maxent with a sigma of .01, trained on tennis/test and tested on tennis/train?

Problem 5 - Prepositional Phrase Attachment with maxent [20 pts]

The output files from PpaFeatures that you produced for problem 3 are ready to be used with OpenNLP Maxent.

Part (a). Written answer. Use maxent to obtain a score on the dev set with both features sets (a) simple features and (b) your extended features. You'll want to find a good sigma for each feature set. Note also that you can increase the number of iterations with the option -maxit, followed by a number ('-maxit 1000' is a good choice). Write down your best score with each feature set, and what your values for sigma and maxit were.

Part (b). Written answer. Based on the parameters for lambda, sigma, and maxit that you found to be most optimal, run naive Bayes and maxent on the PPA test set. Report the results again for all three feature sets, using a table like the following:

 Model/Features ParametersSimple
Your Extended

Did the relative performance of the different models from the dev set to the test set stay the same? Briefly describe any differences you notice.

Tip: What you did for this problem can be used for any kind of standard labeling task you might be interested in. You just need to produce feature files in the formats we did here, and you can then get results back and use them as you like, etc.

Problem 6 - Classifier confidence and performance [20 pts]

You may have noticed that there are some examples that have a higher probability for their most probable label than others. As it turns out, classifier confidence is a good indicator of when the prediction is likely to be correct: the higher the probability, the higher the confidence, and--typically--the more likely the assignment is correct.

To make this concrete, consider the output of the naive Bayes classifier on the tennis dataset:

$ scalabha classify -t tennis/train -p tennis/test -l 1
No 0.7098248137334299 Yes 0.2901751862665702
Yes 0.9653030731512793 No 0.03469692684872127
Yes 0.676016267177877 No 0.323983732822123
No 0.7334732282476866 Yes 0.2665267717523131
Yes 0.8066941597364939 No 0.19330584026350536
Yes 0.5767484281985087 No 0.4232515718014911
No 0.5571385693746576 Yes 0.44286143062534244
Yes 0.9085594742433201 No 0.09144052575668082
Yes 0.7607393593960057 No 0.23926064060399474
Yes 0.9226202882823726 No 0.07737971171762743
Yes 0.7876617086814863 No 0.21233829131851342
No 0.5118066815368538 Yes 0.48819331846314595
Yes 0.6138655053429548 No 0.38613449465704547

Notice that the second line is the most confident, with a probability of 96.53 for Yes, while the second to last line is the least confident, with a probability of 51.18 for No. What you are doing in this problem is sorting based on the second column probability (the probability of the best label), and then scoring the most and least confident examples.

Part (a). Implementation. Write code to calculate how accuracy changes based on confidence. Implement this in the ConfidenceScorer object in the file ConfidenceScorer.scala. You must output the accuracy for the top-third most confident examples and for the bottom-third least confident examples. (This means that you'll ignore the middle third.)

As an example, when it is done, you should be able to run it like this on the tennis data set like this:

$ scalabha classify -t tennis/train -p tennis/test -l 1 > out/
$ scalabha run ata.hw5.ConfidenceScorer tennis/test out/
High confidence accuracy: 100.00
Low confidence accuracy:  0.00

$ scalabha maxent-train -sigma 1 tennis/train out/tennis.maxent.txt > /dev/null
$ scalabha maxent-apply out/tennis.maxent.txt tennis/test > out/
$ scalabha run ata.hw5.ConfidenceScorer tennis/test out/
High confidence accuracy: 100.00
Low confidence accuracy:  50.00

Except that you'll do this for the PPA dataset, which will give a more interesting result.

Part (b). Written answer. Use your scoring implementation to calculate the high and low confidence accuracies for the PPA dataset for both naive Bayes and maxent on the PPA development data. Include the output in your answer.

Tip: Look at the code in opennlp.scalabha.classify.ClassifyScorer for some examples of reading in the files and getting the labels and probabilities out of them.

Tip: Group the probability, gold label, and predicted labels for each instance in a tuple and sort on the List of tuples. If the probability is first (and is of type Double, not of type String), you'll get the sorted order you need, with the least confident at the front of the list and the most confident at the end, and the right predicted label will stay linked to its corresponding gold label.

7. Feedback
(a) How long did this homework take you?

(b) Do you have any suggestions?

(c) Do you have any complaints?

8. Extra credit
Obtain another labeled dataset and do classification on it, using both naive Bayes and maxent. Most of your work will be converting the raw data format into that needed by the classifiers (including doing feature extraction). You can choose any suitable data set you like, but here are a couple of suggestions:
In your write-up, say a bit about the dataset and what you had to do to work with it, what features you used, and summarize your results. Provide the code you wrote to support your experiments as well.

Copyright 2012 Jason Baldridge

The text of this homework is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to and to this original homework.

Please email Jason at with suggestions, improvements, extensions and bug fixes.

Jason Baldridge,
Apr 4, 2012, 11:40 PM