IntroductionDUE: April 16, 2012 by 10am (submission on Blackboard)This assignment involves using naive Bayes and maximum entropy classifiers for a prepositional phrase attachment task. The main tasks are using existing software for classification and extracting features for them to use in learning and making predictions.Notes:
PRELIMINARIES Getting Scalabha You must obtain and install Scalabha, a software package that will allow you to access code written to support this assignment and to easily compile and run your own code. For this homework, you need version 0.2.3., which you can obtain here: https://github.com/downloads/utcompling/Scalabha/scalabha-0.2.3.zip Note: if you know how to use Git (or are interested in using it), you can just clone the Scalabha repository. You should unzip the contents of scalabha-0.2.3.zip and follow the instructions for installing it in the file scalabha-0.2.3/README.md. Note that one of those instructions requires you to set an environment variable SCALABHA_DIR that points to the location of Scalabha on your machine. At times, this homework will reference file locations that are relative to that directory. The data You will work with clustering datasets that are located in $SCALABHA_DIR/data/classify. Go to that directory and have a look at it. Setting up the code Like the previous assignment, you will this time develop your solutions in the context of the Scalabha build system. See tutorial 11 on SBT and Scalabha for a refresher on SBT if you need it. There are just three files in the homework bundle hw5_stubs.zip: two stub files PrepAttach.scala and ConfidenceScorer.scala, and an answers file hw5_answers.txt. Please modify these files when solving the problems; do not use different names for either. To prepare for working with the assignment, you should do the following steps: $ mv hw5_stubs.zip $SCALABHA_DIR $ cd $SCALABHA_DIR $ unzip hw5_stubs.zip $ scalabha build compile The last command should end with a line starting with [success]. If you are having trouble with this, get in touch with me right away. Your implemented solutions will be done in PrepAttach.scala and ConfidenceScorer.scala. Any portions of problems that begin with Written question or that request example output should go in hw5_answers.txt. Tip: while you are working on your solutions, you should definitely take advantage of the ~compile command in SBT. It will compile your code automatically every time you save a file you are working on. Note that most of the commands suggested in the problems assume that you are running them in the $SCALABHA_DIR/data/classify directory. Submitting your solutions For submission, create a zip file with the name hw5_<lastname>_<firstname>.zip that contains your src/main/scala/ata/hw5 directory and its contents. Here are some example commands for doing this: $ cd $SCALABHA_DIR $ zip -r hw5_baldridge_jason.zip src/main/scala/ata/hw5/ adding: src/main/scala/ata/hw5/ (stored 0%) adding: src/main/scala/ata/hw5/PrepAttach.scala (deflated 65%) adding: src/main/scala/ata/hw5/ConfidenceScorer.scala (deflated 57%) Make sure that your code compiles before you do this. If it does not, I’ll deduct 20 points right away and ask you to fix it and resubmit. Tip: Remember that you can create the scaladoc API documentation by running the "doc" command in SBT, and that you also have access to the source code in $SCALABHA_DIR/src/main/scala, which you will be looking at at times to solve the problems in this homework. Problem 1 - A good day to play tennis? [5 pts]Let’s start with a simple example problem from Tom Mitchell’s book Machine Learning (you don't need the book for this problem).
The problem is to predict whether it is a good day to play tennis given
various factors and some initial data that provides information about
whether previous days were good or bad days for tennis. The factors
include (in the format "Attribute: <List of Possible Values>"):
Each row in the data files corresponds to a single classification instance. For example, here is the training set. data/classify/tennis/train Outlook=Sunny,Temperature=Hot,Humidity=High,Wind=Weak,No Outlook=Sunny,Temperature=Hot,Humidity=High,Wind=Strong,No Outlook=Overcast,Temperature=Hot,Humidity=High,Wind=Weak,Yes Outlook=Rain,Temperature=Mild,Humidity=High,Wind=Weak,Yes Outlook=Rain,Temperature=Cool,Humidity=Normal,Wind=Weak,Yes Outlook=Rain,Temperature=Cool,Humidity=Normal,Wind=Strong,No Outlook=Overcast,Temperature=Cool,Humidity=Normal,Wind=Strong,Yes Outlook=Sunny,Temperature=Mild,Humidity=High,Wind=Weak,No Outlook=Sunny,Temperature=Cool,Humidity=Normal,Wind=Weak,Yes Outlook=Rain,Temperature=Mild,Humidity=Normal,Wind=Weak,Yes Outlook=Sunny,Temperature=Mild,Humidity=Normal,Wind=Strong,Yes Outlook=Overcast,Temperature=Mild,Humidity=High,Wind=Strong,Yes Outlook=Overcast,Temperature=Hot,Humidity=Normal,Wind=Weak,Yes Outlook=Rain,Temperature=Mild,Humidity=High,Wind=Strong,No Each instance consists of a list of attribute values, separated by commas, and the last element is the classification value. The value is "Yes" if it is a good day to play tennis based on the conditions, and "No" if it is not. What we are interested in for this toy example is to determine whether the probability of playing tennis is higher than the probability of not playing tennis. We can represent the probability of playing tennis as: Note that Label is the output prediction we are making (with yes as one of its two values), Outlook, Temperature, Humidity, and Wind are all attributes of interest, , and o, t, h, and w are variables for the possible values of each attributes. As an example, if we have a new instance that we wish to classify, like: Outlook=Sunny,Temperature=Cool,Humidity=High,Wind=Strong what we seek is (shortening notation to ignore the feature names since they are clear from context): Where we obtained the second line from the first by using Bayes rule and assuming independence, as discussed in the lecture slides. This simply means we need to compute the two values: and pick the label that produced the higher value. The naive Bayes implementation in Scalabha will do this for you, provided you give it the data in the right format. The data includes a test set for the tennis task as well, provided in full here: data/classify/tennis/test Outlook=Sunny,Temperature=Cool,Humidity=High,Wind=Strong,No Outlook=Overcast,Temperature=Cool,Humidity=Normal,Wind=Weak,Yes Outlook=Sunny,Temperature=Hot,Humidity=Normal,Wind=Weak,Yes Outlook=Rain,Temperature=Hot,Humidity=High,Wind=Strong,No Outlook=Sunny,Temperature=Cool,Humidity=Normal,Wind=Weak,Yes Outlook=Overcast,Temperature=Hot,Humidity=High,Wind=Strong,No Outlook=Sunny,Temperature=Mild,Humidity=High,Wind=Weak,Yes Outlook=Overcast,Temperature=Mild,Humidity=Normal,Wind=Strong,Yes Outlook=Rain,Temperature=Cool,Humidity=Normal,Wind=Strong,No Outlook=Overcast,Temperature=Cool,Humidity=Normal,Wind=Strong,Yes Outlook=Rain,Temperature=Hot,Humidity=Normal,Wind=Weak,Yes Outlook=Sunny,Temperature=Cool,Humidity=High,Wind=Weak,Yes Outlook=Rain,Temperature=Hot,Humidity=Normal,Wind=Strong,No Like the training set, this provides a list of instances, and for each instance, the values for each of the attributes and the classification outcome. We'll use Scalabha's naive Bayes classifier to make predictions on each instance and then compare its most probably label with the gold standard label provide in the test set. The only thing you need to do for this problem is run the naive Bayes classifier such that it trains on the train file and is evaluated on the test file. Do the following commands: $ cd $SCALABHA_DIR/data/classify $ scalabha classify --train tennis/train --predict tennis/test --out predictions.txt $ scalabha score --gold tennis/test --predict predictions.txt You should see the following output: Accuracy: 61.54 Tip: things are set up so that:
$ scalabha classify -t tennis/train -p tennis/test | scalabha score -g tennis/test Accuracy: 61.54 Here's a bit more detail on what is going on. The classify command outputs a probability distribution for each instance, for example: $ scalabha classify -t tennis/train -p tennis/test No 0.7098248137334299 Yes 0.2901751862665702 Yes 0.9653030731512793 No 0.03469692684872127 Yes 0.676016267177877 No 0.323983732822123 No 0.7334732282476866 Yes 0.2665267717523131 Yes 0.8066941597364939 No 0.19330584026350536 Yes 0.5767484281985087 No 0.4232515718014911 No 0.5571385693746576 Yes 0.44286143062534244 Yes 0.9085594742433201 No 0.09144052575668082 Yes 0.7607393593960057 No 0.23926064060399474 Yes 0.9226202882823726 No 0.07737971171762743 Yes 0.7876617086814863 No 0.21233829131851342 No 0.5118066815368538 Yes 0.48819331846314595 Yes 0.6138655053429548 No 0.38613449465704547 The most probable label is the leftmost on each line. Notice that the probabilities sum to one (as they should). So, if we take the most probable label as the prediction (which is the usual thing to do), this means that the model predicts "No" for the first instance in the test file: Outlook=Sunny,Temperature=Cool,Humidity=High,Wind=Strong This does match the actual label given in the file for that instance. (Other predictions are of course different from the test labels.) Written answer: Run the naive Bayes classifier by training it on the "test" file and evaluating it on the "train" file. Report the accuracy. Problem 2 - Prepositional Phrase Attachment and smoothing [15 pts]Prepositional
phrase attachment is the well-known task of resolving a common
ambiguity in English syntax regarding whether a prepositional phrase
attaches to the verb or the noun in sentences with the pattern Verb Noun_Phrase Prepositional_Phrase. An example is I saw the man with the telescope.
If the prepositional phrase attaches to the verb, then the seeing was
done with the telescope; if it attaches to the noun, it indicates that
the man had the telescope in his possession. A clear difference can be
seen with the following related examples:
For this problem, you will use a conveniently formatted data set for prepositional phrase attachment which has been made available by Adwait Ratnaparkhi. You can find it in the directory $SCALABHA_DIR/data/classify/ppa. Go to that directory and list the contents. There are three files which you will use for this problem: training, devset, and test. Look at the contents of training: data/classify/ppa/training 0 join board as director V 1 is chairman of N.V. N 2 named director of conglomerate N 3 caused percentage of deaths N 5 using crocidolite in filters V ... Each row lists an abbreviated form of a prepositional phrase attachment. The first item is just a source sentence identifier that we can ignore. The four words correspond to the head verb, head noun, preposition, and head noun object of the preposition, in that order. The final element indicates whether the attachment was to the verb (V) or to the noun (N). For example, for the two spaghetti eating sentences given above, the abbreviated forms would be: 4242 ate spaghetti with meatballs N 4243 ate spaghetti with chopsticks V For this exercise, you will build a classifier that learns a model from the data in training and use it to classify new instances. You will develop your model using the material in devset. You must not personally inspect the contents of test — you will run your classifier on test only once (for problem 5), when you are done developing. The first thing you must do is produce features from the ppa data in the format used previously for the tennis problem. Each of the four columns in the ppa data implicitly indicates a separate attribute – let’s call them verb, noun, prep, and prep_obj, respectively. To begin, we'll just create features that are based directly on the attributes and their values. So, a line like, 0 join board as director V becomes the following: verb=join,noun=board,prep=as,prep_obj=director,V The main method of ata.hw5.PpaFeatures reads in a ppa file and produce the features in the above format. Using this program, produce files for training and development as follows: $ cd $SCALABHA_DIR/data/classify $ mkdir out $ scalabha run ata.hw5.PpaFeatures ppa/training > out/ppa.basic.train $ scalabha run ata.hw5.PpaFeatures ppa/devset > out/ppa.basic.dev $ scalabha run ata.hw5.PpaFeatures ppa/test > out/ppa.basic.test You can run scalabha classify on the feature files that are produced and score the output with scalabha score: $ scalabha classify -t out/ppa.basic.train -p out/ppa.basic.dev | scalabha score -g out/ppa.basic.dev Ratnaparkhi et al (1994)
obtain accuracies of around 80%, so we clearly should be able to do
much better. One obvious problem shows up if you look at the actual
output:Accuracy: 68.80 $ scalabha classify -t out/ppa.basic.train -p out/ppa.basic.dev | more V 0.8324686834013496 N 0.16753131659864912 V 1.0 N 0.0 V 1.0 N 0.0 V 1.0 N 0.0 V 1.0 N 0.0 V 0.5 N 0.5 V 1.0 N 0.0 V 0.9999400936526425 N 5.990634735608704E-5 V 0.5 N 0.5 V 1.0 N 0.0 V 0.9997272120557299 N 2.7278794426904216E-4 V 0.9997444685957534 N 2.5553140424721684E-4 V 0.9996717479741555 N 3.28252025843356E-4 There are many items that have zero probability for N, V, or both (those are the ones with uniform .5 probability for both labels). The problem is that we haven't done any smoothing, so there are many parameters that we assign zero to, and then the overall probability for the class becomes zero. For example, the tenth line in out/ppa.basic.dev is: verb=was,noun=performer,prep=among,prep_obj=groups,N The output gives zero probability to N because the only training instance that has noun=performer is with the V label: verb=juxtapose,noun=performer,prep=with,prep_obj=tracks,V Thus, the value of P(Noun=performer | Label=V) is zero, making P(Label=V | Verb=juxtapose, Noun=performer, Prep=with, PrepObj=tracks) also zero, regardless of how much the rest of the information looks like a V attachment. We can fix this by using add-λ smoothing, which hallucinates counts for every event and ensures none of the probabilities are zero. Using smoothing is easy: just use the -lambda option and provide a value greater than zero. For example, here is what we get with a lambda value of one. $ scalabha classify -t out/ppa.basic.train -p out/ppa.basic.dev -l 1 | scalabha score -g out/ppa.basic.dev Accuracy: 80.81 Find a good λ for improving the accuracy on the development set. By exploring other values for λ, you may be able to improve the results. Written answer. Report the best λ you found and what accuracy you obtained. Tip. You can easily check many possible values for λ using the Unix command line. For example, $
for i in {0..5}; do echo $i; scalabha classify -t out/ppa.basic.train
-p out/ppa.basic.dev -l $i | scalabha score -g out/ppa.basic.dev ; done You should see multiple accuracy values go by, each one associated with the λ value printed above it. (If this doesn't work, then you might not be using the bash shell.) You can also specify ranges such as for i in .5 1 1.5 2 2.5 for drilling down further. Remember: don't do anything with ppa/test yet! Problem 3 - Extending the feature set [30 pts]The
simple set of features used for the ppa data above can definitely be
improved upon. For example, you could have features that:
Part (b). Written answer. Describe five of the features you added, and why you think they would be useful. Tip. Some resources have been imported for you:
$ scalabha run ata.hw5.PpaFeatures -e -b ppa/bitstrings ppa/training > out/ppa.extended.train $ scalabha run ata.hw5.PpaFeatures -e -b ppa/bitstrings ppa/devset > out/ppa.extended.dev $ scalabha classify -t out/ppa.extended.train -p out/ppa.extended.dev -l 1 | scalabha score -g out/ppa.extended.dev Notice that the above commands use the shortened versions of the options. As before with the basic features, find an optimal λ on the dev set with the extended features. (Note that this may change as you add more features.) When you are satisfied that you cannot improve the performance any further, you can finally try out the test set! (Once you've done this, there is no going back to change the features any more.) Part (c). Written answer. What is the performance you obtain when using the basic features and the extended features, each with their best λ? Problem 4 - Use OpenNLP Maxent on the tennis dataset [10 pts]This problem will get you using the OpenNLP Maxent
classifier on the tennis dataset. OpenNLP Maxent has been imported as a
dependency in Scalabha, and I have enabled OpenNLP's command line
interface to be called from the bin/scalabha script with the commands maxent-train and maxent-apply. The native format of the tennis dataset is ready to use with OpenNLP Maxent. All of the instructions assume you are running the commands from the $SCALABHA_DIR/data/classify directory. Next try it out with maxent. $ mkdir out $ scalabha maxent-train tennis/train out/tennis_model.txt Indexing events using cutoff of 1 Computing event counts... done. 14 events Indexing... done. Sorting and merging events... done. Reduced 14 events to 14. Done indexing. Incorporating indexed data for training... done. Number of Event Tokens: 14 Number of Outcomes: 2 Number of Predicates: 10 ...done. Computing model parameters... Performing 100 iterations. 1: .. loglikelihood=-9.704060527839234 0.35714285714285715 2: .. loglikelihood=-7.977369481882935 0.7142857142857143 3: .. loglikelihood=-7.274865786406583 0.8571428571428571 4: .. loglikelihood=-6.820070735855129 0.8571428571428571 5: .. loglikelihood=-6.494901774488233 0.8571428571428571 <more lines> 75: .. loglikelihood=-5.081164775456419 0.8571428571428571 76: .. loglikelihood=-5.081044303386861 0.8571428571428571 77: .. loglikelihood=-5.080931523277981 0.8571428571428571 78: .. loglikelihood=-5.080825909041938 0.8571428571428571 79: .. loglikelihood=-5.080726973756942 0.8571428571428571 This output shows the log likelihood (log prob) of the training dataset (data/classify/tennis/train) and the accuracy on the training set after every iteration of the parameter setting algorithm. Note that even though it states that it is doing 100 iterations, it stops early if the change in log likelihood is below a threshold. You can see the resulting model by inspecting out/tennis_model.txt, which contains information about the features and the parameters values that have been learned from the training data. Next, we'll use this model to classify test items. Tip: You can suppress the output by directing it to /dev/null, e.g.: scalabha maxent-train tennis/train out/tennis_model.txt > /dev/null Classify the items in the test set as follows: $ scalabha maxent-apply out/tennis_model.txt tennis/test No 0.8436671652082209 Yes 0.15633283479177895 Yes 0.9430013482443135 No 0.05699865175568643 Yes 0.716243615645877 No 0.28375638435412287 No 0.8285289429029252 Yes 0.17147105709707472 Yes 0.7592631514228292 No 0.24073684857717087 No 0.5624507721921893 Yes 0.43754922780781064 Yes 0.5321312571177222 No 0.4678687428822778 Yes 0.9067302278999085 No 0.09326977210009141 Yes 0.5766008802968591 No 0.423399119703141 Yes 0.8365744412930616 No 0.16342555870693856 Yes 0.7788844854214183 No 0.2211155145785817 No 0.6254365317232463 Yes 0.3745634682767537 Yes 0.5215105433689039 No 0.4784894566310961 The output is the same format as you saw before with the naive Bayes implementation. That means you can funnel it right over to the scoring program: $ scalabha maxent-apply out/tennis_model.txt tennis/test | scalabha score -g tennis/test Accuracy: 76.92 Like naive Bayes, a maxent model can be smoothed. The underlying mechanism is quite different from add-lambda smoothing for naive Bayes. Instead of adding virtual counts, smoothing in maxent involves constraining parameter values; the result is that model parameters don't get as extreme as they would without smoothing. Another way of thinking about it is that we don't trust the training data fully, so we don't make the model parameters a perfect fit to the training data and instead ensure that they don't move too far from zero, obeying a Gaussian (normal) distribution. The smoothing value you can provide to OpenNLP Maxent is the standard deviation of that Gaussian distribution, sigma. Larger values for sigma perform less smoothing in that they allow the parameters to be more extreme, while smaller values keep the parameters closer to zero. Here's how you specify it: $ scalabha maxent-train -sigma 1.0 tennis/train out/tennis_model.txt In the problems for this homework, a suggested set of values to try for sigma are [.01, .1, .5, 1.0, 5.0, 10, 50, 100]. You are of course welcome to explore other values, including others in this range. Keep in mind that you need to retrain the model and then apply the model (whereas the naive Bayes model does this in one step. While this may seem like an extra step that you would prefer not to do, it is usually the case that one wants to train a model, save it, and then use many many times. Part (a). Written answer. What is the accuracy when you use a sigma value of .1? Part (b). Written answer. What is the accuracy when you use naive Bayes with a lambda of 5.0, trained on tennis/test and tested on tennis/train? Part (c). Written answer. What is the accuracy when you use maxent with a sigma of .01, trained on tennis/test and tested on tennis/train? Problem 5 - Prepositional Phrase Attachment with maxent [20 pts]The output files from PpaFeatures that you produced for problem 3 are ready to be used with OpenNLP Maxent.Part (a). Written answer. Use maxent to obtain a score on the dev set with both features
sets (a) simple features and (b) your extended
features. You'll want to find a good sigma for each feature set. Note also that you can increase the number of iterations with the option -maxit, followed by a number ('-maxit 1000' is a good choice). Write down your best score with each feature set, and what your values for sigma and maxit were.
Part (b). Written answer.
Based on the parameters for lambda, sigma, and maxit that
you found to be most optimal, run naive Bayes and maxent on the PPA
test set. Report the results again for all three feature sets, using a
table like the following:
Did the relative performance of the different models from the dev set to the test set stay the same? Briefly describe any differences you notice. Tip: What you did for this
problem can be used for any kind of standard labeling task you might be
interested in. You just need to produce feature files in the formats we
did here, and you can then get results back and use them as you like,
etc. Problem 6 - Classifier confidence and performance [20 pts]You
may have noticed that there are some examples that have a higher
probability for their most probable label than others. As it turns out,
classifier confidence is a good indicator of when the prediction is
likely to be correct: the higher the probability, the higher the
confidence, and--typically--the more likely the assignment is correct.To make this concrete, consider the output of the naive Bayes classifier on the tennis dataset: $ scalabha classify -t tennis/train -p tennis/test -l 1 No 0.7098248137334299 Yes 0.2901751862665702 Yes 0.9653030731512793 No 0.03469692684872127 Yes 0.676016267177877 No 0.323983732822123 No 0.7334732282476866 Yes 0.2665267717523131 Yes 0.8066941597364939 No 0.19330584026350536 Yes 0.5767484281985087 No 0.4232515718014911 No 0.5571385693746576 Yes 0.44286143062534244 Yes 0.9085594742433201 No 0.09144052575668082 Yes 0.7607393593960057 No 0.23926064060399474 Yes 0.9226202882823726 No 0.07737971171762743 Yes 0.7876617086814863 No 0.21233829131851342 No 0.5118066815368538 Yes 0.48819331846314595 Yes 0.6138655053429548 No 0.38613449465704547 Notice that the second line is the most confident, with a probability of 96.53 for Yes, while the second to last line is the least confident, with a probability of 51.18 for No. What you are doing in this problem is sorting based on the second column probability (the probability of the best label), and then scoring the most and least confident examples. Part (a). Implementation. Write code to calculate how accuracy changes based on confidence. Implement this in the ConfidenceScorer object in the file ConfidenceScorer.scala. You must output the accuracy for the top-third most confident examples and for the bottom-third least confident examples. (This means that you'll ignore the middle third.) As an example, when it is done, you should be able to run it like this on the tennis data set like this: $ scalabha classify -t tennis/train -p tennis/test -l 1 > out/nbayes.tennis.test.txt $ scalabha run ata.hw5.ConfidenceScorer tennis/test out/nbayes.tennis.test.txt High confidence accuracy: 100.00 Low confidence accuracy: 0.00 $ scalabha maxent-train -sigma 1 tennis/train out/tennis.maxent.txt > /dev/null $ scalabha maxent-apply out/tennis.maxent.txt tennis/test > out/maxent.tennis.test.txt $ scalabha run ata.hw5.ConfidenceScorer tennis/test out/maxent.tennis.test.txt High confidence accuracy: 100.00 Low confidence accuracy: 50.00 Except that you'll do this for the PPA dataset, which will give a more interesting result. Part (b). Written answer. Use your scoring implementation to calculate the high and low confidence accuracies for the PPA dataset for both naive Bayes and maxent on the PPA development data. Include the output in your answer. Tip: Look at the code in opennlp.scalabha.classify.ClassifyScorer for some examples of reading in the files and getting the labels and probabilities out of them. Tip: Group the probability, gold label, and predicted labels for each instance in a tuple and sort on the List of tuples. If the probability is first (and is of type Double, not of type String), you'll get the sorted order you need, with the least confident at the front of the list and the most confident at the end, and the right predicted label will stay linked to its corresponding gold label. 7. Feedback(a) How long did this homework take you?(b) Do you have any suggestions? (c) Do you have any complaints? 8. Extra creditObtain another labeled dataset and do classification on it, using both naive Bayes and maxent. Most of your work will be converting the raw data format into that needed by the classifiers (including doing feature extraction). You can choose any suitable data set you like, but here are a couple of suggestions:
Copyright 2012 Jason Baldridge The text of this homework is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original homework. Please email Jason at jasonbaldridge@gmail.com with suggestions, improvements, extensions and bug fixes. |
Assignments >