DUE: March 30, 2012 by 10am (submission on Blackboard)This
assignment involves using k-means clustering to analyze several
datasets, including authorship analysis on the Federalist Papers. In
addition to giving you experience working with clustering and extracting
features from text, the assignment also is designed to give you
practical experience with programming in a more realistic environment
than you’ve seen so far, including using a build system, accessing
classes defined in an API, and working with file input/output.Notes:
- As usual, the description of the problems for this homework is somewhat
long -- don’t let this scare you. In fact, much of the text is there to
give you documentation and help you know what to do.
- You should read all of the problems before starting, but it will be best to solve them in order since each one builds on the previous problem.
- You should get an early start on the homework. There are likely to be bumpy starts for some of you, and we need to smooth those on sooner rather than later.
- My solutions to these problems don’t use a single “var” variable. I
encourage you to try to do without them too, but you may use them if you really must.
- If you run into what you think might be a bug with Scalabha, please let me know right away so I can fix it if it indeed is a bug.
If
you don’t absolutely need to have words on paper, don’t print this
homework description out -- it has a lot of example output that makes it
lengthy, and it will be easier to do some cutting and pasting etc right
from the web page.PRELIMINARIESGetting ScalabhaYou
must obtain and install Scalabha, a software package that will allow
you to access code written to support this assignment and to easily
compile and run your own code. For this homework, you need version 0.2.2, which you can obtain here:https://github.com/downloads/utcompling/Scalabha/scalabha-0.2.2.zipNote: if you know how to use Git (or are interested in using it), you can just clone the Scalabha repository.You should unzip the contents of scalabha-0.2.2.zip and follow the instructions for installing it in the file scalabha-0.2.2/README.md. Note that one of those instructions requires you to set an environment variable SCALABHA_DIR
that points to the location of Scalabha on your machine. At times, this
homework will reference file locations that are relative to that
directory.The dataYou will work with clustering datasets that are located in $SCALABHA_DIR/data/cluster. Go to that directory and have a look at it.Setting up the codeUnlike
previous assignments, where you wrote single-file Scala scripts, you
will this time develop your solutions in the context of the Scalabha
build system (which uses SBT, the Simple Build Tool). Apart from brief exposure with tutorial 11 on SBT and Scalabha, this is a new
concept for most of the students in the class, but don’t worry: it will
actually make things much easier for you. It will also help you acquire
an important set of skills for real software development.There are just three files in the homework bundle hw4_stubs.zip: two stub files Clustering.scala and Features.scala, and an answers file hw4_answers.txt.
Please modify these files when solving the problems; do not use
different names for either. To prepare for working with the assignment,
you should do the following steps:$ mv hw4_stubs.zip $SCALABHA_DIR$ cd $SCALABHA_DIR$ unzip hw4_stubs.zipArchive: hw4_stubs.zip creating: src/main/scala/ata/hw4/ inflating: src/main/scala/ata/hw4/Features.scala inflating: src/main/scala/ata/hw4/Clustering.scala $ scalabha build compileThe last command should end with a line starting with [success]. If you are having trouble with this, get in touch with me right away.Your implemented solutions will be done in Clustering.scala and Features.scala. Any portions of problems that begin with Written question or that request example output should go in hw4_answers.txt.Tip: while you are working on your solutions, you should definitely take advantage of the ~compile command in SBT. It will compile your code automatically every time you save a file you are working on.Note that most of the commands suggested in the problems assume that you are running them in one of the sub-directories of the $SCALABHA_DIR/data/cluster directory.Submitting your solutionsFor submission, create a zip file with the name hw4_<lastname>_<firstname>.zip that contains your src/main/scala/icl/hw4 directory and its contents. Here are some example commands for doing this:$ cd $SCALABHA_DIR $ zip -r hw4_baldridge_jason.zip src/main/scala/ata/hw4/ adding: src/main/scala/ata/hw4/ (stored 0%) adding: src/main/scala/ata/hw4/Features.scala (deflated 67%) adding: src/main/scala/ata/hw4/Clustering.scala (deflated 56%)Make
sure that your code compiles before you do this. If it does not, I’ll
deduct 20 points right away and ask you to fix it and resubmit.Tip: you can create the scaladoc API documentation by running the "doc" command in SBT: $ scalabha build [info] Loading global plugins from /home/jbaldrid/.sbt/plugins [info] Loading project definition from /home/jbaldrid/devel/scalabha/project [info] Set current project to Scalabha (in build file:/home/jbaldrid/devel/scalabha/) > doc [info] Updating {file:/home/jbaldrid/devel/scalabha/}default-29cc0c... <many lines of output> [info] API documentation generation successful. [success] Total time: 16 s, completed Mar 6, 2012 9:01:07 PMHaving done this, you can now point your browser to $SCALABHA_DIR/target/api/index.html to see the Scalabha API. The main information for this homework is that for the opennlp.scalabha.cluster package, which you can see at this location: $SCALABHA_DIR/target/api/index.html#opennlp.scalabha.cluster.packageYou also have access to the source code in $SCALABHA_DIR/src/main/scala, which you will be looking at at times to solve the problems in this homework. 1. Basic clustering with generated dataGo to the directory $SCALABHA_DIR/data/cluster/generated. In this directory, you’ll find several files, one of which is clusters_equal_variance.dat. Look at its contents, e.g.:~/devel/scalabha/data/cluster/autogenerated$ more clusters_equal_variance.dat1 1 -1.27 -1.62 1 -0.88 -0.553 1 -2.34 -0.854 1 0.2 -0.32<more lines omitted>This
file contains the x-y coordinates for points that have been randomly
generated from three centroids (sampling from independent Gaussian
distributions for each dimension). Here’s what they look like
graphically: The three clusters are very clear, so k-means should be able to do a good job with them.For
this problem, you will create a reader that consumes each line of input
in this file to produce the information needed to run k-means, then run
the algorithm and evaluate its output. What you do for this problem
will probably seem hard as you are doing it, but you’ll see that it is
quite straightforward when you are done -- but only once you’ve done it
for yourself with plenty of room to get confused and then resolve your
confusion. (IMPORTANT: don’t hesitate to contact me for help if you just
can’t get started.)Part (a) The
first step is to create the reader. This will be the first foray into
using an API in this course (it’s a simple one). Your starting points
are the main and readData methods in the ata.hw4.RunKmeans object (found in src/main/scala/ata/hw4/Clustering.scala). Look at Clustering.scala and scan the documentation in it. You need to take the input file name given to the main method, and call readData with that file name. Note that readData returns a triple of empty sequences for now. Call it from the main method and then compile Scalabha (using SBT). Now, go to $SCALABHA_DIR/data/cluster/generated and run the following command:$ scalabha run ata.hw4.RunKmeans clusters_equal_variance.dat cosine ident 3It should do nothing (and that’s fine). Now, implement readData so that it returns a triple of ids, cluster labels, and Points, e.g. for the file above: (IndexedSeq(1, 2, 3, …), IndexedSeq(1, 1, 1, …), IndexedSeq(Point(-1.27,-1.6),Point(-0.88,-0.55), Point(-2.34, -0.85)))Basically,
the first element of the triple is the first column of the file, the
second element is the second column, and the third element is the rest
of the columns aggregated as Points. What’s a Point? Look at the class opennlp.scalabha.cluster.Point and figure out how to create them based on the values you are getting from the file. (Need help finding that class? Look under $SCALABHA_DIR/src/main/scala/opennlp.) Once you have done this, you should now capture the return values of readData. If you print out the points, you should see this:[-1.27,-1.6][-0.88,-0.55][-2.34,-0.85][0.2,-0.32]<many more points>Once you have gotten to this point, remove the line that prints the points and move on to the next part of this question. Part (b) Next,
you need to get the object for the distance function that corresponds
to the distance function option from the command line. For example,
“cosine” requires using the CosineDistance object in Points.scala. There is a helper object DistanceFunction (also in Points.scala) that allows you to get the right function given a description such as “cosine”, via its apply method (recall that this means you aren’t required to write out “apply” in order to use it). Also, you should create a variable k of type Int that has the value corresponding to the kOption String given for the number of clusters.Given the points and the distance function, you can create a Kmeans object. Look at opennlp.scalabha.cluster.Kmeans
to see how to call its constructor, and then create an object in your
main method. With that object, you can now run k-means for a given k. Do
so, capture the created centroids from the return value of Kmeans.run (remember that Kmeans is a class, and you are creating an instance of that class, e.g. val foo = Kmeans(...), so that means doing foo.run(...)).
Write code that prints out the centroids, one per line. For the command
line options used below, you should get the same set of three
centroids.$ scalabha run ata.hw4.RunKmeans clusters_equal_variance.dat cosine ident 3 [5.796363636363637,3.7084848484848485][-1.0533333333333335,-1.3621428571428567][-1.7792000000000001,6.305599999999999]Comment out or remove this centroid printing, and then move on to the next part.Part (c) Just
given the centroids, we don’t know if they are any good, so let’s
evaluate. For this, we need to know what the cluster memberships are for
each point and then compute the confusion matrix between the predicted
clusters and the true cluster labels. Use the Kmeans.computeClusterMemberships method for the former; with those memberships in hand, you have everything you need to create a ClusterConfusionMatrix (see the companion object to the ClusterConfusionMatrix class in opennlp/scalabha/cluster/Evaluation.scala).Now when you run with the same options, you should get the following output.$ scalabha run ata.hw4.RunKmeans clusters_equal_variance.dat cosine ident 3--------------------------------------------------------------------------------Confusion matrix.Columns give predicted counts. Rows give gold counts.--------------------------------------------------------------------------------3 43 4 | 50 [1]30 0 0 | 30 [2]0 0 20 | 20 [3]-------------------------33 43 24[0] [1] [2]The
rows show the true, gold label counts and the columns give the
predicted cluster id counts. If the clustering worked, there should be
one column for each row that contains all the counts, so this output
indicates that something is amiss. The picture of the clusters above
shows that the clusters are all very distinct, so we should have perfect
clustering for this particular set of points. Make the obvious fix to
the options (no change in code needed) and rerun -- you should now get
it all perfectly.Written answer: include the output of the confusion matrix from this fix.You
now have most of the elements in place for computing k-means clusters
for a variety of datasets, provided they are given in the format
expected by readData.Part (d)Written answer:
Look at the code for computing k-means and describe in a paragraph or
two what you understand about how it works and how the different
functions relate to the pseudo-code from the slides. 2. Scaling dimensionsQuite
often, the dimensions we would like to use for clustering are scaled
differently. For example, if we are clustering individuals based on
their income and height, the income dimension will have values in the
thousands and be spread over a large range, whereas heights will have
values in the hundreds (e.g. if measured in centimeters) and be
constrained to a fairly small range (e.g. 50 cm to 200 cm). Distance
measures, especially Euclidean and Manhattan distances will fall afoul
of such disparate dimensions because the one with greater values and
greater variance will tend to dominate the distance computation.As an example, look at the data in:$SCALABHA_DIR/data/cluster/generated/clusters_bigger_x_variance.dat The
points in this dataset were generate from the same centroids as the in
the previous problem, but the variance in the x dimension was increased
from 1.0 to 5.0 (you can see the R code in generate_clusters.R to see how). Visually, the clusters look like this:
 The
x values have been stretched out, which leads to many points ending up
closer to points from other clusters. So, k-means is pretty much
guaranteed to do poorly if we use the exact values as given in the
dataset.A
standard and simple way to scale the dimensions so that they are
comparable is to use z-scores. All this means is we compute the mean and
standard deviation of all the values in a particular dimension, and
then the transformed value for each point is simply how many standard
deviations from the mean for that dimension. For the dataset depicted
above, this will have the effect of shrinking the x-values (imagine
squishing the graph by putting your hands on the left and right sides
and compressing it). For more details, check out the Wikipedia page on standard scores (z-scores).Part (a) Written answer: The z-score transform is implemented as ZscoreTransformer in Points.scala. Go have a look at it, and explain briefly how the apply method of the ZscoreTransformer class (not object) computes the sequence of z-scores based on the original values. Note: the “-” and “/” operators are not
the same as when you do 3-4 or 1.0/5.0, and you need to show you
understand where they come from and how they are operating on multiple
values.Part (b) Run the Kmeans clusterer on the dataset using Euclidean distance. You should get the following output.$ scalabha run ata.hw4.RunKmeans clusters_bigger_x_variance.dat euclidean ident 3--------------------------------------------------------------------------------Confusion matrix.Columns give predicted counts. Rows give gold counts.--------------------------------------------------------------------------------6 0 44 | 50 [1]23 3 4 | 30 [2]6 14 0 | 20 [3]-------------------------35 17 48[0] [1] [2]As expected, the clusters have ended up poorly capturing the structure of this data.Fix this by enabling the ZscoreTransformer in your RunKmeans main method. Previously, you totally ignored the transformer option (given in the variable transformerOption).
You should now use it to retrieve the appropriate transformer object
and apply it to the input points before running k-means. Tip: See the documentation in Points.scala
for guidance on how to do this. Be prepared to ask questions for help
on this -- it’s easy, but I’m intentionally not spoon-feeding the answer
to you so that you have to put the pieces together yourself since that
is what you’ll need to do with using other API’s that you might need to
use in the future. So, to repeat: it is fine to ask questions. Even
though it is “easy”, I expect a number of you to start off confused by the degrees of freedom available to you.After
enabling z-score scaling, you’ll find that the clusters found by
k-means are closer to the true clusters (though there are still some
“errors” due to the fact that some points actually did “wander” into
another centroid’s realm). 3. Schools.Go to $SCALABHA_DIR/data/cluster/schools and look at achieve.dat. Its contents are the following:BALDWIN 2.7 3.2 4.5 4.8BARNARD 3.9 3.8 5.9 6.2BEECHER 4.8 4.1 6.8 5.5BRENNAN 3.1 3.5 4.3 4.6CLINTON 3.4 3.7 5.1 5.6<more schools>Each
line describes reading and math scores for the given school for
students in the fourth and sixth grades. For example, 4th graders at
Baldwin scored 2.7 for reading and 3.2 for math (compared to national
average of 4.0) and 6th graders scored 4.5 for reading and 4.8 for math
(compared to 6.0 average).In
this problem, you’ll do a simple transformation from this format into
the format used in the previous problems such that you can use RunKmeans without modification. You’ll also add a bit of error analysis output. Part (a) In Features.scala, modify the main method of the SchoolsFeatures object so that it reads in achieve.dat
and prints out data in the format used for the previous problems. It
should separate each school into a 4th and 6th grade row, give the label
“4” or “6” as appropriate, and then give the relevant reading and math
scores for that school and grade level. Here’s an example of how running
it should look:$ scalabha run ata.hw4.SchoolsFeatures achieve.datBALDWIN_4th 4 2.7 3.2BALDWIN_6th 6 4.5 4.8BARNARD_4th 4 3.9 3.8BARNARD_6th 6 5.9 6.2<more lines>Tip: Be aware that some schools have names with spaces in them -- and the format used for input to RunKmeans needs to be splittable on spaces. That means you need to convert names like LINCOLN BASSETT into LINCOLN_BASSETT.Part (b)Now, direct the output of SchoolsFeatures to the file school.dat and
run k-means on it for k=2 and with no transformation of the points. You
should see good separation of the clusters into the 4th and 6th grade
clusters, with a few that fall outside of their “true” cluster (we’ll
get back to those in a bit).Written answer:
Output the centroids that were computed by k-means, and include them in
your answer. Based on these, were the schools in this sample performing
above or below average in 4th and/or 6th grades? (Recall that the
averages are 4.0 for 4th graders and 6.0 for 6th graders.)Part (c) The
confusion matrix gives us a sense of the overall quality of the
clusters, but it doesn’t show us which items fell into which confusion
cell. Modify RunKmeans.outputClusterReport
so that for each kind of assignment of a predicted cluster to a true
cluster, it prints which items had that signature. Call it from RunKmeans.main
so that it outputs this information before the confusion matrix is
created and printed. Leaving out some details, your output should look
something like the following (where I had previously saved the output of
SchoolsFeatures to the file schools.dat):$ scalabha run ata.hw4.RunKmeans schools.dat euclidean ident 2[Label: 6] <=> [Cluster: 0]Ids: BRENNAN_6th CONTE_6th <MORE>[Label: 4] <=> [Cluster: 1]Ids: EDGEWOOD_4th <MORE>[Label: 4] <=> [Cluster: 0]Ids: BALDWIN_4th BARNARD_4th <MORE>[Label: 6] <=> [Cluster: 1]Ids: BALDWIN_6th BARNARD_6th <MORE><FOLLOWED BY CONFUSION MATRIX>Written answer: Provide the output in your answers, and briefly discuss the outliers.Part (d) Do the same, but with k=4 and also outputting the centroids.Written answer: What is the structure of the data that was found? (Include your output.) 4. Birth and death rates of countries.Go to $SCALABHA_DIR/data/cluster/countries and look at birth.dat. Its contents are the following:ALGERIA 36.4 14.6CONGO 37.3 8.0 EGYPT 42.1 15.3 GHANA 55.8 25.6 <more countries>Each line gives the name of a country, then its birth rate and death rate.Part (a) In Features.scala, modify the main method of the CountriesFeatures object so that it reads in birth.dat and
prints out data in the format used for the previous problems. This is a
trivial transformation, but the missing thing here are the labels,
since in fact, there are none. For each line, give it the “label” 1.
That means that the interest in this problem will be to look at the
contents of the clusters found and explore them for interesting
patterns, rather than trying to match predefined clusters (which is in
fact generally what we are doing with clustering, as opposed to
classification).Here’s what running CountriesFeatures should look like.$ scalabha run ata.hw4.CountriesFeatures birth.datALGERIA 1 36.4 14.6CONGO 1 37.3 8.0EGYPT 1 42.1 15.3GHANA 1 55.8 25.6IVORY_COAST 1 56.1 33.1<more countries>Note that you’ll need to deal with spaces in names, as with the schools data.Part (b)Written answer:
Do some open-ended exploration of clustering for the dataset, using
different distance metrics, considering scaling, and different values
for k (consider values up to 10, at least). In one to three paragraphs,
describe any interesting things you find.
5. Federalist papers, basic setup and features.Now
we finally get to authorship analysis of the Federalist papers. For
this, you will turn your script (or my solution if you prefer) from HW3
to extract the articles and their contents, compute features based on
their contents for clustering, and then use those features for
clustering.Part (a)Modify FederalistFeatures in Features.scala so that it can parse the articles in federalist.txt. Do this by modifying the solution to HW3 such that it implements the extractArticles method and then calling that method from the main method.To
check your solution, output the id, author and first 40
characters of each article. Your output should look something like the
following: $ scalabha run ata.hw4.FederalistFeatures federalist.txt1: HAMILTONAFTER an unequivocal experience of the i2: JAYWHEN the people of America reflect that3: JAYIT IS not a new observation that the peo<more articles>To
be clear, this problem requires nothing more than plumbing. That is,
you should be able to take your previous script and make it work in this
context in the way specified. This isn’t hard, but it will reinforce
how main methods and support methods work using code you already wrote.Once you are done, make sure to turn off this printing, e.g. by commenting it out in the code.Part (b)Given the list of articles returned by extractArticles,
remove the extra copy of article #70. This is simple, but it is
important to get rid of that extra copy for the clustering experiments
you will do, and it is useful to get used to slicing and dicing
sequences.Suggestion: Assign the value returned by extractArticles to a “val” variable called something like allArticles, and then filter that to get a “val” variable called articles that you will use for the remainder of this problem.Tip: Use the take and slice methods of IndexedSeq for this.Part (c)Implement the extractSimplePoints method so that, for each article, it computes the count of “the”, “people”, and “which” and creates a Point from these two values. Use this method to compute the points from the text of the articles obtained from Part (b), and then print them out. Your output should look something like this:$ scalabha run ata.hw4.FederalistFeatures federalist.txt[130.0,5.0,18.0][105.0,21.0,11.0][91.0,7.0,11.0][84.0,7.0,10.0]<many more>The
first line indicates that article #1 has 130 tokens of “the”, 5 tokens
of “people”, and 18 tokens of “which”, and so on for the remaining
articles. (These are counts ignoring case, so “The” counts as “the”.)Tip: To get the same output as I have given above (and to simplify your job significantly) use the SimpleTokenizer object in opennlp.scalabha.util.StringUtils. It takes a String and returns an IndexedSeq[String], with one token per element of the sequence.Note: There is a featureType option in the FederalistFeatures main method -- you can safely ignore that until the next problem.Once
you have the right values being computed for counts of “the”, “people”,
and “which”, comment out the printing of the points, and instead output
the format needed as input to RunKmeans. The output should look like the following:$ scalabha run ata.hw4.FederalistFeatures federalist.txt1 HAMILTON 130.0 5.0 18.02 JAY 105.0 21.0 11.03 JAY 91.0 7.0 11.04 JAY 84.0 7.0 10.0<many more>Part (d)You can now run k-means on the points in the three dimensions extracted in Part (c). Redirect the output of FederalistFeatures to a file, e.g. simple_fedfeats.dat, and then provide it as input to RunKmeans. For k=4, euclidean distance and no transformation (the identity transformer), you should get the following output.$ scalabha run ata.hw4.RunKmeans simple_fedfeats.dat euclidean ident 4--------------------------------------------------------------------------------Confusion matrix.Columns give predicted counts. Rows give gold counts.--------------------------------------------------------------------------------30 3 9 9 | 51 [HAMILTON]8 0 2 1 | 11 [HAMILTON_OR_MADISON]1 0 4 0 | 5 [JAY]4 1 0 10 | 15 [MADISON]1 0 1 1 | 3 [HAMILTON_AND_MADISON]---------------------------------44 4 16 21[0] [1] [2] [3]Written answer:
Now run it with the z-score transformation and include the confusion
matrix output in your answers. Say in 3-5 sentences whether you think
the results better and why. (There is no simple answer here, just do
your best to eyeball it. You could also look at the detailed output from outputClusterReport if you wish, though this isn’t required.)
6. Federalist papers, authorship attribution with better featuresThe
results from the three paltry dimensions used in problem 5 are
certainly unimpressive. Let’s improve that. For this problem, you will
modify the extractPoints method to do some more serious feature extraction. Part (a)First,
let’s take care of ensuring that we can choose the “simple” features of
problem 5 or the “full” features you will implement now. Based on the
value of the featureType variable, you should create points from either extractSimplePoints or extractPoints as appropriate. Tip: Use the fact that conditionals in Scala are expressions that return values, so you can do it like this:val points = if (...) extractSimplePoints(...) else extractPoints(...)You should now be able to get the feature file that you got for problem 5 by adding “simple” to the command line.$ scalabha run ata.hw4.FederalistFeatures federalist.txt simpleWithout that, you’ll get no output (since extractPoints isn’t implemented yet and the default is to use “full” features).Part (b)Now implement the full features. As you develop them, you can output your features to a file and then use RunKmeans
to see what the clustering performance is. Based on that you, can
adjust your features and try to improve things. (Note: this would be terrible
experimental methodology if you were trying to make a scientific point
about authorship of the Federalist papers, but this gives you a chance
to see how different features affect performance in this homework sandbox.)This is very open-ended, but here are a few suggestions.- Consider using relative frequencies rather than raw counts.
- Consider
working with document frequencies in addition to word frequencies. This
just means counting how many documents a word occurred in. Words like
“the” tend to have high values, whereas others, like “Macedon” have a
low count. A common strategy is to compute the tf-idf (for term
frequency - inverse document frequency) score for a word in a given
document as the value for that word/feature for that document.
- Consider using bigrams (pairs of words) as well as unigrams (single words).
- Consider
using other features like capitalization patterns, average sentence
length, average word length, type/token ratios, and so on.
- You’ll
not want to include all words and their values (be they counts,
relative frequencies, or something else) since too many dimensions will
basically overwhelm the k-means algorithm (and it will also probably run
very slowly). So, you need to either add them by hand (tedious, but
possible) or compute the values for all words for all articles and then
filter them down to some reduced set. For example, you can use some
minimum number of counts, or some threshold on a value computed from
counts and document frequencies. (Feel free to get more sophisticated
than that if you like.)
I
encourage you all to discuss ideas for features on the class message
board (without showing code, but sharing ideas and even equations for
computing them).Once
you start getting a lot of dimensions, k-means can get overwhelmed and
do a bad job, especially if the features are highly correlated with one
another. This is very much expected for the features you are extracting
here, and you should combat this by using the PcaTransformer
(invoked with the “pca” option for the transformer). It scales the
data, and then applies principal components analysis to it so that you
can get the points in the transformed PCA space. It also automatically
reduces the dimensionality to just the principal components that account
for 95% of the variance in the original points. This might sound
tricky, but it is actually handled for you automatically if you just
request a PcaTransformer on your input points. Aren’t API’s nice? Note:
You’ll very likely need to turn the values for each of the articles into a matrix
so that for each dimension, each article has a value (possibly zero).
You’ll probably have a Map from features (e.g. words) to values
(Doubles) for each article, so you are converting each of these Maps
into a Point, ensuring that the values for one point are consistent with
other points in terms of lining up with corresponding dimensions (e.g.,
if the count of “the” is the first dimension, it must be the first
element for the Points for every article). This is another “plumbing”
problem, but it is important to be able to work with such
transformations. If you have trouble doing this, ask for help!Part (c)Written answer:
Discuss your features, and how you arrived at them. How successful were
you in recovering the clusters? How important is PCA for the features
you used? Describe any major challenges you encountered in producing
features that led to good performance.Note that you can achieve nearly perfect results. My own full feature extraction leads to the following confusion matrix:--------------------------------------------------------------------------------Confusion matrix.Columns give predicted counts. Rows give gold counts.--------------------------------------------------------------------------------0 0 0 51 | 51 [HAMILTON]11 0 0 0 | 11 [HAMILTON_OR_MADISON]0 5 0 0 | 5 [JAY]14 0 1 0 | 15 [MADISON]0 0 3 0 | 3 [HAMILTON_AND_MADISON]---------------------------------25 5 4 51[0] [1] [2] [3]So,
just one article goes amiss, and it’s sensible because it is a Madison
article that was grouped with the articles written by both Hamilton and
Madison (suggesting, perhaps, that Madison was the “main” author of
those joint works). My
solution is actually pretty simple: the values are relative frequencies
of unigrams, and they are selected by choosing all word types that have
a value greater than or equal to 6.0 when you divide their corpus frequency by their
document frequency. It will be interesting if someone can get similarly
almost perfect separation of the clusters using features like average
sentence and word lengths and such.
7. Feedback(a) How long did this homework take you?(b) Do you have any suggestions?(c) Do you have any complaints?
8. Extra creditMuch
of this assignment is designed to help newer programmers get used to using an API. For those who already have significant programming
experience, those parts should go quite smoothly and shouldn’t take much
time. However, the homework provides plenty of launching points for
going further. Here are some suggestions:1. Other datasets.
Collect examples from another authorship collection and run
experiments. This could involve finding a standard dataset, or creating
one for yourself.2. Cluster cardinality. Determine k automatically. You can look at things like the Aikike Information Criterion (AIC). See section 16.4.1 of Manning, Raghavan and Scheutze for a discussion. However AIC has an unfortunate parameter q that takes you from tuning k to tuning q. A more interesting alternative is the gap statistic, whech Edwin Chen discusses on his blog post Counting Clusters.
You’ll implement this as a wrapper around Kmeans, and you'll make use of the dispersion values returned by Kmeans.run. Run it on the
datasets in this homework and see how many clusters are predicted as
being necessary for each dataset. Of particular interest is whether the
Federalist papers show four distinct authorship patterns (each author on
his own plus the three by Madison and Hamilton together).3. Translation from points in the transformed space.
Enable translation from points in a transformed space back into the
original space (e.g. for plotting the centroids computed in the transformed space back to the original space). This is easy for z-score, but will be a bit trickier for
PCA.4. K-means implementation. Implement k-means by yourself. Compare and contrast with my implementation in Scalabha. Consider doing some speed comparisons.5. Hierarchical agglomerative clustering. Implement HAC. See Manning, Raghavan and Scheutze for details. Run it on some of the datasets, especially the Federalist papers and see whether interesting structure arises. 6. Scaling up.
Use Scrunch, Scoobi, Scalding, or Spark to turn the Kmeans
implementation into a distributed one. Find a large dataset (perhaps by
looking at materials from Mahout) and make sure you can scale up. I can
give you access to the Hadoop cluster on TACC if you want to do this.I’m
happy to provide feedback and suggestions for any of these, and you are
welcome to work on something else that you are interested in and that
is relevant.For
any of these, please provide a short write-up of what you did, and
provide the code (and data) you used separately from the code for the
main assignment, with instructions for how to compile and run it.
Copyright 2012 Jason Baldridge
The text of this homework is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to www.jasonbaldridge.com and to this original homework.
Please email Jason at jasonbaldridge@gmail.com with suggestions, improvements, extensions and bug fixes. |
 Updating...
Jason Baldridge, Mar 6, 2012, 6:57 PM
|