Assignments‎ > ‎


DUE: March 30, 2012 by 10am (submission on Blackboard)

This assignment involves using k-means clustering to analyze several datasets, including authorship analysis on the Federalist Papers. In addition to giving you experience working with clustering and extracting features from text, the assignment also is designed to give you practical experience with programming in a more realistic environment than you’ve seen so far, including using a build system, accessing classes defined in an API, and working with file input/output.

  • As usual, the description of the problems for this homework is somewhat long -- don’t let this scare you. In fact, much of the text is there to give you documentation and help you know what to do.
  • You should read all of the problems before starting, but it will be best to solve them in order since each one builds on the previous problem.
  • You should get an early start on the homework. There are likely to be bumpy starts for some of you, and we need to smooth those on sooner rather than later.
  • My solutions to these problems don’t use a single “var” variable. I encourage you to try to do without them too, but you may use them if you really must.
  • If you run into what you think might be a bug with Scalabha, please let me know right away so I can fix it if it indeed is a bug.
If you don’t absolutely need to have words on paper, don’t print this homework description out -- it has a lot of example output that makes it lengthy, and it will be easier to do some cutting and pasting etc right from the web page.


Getting Scalabha

You must obtain and install Scalabha, a software package that will allow you to access code written to support this assignment and to easily compile and run your own code. For this homework, you need version 0.2.2, which you can obtain here:

Note: if you know how to use Git (or are interested in using it), you can just clone the Scalabha repository.

You should unzip the contents of and follow the instructions for installing it in the file scalabha-0.2.2/ Note that one of those instructions requires you to set an environment variable SCALABHA_DIR that points to the location of Scalabha on your machine. At times, this homework will reference file locations that are relative to that directory.

The data

You will work with clustering datasets that are located in $SCALABHA_DIR/data/cluster. Go to that directory and have a look at it.

Setting up the code

Unlike previous assignments, where you wrote single-file Scala scripts, you will this time develop your solutions in the context of the Scalabha build system (which uses SBT, the Simple Build Tool). Apart from brief exposure with tutorial 11 on SBT and Scalabha, this is a new concept for most of the students in the class, but don’t worry: it will actually make things much easier for you. It will also help you acquire an important set of skills for real software development.

There are just three files in the homework bundle two stub files Clustering.scala and Features.scala, and an answers file hw4_answers.txt. Please modify these files when solving the problems; do not use different names for either. To prepare for working with the assignment, you should do the following steps:

$ unzip
  creating: src/main/scala/ata/hw4/
 inflating: src/main/scala/ata/hw4/Features.scala  
 inflating: src/main/scala/ata/hw4/Clustering.scala  
$ scalabha build compile

The last command should end with a line starting with [success]. If you are having trouble with this, get in touch with me right away.

Your implemented solutions will be done in Clustering.scala and Features.scala. Any portions of problems that begin with Written question or that request example output should go in hw4_answers.txt.

Tip: while you are working on your solutions, you should definitely take advantage of the ~compile command in SBT. It will compile your code automatically every time you save a file you are working on.

Note that most of the commands suggested in the problems assume that you are running them in one of the sub-directories of the $SCALABHA_DIR/data/cluster directory.

Submitting your solutions

For submission, create a zip file with the name hw4_<lastname>_<firstname>.zip that contains your src/main/scala/icl/hw4 directory and its contents. Here are some example commands for doing this:

$ zip -r src/main/scala/ata/hw4/
 adding: src/main/scala/ata/hw4/ (stored 0%)
 adding: src/main/scala/ata/hw4/Features.scala (deflated 67%)
 adding: src/main/scala/ata/hw4/Clustering.scala (deflated 56%)

Make sure that your code compiles before you do this. If it does not, I’ll deduct 20 points right away and ask you to fix it and resubmit.

Tip: you can create the scaladoc API documentation by running the "doc" command in SBT:

$ scalabha build
[info] Loading global plugins from /home/jbaldrid/.sbt/plugins
[info] Loading project definition from /home/jbaldrid/devel/scalabha/project
[info] Set current project to Scalabha (in build file:/home/jbaldrid/devel/scalabha/)
> doc
[info] Updating {file:/home/jbaldrid/devel/scalabha/}default-29cc0c...
<many lines of output>
[info] API documentation generation successful.
[success] Total time: 16 s, completed Mar 6, 2012 9:01:07 PM

Having done this, you can now point your browser to $SCALABHA_DIR/target/api/index.html to see the Scalabha API. The main information for this homework is that for the opennlp.scalabha.cluster package, which you can see at this location:


You also have access to the source code in $SCALABHA_DIR/src/main/scala, which you will be looking at at times to solve the problems in this homework.

1. Basic clustering with generated data
Go to the directory $SCALABHA_DIR/data/cluster/generated. In this directory, you’ll find several files, one of which is clusters_equal_variance.dat. Look at its contents, e.g.:

~/devel/scalabha/data/cluster/autogenerated$ more clusters_equal_variance.dat
1 1 -1.27 -1.6
2 1 -0.88 -0.55
3 1 -2.34 -0.85
4 1 0.2 -0.32
<more lines omitted>

This file contains the x-y coordinates for points that have been randomly generated from three centroids (sampling from independent Gaussian distributions for each dimension). Here’s what they look like graphically:

The three clusters are very clear, so k-means should be able to do a good job with them.

For this problem, you will create a reader that consumes each line of input in this file to produce the information needed to run k-means, then run the algorithm and evaluate its output. What you do for this problem will probably seem hard as you are doing it, but you’ll see that it is quite straightforward when you are done -- but only once you’ve done it for yourself with plenty of room to get confused and then resolve your confusion. (IMPORTANT: don’t hesitate to contact me for help if you just can’t get started.)

Part (a)

The first step is to create the reader. This will be the first foray into using an API in this course (it’s a simple one).  Your starting points are the main and readData methods in the ata.hw4.RunKmeans object (found in src/main/scala/ata/hw4/Clustering.scala). Look at Clustering.scala and scan the documentation in it. You need to take the input file name given to the main method, and call readData with that file name.  Note that readData returns a triple of empty sequences for now. Call it from the main method and then compile Scalabha (using SBT).  Now, go to $SCALABHA_DIR/data/cluster/generated and run the following command:

$ scalabha run ata.hw4.RunKmeans clusters_equal_variance.dat cosine ident 3

It should do nothing (and that’s fine). Now, implement readData so that it returns a triple of ids, cluster labels, and Points, e.g. for the file above:

(IndexedSeq(1, 2, 3, …), IndexedSeq(1, 1, 1, …), IndexedSeq(Point(-1.27,-1.6),Point(-0.88,-0.55), Point(-2.34, -0.85)))

Basically, the first element of the triple is the first column of the file, the second element is the second column, and the third element is the rest of the columns aggregated as Points. What’s a Point? Look at the class opennlp.scalabha.cluster.Point and figure out how to create them based on the values you are getting from the file. (Need help finding that class? Look under $SCALABHA_DIR/src/main/scala/opennlp.)

Once you have done this, you should now capture the return values of readData. If you print out the points, you should see this:

<many more points>

Once you have gotten to this point, remove the line that prints the points and move on to the next part of this question.

Part (b)

Next, you need to get the object for the distance function that corresponds to the distance function option from the command line. For example, “cosine” requires using the CosineDistance object in Points.scala. There is a helper object DistanceFunction (also in Points.scala) that allows you to get the right function given a description such as “cosine”, via its apply method (recall that this means you aren’t required to write out “apply” in order to use it).

Also, you should create a variable k of type Int that has the value corresponding to the kOption String given for the number of clusters.

Given the points and the distance function, you can create a Kmeans object. Look at opennlp.scalabha.cluster.Kmeans to see how to call its constructor, and then create an object in your main method. With that object, you can now run k-means for a given k. Do so, capture the created centroids from the return value of (remember that Kmeans is a class, and you are creating an instance of that class, e.g. val foo = Kmeans(...), so that means doing Write code that prints out the centroids, one per line. For the command line options used below, you should get the same set of three centroids.

$ scalabha run ata.hw4.RunKmeans clusters_equal_variance.dat cosine ident 3 [5.796363636363637,3.7084848484848485]

Comment out or remove this centroid printing, and then move on to the next part.

Part (c)

Just given the centroids, we don’t know if they are any good, so let’s evaluate. For this, we need to know what the cluster memberships are for each point and then compute the confusion matrix between the predicted clusters and the true cluster labels. Use the Kmeans.computeClusterMemberships method for the former; with those memberships in hand, you have everything you need to create a ClusterConfusionMatrix (see the companion object to the ClusterConfusionMatrix class in opennlp/scalabha/cluster/Evaluation.scala).

Now when you run with the same options, you should get the following output.

$ scalabha run ata.hw4.RunKmeans clusters_equal_variance.dat cosine ident 3
Confusion matrix.
Columns give predicted counts. Rows give gold counts.
3    43    4    |    50    [1]
30    0    0    |    30    [2]
0    0    20    |    20    [3]
33    43    24
[0]    [1]    [2]

The rows show the true, gold label counts and the columns give the predicted cluster id counts. If the clustering worked, there should be one column for each row that contains all the counts, so this output indicates that something is amiss. The picture of the clusters above shows that the clusters are all very distinct, so we should have perfect clustering for this particular set of points. Make the obvious fix to the options (no change in code needed) and rerun -- you should now get it all perfectly.

Written answer: include the output of the confusion matrix from this fix.

You now have most of the elements in place for computing k-means clusters for a variety of datasets, provided they are given in the format expected by readData.

Part (d)

Written answer: Look at the code for computing k-means and describe in a paragraph or two what you understand about how it works and how the different functions relate to the pseudo-code from the slides.

2. Scaling dimensions
Quite often, the dimensions we would like to use for clustering are scaled differently. For example, if we are clustering individuals based on their income and height, the income dimension will have values in the thousands and be spread over a large range, whereas heights will have values in the hundreds (e.g. if measured in centimeters) and be constrained to a fairly small range (e.g. 50 cm to 200 cm). Distance measures, especially Euclidean and Manhattan distances will fall afoul of such disparate dimensions because the one with greater values and greater variance will tend to dominate the distance computation.

As an example, look at the data in:


The points in this dataset were generate from the same centroids as the in the previous problem, but the variance in the x dimension was increased from 1.0 to 5.0 (you can see the R code in generate_clusters.R to see how). Visually, the clusters look like this:

The x values have been stretched out, which leads to many points ending up closer to points from other clusters. So, k-means is pretty much guaranteed to do poorly if we use the exact values as given in the dataset.

A standard and simple way to scale the dimensions so that they are comparable is to use z-scores. All this means is we compute the mean and standard deviation of all the values in a particular dimension, and then the transformed value for each point is simply how many standard deviations from the mean for that dimension. For the dataset depicted above, this will have the effect of shrinking the x-values (imagine squishing the graph by putting your hands on the left and right sides and compressing it). For more details, check out the Wikipedia page on standard scores (z-scores).

Part (a)

Written answer: The z-score transform is implemented as ZscoreTransformer in Points.scala. Go have a look at it, and explain briefly how the apply method of the ZscoreTransformer class (not object) computes the sequence of z-scores based on the original values.

Note: the “-” and “/” operators are not the same as when you do 3-4 or 1.0/5.0, and you need to show you understand where they come from and how they are operating on multiple values.

Part (b)

Run the Kmeans clusterer on the dataset using Euclidean distance. You should get the following output.

$ scalabha run ata.hw4.RunKmeans clusters_bigger_x_variance.dat euclidean ident 3
Confusion matrix.
Columns give predicted counts. Rows give gold counts.
6    0    44    |    50    [1]
23    3    4    |    30    [2]
6    14    0    |    20    [3]
35    17    48
[0]    [1]    [2]

As expected, the clusters have ended up poorly capturing the structure of this data.

Fix this by enabling the ZscoreTransformer in your RunKmeans main method. Previously, you totally ignored the transformer option (given in the variable transformerOption). You should now use it to retrieve the appropriate transformer object and apply it to the input points before running k-means.

Tip: See the documentation in Points.scala for guidance on how to do this. Be prepared to ask questions for help on this -- it’s easy, but I’m intentionally not spoon-feeding the answer to you so that you have to put the pieces together yourself since that is what you’ll need to do with using other API’s that you might need to use in the future.  So, to repeat: it is fine to ask questions. Even though it is “easy”, I expect a number of you to start off confused by the degrees of freedom available to you.

After enabling z-score scaling, you’ll find that the clusters found by k-means are closer to the true clusters (though there are still some “errors” due to the fact that some points actually did “wander” into another centroid’s realm).

3. Schools.
Go to $SCALABHA_DIR/data/cluster/schools and look at achieve.dat. Its contents are the following:

BALDWIN            2.7         3.2        4.5        4.8
BARNARD            3.9         3.8        5.9        6.2
BEECHER            4.8         4.1        6.8        5.5
BRENNAN            3.1         3.5        4.3        4.6
CLINTON            3.4         3.7        5.1        5.6
<more schools>

Each line describes reading and math scores for the given school for students in the fourth and sixth grades. For example, 4th graders at Baldwin scored 2.7 for reading and 3.2 for math (compared to national average of 4.0) and 6th graders scored 4.5 for reading and 4.8 for math (compared to 6.0 average).

In this problem, you’ll do a simple transformation from this format into the format used in the previous problems such that you can use RunKmeans without modification. You’ll also add a bit of error analysis output.

Part (a)

In Features.scala, modify the main method of the SchoolsFeatures object so that it reads in achieve.dat and prints out data in the format used for the previous problems. It should separate each school into a 4th and 6th grade row, give the label “4” or “6” as appropriate, and then give the relevant reading and math scores for that school and grade level. Here’s an example of how running it should look:

$ scalabha run ata.hw4.SchoolsFeatures achieve.dat
BALDWIN_4th 4 2.7 3.2
BALDWIN_6th 6 4.5 4.8
BARNARD_4th 4 3.9 3.8
BARNARD_6th 6 5.9 6.2
<more lines>

Tip: Be aware that some schools have names with spaces in them -- and the format used for input to RunKmeans needs to be splittable on spaces. That means you need to convert names like LINCOLN BASSETT into LINCOLN_BASSETT.

Part (b)

Now, direct the output of SchoolsFeatures to the file school.dat and run k-means on it for k=2 and with no transformation of the points. You should see good separation of the clusters into the 4th and 6th grade clusters, with a few that fall outside of their “true” cluster (we’ll get back to those in a bit).

Written answer: Output the centroids that were computed by k-means, and include them in your answer. Based on these, were the schools in this sample performing above or below average in 4th and/or 6th grades? (Recall that the averages are 4.0 for 4th graders and 6.0 for 6th graders.)

Part (c)

The confusion matrix gives us a sense of the overall quality of the clusters, but it doesn’t show us which items fell into which confusion cell. Modify RunKmeans.outputClusterReport so that for each kind of assignment of a predicted cluster to a true cluster, it prints which items had that signature. Call it from RunKmeans.main so that it outputs this information before the confusion matrix is created and printed. Leaving out some details, your output should look something like the following (where I had previously saved the output of SchoolsFeatures to the file schools.dat):

$ scalabha run ata.hw4.RunKmeans schools.dat euclidean ident 2
[Label: 6] <=> [Cluster: 0]
Ids: BRENNAN_6th    CONTE_6th    <MORE>

[Label: 4] <=> [Cluster: 1]
Ids: EDGEWOOD_4th   <MORE>

[Label: 4] <=> [Cluster: 0]
Ids: BALDWIN_4th    BARNARD_4th    <MORE>

[Label: 6] <=> [Cluster: 1]
Ids: BALDWIN_6th    BARNARD_6th   <MORE>


Written answer: Provide the output in your answers, and briefly discuss the outliers.

Part (d)

Do the same, but with k=4 and also outputting the centroids.

Written answer: What is the structure of the data that was found? (Include your output.)

4. Birth and death rates of countries.
Go to $SCALABHA_DIR/data/cluster/countries and look at birth.dat. Its contents are the following:

ALGERIA        36.4   14.6
CONGO          37.3    8.0               
EGYPT          42.1   15.3               
GHANA          55.8   25.6         
<more countries>

Each line gives the name of a country, then its birth rate and death rate.

Part (a)

In Features.scala, modify the main method of the CountriesFeatures object so that it reads in birth.dat and prints out data in the format used for the previous problems. This is a trivial transformation, but the missing thing here are the labels, since in fact, there are none. For each line, give it the “label” 1. That means that the interest in this problem will be to look at the contents of the clusters found and explore them for interesting patterns, rather than trying to match predefined clusters (which is in fact generally what we are doing with clustering, as opposed to classification).

Here’s what running CountriesFeatures should look like.

$ scalabha run ata.hw4.CountriesFeatures birth.dat
ALGERIA 1 36.4 14.6
CONGO 1 37.3 8.0
EGYPT 1 42.1 15.3
GHANA 1 55.8 25.6
IVORY_COAST 1 56.1 33.1
<more countries>

Note that you’ll need to deal with spaces in names, as with the schools data.

Part (b)

Written answer: Do some open-ended exploration of clustering for the dataset, using different distance metrics, considering scaling, and different values for k (consider values up to 10, at least). In one to three paragraphs, describe any interesting things you find.

5. Federalist papers, basic setup and features.
Now we finally get to authorship analysis of the Federalist papers. For this, you will turn your script (or my solution if you prefer) from HW3 to extract the articles and their contents, compute features based on their contents for clustering, and then use those features for clustering.

Part (a)

Modify FederalistFeatures in Features.scala so that it can parse the articles in federalist.txt. Do this by modifying the solution to HW3 such that it implements the extractArticles method and then calling that method from the main method.

To check your solution, output the id, author and first 40 characters of each article. Your output should look something like the following:

$ scalabha run ata.hw4.FederalistFeatures federalist.txt
AFTER an unequivocal experience of the i

2: JAY
WHEN the people of America reflect that

3: JAY
IT IS not a new observation that the peo
<more articles>

To be clear, this problem requires nothing more than plumbing. That is, you should be able to take your previous script and make it work in this context in the way specified. This isn’t hard, but it will reinforce how main methods and support methods work using code you already wrote.

Once you are done, make sure to turn off this printing, e.g. by commenting it out in the code.

Part (b)

Given the list of articles returned by extractArticles, remove the extra copy of article #70. This is simple, but it is important to get rid of that extra copy for the clustering experiments you will do, and it is useful to get used to slicing and dicing sequences.

Suggestion: Assign the value returned by extractArticles to a “val” variable called something like allArticles, and then filter that to get a “val” variable called articles that you will use for the remainder of this problem.

Tip: Use the take and slice methods of IndexedSeq for this.

Part (c)

Implement the extractSimplePoints method so that, for each article, it computes the count of “the”, “people”, and “which” and creates a Point from these two values.  Use this method to compute the points from the text of the articles obtained from Part (b), and then print them out. Your output should look something like this:

$ scalabha run ata.hw4.FederalistFeatures federalist.txt
<many more>

The first line indicates that article #1 has 130 tokens of “the”, 5 tokens of “people”, and 18 tokens of “which”, and so on for the remaining articles. (These are counts ignoring case, so “The” counts as “the”.)

Tip: To get the same output as I have given above (and to simplify your job significantly) use the SimpleTokenizer object in opennlp.scalabha.util.StringUtils. It takes a String and returns an IndexedSeq[String], with one token per element of the sequence.

Note: There is a featureType option in the FederalistFeatures main method -- you can safely ignore that until the next problem.

Once you have the right values being computed for counts of “the”, “people”, and “which”, comment out the printing of the points, and instead output the format needed as input to RunKmeans. The output should look like the following:

$ scalabha run ata.hw4.FederalistFeatures federalist.txt
1 HAMILTON 130.0 5.0 18.0
2 JAY 105.0 21.0 11.0
3 JAY 91.0 7.0 11.0
4 JAY 84.0 7.0 10.0
<many more>

Part (d)

You can now run k-means on the points in the three dimensions extracted in Part (c). Redirect the output of FederalistFeatures to a file, e.g. simple_fedfeats.dat, and then provide it as input to RunKmeans. For k=4, euclidean distance and no transformation (the identity transformer), you should get the following output.

$ scalabha run ata.hw4.RunKmeans simple_fedfeats.dat euclidean ident 4
Confusion matrix.
Columns give predicted counts. Rows give gold counts.
30    3    9    9    |    51    [HAMILTON]
8    0    2    1     |    11    [HAMILTON_OR_MADISON]
1    0    4    0     |    5     [JAY]
4    1    0    10    |    15    [MADISON]
1    0    1    1     |    3     [HAMILTON_AND_MADISON]
44    4    16    21
[0]    [1]    [2]    [3]

Written answer: Now run it with the z-score transformation and include the confusion matrix output in your answers.  Say in 3-5 sentences whether you think the results better and why. (There is no simple answer here, just do your best to eyeball it. You could also look at the detailed output from outputClusterReport if you wish, though this isn’t required.)

6. Federalist papers, authorship attribution with better features
The results from the three paltry dimensions used in problem 5 are certainly unimpressive. Let’s improve that. For this problem, you will modify the extractPoints method to do some more serious feature extraction.

Part (a)

First, let’s take care of ensuring that we can choose the “simple” features of problem 5 or the “full” features you will implement now. Based on the value of the featureType variable, you should create points from either extractSimplePoints or extractPoints as appropriate.

Tip: Use the fact that conditionals in Scala are expressions that return values, so you can do it like this:

val points = if (...) extractSimplePoints(...) else extractPoints(...)

You should now be able to get the feature file that you got for problem 5 by adding “simple” to the command line.

$ scalabha run ata.hw4.FederalistFeatures federalist.txt simple

Without that, you’ll get no output (since extractPoints isn’t implemented yet and the default is to use “full” features).

Part (b)

Now implement the full features. As you develop them, you can output your features to a file and then use RunKmeans to see what the clustering performance is. Based on that you, can adjust your features and try to improve things. (Note: this would be terrible experimental methodology if you were trying to make a scientific point about authorship of the Federalist papers, but this gives you a chance to see how different features affect performance in this homework sandbox.)

This is very open-ended, but here are a few suggestions.
  • Consider using relative frequencies rather than raw counts.
  • Consider working with document frequencies in addition to word frequencies. This just means counting how many documents a word occurred in. Words like “the” tend to have high values, whereas others, like “Macedon” have a low count. A common strategy is to compute the tf-idf (for term frequency - inverse document frequency) score for a word in a given document as the value for that word/feature for that document.
  • Consider using bigrams (pairs of words) as well as unigrams (single words).
  • Consider using other features like capitalization patterns, average sentence length, average word length, type/token ratios, and so on.
  • You’ll not want to include all words and their values (be they counts, relative frequencies, or something else) since too many dimensions will basically overwhelm the k-means algorithm (and it will also probably run very slowly). So, you need to either add them by hand (tedious, but possible) or compute the values for all words for all articles and then filter them down to some reduced set. For example, you can use some minimum number of counts, or some threshold on a value computed from counts and document frequencies. (Feel free to get more sophisticated than that if you like.)
I encourage you all to discuss ideas for features on the class message board (without showing code, but sharing ideas and even equations for computing them).

Once you start getting a lot of dimensions, k-means can get overwhelmed and do a bad job, especially if the features are highly correlated with one another. This is very much expected for the features you are extracting here, and you should combat this by using the PcaTransformer (invoked with the “pca” option for the transformer). It scales the data, and then applies principal components analysis to it so that you can get the points in the transformed PCA space. It also automatically reduces the dimensionality to just the principal components that account for 95% of the variance in the original points. This might sound tricky, but it is actually handled for you automatically if you just request a PcaTransformer on your input points. Aren’t API’s nice?

Note: You’ll very likely need to turn the values for each of the articles into a matrix so that for each dimension, each article has a value (possibly zero). You’ll probably have a Map from features (e.g. words) to values (Doubles) for each article, so you are converting each of these Maps into a Point, ensuring that the values for one point are consistent with other points in terms of lining up with corresponding dimensions (e.g., if the count of “the” is the first dimension, it must be the first element for the Points for every article). This is another “plumbing” problem, but it is important to be able to work with such transformations. If you have trouble doing this, ask for help!

Part (c)

Written answer: Discuss your features, and how you arrived at them. How successful were you in recovering the clusters? How important is PCA for the features you used? Describe any major challenges you encountered in producing features that led to good performance.

Note that you can achieve nearly perfect results. My own full feature extraction leads to the following confusion matrix:

Confusion matrix.
Columns give predicted counts. Rows give gold counts.
0    0    0    51   |    51    [HAMILTON]
11   0    0    0    |    11    [HAMILTON_OR_MADISON]
0    5    0    0    |    5     [JAY]
14   0    1    0    |    15    [MADISON]
0    0    3    0    |    3     [HAMILTON_AND_MADISON]
25    5    4    51
[0]    [1]    [2]    [3]

So, just one article goes amiss, and it’s sensible because it is a Madison article that was grouped with the articles written by both Hamilton and Madison (suggesting, perhaps, that Madison was the “main” author of those joint works).

My solution is actually pretty simple: the values are relative frequencies of unigrams, and they are selected by choosing all word types that have a value greater than or equal to 6.0 when you divide their corpus frequency by their document frequency. It will be interesting if someone can get similarly almost perfect separation of the clusters using features like average sentence and word lengths and such.

7. Feedback
(a) How long did this homework take you?

(b) Do you have any suggestions?

(c) Do you have any complaints?

8. Extra credit
Much of this assignment is designed to help newer programmers get used to using an API. For those who already have significant programming experience, those parts should go quite smoothly and shouldn’t take much time. However, the homework provides plenty of launching points for going further. Here are some suggestions:

1. Other datasets. Collect examples from another authorship collection and run experiments. This could involve finding a standard dataset, or creating one for yourself.

2. Cluster cardinality. Determine k automatically. You can look at things like the Aikike Information Criterion (AIC). See section 16.4.1 of Manning, Raghavan and Scheutze for a discussion. However AIC has an unfortunate parameter q that takes you from tuning k to tuning q. A more interesting alternative is the gap statistic, whech Edwin Chen discusses on his blog post Counting Clusters. You’ll implement this as a wrapper around Kmeans, and you'll make use of the dispersion values returned by Run it on the datasets in this homework and see how many clusters are predicted as being necessary for each dataset. Of particular interest is whether the Federalist papers show four distinct authorship patterns (each author on his own plus the three by Madison and Hamilton together).

3. Translation from points in the transformed space. Enable translation from points in a transformed space back into the original space (e.g. for plotting the centroids computed in the transformed space back to the original space). This is easy for z-score, but will be a bit trickier for PCA.

4. K-means implementation. Implement k-means by yourself. Compare and contrast with my implementation in Scalabha. Consider doing some speed comparisons.

5. Hierarchical agglomerative clustering. Implement HAC. See Manning, Raghavan and Scheutze for details. Run it on some of the datasets, especially the Federalist papers and see whether interesting structure arises.

6. Scaling up. Use Scrunch, Scoobi, Scalding, or Spark to turn the Kmeans implementation into a distributed one. Find a large dataset (perhaps by looking at materials from Mahout) and make sure you can scale up. I can give you access to the Hadoop cluster on TACC if you want to do this.

I’m happy to provide feedback and suggestions for any of these, and you are welcome to work on something else that you are interested in and that is relevant.

For any of  these, please provide a short write-up of what you did, and provide the code (and data) you used separately from the code for the main assignment, with instructions for how to compile and run it.

Copyright 2012 Jason Baldridge

The text of this homework is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to and to this original homework.

Please email Jason at with suggestions, improvements, extensions and bug fixes.

Jason Baldridge,
Mar 6, 2012, 6:57 PM