Assignments‎ > ‎

Scala programming, Part 2

DUE: February 15, 2012 by 10am (submission on Blackboard)

Your starting point is a set of stub programs that you can download from the course website in the file hw2-stubs.zip. These are incomplete implementations that handle a few things for you (such as code to test your code), and have comments embedded in them to help you solve the problems.

Your submission will be a zipped up file with your implementations for each problem in it. Submit it on Blackboard as <lastname>_<firstname>_hw2.zip. Please follow the same instructions as for HW1 with respect how to package up the solutions before submission.

Note: there may be some new concepts in this homework that have not been covered in the tutorials or in class. I have provided hints and guidelines for you to be able to use them, but you should not feel shy to ask for help. A big part of programming is trying to figure out some new things based on what you already know, so give it a try, but at the end of the day, I’m here to help smooth things out for you.

If any of the instructions don’t make sense to you, please get in touch with the instructor right away.


1. List transformations and computations.

For this problem, you need to process a string from the command line, and do some simple transformations using Lists. Modify the stub file lengthSort.scala.

(a) Split the string from the command line (which you get as args(0)) and sort it. The sorting order is first according to length of each token and then alphabetically by the token (so “to” is before “and”, which is before “the”). Print out the list in this sorted order as a single line, with each token converted to the format word:length, e.g. the:3. (See example output below.)

Tip: The split method on Strings returns an Array, and you may want to work with a List instead. Just use toList on the Array to convert it to a List, e.g. “a:b:c:d”.split(“:”).toList.

(b) Working from the sorted list you obtained in (a), now output the list using a similar format, but remix it so that you have:
  • the first group of elements are the tokens of length 3 or less, in ascending order
  • the next group of elements are the tokens of length greater than 3 and less than 11, in descending order
  • the last group of elements are the tokens of length 11 or more, in ascending order
See the example output for guidance on formatting, which is similar to (a).

Example output.

$ scala lengthSort.scala "Introduction to Computational Linguistics introduces the most important data structures and algorithmic techniques underlying computational linguistics"

Sorted: to:2 and:3 the:3 data:4 most:4 important:9 introduces:10 structures:10 techniques:10 underlying:10 Linguistics:11 algorithmic:11 linguistics:11 Introduction:12 Computational:13 computational:13

Remixed: to:2 and:3 the:3 underlying:10 techniques:10 structures:10 introduces:10 important:9 most:4 data:4 Linguistics:11 algorithmic:11 linguistics:11 Introduction:12 Computational:13 computational:13

Hint: You’ll want to use pairs (Tuple2) for a number of reasons. One of the biggest advantages is that they sort first on the first element and then on the second element.

Note: You must make sure to enclose the argument to lengthSort.scala in quotes so that the entire argument is available in the first element of the args array.


2. Text manipulation
Here is a text passage from H.G.Wells, The War of the Worlds (from Project Gutenberg, http://www.gutenberg.org/etext/36):

After the glimpse I had had of the Martians emerging from the cylinder in which they had come to the earth from their planet, a kind of fascination paralysed my actions. I remained standing knee-deep in the heather, staring at the mound that hid them. I was a battleground of fear and curiosity.


I did not dare to go back towards the pit, but I felt a passionate longing to peer into it. I began walking, therefore, in a big curve, seeking some point of vantage and continually looking at the sand heaps that hid these new-comers to our earth. Once a leash of thin black whips, like the arms of an octopus, flashed across the sunset and was immediately withdrawn, and afterwards a thin rod rose up, joint by joint, bearing at its apex a circular disk that spun with a wobbling motion. What could be going on there?


For this problem, you will perform series of manipulations and operations on this passage. Modify the stub program wotw.scala, looking at the comments in it for guidance. See the output at the end of this question for guidance on what your program’s output should be.

Note: even though this problem is a bit lengthy in its description, the Scala solutions themselves can be quite short.

(a) In the passage, how many word tokens are there that end with the letters “ing”? (Note that we are looking for words ending in “ing”, not just occurrences of the letter sequence “ing” anywhere in a word. You do not need to deal with punctuation. Just count only the words that actually end in “ing”, ignoring those that end in “ing,” or “ing.”.) Print it out as specified in the output below.

Hint: Use the String method split to split a string of text into individual words (and split on sequences of one or more whitespace characters), the string method endsWith to test whether a string ends with a particular substring, and the List method count to find how many words pass that test.

Note: The result of split is an Array, not a List. For present purposes, this will behave similarly to a List, so you don’t need to change anything. However, if you do want a list, call the method toList, e.g. “a:b:c:d”.split(“:”).toList.

(b) Make a list chopped that contains all the words from the above passage, but with the last two letters cut off. Print out a string that concatenates the chopped tokens using a space as the separator, as below. (This is a very crude first stab at making a stemmer, a program that cuts off suffixes and leaves only word stems. We will see improved implementations later.)

Hint: Do this using map on the result of splitting the string from part (a). Also, recall the slice method, which excises a sublist from a List, e.g.:

scala> val foo = List(3,4,0,32,3,9,13)
foo: List[Int] = List(3, 4, 0, 32, 3, 9, 13)

scala> foo.slice(2,5)
res21: List[Int] = List(0, 32, 3)

Strings are also a List-like data structure (they are sequences of characters) and the slice method works for them too.

Note: It’s fine to work with either List[String] or Array[String].

(c ) A common item of interest when we get to language models is the bigram word probability of one word given that another precedes it, such as the probability of seeing the word “the” next given that you’ve just seen the word “at” in a sentence like “I’ll see you at … “ This value is expressed as P(the|at). At its most basic level, this is a simple thing to compute. Given a corpus of texts, the probability is calculated as:

P(the|at) = count(at,the) / count(at)

In words: out of all the times I’ve seen the word “at”, how many times was it followed by the word “the”? Write the code to compute this from the WotW passage, and print out the result, as below.

Hint: There is a very useful function on lists called sliding, which gives you a sliding window of the values in the list. This is easy to see in an example:

scala> "a b c d".split(" ").sliding(2).toList
res0: List[Array[java.lang.String]] = List(Array(a, b), Array(b, c), Array(c, d))

The argument to sliding is an Int that specifies how many items will be in each window. With such a list of two-element Arrays in hand, you can now use the count function with an appropriately defined predication to compute the numerator of the probability.

Goal output: (with ### substituted for actual numerical values)

$ scala wotw.scala

Part (a)
Number of words ending in 'ing': ###

Part (b)
Aft t glimp  h h  t Martia emergi fr t cylind  whi th h co  t ear fr the plane  ki  fascinati paralys  action  remain standi knee-de  t heathe stari  t mou th h the  w  battlegrou  fe a curiosit  d n da   ba towar t pi b  fe  passiona longi  pe in i  beg walkin therefor   b curv seeki so poi  vanta a continual looki  t sa hea th h the new-come  o eart On  lea  th bla whip li t ar   octopu flash acro t suns a w immediate withdraw a afterwar  th r ro u joi  join beari  i ap  circul di th sp wi  wobbli motio Wh cou  goi  ther

Part (c)
P(the|at) = ###


3. Map data structures (a.k.a. associative arrays)

For this problem, you will write a program that takes a list of names and room numbers on the command line, adds them to entries from a Map relating people to their rooms, and then creates a reverse Map that relates rooms to the people who are in them.

Look at the stub program rooms.scala. A Map named defaultRoomNumbers has already been defined with a few entries. As in previous problems, there are some print statements like print ’’Part (a)’’ – make sure not to delete those.

Note: you must use immutable Maps for this problem.

(a) The arguments on the command line are to be expected as a list of (person number) pairs, like Bill 315 Angela 412 Susan 325. Get this information from the command line (making sure to check that an even number of arguments has been provided), and use it to add create a new Map that contains both these person to room pairings and those in the defaultRoomNumbers Map. Then print out who is in which room, alphabetically by name.

Hint: use the mod function % to find out whether there are an even number of arguments. (Try 5 % 2 and 4 % 2 in the Scala REPL and you’ll see what it does.)

(b) Some people might share the same room. Create a new map that maps rooms to the people in them, so it will have room numbers as keys, and lists of people as values.

Here’s an example of how your program should work on some example input:

$ scala rooms.scala Bill 315 Angela 412 Susan 325

Part (a)
Angela: Room 412
Bill: Room 315
Jane: Room 312
Sam: Room 312
Susan: Room 325
Ted: Room 325

Part (b)
Room 312: Sam,Jane
Room 315: Bill
Room 325: Susan,Ted
Room 412: Angela


4. Word-by-word translation
For this problem you’ll create a very simple machine translator that looks at each word in a source language sentence and maps it to a unique word in a target language. This will involve working with the file system and with Maps. Use the stub file translate.scala for this.

You are given a (very) small translation dictionary in the file por2eng.txt. You must read this in from the command line (as the first argument) to create a Map that maps Portuguese words to the English words, as specified in por2eng.txt. Then, using that Map, you must translate any remaining arguments on the command line (which you can expect to be Portuguese sentences) and translate them one by one to English.

As with the previous problem, you should use only immutable data structures.

Here is an example call to translate.scala and the output it should produce for the given input.

$ scala translate.scala por2eng.txt "a menina ouviu o cachorro ontem" "o cachorro viu o gato hoje"
Portuguese: a menina ouviu o cachorro ontem
English:    the girl heard the dog yesterday

Portuguese: o cachorro viu o gato hoje
English:    the dog saw the cat today
ċ
hw2-stubs.zip
(2k)
Jason Baldridge,
Jan 30, 2012, 7:30 PM
Comments