Exercises‎ > ‎

Regular expressions


Download the Federalist Papers and the Entire Works of Mark Twain from Project Gutenberg.

Tip: if you have it installed on your computer, use wget to do this without clicking on web links.

$ wget http://www.gutenberg.org/cache/epub/3200/pg3200.txt
$ wget http://www.gutenberg.org/cache/epub/18/pg18.txt


1. Write a Scala program that takes the first command line argument as a file to process, converts it into a string, and then prints out the text.

2. Notice that there is material about Project Gutenberg at the beginning and end of each file, and that there are consistent strings indicating where the actual text of a  book starts and ends. Write a regular expression 'JustTextRE' that matches these delimiters and captures the text in between them (as a regex group). Apply this regular expression to the text and obtain the matched text such that you can save it to a variable 'text'. Print out the text and verify that it is correct for both files.

Tip: you'll need to preface your regular expression with "(?s)" so that the "any" character '.' matches newlines.

3. Write a regular expression 'AllCapsRE' that identifies any sequence of words that are in all uppercase. Match this regex to the text and print out each match.

4. Pipe the output of the previous exercise to the Unix utilities we looked at in class such that you obtain a list of all these sequences and their counts in reverse sorted order. See Unix for Poets for assistance.

5. Write a regular expression for identify names with titles in the text like "Mr. John Brown" and "Mrs. Smith". You should handle Dr, Mr., Mrs. and Ms. Apply it to the text and print out each match.

6. For the output for both the Federalist Papers and Twain from the previous exercise, create the reverse sorted list of names from each and save them to a file (using '>'). Compare the top names in each by using the 'paste' command to see the sorted lists side-by-side.

7. Extend the code from exercise 5 so that you calculate how many of each kind of title there are, e.g. Dr., Mr., Mrs., and Ms. Use a Map for this.

8. Create a regular expression and use it to convert the Federalist Papers into a list that has the articles as the elements. The articles should be represented as Maps that encode values such as 'author', 'title', 'venue', 'date' and 'text'. Note that there is slight variability in how the articles are formatted, so you'll need to handle some special cases.