Instructor: Jason Baldridge  Time: Wed 11-2pm   Location: Parlin 10

Advances in computational linguistics, machine learning, and computer hardware over the last two decades have produced a powerful set of tools and capabilities for the automatic processing of natural language texts. We now have access to a massive quantity of free-form natural language text available on the Internet and in large text collections (including out-of-copyright books). Increasing quantities of text in a wide variety of languages are being produced everyday through news, blogs, and social media. Consequently, the ability to process natural language to categorize and cluster texts, find and visualize patterns in them, or to even just find them at all has become increasingly important. A wide variety of disciplines---from linguistics to psychology to archeology to literature and beyond---are coming to rely on natural language processing tools that enable them to ask new questions of corpora of interest to them. This is particularly evident in the ascendancy of digital humanities, where researchers would often like to be able to identify interesting patterns in corpora that are too large to be manually inspected. There is also a great deal of commercial interest in systems that can process unstructured textual data to extract, categorize, and present the information contained in it, and in some cases, to use it to predict things about the real world, such as the movement of stock prices.

This class will provide a practical introduction to many of the core algorithms in natural language processing and machine learning that are useful in a wide variety of text analysis applications, such as authorship attribution, sentiment analysis, information extraction and geolocation. We will cover algorithms for clustering, classification, part-of-speech tagging, topic modeling and named entity recognition, as well as evaluation methodologies for evaluating their success and methods for visualizing their outputs. The course will include an introduction to the programming language Scala, which will be used for homework assignments. Assignments will provide experience with the methods as well as experience with popular open source toolkits such as Apache OpenNLP and Mallet. No prior programming experience is assumed.