Natural Language Processing (11-411)

Instructor: Noah Smith (Assistant Professor in the LTI) — email me (fix the address first)
Time: Spring 2008, Tuesdays and Thursdays 3-4:20pm
Place: Wean 5310
Web page: makeitunderstand.org
History: Spring 2008
Prerequisites: Fundamental Data Structures and Algorithms (15-211) and strong programming capabilities
Textbook: Speech and Language Processing (second edition, 2007, Prentice-Hall), by Daniel Jurafsky and James Martin

DAVID BOWMAN: Open the pod bay doors, HAL.
HAL 9000: I'm sorry, Dave, I'm afraid I can't do that.
Stanley Kubrick and Arthur C. Clarke, screenplay of 2001: A Space Odyssey

Course Description

This course is about a variety of ways to represent human languages (like English and Chinese) as computational systems, and how to exploit those representations to write programs that do neat stuff with text and speech data, like

This field is called Natural Language Processing or Computational Linguistics, and it is extremely multidisciplinary. This course will therefore include some ideas central to Machine Learning and to Linguistics.

We'll cover computational treatments of words, sounds, sentences, meanings, and conversations. We'll see how probabilities and real-world text data can help. We'll see how different levels interact in state-of-the-art approaches to applications like translation and information extraction.

From a software engineering perspective, there will be an emphasis on rapid prototyping, a useful skill in many other areas of Computer Science. In particular, we will introduce some high-level languages (e.g., regular expressions and Dyna) and some scripting languages (e.g., Python and Perl) that can greatly simplify prototype implementation.

Competitive Project

A major component will be the project: build a program whose input is a web page P and whose output is a set of questions about the content in P (that a human could answer if she read P), and can also, if given a question Q about the content of P, answer the question intelligently. Projects will be pitted against each other in a competition at the end of the course.

Evaluation

Students will be evaluated by exam (midterm and final, totaling 30%), regular short quizzes (20%), weekly pencil-and-paper or small programming homework problems (10%), and the group project (40%).

FAQ

How is this course different from 11-682?

HLT was meant to give you a broad overview of language technologies, including translation, information retrieval, and speech recognition. This course is really focused on core techniques for natural language processing. We'll go a lot deeper into modern approaches to solving NLP problems from disambiguating words, to accurately finding parse trees, using statistics to "build software from data," and the connections between NLP and linguistics. The course is both deeper and more applied - it will give you lots of hands-on experience, both through the project and some lectures on useful tools.

Should I take this course?

Yes, if:

Related courses elsewhere (not exhaustive!)

University of California, Berkeley, Brown University, University of Colorado, Columbia University, Cornell University, Harvard University, University of Illinois at Urbana-Champaign, Johns Hopkins University, University of Maryland, New York University, University of Pennsylvania, Stanford University, University of Utah, University of Wisconsin-Madison