by
Deniz Yuret
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
May 15, 1998
©Massachusetts Institute of Technology 1998. All rights reserved.
Within the framework of lexical attraction, I developed an unsupervised language acquisition program that learns to identify linguistic relations in a given sentence. The only explicitly represented linguistic knowledge in the program is lexical attraction. There is no initial grammar or lexicon built in and the only input is raw text. Learning and processing are interdigitated. The processor uses the regularities detected by the learner to impose structure on the input. This structure enables the learner to detect higher level regularities. Using this bootstrapping procedure, the program was trained on 100 million words of Associated Press material and was able to achieve 60% precision and 50% recall in finding relations between content-words. Using knowledge of lexical attraction, the program can identify the correct relations in syntactically ambiguous sentences such as ``I saw the Statue of Liberty flying over New York.''
Certified by Patrick H. Winston, Ford Professor of Artificial Intelligence and Computer Science, Thesis Supervisor.
Accepted by Arthur C. Smith, Chairman, Department Committee on Graduate Students.
Reader Marvin Minsky, Toshiba Professor of Media Arts and Sciences and Professor of Electrical Engineering and Computer Science.
Reader Boris Katz, Principal Research Scientist.
Acknowledgments
I am grateful to Carl de Marcken for his valuable insights, Alkan Kabakçioglu for sharing my interest in math, and Ayla Ogus for her endless patience. I am thankful to my mother and father for the importance they placed on my education. I am indebted to my advisors Patrick Winston for teaching me AI, Boris Katz for teaching me language, and Marvin Minsky for teaching me to avoid bad ideas.