|
|
Dept. Head, AT&T Labs Research
|
||||||
|
|
Awards
AT&T Fellow, 2001 Education
Received Ph.D., MS, and SB degrees in 1983, 1980, and 1978, all
from MIT in Computer Science. Undergraduate thesis, supervised by
Richard Greenblatt, on chess programming. Thanks to considerable help and
encouragement from Hans Berliner, it was published in IJCAI-79 as Co-ordinate
Squares: A Solution to Many Chess Pawn Endgames. Masters thesis, supervised by Peter Szolovits, argued that a finite state machine (FSM) was
sufficient, and that one did not need much more complex mechanisms (ATNs or Turing Machines) that were in vogue at the
time. The thesis was originally
intended to help Szolovits' Expert System MEDG
(Medical Decision Making Group) generate better explanations. Became convinced that the MEDG
explanations were difficult to understand because they were full of
center-embedding, a kind of recursion that requires unbounded memory to
process. This argument was based
on Chomsky's 1959 proof that center-embedded characterizes the difference
between push down automata and finite state machines (FSMs). The thesis ultimately evolved into a
demonstration that language could be processed on a FSM by implementing a
parser based on Mitch Marcus' thesis and Kaplan-Bresnan's
Lexical-Functional Grammar, and showing that the stack did not need more than
about three cells as long as one used certain compiler optimizations such as
tail recursion to conserve memory. Doctoral thesis, supervised by Jon
Allen, argued that well-known parsing methods could be used help take
advantage of allophonic variation in speech. The phoneme /t/, for example, is
typically realized with a heavily aspirated burst at the beginning of a
syllable as in Tom, but not at the end of a syllable as in atlas.
At the time, these sorts of variations were seen as a kind of noise that a
speech recognition machine would have to overcome. I preferred to view the speech system
as consisting of two types of clues, those that varied systematically with
the suprasegmental (syllable structure and stress)
context such as aspiration and those that tended to be relatively invariant
to contextual influences such as place, manner and voicing. Both types of clues are useful. Variant clues can be used in a parser
to discover the underlying context, and therefore should not be considered
noise. An abbreviated version of
the thesis was published in Cognition in 1987; the complete thesis was
published by Kluwer the following year. Work
Experience
1983-1990 Joined AT&T Bell Laboratories
research in 1983 to work with Mitch Marcus and Mark Liberman
in Osamu Fujimura's Linguistic Research and Artificial Intelligence
department. Worked on a number of
problems in speech and language, especially part of speech tagging,
pronunciation of surnames for speech synthesis and text analysis. Published a statistical trigram part
of speech tagger in 1988. The approach has since become
standard, but at the time it was believed that much more complex mechanisms
such as ATNs or Turing Machines were required.
(There was also a strong bias at the time against empirical/statistical
approaches to language.) My interest in text analysis grew out of
a collaboration with the lexicographer Patrick Hanks. I ran into a number of lexicographers
while working on letter-to-sound rules for speech synthesis. At the time, letter-to-sound rules
were being used to guess when the letter “t” should be pronounced
with the phoneme /t/ and when it shouldn't. Word accuracy rates were unacceptably
low. After acquiring the rights
to a dictionary, word accuracy quickly climbed from about 80% to 99.9% or
better. The dictionary-based
approach has since become standard, but at the time there was considerable
controversy over the “larger” memory requirement (0.25
megabytes). Ever since I have
been acquiring data just about as fast as I have been able to acquire disks. These days, we buy disks by the
terabyte. After this successful interaction with
lexicography, Patrick Hanks visited Bell Laboratories for a month in early
1989. He showed me how
lexicographers were beginning to use large corpora to find collocations,
words that are often found together like doctor/nurse, or save...from. We found ways to automate some of the
drudgery by using well-understood statistical measures like mutual
information. Began working with William Gale, a
statistician at Bell Laboratories, on estimating the probability of rare
events in large corpora, comparing the Good-Turing method with a held-out
estimator. Over the next five
years, ending with Gale's retirement, we published a series of papers
together on spelling correction, word-sense disambiguation, aligning parallel
corpora, and alternatives to the Poisson for modeling the distribution of
words in text. 1990-1991 Visited Ed Hovy at USC/ISI for an academic year, with support
from ARPA for work on machine translation and parallel corpora (texts such as
the Canadian Parliamentary debates that are published in two languages). Spent the time reading in many areas,
especially in machine translation (MT) and information retrieval (IR). Started attending meetings in both
areas. Became very active in
corpus lexicography, publishing a number of papers with Hanks, Gale, Yarowsky and others. Also used the time to help members of
the community get started in corpus-based research by serving on data
collection committees such as the ACL/DCI (Association for Computational Data
Collection Initiative), writing review articles (for Computational
Linguistics, Machine Translation, and CACM) and tutorials (Coling-90, 1992
European Summer School, ACL-95).
Taught a graduate seminar course at USC with Ed Hovy
in the spring term. Started
reviewing large numbers of papers for conferences, journals and funding
agencies, perhaps as many as a hundred papers in some years. 1991-1994 Returned to Bell Laboratories in fall
1991 and switched into a software engineering organization, headed by Peter
Weinberger. Continued my interest in large corpora, by showing how a large
multi-million line program could be thought of as a large
corpora, and how corpus-based techniques could be used in the
discovery process, the process of reading a large program for the first
time. Wrote a program with Jon Helfman called dotplot that helps find self-similarity (copied
regions) in text (large corpora, source code, object
code) using methods borrowed from biology for matching long sequences of DNA. Developed the Termworks
platform with AT&T Language Line,
a commercial translation service.
The platform was designed to help translators with technical
terminology and translation reuse.
Translators don't need help with easy vocabulary and easy grammar (the
focus of most machine translation research) but they do need help with
technical terms like dialog box.
The solutions to these terminology puzzles can be surprising, even to
a native speaker of the language.
Microsoft uses finestra (window)
in their Italian manuals, but boite (box)
in their French manuals. The tools
help translators conform to house standards by extracting examples of the
term and its translations from a previous release of the manual that has
already been published in both languages. A statistical alignment program is
used to figure out which parts of the source text correspond to which parts
of the translation. The tool and
its implications for machine translation research were presented at an
invited talk at EACL-93. Pierre
Isabelle and I edited a special issue of the journal Machine Translation
(1997 12:1/2) on tool-based alternatives to machine translation. Worked on a digital library project
based on a collection of 500,000 pages of AT&T memos supplied by the
Murray Hill Library in 400 dots-per-inch bitmap format (20 gigabytes). Demonstrated how off-the-shelf optical
character recognition (OCR) and information retrieval (IR) programs could be
used to implement a ``fax-a-query'' information retrieval service where both
the input and output could be faxes or bitmaps or any format that could be
converted to a fax or bitmap. Argued at Coling-94 that fax had advantages
over proposed ascii-based standards such as SGML
(availability, ease-of-use, longevity).
Began work on bitmap-level matching, avoiding the need for OCR. Found that bitmap-level matching could
also be used to improve up-sampling and down-sampling of documents, which is
important if the scanner, the fax transmission line and the printer use
different sampling rates. 1995-1996 Taught a seminar course on corpus-based
methods at Penn with Mitch Marcus.
Served on thesis committees for Pascale Fung and Vasileios
Hatzivassiloglou, both at 1996- Promoted to head a new department in
research. AT&T has vast
quantities of data, some is textual, but most is not. The department uses visualization and datamining, methods that have become popular in
corpus-based natural language processing, to find patterns of interest. Imagine that we have a large steady
stream of usage records indicating which customers are buying which services
and when. This kind of data feed
is usually developed for billing purposes, but many companies are discovering
a broad range of other applications such as fraud detection, inventory
control, provisioning scarce resources, forecasting demand, etc. The problems are challenging from a
research point of view because of the massive scale. The telephone network, for example,
has about 100 million customers (nodes), who purchase about 100 million calls
(edges) per day. This data stream
is much larger than the very large corpora that I had just a few years
ago. We used to think that a
billion words of text was a lot of data, but we are currently receiving a
couple billion records a week. 1998-1999 Developed with Glenn Fowler and Don
Caldwell a program called pzip
that is like gzip, but uses an automatic
training procedure to induce an optimal schema for compressing tabular data
such as database records and spread sheets. Sometimes it is better to compress a
table in row major order and sometimes it is better to compress a table in
column major order. Pzip uses a schema to numerate the values
in the table and then passes the results to gzip. Another program, pin (partition
inducer), uses dynamic programming to find the optimal partition (or schema)
by applying gzip to all possible partitions
of a sample of the table. Pzip typically saves a factor of two in both space
and time over gzip, which has been very much
appreciated by several large data warehousing applications. A paper was published in SODA-2000 ps. Won a “Circle of
Excellence” award in 1998 from the AT&T CIO (Chief Information
Office) for this work. Established excellent relationships with
AT&T marketing, especially the folks behind the new Lucky Dog flanker brand. My team maintains a web site that
tracks sales in “real time” (faster than previously
possible). This web site is
changing the culture in marketing.
Now that they know they can get quick feedback, they are much more
willing to experiment. Try a
little bit of this and a little bit of that, and do more of what works and
less of what doesn't. The web
site has moved SWIFT from a research prototype into a highly visible business
reality. SWIFT was a very nice
demo that visualized traffic on the AT&T network on a map projected on a powerwall, a huge wall-sized terminal. The point of the demo is that we ought
to be able to react to a change in the marketplace in less than a day. To make the idea a business reality,
we simplified the interface so that marketing managers could drive it, and
replaced the fancy powerwall hardware with a
standard web browser. In
addition, we worked closely with marketing managers to understand what
matters to them, a task that reminded me of my days in Peter Szolovits' expert systems group at MIT. 1999- Built upon our success with Lucky Dog to
establish relationships with other marketing groups, especially the launch of
AT&T local service in Despite my new interests in datamining, visualization, management and business, I
remain active in computational linguistics, my first love. With so many responsibilities, it has
been more difficult to publish, but nevertheless, together with Yamamoto, we
showed how to compute term frequencies and document frequencies, two
statistics used in Information Retrieval, for all substrings in a large
corpus. It is well known how to
compute these statistics for short substrings, e.g., unigrams, bigrams and trigrams, but we showed how to compute many
of these kinds of statistics for all substrings including very long ones,
e.g., million-grams. One might
have thought that this would be impractical since there are n2 substrings,
but it turns out that the n2 substrings can be grouped into
n classes using a data structure called suffix arrays, and many of the
statistics of interest can be computed over the n classes rather than
the n2 substrings.
Since we were able to compute these statistics over a larger set of words
and phrases than previously possible, we found an interesting new heuristic
for distinguishing good keywords from general vocabulary, which seems
promising for both Japanese and English.
Both mutual information and a new measure called residual IDF (inverse
document frequency) compare an observation to chance, but they use different
notions of chance. Mutual
information looks for deviations from compositionality. It grew out of work with
lexicographers (Hanks) and so it isn't surprising that it is better for
finding general vocabulary, the kind of words and phrases that one would
expect to find in a dictionary.
On the other hand, residual IDF looks for words and phrases that are
not randomly distributed across documents. It grew out of the information
retrieval literature and so it isn't surprising that it is better for finding
keywords, the words and phrases that distinguish one document from another. Community Service: Served on editorial board of Computational Linguistics (CL), Machine Translation
(MT), Quantitative
Linguistics (QL), International Journal of Corpus Linguistics
(IJCL). Served on program
committees for numerous conferences including SIGIR and ACL, both more than once. Helped to create SIGDAT, a special
interest group focusing on corpus data, and organized, chaired, or co-chaired
many of its annual workshops on very large corpora, which have evolved into
conferences with 100-200 participants per meeting, about equally distributed
over
|
|||||