AT&T Labs Research

Mugshot

Church, Kenneth W.

Phone: +1 973 360-8620
Fax: +1 973 360-8077
Room: D235
Address: 180 PARK AVE, P.O. BOX 971, FLORHAM PARK, NJ 07932-0971 UNITED STATES

Dept. Head, AT&T Labs Research

Awards

AT&T Fellow, 2001

Education

Received Ph.D., MS, and SB degrees in 1983, 1980, and 1978, all from MIT in Computer Science.

Undergraduate thesis, supervised by Richard Greenblatt, on chess programming. Thanks to considerable help and encouragement from Hans Berliner, it was published in IJCAI-79 as Co-ordinate Squares: A Solution to Many Chess Pawn Endgames.

Masters thesis, supervised by Peter Szolovits, argued that a finite state machine (FSM) was sufficient, and that one did not need much more complex mechanisms (ATNs or Turing Machines) that were in vogue at the time. The thesis was originally intended to help Szolovits' Expert System MEDG (Medical Decision Making Group) generate better explanations. Became convinced that the MEDG explanations were difficult to understand because they were full of center-embedding, a kind of recursion that requires unbounded memory to process. This argument was based on Chomsky's 1959 proof that center-embedded characterizes the difference between push down automata and finite state machines (FSMs). The thesis ultimately evolved into a demonstration that language could be processed on a FSM by implementing a parser based on Mitch Marcus' thesis and Kaplan-Bresnan's Lexical-Functional Grammar, and showing that the stack did not need more than about three cells as long as one used certain compiler optimizations such as tail recursion to conserve memory.

Doctoral thesis, supervised by Jon Allen, argued that well-known parsing methods could be used help take advantage of allophonic variation in speech. The phoneme /t/, for example, is typically realized with a heavily aspirated burst at the beginning of a syllable as in Tom, but not at the end of a syllable as in atlas. At the time, these sorts of variations were seen as a kind of noise that a speech recognition machine would have to overcome. I preferred to view the speech system as consisting of two types of clues, those that varied systematically with the suprasegmental (syllable structure and stress) context such as aspiration and those that tended to be relatively invariant to contextual influences such as place, manner and voicing. Both types of clues are useful. Variant clues can be used in a parser to discover the underlying context, and therefore should not be considered noise. An abbreviated version of the thesis was published in Cognition in 1987; the complete thesis was published by Kluwer the following year.

Work Experience

1983-1990

Joined AT&T Bell Laboratories research in 1983 to work with Mitch Marcus and Mark Liberman in Osamu Fujimura's Linguistic Research and Artificial Intelligence department. Worked on a number of problems in speech and language, especially part of speech tagging, pronunciation of surnames for speech synthesis and text analysis. Published a statistical trigram part of speech tagger in 1988. The approach has since become standard, but at the time it was believed that much more complex mechanisms such as ATNs or Turing Machines were required. (There was also a strong bias at the time against empirical/statistical approaches to language.)

My interest in text analysis grew out of a collaboration with the lexicographer Patrick Hanks. I ran into a number of lexicographers while working on letter-to-sound rules for speech synthesis. At the time, letter-to-sound rules were being used to guess when the letter “t” should be pronounced with the phoneme /t/ and when it shouldn't. Word accuracy rates were unacceptably low. After acquiring the rights to a dictionary, word accuracy quickly climbed from about 80% to 99.9% or better. The dictionary-based approach has since become standard, but at the time there was considerable controversy over the “larger” memory requirement (0.25 megabytes). Ever since I have been acquiring data just about as fast as I have been able to acquire disks. These days, we buy disks by the terabyte.

After this successful interaction with lexicography, Patrick Hanks visited Bell Laboratories for a month in early 1989. He showed me how lexicographers were beginning to use large corpora to find collocations, words that are often found together like doctor/nurse, or save...from. We found ways to automate some of the drudgery by using well-understood statistical measures like mutual information. Began working with William Gale, a statistician at Bell Laboratories, on estimating the probability of rare events in large corpora, comparing the Good-Turing method with a held-out estimator. Over the next five years, ending with Gale's retirement, we published a series of papers together on spelling correction, word-sense disambiguation, aligning parallel corpora, and alternatives to the Poisson for modeling the distribution of words in text.

1990-1991

Visited Ed Hovy at USC/ISI for an academic year, with support from ARPA for work on machine translation and parallel corpora (texts such as the Canadian Parliamentary debates that are published in two languages). Spent the time reading in many areas, especially in machine translation (MT) and information retrieval (IR). Started attending meetings in both areas. Became very active in corpus lexicography, publishing a number of papers with Hanks, Gale, Yarowsky and others. Also used the time to help members of the community get started in corpus-based research by serving on data collection committees such as the ACL/DCI (Association for Computational Data Collection Initiative), writing review articles (for Computational Linguistics, Machine Translation, and CACM) and tutorials (Coling-90, 1992 European Summer School, ACL-95). Taught a graduate seminar course at USC with Ed Hovy in the spring term. Started reviewing large numbers of papers for conferences, journals and funding agencies, perhaps as many as a hundred papers in some years.

1991-1994

Returned to Bell Laboratories in fall 1991 and switched into a software engineering organization, headed by Peter Weinberger. Continued my interest in large corpora, by showing how a large multi-million line program could be thought of as a large corpora, and how corpus-based techniques could be used in the discovery process, the process of reading a large program for the first time. Wrote a program with Jon Helfman called dotplot that helps find self-similarity (copied regions) in text (large corpora, source code, object code) using methods borrowed from biology for matching long sequences of DNA.

Developed the Termworks platform with AT&T Language Line, a commercial translation service. The platform was designed to help translators with technical terminology and translation reuse. Translators don't need help with easy vocabulary and easy grammar (the focus of most machine translation research) but they do need help with technical terms like dialog box. The solutions to these terminology puzzles can be surprising, even to a native speaker of the language. Microsoft uses finestra (window) in their Italian manuals, but boite (box) in their French manuals. The tools help translators conform to house standards by extracting examples of the term and its translations from a previous release of the manual that has already been published in both languages. A statistical alignment program is used to figure out which parts of the source text correspond to which parts of the translation. The tool and its implications for machine translation research were presented at an invited talk at EACL-93. Pierre Isabelle and I edited a special issue of the journal Machine Translation (1997 12:1/2) on tool-based alternatives to machine translation.

Worked on a digital library project based on a collection of 500,000 pages of AT&T memos supplied by the Murray Hill Library in 400 dots-per-inch bitmap format (20 gigabytes). Demonstrated how off-the-shelf optical character recognition (OCR) and information retrieval (IR) programs could be used to implement a ``fax-a-query'' information retrieval service where both the input and output could be faxes or bitmaps or any format that could be converted to a fax or bitmap. Argued at Coling-94 that fax had advantages over proposed ascii-based standards such as SGML (availability, ease-of-use, longevity). Began work on bitmap-level matching, avoiding the need for OCR. Found that bitmap-level matching could also be used to improve up-sampling and down-sampling of documents, which is important if the scanner, the fax transmission line and the printer use different sampling rates.

1995-1996

Taught a seminar course on corpus-based methods at Penn with Mitch Marcus. Served on thesis committees for Pascale Fung and Vasileios Hatzivassiloglou, both at Columbia. In Pascale's case, I was in a certain sense her de facto advisor. Her interest in aligning parallel and non-parallel corpora, which ultimately became her thesis, started with a summer job in my lab at AT&T. Switched into a new organization headed by Eric Sumner, focusing on network services (issues more closely aligned with the core telephone business than speech and language). Worked on some systems problems such as how to replace a billing factory with a few PCs. There is a tendency to believe that if you have a serious “big data” problem (e.g., billing, datamining), then you need a serious (=expensive) computer center full of mainframes, tape robots and bureaucracy. Attempted to show that a gang of PCs could replace a mainframe, but discovered, at least in one application (Billing Edge) that a single PC could replace a gang of mainframes.

1996-

Promoted to head a new department in research. AT&T has vast quantities of data, some is textual, but most is not. The department uses visualization and datamining, methods that have become popular in corpus-based natural language processing, to find patterns of interest. Imagine that we have a large steady stream of usage records indicating which customers are buying which services and when. This kind of data feed is usually developed for billing purposes, but many companies are discovering a broad range of other applications such as fraud detection, inventory control, provisioning scarce resources, forecasting demand, etc. The problems are challenging from a research point of view because of the massive scale. The telephone network, for example, has about 100 million customers (nodes), who purchase about 100 million calls (edges) per day. This data stream is much larger than the very large corpora that I had just a few years ago. We used to think that a billion words of text was a lot of data, but we are currently receiving a couple billion records a week.

1998-1999

Developed with Glenn Fowler and Don Caldwell a program called pzip that is like gzip, but uses an automatic training procedure to induce an optimal schema for compressing tabular data such as database records and spread sheets. Sometimes it is better to compress a table in row major order and sometimes it is better to compress a table in column major order. Pzip uses a schema to numerate the values in the table and then passes the results to gzip. Another program, pin (partition inducer), uses dynamic programming to find the optimal partition (or schema) by applying gzip to all possible partitions of a sample of the table. Pzip typically saves a factor of two in both space and time over gzip, which has been very much appreciated by several large data warehousing applications. A paper was published in SODA-2000 ps. Won a “Circle of Excellence” award in 1998 from the AT&T CIO (Chief Information Office) for this work.

Established excellent relationships with AT&T marketing, especially the folks behind the new Lucky Dog flanker brand. My team maintains a web site that tracks sales in “real time” (faster than previously possible). This web site is changing the culture in marketing. Now that they know they can get quick feedback, they are much more willing to experiment. Try a little bit of this and a little bit of that, and do more of what works and less of what doesn't. The web site has moved SWIFT from a research prototype into a highly visible business reality. SWIFT was a very nice demo that visualized traffic on the AT&T network on a map projected on a powerwall, a huge wall-sized terminal. The point of the demo is that we ought to be able to react to a change in the marketplace in less than a day. To make the idea a business reality, we simplified the interface so that marketing managers could drive it, and replaced the fancy powerwall hardware with a standard web browser. In addition, we worked closely with marketing managers to understand what matters to them, a task that reminded me of my days in Peter Szolovits' expert systems group at MIT.

1999-

Built upon our success with Lucky Dog to establish relationships with other marketing groups, especially the launch of AT&T local service in New York and Texas. Used datamining techniques to discover considerable room for improvement in operations. Won “salesman of the month” in August 2000 for finding large numbers of sales that could be “reflowed.” This discovery played a major role in the team meeting its sales commitments.

Despite my new interests in datamining, visualization, management and business, I remain active in computational linguistics, my first love. With so many responsibilities, it has been more difficult to publish, but nevertheless, together with Yamamoto, we showed how to compute term frequencies and document frequencies, two statistics used in Information Retrieval, for all substrings in a large corpus. It is well known how to compute these statistics for short substrings, e.g., unigrams, bigrams and trigrams, but we showed how to compute many of these kinds of statistics for all substrings including very long ones, e.g., million-grams. One might have thought that this would be impractical since there are n²substrings, but it turns out that the n² substrings can be grouped into n classes using a data structure called suffix arrays, and many of the statistics of interest can be computed over the n classes rather than the n²substrings. Since we were able to compute these statistics over a larger set of words and phrases than previously possible, we found an interesting new heuristic for distinguishing good keywords from general vocabulary, which seems promising for both Japanese and English. Both mutual information and a new measure called residual IDF (inverse document frequency) compare an observation to chance, but they use different notions of chance. Mutual information looks for deviations from compositionality. It grew out of work with lexicographers (Hanks) and so it isn't surprising that it is better for finding general vocabulary, the kind of words and phrases that one would expect to find in a dictionary. On the other hand, residual IDF looks for words and phrases that are not randomly distributed across documents. It grew out of the information retrieval literature and so it isn't surprising that it is better for finding keywords, the words and phrases that distinguish one document from another.

Community Service:

Served on editorial board of Computational Linguistics (CL), Machine Translation (MT), Quantitative Linguistics (QL), International Journal of Corpus Linguistics (IJCL). Served on program committees for numerous conferences including SIGIR and ACL, both more than once. Helped to create SIGDAT, a special interest group focusing on corpus data, and organized, chaired, or co-chaired many of its annual workshops on very large corpora, which have evolved into conferences with 100-200 participants per meeting, about equally distributed over Europe, Asia and America. Co-chaired ACL-99, the major meeting in computational linguistics in 1999. Co-chaired the Industrial Track of KDD-2000, the major meeting in datamining in 2000.

home | people | projects | research areas | resources