Mining a Year of Speech

University of Oxford

Phonetics Laboratory

Linguistic Data Consortium
University of Pennsylvania

Mining a Year of Speech: a Digging into Data challenge

Mining a year of speech (waveform)

Technologies for storing and processing vast amounts of text are mature and well-defined. In contrast, technologies for browsing or mining content from large collections of non-textual material, especially audio and video, are less well-developed. Large scale data mining on text has helped transform the relevant disciplines; the disciplines dealing with spoken language may well reap similar benefits from accessible, searchable, large corpora.

This project shall address the challenge of providing rich, intelligent data mining capabilities for a substantial collection of spoken audio data in American and British English. We shall apply and extend state-of-the art techniques to offer sophisticated, rapid and flexible access to a richly annotated corpus of a year of speech (about 9000 hours, 100 million words, or 2 Terabytes of speech), derived from the Linguistic Data Consortium, the British National Corpus, and other existing resources. This is at least ten times more data than has previously been used by researchers in fields such as phonetics, linguistics, or psychology, and more than 100 times common practice. (More on the datasets here ...)

It is impractical for anyone to listen to a year of audio to search for certain words or phrases, or to manually analyze the resulting data. With our methods, such tasks will take just a few seconds. The purposes for which people conduct such searches are very varied, and it is neither possible nor desirable to predict what people might want to look for. Some possibilities are:

When did X say Y? For example, "find the video clip where George Bush said 'read my lips'."
Are there changes in dialects, or in their social status, that are tied to the new social media?
How do arguments work? For example, what strategies do people use to handle interruptions?

Though our experience and research interests happen to be focussed on such matters as intonation, pronunciation differences between dialects, and dialogue modeling, the text-to-speech alignment and search tools produced by the project will open up this "year of speech corpus" for use by a wide variety of researchers interested in e.g. linguistics, phonetics, speech communication, oral history, newsreels, or media studies. Audio-video usage on the Internet is large and growing at an extraordinarily high rate - witness the huge growth of Skype and YouTube (now the second most frequently used search engine in the world). In the multimedia space of Web 2.0, automatic and reliable annotation and searchable indexing of spoken materials would be a "killer app". It is easy to envisage a near-future world in which a search query would return the relevant video clips and data describing the event(s). The techniques we use here could be applied to any material where audio or video is accompanied by a script or transcript, including copyright-controlled broadcast media.

This project has three stages, in all of which we have considerable experience. Each stage extends proven technology; their combination and application on a large scale will open the door to new kinds of language-based research.

First, we transform transcripts in ordinary spelling into phonetic transcriptions; these are then automatically time-aligned to the digital audio recordings. This uses a version of the forced-alignment techniques developed as part of automatic speech recognition research, adapted to deal with disfluencies and transcripts that are sometimes incomplete or inaccurate.

Second, we put the time-aligned orthographic and phonetic transcriptions into a database that will allow us (or future researchers) to add additional layers of annotation – e.g. marking which meaning of a word is meant – and metadata such as the date and circumstances of the recording. We will also add summaries of acoustic properties that are useful for indexing and data mining.

Third, we will develop a demonstration front-end for accessing the database. Using this, we will seek to understand usage scenarios, what data should be included, and the impact that a larger scale search engine might have.

This project is a collaboration between the Linguistic Data Consortium, University of Pennsylvania, USA (Director: Professor Mark Y. Liberman) and the Phonetics Laboratory, University of Oxford, UK (Director: Professor John S. Coleman), together with colleagues at the British Library, whose National Sound Archive curates the British English materials used in this project. The US team also includes Jiahong Yuan and Christopher Cieri; the UK team also includes Greg Kochanski, Lou Burnard and Ladan Ravary (from Oxford) and Jonathan Robinson and Joanne Sweeney (from the British Library).

John Coleman, December 2009