Mining a Year of Speech: a Digging into Data challenge

News

30/6/11 The project officially finished (though our work on very large-scale corpora continues under other sources of funds). Our final white paper is available here. We released a sampler of spoken audio material from the British National Corpus here.

9-10/6/11 John Coleman and Mark Liberman presented "Mining Years and Years of Speech" at the Digging into Data Challenge final conference, NEH headquarters, Washington DC. Eric Hand, a reporter for Nature magazine, picked up part of our work in this article.

28/4/11 A short article about the Digging into Data programme, in which John Coleman gives a few comments, appeared in the Times Higher Education magazine.

28-31/1/11 We participated in "New Tools and Methods for Very-Large-Scale Phonetics Research", at the University of Pennsylvania ). The workshop was organized by our collaborators Jiahong Yuan and Mark Liberman. John Coleman presented a paper, "Mining a Year of Speech", on behalf of our project, and Ladan Baghai-Ravary, Sergio Grau and Greg Kochanski had a poster, "Detecting gross alignment errors in the Spoken British National Corpus". Greg also gave an oral paper, "Should corpora be big, rich, or dense?" on work he's been doing with Chilin Shih and Ryan Shosted, under our related "Word Joins" project. On the final day, John Coleman and Lauren Hall-Lew had a poster paper in the STELARIS satellite workshop.

18/11/10 We received the final installment of digitized BNC recordings from the British Library. So, the digitization is now complete (modulo checks to see that we haven't missed any tapes). A big Bravo! and Thankyou! to Christine Adams, Adam Tovell and their colleagues in the British Library Sound Archive.

19/10/2010 J. Coleman gave invited lecture, "Large-scale computational research in Arts and Humanities, using mostly unwritten (audio/visual) media", at the Universities UK-sponsored 'Future of Research' conference. The slides are here

9/8/2010 NSF finally informed our UPenn partners that they will be given their grant for the Mining a Year of Speech project. Somewhat later than the 1st January start date of the project.

9/6/2010 The Mining a year of speech project was featured in The Chronicle for Higher Education (June 4, 2010) link

January 2010 The Mining a year of speech project was featured in the Oxford University Blueprint: link (see p. 3)

J. Coleman interview on BBC World Service Digital Planet: link

4/12/2009 News item about the project on the University's news page: link

Content

Overview

Datasets

A Spoken BNC Sampler

Further examples

Overview

Technologies for storing and processing vast amounts of text are mature and well-defined. In contrast, technologies for browsing or mining content from large collections of non-textual material, especially audio and video, are less well-developed. Large scale data mining on text has helped transform the relevant disciplines; the disciplines dealing with spoken language may well reap similar benefits from accessible, searchable, large corpora.

This project shall address the challenge of providing rich, intelligent data mining capabilities for a substantial collection of spoken audio data in American and British English. We shall apply and extend state-of-the art techniques to offer sophisticated, rapid and flexible access to a richly annotated corpus of a year of speech (about 9000 hours, 100 million words, or 2 Terabytes of speech), derived from the Linguistic Data Consortium, the British National Corpus, and other existing resources. This is at least ten times more data than has previously been used by researchers in fields such as phonetics, linguistics, or psychology, and more than 100 times common practice. (More on the datasets here ...)

It is impractical for anyone to listen to a year of audio to search for certain words or phrases, or to manually analyze the resulting data. With our methods, such tasks will take just a few seconds. The purposes for which people conduct such searches are very varied, and it is neither possible nor desirable to predict what people might want to look for. Some possibilities are:

1. When did X say Y? For example, "find the video clip where George Bush said 'read my lips'."
2. Are there changes in dialects, or in their social status, that are tied to the new social media?
3. How do arguments work? For example, what strategies do people use to handle interruptions?

Though our experience and research interests happen to be focussed on such matters as intonation, pronunciation differences between dialects, and dialogue modeling, the text-to-speech alignment and search tools produced by the project will open up this "year of speech corpus" for use by a wide variety of researchers interested in e.g. linguistics, phonetics, speech communication, oral history, newsreels, or media studies. Audio-video usage on the Internet is large and growing at an extraordinarily high rate - witness the huge growth of Skype and YouTube (now the second most frequently used search engine in the world). In the multimedia space of Web 2.0, automatic and reliable annotation and searchable indexing of spoken materials would be a "killer app". It is easy to envisage a near-future world in which a search query would return the relevant video clips and data describing the event(s). The techniques we use here could be applied to any material where audio or video is accompanied by a script or transcript, including copyright-controlled broadcast media.

This project has three stages, in all of which we have considerable experience. Each stage extends proven technology; their combination and application on a large scale will open the door to new kinds of language-based research.

First, we transform transcripts in ordinary spelling into phonetic transcriptions; these are then automatically time-aligned to the digital audio recordings. This uses a version of the forced-alignment techniques developed as part of automatic speech recognition research, adapted to deal with disfluencies and transcripts that are sometimes incomplete or inaccurate.

Second, we put the time-aligned orthographic and phonetic transcriptions into a database that will allow us (or future researchers) to add additional layers of annotation – e.g. marking which meaning of a word is meant – and metadata such as the date and circumstances of the recording. We will also add summaries of acoustic properties that are useful for indexing and data mining.

Third, we will develop a demonstration front-end for accessing the database. Using this, we will seek to understand usage scenarios, what data should be included, and the impact that a larger scale search engine might have.

This project is a collaboration between the Linguistic Data Consortium, University of Pennsylvania, USA (Director: Professor Mark Y. Liberman) and the Phonetics Laboratory, University of Oxford, UK (Director: Professor John S. Coleman), together with colleagues at the British Library, whose National Sound Archive curates the British English materials used in this project. The US team also includes Jiahong Yuan and Christopher Cieri; the UK team also includes Greg Kochanski, Lou Burnard and Ladan Ravary (from Oxford) and Jonathan Robinson and Joanne Sweeney (from the British Library).

John Coleman, December 2009