Mining a Year of Speech

University of Oxford

Phonetics Laboratory

Linguistic Data Consortium
University of Pennsylvania

The Spoken BNC: samples of "language in the wild"

Mining a year of speech (waveform)

The spoken part of the British National Corpus consists of about 1,800 hours (about 10 million words) of unscripted speech. It has two parts of roughly equal size:

a demographic part, of informal talk recorded by a socially-stratified sample of respondents, selected by age group, social class and geographic region;
a context-governed part, recorded in more formal situations such as meetings, debates, lectures, seminars, religious services, radio programmes etc.

For the demographic part, random location sampling procedures were used to recruit 124 people aged over 15 from across the United Kingdom, with approximately equal numbers of men and women, from each of five age groups and four social classes. Each recruit used a portable tape recorder to record their own speech and the speech of people they conversed with over a period of up to a week. Recordings of people under 16 were contributed to the BNC as part of the University of Bergen COLT (Corpus of London Teenager speech) project, using the same recording methodology.

The demographic part is a vast treasure-house of "language in the wild", and is about as close as it is possible to get (without covert recording) to "real speech". Here are a few samples:

John Coleman, December 2009