speaker
Oxford University logo
Phonetics Laboratory
Faculty of Linguistics, Philology and Phonetics

A British National Corpus Spoken Audio Sampler

This site presents a selection of audio files from the spoken part of the British National Corpus, digitized from the analogue audio cassette tapes deposited at the British Library Sound Archive, together with associated transcription and annotation files created during the Mining a Year of Speech project.

(All of the audio files from the spoken part of the BNC together with the corresponding Praat TextGrids and html transcription files are now available from http://www.phon.ox.ac.uk/AudioBNC; however, the quality and accuracy of alignments is not uniform for the whole corpus, so the selection in this sampler may be easier to use in some cases.)

British Library Sound Archive, in collaboration with Oxford University Phonetics Laboratory, has recently (2009-10) digitized all of the extant tapes, with a view to a full on-line release in the near future. Under the terms of the original recording permissions agreement with the contributors "all tapes and conversation details will be completely anonymous, and will be used for scientific study and publication by writers of dictionaries and educational material and language researchers"; it has therefore been necessary for us to locate and mute all of the portions of the audio corresponding to the anonymization <gap> tags in the TEI-XML editions of the corpus. Since there are over 18,710 <gap> tags in the TEI-XML transcriptions, this task is not yet complete. When complete, it is planned to provide access via search and browsing tools to stable URI's on the British Library's sound server. In the mean time, we offer this sample via the Phonetics Laboratory website, as a test-bed for researchers and developers. (NB. We have discovered that the extant sound recordings only contain about 7.5 million words, not the 10 million words originally transcribed. There is a substantial number of XML transcription files for which we may no longer have the original audiotapes. Or perhaps we do: we also have quite a few recordings that we haven't yet related to any transcription. So we're still working on resolving that.)

In order to locate anonymization gaps, as well as to index the recordings with all transcribed vowels, consonants, and words, we aligned the text transcriptions to the audio using a forced aligner based on HTK, using a combination our acoustic models for British English plus American English models from P2FA, the Penn Phonetics Lab Forced Aligner. The alignment procedure yields a best-fitting phonemic transcription of the audio, together with detailed timing information: the start and end time of every vowel, consonant, word, utterance and recording. This data is encoded as Praat TextGrid files, which we also provide in this release. A short paper on the Mining a Year of Speech project can be downloaded from here.

The files in this sample were chosen on the grounds that (a) the accuracy of the transcription alignments is relatively good, and (b) either they have no anonymization gaps, or few anonymization gaps which have been carefully checked, to ensure that the corresponding portions of the audio signal have been accurately muted.

Previous releases of BNC spoken audio material

The BNC spoken audio recordings have been (and still are) available for study by language researchers visiting the British Library Sound Archive in person; however, until our recent digitization project, neither the online catalogue nor the TEI-XML editions of the transcriptions were sufficiently informative for researchers to be able to easily find tapes or portions of interest. By issuing our forced alignment index files, we aim to make the researchers' task substantially easier. A subset of the recordings in the BNC have previously been published on audio CD's as COLT: the Bergen Corpus of London Teenage Language. A smaller sample on audio cassette was distributed by Longman during the BNC collection project (Cassette Sleeve images).

Copyright and access terms

BNC spoken audio recordings were created or collected from other sources by Longman Dictionaries for the British National Corpus Consortium. Their usage is governed by the terms of the original recording permissions agreement with the contributors, which requires that they can only be "used for scientific study and publication by writers of dictionaries and educational material and language researchers". Furthermore, by downloading any of the audio recordings, you agree to the terms in section 2, 6, 7 and 9 of the BNC User Licence (available here), the audio recordings being understood to be among the "spoken texts" included in the "BNC Texts". The supporting annotation and transcription files are Copyright © 2011 The University of Oxford, and are made publicly available under a Creative Commons Attribution License (details here).

Though we do not charge a licence fee for access to or use of the audio recordings, users are required to register at the time of their first accessing the sound files, via the following form. In order to avoid this step on future visits to this site, users are advised to bookmark the next page, following registration. Additionally, registered users are welcome to link to or directly access the sound files and associated annotation and transcription files listed on the next page.

Registration Form

Please complete this form to access the BNC Spoken Audio Sampler

Email: