speaker
Oxford University logo
Phonetics Laboratory
Faculty of Linguistics, Philology and Phonetics

Audio BNC: the audio edition of the Spoken British National Corpus

John Coleman, Ladan Baghai-Ravary, John Pybus, and Sergio Grau (2012) Audio BNC: the audio edition of the Spoken British National Corpus. Phonetics Laboratory, University of Oxford. http://www.phon.ox.ac.uk/AudioBNC

About the corpus (to skip this description and jump to the access instructions, click here)

This site presents most (but not yet all) of the audio recordings from the spoken part of the British National Corpus, digitized from the analogue audio cassette tapes deposited at the British Library Sound Archive, together with associated transcription and annotation files created in a sequence of projects, especially Mining a Year of Speech and Word joins in real life-speech. Oxford University is responsible for curating and publishing the corpus, and the British Library is responsible for archiving and curating the audio recordings from the BNC and ensuring public access.

British Library Sound Archive, in collaboration with Oxford University Phonetics Laboratory, digitized all of the extant tapes in its possession in 2009-10. Under the terms of the original recording permissions agreement with the contributors, "all tapes and conversation details will be completely anonymous, and will be used for scientific study and publication by writers of dictionaries and educational material and language researchers"; it has therefore been necessary for us to locate and mute all of the portions of the audio corresponding to the anonymization <gap> tags in the TEI-XML editions of the Spoken BNC. Over 18,710 <gap> tags in the TEI-XML transcriptions have been individually checked to ensure that the anonymization has been carried out correctly. In due course, it is planned to provide long-term access via search and browsing tools to stable URI's. In the mean time, we offer this initial release, partly as a test-bed for researchers and developers, and partly to avoid further delay. (NB. We have discovered that the extant sound recordings only contain about 7.5 million words, not the 10 million words originally transcribed. There is a substantial number of XML transcription files for which we may no longer have the original audiotapes. Or perhaps we do: we also have quite a few recordings that we haven't yet related to any transcription. So we're still working on resolving that. Also, the audio recordings from the Bergen Corpus of London Teenage Language - a part of the BNC - are not included here, but are available from the University of Bergen.)

In order to locate anonymization gaps, as well as to index the recordings with all transcribed vowels, consonants, and words, we aligned the text transcriptions to the audio using a forced aligner based on HTK, using a combination our acoustic models for British English plus American English models from P2FA, the Penn Phonetics Lab Forced Aligner. The alignment procedure yields a best-fitting phonemic transcription of the audio, together with detailed timing information: the start and end time of every vowel, consonant, word, utterance and recording. This data is encoded as Praat TextGrid files, which we also provide in this release. A short paper on the Mining a Year of Speech project, under which we began this work, can be downloaded from here.

Previous releases of BNC spoken audio material

The BNC spoken audio recordings have been (and still are) available for study by language researchers visiting the British Library Sound Archive in person; however, until our recent digitization project, neither the online catalogue nor the TEI-XML editions of the transcriptions were sufficiently informative for researchers to be able to easily find tapes or portions of interest. By issuing our forced alignment index files, we aim to make the researchers' task substantially easier. A subset of the recordings in the BNC have previously been published in mp3 format on CD-ROM's as COLT: the Bergen Corpus of London Teenage Language. A smaller sample on audio cassette was distributed by Longman during the BNC collection project (Cassette Sleeve images).

Accessing the recordings

If you wish to access the recordings and associated files, please read the copyright terms below and register using the form at the bottom of this page. Registered users are welcome to link to or directly access the sound files and associated annotation and transcription files.

The audio files are 16-bit, 1-channel (monophonic) .wav files, with sampling rate 16,000 samples per second. Their rather long filenames encode a combination of the British Library's catalogue code, BNC tape number and the 3-character "BNC codes".

Suppose you wish to find the .wav file containing the dialect word "gronnies", which occurs only once in the BNC. From the published BNC, you can find that it occurs in transcription file KBW.xml. (You can also download html versions of these transcription files from here.) Inspection of that transcription shows that the word "gronnies" is in <div> number 022505 (<div n="022505">), which is the 5th <div> in tape number 0225. The XML transcriptions do not record whether it is on the A side or the B side of the tape, but from this information it can be inferred that the required recording is either

http://bnc.phon.ox.ac.uk/data/021A-C0897X0225XX-AAZZP0.wav (A-side) or

http://bnc.phon.ox.ac.uk/data/021A-C0897X0225XX-ABZZP0.wav (B-side). The syntax of these URI's is as follows. There are three slightly different filename formats, for different ranges of tape numbers:

Audio server URL BL catalogue code Tape number  

Side A/B

 
http://bnc.phon.ox.ac.uk/data/ 021A-C0897X 0004 XX-A B ZZP0.wav
   

to 0087, and 0091-0905

     
 

For some tapes from

00882

  Side 1/2  
  to 00993 X-0 1 00P0.wav
  For some tapes from 097700      
  to 125500 XX-0 1 00P0.wav

You may also obtain some information about tape numbers and their contents from the British Library Sound and Moving Image Catatogue, http://cadensa.bl.uk. (Search for "British National Corpus" and look at items bearing the code C897.)

You can also (optionally) add a start time and end time to a complete file URI in order to select a specific audio clip, or start time & duration. For example, the following are two ways of referring to the "gronnies" audio clip:

http://bnc.phon.ox.ac.uk/data/021A-C0897X0225XX-ABZZP0.wav?t=2443.4825,2443.8925

http://bnc.phon.ox.ac.uk/data/021A-C0897X0225XX-ABZZP0.wav?t=2443.4825&d=0.41

HTML versions of the transcriptions, in ordinary spelling, are available from here. Full lists of all .wav, .html and Praat TextGrid annotation files are available from http://bnc.phon.ox.ac.uk/filelist-wav.txt, http://bnc.phon.ox.ac.uk/filelist-html.txt and http://bnc.phon.ox.ac.uk/filelist-textgrid.txt, respectively.

A table of the phone symbols used in the TextGrids is available from http://www.phon.ox.ac.uk/files/docs/BNC_transcription_alphabet.html.

Start and end times of specific phones, words and word-pairs will be provided via index files in the near future. The TextGrid files may be used together with the .wav audio files in the freely-available Praat speech processing package to view or to find selected words, vowels or consonants in each audio file. For users who are unfamiliar with Praat, a short explanation of how to do this is given here.)

In future, we'd like to make search as easy as this demo (only works in some audio-aware browsers, e.g. Firefox 3.6 or later, Safari 5 or later, Opera 10.5 or later, Internet Explorer 9 beta), or browsing as easy as this demo. Please feel free to send us any feedback or comments about these demos or other tools you would find useful.

 

User commentary

 

Saul Albert wrote this blogpost.

 

Copyright and access terms

 

BNC spoken audio recordings were created or collected from other sources by Longman Dictionaries for the British National Corpus Consortium. Their usage is governed by the terms of the original recording permissions agreement with the contributors, which requires that they can only be "used for scientific study and publication by writers of dictionaries and educational material and language researchers". Furthermore, by downloading any of the audio recordings, you agree to the terms in section 2, 6, 7 and 9 of the BNC User Licence (available here), the audio recordings being understood to be among the "spoken texts" included in the "BNC Texts". The supporting annotation and transcription files are Copyright © 2011 The University of Oxford, and are made publicly available under a Creative Commons Attribution License (details here); if you use these files, you must cite the Audio BNC corpus as follows:

John Coleman, Ladan Baghai-Ravary, John Pybus, and Sergio Grau (2012) Audio BNC: the audio edition of the Spoken British National Corpus. Phonetics Laboratory, University of Oxford. http://www.phon.ox.ac.uk/AudioBNC

Though we do not charge a licence fee for access to or use of the audio recordings, users are required to register at the time of their first accessing the sound files, via the following form. In order to avoid this step on future visits to this site, users are advised to bookmark the next page, following registration. Additionally, registered users are welcome to link to or directly access the sound files and associated annotation and transcription files.

If you have registered for access to the BNC Audio Sampler on a previous occasion, please register your access to the full Audio BNC here as well. And please keep us informed about what you've been using it for, or if you discover anything interesting in it (or anything wrong! - there are certainly many errors).

Registration


Please complete this form to access the Audio BNC. These details will be kept securely and not shared with anyone else.

Email:
Name and/or Institution:

Please tell us a little about how you might use it:

Please note that we may contact you by email so that we can understand how the data is being used.