ICT Tools for Searching, Annotation and Analysis of Audiovisual Media

(First) (Next) (Contents) (Home) (Previous) (Last)

4 Appendix C. Researchers: practices, possibilities and expectations

The preceding technology survey described technology capabilities. This section considers the ways in which the needs of arts and humanities researchers might be satisfied by some of the technologies surveyed. The section begins with a brief snapshot of the ways in which audiovisual media are currently used within the arts and humanities. It then presents the results of our qualitative study of researcher needs, using scenarios to demonstrate how some of the technologies discussed earlier might meet those needs.

4.1 Snapshot of Current Humanities Uses of Audiovisual Media

Table 1 illustrates some of the ways that audiovisual media are currently being used within the arts and humanities. (The numbering relates to the examples that follow.) The following examples may provide useful illustration: [n.b. the examples were selected to show the range of possibilities and do not reflect any assessment of their academic merit]

1. Linguistics corpus: the IViE (Intonational Variation in English) project investigated cross-varietal and stylistic variation in English intonation using self-recorded data from nine urban dialects of British English from both male and female speakers (Grabe, 2003).

2. Oral history interviews: the Childhood in Russia 1890-1991: a Social and Cultural History Project (Kelly et al., 2004) has collected a variety of detailed, tape-recorded interviews with informants from a range of different generations.

3. Auditory archaeology: There is an experimental auditory archaeology (Witmore, 2005) project at the Catalhoyuk site. Volunteers engage in activities such as sweeping, polishing walls, making plaster, applying plaster and repairing a platform within an experimental house and do so using materials and techniques supported by Neolithic-era evidence. They wear binaural microphones to record the sounds reaching their ears, creating stereo digital recordings which are transferred to a PC setup for storage and analysis (Mills, 2004).

4. Linguistics corpus: the British National corpus is an approximately 100 million word corpus of present-day British English from a variety of sources, containing both a spoken and written component (University of Oxford, 2005). It has acted as a found resource in a number of studies, such as a study investigating variation in vocabulary usage according to gender, age and social group (Rayson, 1997).

5. Films/television/radio: uses of such media as a found resource include a project comparing British and German film propaganda during the Second World War (Fox, 2003). Popular and/or high culture as found on commercial CDs and DVDs form a large part of the primary materials for numerous researchers.

Table 1. Uses of audiovisual media within the arts and humanities
	Research resource	Work record	Research outputs and/or dissemination	Teaching and Other
Self recorded or con-structed	E.g., linguistics corpus (1), oral history interviews (2), auditory archaeology recordings (3)	E.g., archaeological excavation recordings (7), raw anthropological/documentary footage (8)	E.g., multimedia archives created for use by researchers (9), technical/scholarly/popular presentations of research results involving multimedia (10,11)	E.g., phonetics sound examples for class, tutorial exemplars of form
Found	E.g., linguistics corpus (4), films/television/radio for historical or cultural analysis (5), poetry readings (6)			E.g., clip examples for teaching, examples illustrating responses to media questions, contributing to external projects

6. Poetry performances: the study of poetry readings (and of silent readings of poems) is discussed in How to Read a Poetry Reading: Reading the Reading (Middleton, 2003). This paper is linked to the British Electronic Poetry Centre (BEPC, 2004) which provides information on poets, their publications and some audio files demonstrating their work (a potential found resource for future research).

7. Archaeological excavation recordings: the long-term Poggio Imperiale project was associated with the creation of an archaeological and monumental park investigating a hilltop to the west of the Italian town Poggibonsi. Digging was filmed with a video camera and the resulting recordings were assembled and edited using desktop computers (e.g. with QuickTime software) (Archaeological Computing 1996). Another excavation, at Catalhoyuk in Turkey, also has a project investigating video recording within archaeology (Cee, 1996).

8. Some audiovisual researchers are practitioners, creating new research in an electronic medium. Their raw, intermediate results represent a form of work record from which new products are distilled.

9. Documentary footage: the Designing Shakespeare audiovisual database (AHDS 2005) includes a collection of researcher-recorded video interviews with designers, in addition to a text database of production details and theatre review excerpts, a collection of production photographs and VRML theatre models.

10. Archaeology walkthroughs: the use of visualisation tools such as pre-computed video walkthroughs is discussed in the technology paper An Interactive Photo-Realistic Visualisation System for Archaeological Sites (Chalmers, 1996).

11. Scholarly publications including audiovisual components: the Sphakia Survey was an interdisciplinary archaeological project which attempted to reconstruct the sequence of human activity in a remote part of Crete (Greece) between 3000 BC and AD 1900. The project made a 50 minute video about the Survey (Nixon & Price, 2000). Although primarily for use in university classes, it was also used to report to general audiences by through national television networks in Greece and elsewhere and through distribution of individual copies (Nixon & Price, 2004).

12. Communications for public consumption: academics often contribute to broadcast media productions such as The British Empire in Colour (BFI 2002), as seen in an interview with the production team (Luscombe, 2002).

4.2 User Needs Study

The goal of the user needs study was to determine ways in which new and emerging tools for time-based AV analysis, annotation and search might aid humanities researchers, either by facilitating their exploration of conventional research questions or by enabling them to ask new research questions.

4.2.1 Methodology

Interviews with academic researchers in the humanities were conducted in three phases from October 2005 to July 2006:

Phase 1 aimed to interview one person per humanities field, using the AHRC Research Subject Coverage for guidance (AHRC 2003). Phase 1 interviews were loosely structured using an interview questionnaire, supported by PowerPoint props demonstrating screenshots of the following tools uncovered in the early phases of the project:

BLINKX: a live system supporting browsing and free text search for AV on the web (Blinkx 2006)
ANSES: a demo interface for news summarisation including automatically extracted organisations, people, locations and dates (Pickering, 2006)
FERRET: a meeting browser tool (IDIAP, 2006)
MULTIMODAL ANNOTATION TOOL: a manual annotation tool for video including associated soundtrack (Adams 2002)

The initial interview questions explored a researchers current usage of time-based AV and any difficulties they experience when working with AV. The interview then moved to a short demonstration of tooling possibilities using the PowerPoint props. The tools shown were selected to stimulate exploratory discussion about user needs such as access to AV on the Web or in archives (BLINKX, ANSES), non-linear access to AV (FERRET) and annotation of AV data (MULTIMODAL ANNOTATION TOOL). The tools demonstrated were chosen because of their online accessibility, rather than any criterion reflecting technical merit or humanities-specific design; this meant a handout containing the appropriate links could be distributed at the end of the interview enabling an interested researcher to explore further if they wished. Discussion was not limited to these classes of tools and often led to quite unexpected tooling suggestions that reflected the needs of individual researchers.

Phase 2 aimed to interview modern historians whose web presence suggested audiovisual data might be a potential resource (even if not currently used). Phase 2 interviews were aimed at gaining a more detailed understanding of the work process of researchers in one specific field, extracting information about their use of resources by asking researchers to talk through a typical research project prior to the exploratory discussion of the PowerPoint props. This information clarified the ways in which some of the tools under consideration might fit into the research process.

Phase 3 interviews concentrated on researchers whose primary interest was in audiovisual material, such as films in popular culture, music within films, video games, or general musicology. The interviews were conducted using a set repertoire of questions on audiovisual media usage and research practice, garnered from experience with the first two phases. Presentation of technological tools was based on knowledge of the literature, and was presented orally as called for in the situation.

Additional interviews were conducted with the BUFVC (British Universities Film and Video Council), a creative video artist, a technologist collaborator of a music researcher and a Modern Languages IT Manager. These interviews provided useful background for interpreting the main interview results.

4.2.2 Institutions represented

Interviewees came from the RSAMD, Glasgow School of Art, Royal Holloway and Goldsmiths, University of London, and the Universities of Oxford, Reading, York, Sheffield, Manchester, Edinburgh, Glasgow, Kent and Lancaster.

4.2.3 Subjects represented

Phase 1 interviewees were drawn from a wide variety of humanities fields. The creative arts and art history, law and philosophy were not represented. Phase 2 interviewees were all drawn from modern history. Phase 3 concentrated on researchers whose work focussed on music- and video-as-artefact.

4.2.4 Limitations of study

The study is small-scale and qualitative, with no verification of results. It covers only a small sample of academics from a small number of institutions, so cannot be assumed to be representative of the whole UK humanities community. There is an additional bias towards lecturers and professors, rather than graduates and research fellows, which may be reflected in the current research approaches and ICT uses that are discussed in the scenarios.

The use of canned (pre-stored) screenshots rather than live demos as interview props allowed interviews to be kept within an average of one hour, which was felt to be the maximum that could reasonably be asked of researchers who already have heavy workloads and are being asked to participate voluntarily. Their comments therefore reflect their interpretation of the canned demos rather than any practical assessment of the tools. Specific misconceptions that arose are discussed in the Section 4.4, Technical Expectations. The props used also demonstrate general-purpose tools rather than those designed for humanities research purposes or operating on humanities-relevant data. This required interviewees to extrapolate to their own situations: however, on balance, this approach appeared to be quite effective, perhaps more so than when placed in a position of selling humanities-specific tools.

There were occasional communication difficulties during the interviews due to differences in language usage by humanities researchers and the engineering-educated interviewers. Any errors of interpretation are the responsibility of the interviewers.

4.3 Interview Results

The combined results of all the interviews have been used to generate the following quasi-fictional scenarios. These summarise uses of analysis, search and annotation tools that were suggested but also mention some of the associated challenges to deployment (technological or otherwise).

We use the following notation:

[1] quote extracted from interview transcription;

[2] minor rearrangement or modification of the transcribed words of an interviewee for clarity, brevity and/or to make anonymous by substituting for one or more [identifying phrases];

[3] paraphrase constructed from handwritten interview notes;

[4] quote extracted from email exchange;

Composite scenarios are constructed from multiple interviews and/or e-mail exchanges.

4.3.1 Obtaining research resources

4.3.1.1 Self Recorded

Composite Scenario: Researchers X, Y and Z all have sets of cassette recordings sitting on the shelves of their offices, from earlier work collecting oral history interviews, ethnographic interviews and data collected for a sociolinguistic study. They would like to convert them into digital form to avoid problems of further cassette degeneration and so that they are more easily accessible for their own reuse; X would also like to send his interviews to transcribers in digital form and ultimately to put his collection online (after removing any confidential sections or interviews that the participants do not wish to be made public) in order to encourage reuse by the wider research community. They would like help or instruction about how this should be done.

Scenario: X is a documentary filmmaker, and, during a preparatory visit to the documentary site, brought a DV camera along on a visit to the tourist officer for the town he was profiling. Theres the very important interview that I probably will end up using I was just going to talk for an hour to the head of tourism [I thought,] Oh, Ill take my camera along, [and I asked] Do you mind if I shoot? Oh, go ahead, and so then I shot it, and then I had it on video, which I wouldnt have had otherwise, and now it becomes a different sort of resource. Thats pretty amazing that you can do that these days, with smallish equipment so its not intimidating to people So technology, the way that its developed, has worked much more closely with my own methodologies, interests, and the way that I like to film. Its made it a heck of a lot easier. [2]

4.3.1.2 Found Data

4.3.1.2.1 Online AV and Web AV Search Tools

Scenario: X is writing a textbook about a diaspora community that has spread around the world, including into the UK. He has a well-established, theme-organised index card system for recording information about useful sources he has located, whether in archives in the UK or abroad or on the Web. He is familiar with search engines for text: One of the interesting things about Google is that youve no idea what youll find [1] Hes recently come across some of the new search engines for online audio and video, and they will fit readily into his research process with a low learning curve since the interface is similar to those he has seen for text-based web search engines. Since they index contemporary news information and podcast information, they are immediately useful for his textbook project: a brief search turns up newscasts about the role played by community members in recent factory strikes and a series of podcasts from individuals discussing community issues. He wonders though about how audiovisual data might fit into the dissemination process: If I found relevant video, Id then have to transcribe what they say onto an old-fashioned medium: I dont think many academic publishers have the notion of linking books to multimedia, so youd be working with transcripts. [2] However, he notes that What one can find is only as good as what is put in [1] on the Web. Also, the audiovisual search engines wont be useful to some of the other research questions hes exploring, since these relate to prominent historical figures captured primarily in text media and photographic images: [time-based] audio and video are not a particularly useful resource for this because the other sources of evidence are very strong and deal with the things Im really interested in. [2]

Scenario: X is a modern historian exploring the responses of individuals from other cultures to the events of 11 September 2001. He began his research by exploring text-based blogs but has now realised that search engines for audio and video blogs could provide him with additional sources.

Scenario: X researches the modern history and politics of an eastern country. His projects are both historically-based in the classic sense but also about the impact of history upon the contemporary [country being studied]. [2] The historical part of his work tends to involve things which are thought of as standard for historians, e.g. going to archives in various countries particularly in [the country] but not exclusively in [the country] to find materials which are stored there, unpublished materials as well as using things that are available in the collections and libraries around the world. I also go to places to do interviews with some of the people who are involved in contemporary politics and also use resource bases such as newspaper holdings and occasionally some contemporary cultural uses of history such as films, lectures, TV shows which are often on DVD. The majority is published or unpublished written sources. [2]

He sees the potential in video search engines (already available for the country he is interested in, in the major language) in looking for news broadcasts and contemporary recordings from his desktop. However, there are other types of AV material he would find useful. If there was a sort of historical searchability or old newsreel footage or whatever put up on the Web, that would certainly be very useful. The contemporary stuff would be more than enough for my contemporary work but not for a historical project. One of the things I am looking at for my new book is the way in which propaganda was shaped in [the country] during wartime. The vast majority of the material I am using will end up being mostly if not all newspaper and print stuff because thats what survives and is relatively easy to get hold of in archives and so forth; if there was somewhere you could actually look at say wartime cinema propaganda broadcasts or radio broadcasts then that would be a really useful addition to that. At the moment Im not really aware of any user-friendly way in which that could be done: there probably are stores and stores of this sort of thing in some archive in [the country] but the effort unless it was really your specific subject that you really wanted to push I think most people would think that life is short and you wouldnt make that extra effort whereas if I was able to call it up as a resource in the way you can do a digitised newspaper or something of that sort then obviously it would be more attractive. [2]

Scenario: X has a theology background. Audio and video isnt used as a matter of course, but certain areas may use it. For example, historical theology has links with archaeology and may use simulation and video modelling techniques. It may also have a place in sociological approaches to theology, perhaps contemporary theology or pastoral theology. [1] Its use remains unusual in his opinion. However, it has potential application. In contemporary theology, so much primary research material is generated through the popular media and popular culture. Or the studies of how liturgy and how theology happens, studies of how it happens in particular contexts. Recordings of liturgical events as ritual. [2] The advent of AV search tools combined with web-based collections that can be accessed from the researchers desktop may prompt new research: The average theologian will not be able to think of any collection that could be indexed; however, he may be able to think of lots of uses for someone elses collection. Its the combination of search tools plus collection that sparks ideas. New kinds of access to an existing collection in digital form could make it yet another primary resource useful for research e.g. to explore questions relating to new religious movements or sociology of religion, contemporary more social science aspects of theology. This technology could be a way of encouraging reuse (perhaps by licensing) of collections. The comments that this might be useful are very much based on the assumption that there would be online, desktop access to these collections though. [2] He concedes that the application to work in fields such as early Christianity is not quite obvious. [1]

Scenario: X is analysing the culture and history of a non-European region. He doesnt have easy access to the regions time-based media such as films or television from his base in the UK. Archives in [the region] are not easily accessible to outsiders and, quite apart from secrecy, Im not sure how much audiovisual data is well preserved. [3] He has accumulated his own collection of films by purchasing them when on holiday in the area (usually on video cassette) and stores these on shelves in his office. He typically digitises these films using his own hardware and, where necessary, subtitles them into English and/or extracts clips of interest for research or teaching using a tool such as iMovie. He also uses a variety of other resources, including periodicals, articles on the Web and so on. [3]

The research questions he addresses are varied. In some pieces of research he analyses known films as a whole e.g. understanding the narratives. The set of films he could consider would be expanded if and when the regions audiovisual outputs become available online, whether for purchase, streaming or some other access mechanism. In other pieces of research he addresses questions which may be answered using a variety of sources of evidence. For example, his current research project is considering issues associated with the commercialisation of religion and may draw upon sources including religious imagery, advertisements and televised discussions. He immediately sees the potential for audio and video search engines for helping the latter kind of project, enabling a more efficient search for relevant clips on the Web e.g. through queries suggestive of advertisements, religious programming and reality TV, televised discussions about consumption, [3] and he would be willing to be creative in coming up with queries in order to find useful data. However, this potential will only be realised when sources from his region of interest are put online and when search engines become available for data in the corresponding language.

Scenario: X is interested in discourse differences across a number of non-Western countries and is currently exploring issues relating to visual grammar and reception. His research begins with a process of dataset construction, which requires him to locate sources of moving image data and then filter that data in order to find instances of desired events in the soundtrack or leading imagery, such as clips showing a weapon or alluding to a weapon. These instances form the dataset for his research. At present, he primarily obtains data through off-air recordings (e.g. made by colleagues in the region) or from the few academic, area-specific websites online: more online sources of data from the area would benefit his research, particularly if easily locatable through search tools. The filtering process is currently very time-consuming, requiring a full viewing: search tools that could help him identify relevant events within videos would be very helpful in speeding up this filtering process. He notes that the search would not need to operate perfectly because although he needs to find a number of events it is not essential to discover all of them. The envisaged search tools could be web-based and allow him to search within AV on the web, combining the location and filtering stages of dataset construction; an alternative would be a package that could index his already downloaded data collection and speed the filtering stage. In either case, the tools would need to support search in his language of interest and perhaps an image-related search as well as a free text search. Since these envisaged tools do not currently exist in a packaged form, he sees manual tools as a potentially useful and available alternative for the filtering step: a tool such as the IBM annotation tool could support the process of marking up and categorising soundtrack segments or image regions and this could be combined with a viewer tool which supports the recall of items in the same category (e.g. the category of clips showing a weapon). Such viewers could be straightforwardly developed for certain formats, rights permitting. The researcher might also investigate annotation tools such as MediaMatrix (MATRIX 2005) and similar tools being developed by the social science community.

Scenario: X is interested in issues involving a contemporary composer and in the reception of 20th-century music. Although he doesnt currently make extensive use of spoken word audio or video, the new search engines for contemporary news and entertainment radio and television on the Web may offer access to relevant interview and performance review data from his desk.

4.3.1.2.2 AV Archive Browsing and Search

Scenario: X is interested in studying different performances of a play which are stored in an audiovisual archive. The archive is experimenting with a new display facility that displays random clips from the collection when a researcher works through the catalogue. The researcher happens to spot that some of these random clips show audience-related, rather than performance-related, information. This serendipitous discovery wouldnt have occurred with their traditional text catalogue and leads him to investigate audience changes in theatre performances over time.

Scenario: X is a modern historian investigating the social history of an English-speaking country outside the UK. He mainly uses traditional archives, but sometime uses tools such as Google Image Search to locate images that bring events to life for students. He doesnt currently make much use of time-based media. Very occasionally hell go to archives and read their transcripts of potentially relevant video and he has independently accumulated a few documentaries on prominent political figures in that country, but he finds it takes a lot of time to get a little way with video. It takes time to find the video, particularly when the collection is not online and I have to travel abroad to go to the archive, and it takes time to find whats relevant Its much easier to scan text to find relevant sections than it is for video without transcripts [3] He has analysed some propaganda videos in the past, though, and certainly sees uses for audiovisual data in the future if it became more accessible.

The ability to do a free text search within a single archive collection of spoken word data might encourage him to use collections which have not been transcribed, particularly if results are cued up around the query terms and linear scanning of full tapes is not required. Even better would be to search for archived or other audiovisual research data via the web from his UK office, particularly where systems return clips which are cued up to the relevant point, but he observes that many of the current web AV search systems do not index the kind of data that he needs for his research. Because he investigates mid-century social history from the bottom-up, he would be interested in recordings from that era involving the people e.g. speeches by the regional mayor or activists or the unedited raw footage collected by production companies might be useful. [3] If this kind of data became more accessible, it would not just provide an additional source of evidence in answering existing research questions; he envisages addressing new questions such as televisual representation or comparative studies of representations of things in television, text and pictures. The latter would be interesting because, in his experience, most research in his area today cites national newspapers rather than national television stations, even though more people watch the latter and it is arguably more influential.

Scenario: X is a film and television researcher. He typically makes use of a number of resources. These include written sources such as encyclopaedias and so on: These are still key to research in this area since they are widely available. [3] He also makes use of many other resources, such as his extensive department library of VHS recordings of films and television, digitised resources such as databases about TV programming and scheduling and numerous Web-based sources of information on film and TV (academic and otherwise).

For some projects, he starts with a constrained dataset such as the works of the particular scriptwriter which he watches carefully in order to construct a thesis and then revisits in order to find evidence to support or rework that thesis. However, there are other situations where he has broader information requirements. For example, he is interested in exploring the influence of another countrys television programmes upon UK television. The archives of UK broadcasters contain useful resources for this project, but these are not currently accessible by everyone. Fortunately, hes able to make use of his contacts to gain access to one such archive and can use their very detailed internal catalogue in order to find television programmes covering relevant topics. Finding data through archives, though, is an art and having an inside contact that can help is valuable academic currency. Such contacts are not available to everyone at all archives and so facilities for archive search are potentially powerful for improving access to some collections. [3] He also notes his work would be greatly facilitated by a UK copyright deposit requirement for audiovisual data, as exists for books: without this, research is not able to address the full breadth of data which is produced, only that which is accessible.

4.3.1.2.3 Commercially/professionally available AV

For a certain class of researcher, there is a great deal of stock placed in ones personal, commercially-obtained collection of CDs or DVDs. The stereotypical researcher in this class is concerned with popular culture and the audiovisual material in itself, but is not limited to it. There is a lot of focus placed on an individual researchers collection, and it tends to be made up of personal purchases, enabling the resource to follow the researcher from institution to institution. There is a secondary interest in broadcast media, both as a source and as an inspiration.

Scenario: X is a researcher in a music department, focussing upon film music in contemporary art cinema. She investigates films from the mid-20^th century to films of a couple years ago. A film in particular from the 1950s has provided a certain degree of frustration, both in terms of the detail scholarship required to track down sources, and with the rewriting of history on the part of the film studio in having released alternately uncensored and censored versions of the original film. Different versions are available at different times, and new releases can completely supplant older versions. The sonic and timing differences between PAL and NTSC releases (4% in length, or a semitone in pitch) is considered an occupational hazard in her field.

Scenario: X is a composer, researcher, and theorist in a music department, who, drawing upon theories of play and the immediate erotic, often allows himself to be seduced by a new piece or a new hearing through broadcast media. Its the chance meeting, its the glance that catches your eye, that turns your head. So the radios very powerful, and I owe [BBC] Radio 3 an enormous amount for setting extraordinary things before me. For instance, I heard [a composition by a British composer]; [he] never set German very often, and never set Goethe, with one exception there it was, [and my reaction was] This is exactly what I need to hear, right now. [2] Although the source is a commercial recording, there was never an explicit search for what ended up being an important piece for his research and writing.

Composite Scenario: X, Y and Z are all contemporary culture researchers who find some proportion of their material as video on DVD. The limited DRM system that governs DVD (the content scramble system, CSS) and theoretically prevents an unlicensed viewing device from viewing copies (but not making them) has been defeated years ago, but due to legislation throughout the western world, disseminating information about such things is illegal. Although all of the researchers had legitimate reasons for wanting to view transcoded video from DVDs, they were stymied from doing so by the fact that tools for doing so were driven underground, out of the mainstream. Much was made of the fact that these technological barriers now exist where there were none before. When I ask the AV services to edit an extract from a DVD for use in the classroom, they always ask, Do you have a videocasette? because, even though they have the technical know-how, they find it a headache to deal with. [3] Future digital distribution formats promise more than headaches, but real barriers to normal scholarly use.

4.3.2 Data preparation

Composite Scenario: X is a historian. I dont use time-based media much at present when I do, I use a notebook to generate a rough transcript such as 2 minutes: event Z happens and then I can fast forward around if I need to revisit sections. If Web search tools for audiovisual data become available and the data I am interested in is online and exposed to these tools then I might use manual annotation tools on that data to mark points and be able to jump back into those points could be helpful. But the search tools and relevant online data are key; the other tools just make things faster when working with AV [3] Researcher Y is a film and television researcher who has similar needs. It would be useful to have a facility which would let one mark and categorise short clips for easy revisiting or for exporting e.g. for playing in lectures. [3] A manual annotation tool with an adjustable lexicon of annotation categories combined with a viewing tool for revisiting annotated categories (as discussed in Section 3.3.2) could fulfil the bookmarking/revisiting function for some data formats; clip extraction is supported by existing tools for some formats though the right to do so may need investigation.

Composite Scenario: X and Y use popular culture as their subjects, whether in audio or video. After an initial, impression-gathering viewing of the material, both perform a rough timeline of the interesting and notable elements in the audiovisual material. There is not always a timecode-accurate notation of the start time, and it is rare that a shot-by-shot or event-by-event notation is made, but there is a fine level of granularity. I will make notes on paper, with timing going down one column, and with other columns including scene, some narrative, important dialogue and music. Ill draw a five-line staff and transcribe important themes. Often I will be rushing between the pause button, my notes, the piano, and my flute when making my timing tables. [3]

Z obtains a lot of original material to be edited together later. In order to prepare his material, he relies on a more traditional methodology: he logs tape in his (digital) editing suite. He doesnt resent this often-laborious process (five to ten hours per hour of raw footage), as it gives him a chance to reflect upon the materials he has gathered. What he does welcome, however, is some way of automatically transcribing the speech from the video. As his subjects include non-native speakers and many interviews are conducted in the field (with field noise), his is a wish not likely to be realised in the near term.

Composite Scenario: Researchers who collect recordings for area studies, oral history and ethnographies often spend significant amount of time or money generating transcriptions, as illustrated by the following two examples:

We used audio recording the team included a stills photographer, we did not have funds for anything fancier and we felt comfortable and flexible with audio, because for example, we were often interviewing in public places where it would be difficult to use audiovisual recording such as cafes, pubs, hotels and noisy environments or in peoples houses where people are five to a room we used a cassette recorder Since I am fluent in [the local language], virtually all interviews were conducted in [that language]; one or two subjects were more comfortable in another language and so one or two translators were sometimes used, making the interview a two or three-way conversation In addition to language issues, because not everyone is comfortable in [the researchers language of choice], there are personal issues when interviewing traumatised people e.g. some womens voices descend to a whisper and become very difficult to hear. There is a very long period of time in terms of transcribing the tapes, because transcribers have to go backwards and forwards and backwards and forwards with people often speaking fast, or speaking in slang, and in mixed languages Transcription is a very lengthy arduous process and I have to go over the transcripts, because some of the material gets garbled because transcribers put down words which are clearly not correct because they dont recognise them or havent been able to make them out but I have handwritten interview notes that I can check things against and it is an extremely lengthy process to accuracy Its a high-density activity when compared to going into an archive and taking notes on someones letter, which you could either Xerox or translate if you are never going to see it again. It would be nice to have a package that I could feed in a tape and get a preliminary printout, but I feel this is very unlikely at present because of the [foreign] language issueOnce I have the transcript on the computer, I can search for keywords and index everything according to key themes that I have drawn out and these themes form the basis of an article. [2]

The resources I use are a mix of what historians would consider to be very traditional sources, archive sources such as looking at Colonial office papers, newspapers of the day, pamphlets, diaries, letters etc. The other part is using what is still considered non-traditional resources, which is oral history material. I will be going out and interviewing people who were involved in the period or living through [the period of interest]. I tape record everything. I choose audio recording for flexibility but it is also much less intimidating. The tape is unobtrusive. A typical interview lasts two or three hours and in some cases I will go back again and ask more questions, anything from two or three hours to six to eight hours of recorded material. The data can have an emotional range you invariably stir up memories the whole spectrum of emotions. And people get tired and children dont concentrate. And some interviews are recorded outdoors, so youve got wind etc. The other problem is corrugated iron roofs. Because of the roofs I cant interview when it rains, there are also tree frogs at night ... The tree frogs give you a hum all the way through, the air conditioning can be the same making nothing come out on the recording Even if youre indoors the windows are open so youve got dogs barking or roosters crowing. I had one where this ruddy dog just didnt stop barking! Afterwards, and this is where Im always looking for technology to help, there are two technical problems. Firstly transcribing, so a voice recognition system that could cope with [the local, highly accented] English would be brilliant but at the moment there is not anything sophisticated enough to do that as far as I know. And the second technical problem, trying to work with a qualitative database containing a very large number of interviews but if you choose a sample right, the full set of interviews doesnt necessarily give you anything more than youll get out of a smaller but more manageable subset of interviews. I use a professional transcriber one of mine is very familiar now with [the local accent]. Then, in some sense, reading transcripts is no different from interpreting archive work. [2]

Transcription problems are also faced by linguists, who develop detailed time-marked word or phonetic transcriptions for found data using tools such as wavesurfer or simply a pencil and paper:

Im studying recordings of English dialects Transcription is difficult, word and phonetic with time alignment locating the word boundaries is not an issue, because I can look at the spectrogram in wavesurfer, it is the transcription that takes time When manual transcription is necessary, it is very expensive. Phonetic transcription speech takes 200-500 seconds of work to transcribe each second of speech. Manually transcribing a one-minute conversation, for example, can take one man-day. I would really like automatic transcription tools for word transcription and phonetic transcription at a fine phonetic level. [3]

Film studies researchers working with foreign language data sometimes generate the same language or English language subtitles manually:

Subtitling is very slow, done manually, so I would like software to transliterate the soundtrack into text Id be interested even if it gave a rough transcription. However, I think thered be issues with [the foreign languages] dialects There are probably 10 different dialects and the most commonly used words differ most widely across dialects with specialised words being more trans-dialectal. And 80 to 90% of the data is colloquial at a guess. [3]

Technology notes: Automatic transcription tools for specific situations do exist at commercial and research stages. These include systems for subtitling American and European television news, the English and Czech transcription tools developed by MALACH and phonetic transcription tools for conversation and American English, amongst many others. Such systems may provide first pass transcriptions that can be cleaned up in a shorter time than the time required for a full transcription but such assessments need to be made case-by-case. The sheer variety of languages and dialects, subject-specific vocabulary, recording environments, speech styles and emotions exhibited by humanities-relevant data (and hinted at by the descriptions above) means that an automated solution such as a general-purpose humanities relevant speech transcription server (e.g. a grid service) is rather unlikely to be feasible in the 2006-2010 timeframe although a potentially interesting, though lengthy, engineering research project might explore development of a transcription server designed for more constrained scenarios (e.g. a server for generating crude transcriptions of interviews in UK Southern English with participants wearing close-talking microphones). Such a server might share some similarities with the MIT lecture transcription server currently under development by one of their iCampus projects (MIT, 2006c).

Composite Scenario: X is a linguist interested in conversational analysis, Y studies performances and Z is a historian who might study speeches by a prominent political figure. Each of these can imagine research questions relating to the non-lexical content of a spoken word recording. They might benefit from certain types of automatic annotation, such as marking laughter, stress, pauses (and their durations) or emotional content. Such problems have been investigated by the engineering community but their solutions are apparently not readily available in a packaged form.

4.3.3 Analysis and interpretation

Scenario: X analyses films and television. Hes learned that engineers have developed research tools which can automatically detect shot boundaries and can classify each shot into categories such as cuts or fades. With this technology I could explore questions such as the use of long takes or statistics about cuts and shot types and so on I could extract statistics such as the number of cuts in the first and last 10 minutes of a film or historical changes in cutting rates in a TV or film type. Its too time-consuming to manually annotate these things for research on an extensive dataset a tool giving this kind of quantitative analysis would be very useful. [3] Such information may strengthen the empirical foundations of the kinds of research questions currently asked in the field, by providing quantitative evidence, but the most important part of the work will continue to be the interpretive analysis that explains why the statistics calculated should be so. [3] He sees the envisaged tool as something facilitating existing research, but not engendering new types of research.

Composite Scenario: X works in theatre studies. He aims to take productions as a whole and to evaluate them in a much wider context, considering performance reception, sociocultural history, translation studies or details of the actual performances. He uses a variety of evidence about productions, including books, scripts, posters, theatre programmes, newspaper reviews etc. These resources come from theatre company archives and more general audiovisual archives such as the British Library Sound Archive. He also makes use of theatre-related broadband resources, but these are mostly for teaching rather than research. Videos or audio recordings of performances are also a possible type of evidence but access to performance recordings is not always straightforward. Although he is fortunate in having access to some audio recordings of performances and hes able to travel to archives which hold videotapes for some (though not all) productions in order to view them, more generally Colleagues in performing arts departments are crying out for a resource to allow visual records of performances to be widely accessible digitally for teaching and research purposes, but that is yet another kettle of fish with all the copyright implications [] [4] He does note some interesting developments on the Web, such as the recent announcement of UK online pay-per-view theatre: This would be a major step forward. [3]

The researcher sees possible uses for (manual or automatic) annotation tools for marking up interesting events in performances, but believes the likelihood of gaining access to annotate is quite unlikely for non-technological reasons. He says that rights and access issues place many limits on what he can achieve. For example, he believes that at present some researchers are benefiting from studying resources that are difficult for most people to access, because their work cannot easily be developed or critiqued. He feels this does not benefit the field. For example, he would like to perform comparative studies across different productions of the same play, across the different nights of a single production or even across rehearsals for that production and believes such analysis would significantly strengthen research in his area. Access to multiple productions would allow more detailed and quantitatively-supported investigation of casting decisions. For example, images from different productions can be used to compare the visual aspects of casting decisions at present, but easier access to performance videos would allow comparisons of the different accents cast, such as when a particular character is played by a Scot or non-Scot. [3]

The researcher has also learned that techniques from speech recognition can be used to time-align spoken words with text. These techniques might facilitate comparative studies if the right to work with the appropriate sources of video could be negotiated: a research project might explore the alignment of multiple production (audio or video) recordings to a master text, allowing easy comparison of how different parts of the text had been staged, lines modified, inordinately long deliveries, surprising or strange uses of words and the uses of pauses.

This researcher is also thinking about other ways of gathering research information from different productions, such as a multimedia wiki: he envisages a web site that allows productions to submit their own recordings, e.g. school productions, and maintains access controlled sections to allow commentary and annotations from academics, schools or others.

Scenario: X is a theologian. He comments that his is a highly interdisciplinary subject in the sense that for any other given [arts and humanities] discipline, there will be a branch of theology and religion that is similar. Thus, the humanities disciplines that have uses for the tools we are looking at will give hints of the potential uses in theology. For example, if the practice based arts have a use for them, I can find an equivalent in theology. Consider performance: then theologians will consider performance within the context of ritual. [2] Thus, the tools suggested in the previous scenario (and elsewhere) may find more general application than may be immediately apparent.

4.3.4 Dissemination

Composite Scenario: X is an area studies researcher and his current project is compiling a collection of clips containing interesting objects and events in the moving images and/or the soundtrack. He will deposit his annotations and perhaps the clips in a digital repository at the end of the project (such as a research council or university repository). He believes the repositorys current text catalogue will not encourage reuse of his dataset, because textual descriptions do not capture the richness of the timebased audiovisual content; in contrast, an interface which supported audiovisual browsing or even searching of the spoken word content would be more appropriate for encouraging the use of this kind of data.

Many researchers collect interviews e.g. theatre researchers interviewing set designers, oral historians interviewing the target group and ethnoarchaeologists interviewing local people close to a dig or field survey site. They also suggest that new technologies could be used to make their data collections more accessible and thereby encourage reuse. For example, the spoken word content could be indexed based upon (automatically time-aligned) transcripts, if available, or based upon less accurate but automatically-generated and time aligned transcripts. This was perceived as a possible improvement over a very limited catalogue description of their collections. Such indexing systems are feasible for languages for which speech-to-text functionality and pronunciation dictionaries exist, though performance would vary depending upon the speech-to-text system quality and the careful design of the associated time-alignment algorithm.

Scenario: X publishes material that involves commercial, popular culture. She is in the process of writing a book chapter about a recent, fairly well-known art-film. She wants to include a still from the film as an illustration in the book, but had to dedicate months of requests, follow-ups, and negotiations to include the one still, eventually at the price of 700. She doesnt resent that, as she often budgets for such contingencies. What is really frustrating, she finds, are the intellectual property owners who dont even acknowledge such requests.

4.3.5 Other uses

Scenario: X is a linguist and is able to spend a reasonable amount of time either self-recording research data or in negotiating data release for research purposes. However, this is not a luxury afforded to students working on class projects or short pieces of research. The ability to search for spoken word audio or video online may be a mechanism by which students can rapidly construct small data sets for research purposes e.g. a selection of UK broadcast media coverage about the Siege of Beslan. Its not always a solution, because some research questions will require more demographic information about the participants in the audio than is available from the source websites, but this can also be a problem with data obtained from traditional audiovisual archives For some research questions it could significantly speed up the data collection process and free up time for additional research. [3]

Scenario: X often has to respond to questions about language usage from the media and the general public. Spoken word examples are sometimes more appropriate than examples based on written text. Audio and video search engines may provide a convenient mechanism for locating illustrative clips, whether in contemporary produced video or in the self-recordings individuals have uploaded to the Web.

Scenario: X is a linguist who occasionally uses clips from a standard linguistics database to illustrate certain phonological changes exhibited by specific languages or dialects for students and lecturers. The ability to search for spoken word audio on the Web may provide an alternative source of examples. However, getting the equipment together to play audiovisual clips in some of the older lecture halls can be so time-consuming that it acts as a deterrent to actually playing these clips to class, though he might give the appropriate links to the student in a handout.

Composite Scenario: X works in a modern languages department. He observes that language teachers regularly make off-air recordings on third-generation videotapes and then spend considerable time watching them to find relevant sections for class or for use in comprehension tests. A package which could make these recordings searchable would be very useful. It would ideally support many foreign languages. Even a mechanism which filtered out non-spoken word audio might speed up the process of browsing the spoken content for relevant phrases, topics or linguistic contexts.

Scenario: X teaches a course on the late 20th century history of an English-speaking country. A technology like Google for video would be very useful to me, for teaching as well as anything else. How far back does the data currently online go, as far as the 1960s-1980s? I might also look for relatively recent reports on political activity in the area. [2]

Scenario: X and Y are archaeologists who report that the use of video is very common for making trench recordings. (These show information such as soil changes or the relationship between points in the trench better than individual images.) They believe a significant amount of time is spent editing such recordings down to a final annotated set, in the evening or some other later date. They are curious about the possibilities for using speech technology (or some other technology) for supporting annotation of images and video in the field, envisaging a system linked to the large cameras used in excavations or the smaller portable cameras used by surveyors. Perhaps this would obviate the need to edit down the video collected: could one simply search the soundtrack for the relevant spoken annotation in order to locate the segment of interest, such as trench layer 6? [3] This is an interesting problem for exploration from a speech recognition (or, more generally, the mobile computing) point-of-view because the situation can be controlled in several ways e.g. these two researchers would be willing to wear close talking microphones to reduce noise, [3] there are limits upon the number of speakers (i.e. the set of researchers who would be digging) and it seems likely there exist constraints upon the annotation vocabulary and grammar that could be usefully exploited by technology solutions.

4.4 Technical expectations

Beyond the questions about general audiovisual usage, we observed a few common misapprehensions in the interviews. They involved mismatched expectations about the likely performance of black box tools for AV in envisaged deployments.

4.4.1 Error

Some interviewees tended to assume that automatic annotation systems would provide correct and unarguable results. For example, one suggestion that arose several times was to use an automatic speech-to-text system to generate a perfect AV concordance tool that would allow a researcher to examine all of the AV contexts in which specific words appear. This suggestion could be challenging to solve today because most automatic annotation techniques operate with some degree of error and thus a concordance based on a speech-to-text transcript is not guaranteed to be perfect. Similarly, an automatic phonetic transcription system or shot boundary detection system will not always yield the same results as a careful and experienced human, and a system for searching video based on an automatic speech-to-text transcription of the soundtrack is not guaranteed to yield the same results as a system that searches a human transcription of the soundtrack. In some cases the apparent disagreement between automated system and human results may be due to genuine ambiguity in the data, which could be revealed by a comparison of the disagreement between multiple humans performing the same task; in other cases, errors arise due to limitations of the automatic system such as a lack of robustness (see next). The level of error which could be tolerated in an automatic annotation system will generally be application- and data-set specific.

4.4.2 Robustness

Many of the techniques discussed in this report make use of statistical methods. These methods extract information from a set of learning or training examples and make use of this information when analysing new examples. Since this set of training examples is usually incomplete, the information extraction process leads to a system which is often (though not always) brittle when challenged by new examples that are different from those it has seen before (i.e. it is not robust in the face of new examples). This means statistically-based techniques cannot always be used as black boxes that can be expected to operate equally well (i.e. at equal error rate levels) on all types of new examples: most deployments must investigate tuning or adaptation techniques to make the base systems perform better in the new deployment scenario.

For example, a speech recognition system developed using examples of carefully dictated, Received Pronunciation speech may perform quite acceptably when applied to carefully dictated, Received Pronunciation speech. However, in many if not most situations, the same system may struggle when faced with more diverse types of speech such as conversational speech or highly accented slang, sometimes even after adaptation. Even apparently similar speech can sometimes be problematic, as illustrated by the lack of robustness of a system trained on broadcast new shows when applied to speech from a news show the system hasnt seen before. At present, the ideal is to use a base system trained on data representative of the data which will be seen during deployment and to perform additional system adaptation.

4.4.3 A lack of appreciation for the demo effect

It is well accepted within technical communities that a new system will be shown in the best possible light in terms of input and performance. Translating such a demonstration to an applied research situation can cause a shocking disappointment to researchers unprepared for it. It is for this reason that we label the maturity of the technologies mentioned in appendix B.

4.4.4 I cant do that [with that tool]

A common complaint was also related to the usability of increasingly complex software and devices. One interviewer repeatedly and enthusiastically pointed out features and capabilities of media applications that he and the researchers mutually used. One music researcher shied away from using iTunes or her iPod when giving talks because she didnt believe she could easily set up extracts in iTunes or get random access within a track on an iPod. Although they are not common functions, both are possible with a little investigation. It is often simply a question of knowing that something is possible with common software.

(First) (Next) (Contents) (Home) (Previous) (Last)