ICT Tools for Searching, Annotation and Analysis of Audiovisual Media

(First) (Next) (Contents) (Home) (Previous) (Last)

1 Project report. Audiovisual media, ICT tools, and humanities research

1.1 Introduction

Some of the most highly valued cultural forms in the west are stored in print form. Hence, much scholarly research focuses on what exists in printed form (e.g., the Bible, Shakespeares plays, Descartes Meditations, Beethovens string quartets). This can give the impression that the humanities primarily refer to books and writing. Actually, much humanities research goes beyond print media. For reasons that go to the heart of their intellectual projects, scholars have been greatly concerned with ephemeral aspects of cultural materials such as speech, bodily movements, performances, and events. Visual, performance and mass media cultures generate transient materials, forms and processes that print represents poorly.

Media that record, store and transmit speech, music and moving images are roughly a century old. Electronic media and audiovisual recording technologies dramatically enlarge the horizon of cultural materials that can be analysed. Technologies that can handle sound, images and text in digital encoded form are just twenty-five years old. As well as static text and pictures, networked computers now deliver sound and video to the desktop. Even more recently, the audiovisual capabilities of a normal office PC open up possibilities for the easy use of non-print resources in many areas of the humanities. For example, historians may employ archive film or video footage of events, interviews, etc; artists, actors, musicians and others may study performances; linguists may wish to study the spoken language of such recordings, and so on. Libraries and universities around the world have been quick to explore the possibilities for making available such audiovisual materials to their researchers, and the internet allows users to access large quantities of audiovisual resources often without the need to go via established institutional providers.

However, humanities research has not always been able to quickly pick up on the enlarged possibilities of the universal media machine. The reasons for this are complex. Software to play sound and images has been mainly commercially produced. It has been designed to allow people to simply listen to or view media, without interrupting, repeating, searching or reordering it. By contrast, humanities research with such media typically relies on slowing down, comparing, collecting and sorting sounds and images in many different ways. Similarly, software for production of audiovisual material (sound editors, software sound and image mixers, video editors, video capture and encoders, etc) does not make it easy to analyse sounds and images. Typically, it makes it much easier to put sound and images together than to take them apart. Finally, in the last decade or so, there have been important research advances and large commercial investment in software systems that automatically transcribe, annotate or analyse sound and moving images. Even so, their application in humanities research is not straightforward. For instance, automated music genre analysis systems are technologically sophisticated and commercially significant. However, if identification of genre is an issue at all for current research in musicology, it is in the analysis of the process and concept of genre-association. If genre-identification software is to be useful in such research, it must be repurposed and perhaps unpicked to do more than simply assign a genre to a piece of music. Comparable illustrations could be made for video, film and speech research.

1.1.1 Scope of the report

This report explores the intersection between audiovisual media and digital technologies in the humanities as it stands in mid-2006. What can be done and what might be done definitely does not coincide with what is actually done. The report focuses on how research is carried out or could be carried out on materials that have already been recorded or captured in electronic form.

The scope of audiovisual media for the purposes of this report is time-based audio and visual source material, as rendered through digital recordings or other capture processes. It excludes material based on still images, research material that exists primarily as a visual artefact (such as the image of a musical score), and materials which are primarily symbolic encodings or notations (such as encodings of a dance in Laban notation or a musical score). Although there are overlapping concerns, we exclude materials for teaching and materials used in a creative process (as in the performing, visual, and compositional arts).

It does not address other changes in research processes. For instance, the report does not explore how acquiring scholarly work through downloads or e-journals, or dissemination of results through electronic publishing changes the nature of research. Instead, we focus on problems and possibilities of working with primary materials such as recordings, footage or broadcasts. These problems and possibilities arise for contemporary scholars in many humanities disciplines. As much as possible, we have avoided addressing highly technical problems specific to single disciplines.

To highlight the most relevant points of intersection, we have adopted a simple generic model of humanities research using AV materials (Figure 1). The model views humanities research as a process of repeatedly accessing, searching, marking up (annotating), transcribing, analysing, and presenting materials. As the figure suggests, the order of these operations varies. Scholars constantly cycle between different ways of working with audiovisual materials. Innovative research often combines them in unexpected variations or applies them to different materials.

1.1.2 Report website and project weblog

Resources in electronic form with hyperlinks for the examples and references contained in this report are available on the web as follows:

An online version of this report: http://www.phon.ox.ac.uk/avtools (mirrored at http://ict4av.lancs.ac.uk/report)

A weblog used in the process of gathering information for this project and including many more examples and links to recent developments: http://ict4av.lancs.ac.uk/

1.1.3 Other relevant reports

The recent British Academy Policy Review (2005) discusses factors which currently or potentially impact humanities and social sciences resources and access to them (see Section 4: Factors and Themes). Many of these factors which include ICT advances, the Grid, access mechanisms, metadata, intellectual property and charging regimes are directly relevant to the audiovisual resources considered in this report.

Relevant technology, copyright and privacy issues are discussed in the report of the EU-US Working Group on Spoken Word Audio Collections (SWAG, 2003).

A document from the British Universities Film and Video Council (BUFVC, 2004) highlighted the importance and varied uses of moving images and sounds and the difficulties faced by audiovisual archives. Key recommendations included collectively defining public sector audiovisual archive holdings as a distributed national collection and collectively asserting a public right of access to this collection for non-commercial use. More general archive-related initiatives such as the Archives Task Force and the UK Film Councils review of film heritage provision may also have consequences that are relevant.

The reader might find the following survey papers useful in addition: Goldman et al., (2005), Koumpis & Renals (2005), Lee & Chen (2005), Ostendorf et al., (2005). The survey of Goldman et al (2005) a product of an EU/US (DELOS/NSF) working group on spoken word audio collections also addresses policy issues relating to privacy and copyright and to the collection and preservation of spoken audio content.

1.2 Overview of the report

1.2.1 Organisation of the report

This main part of the report provides an overview and summary of conclusions. Appendices provide much more detail on different aspects of the model of research:

Appendix A: Accessing sources of audiovisual data. This section outlines the breadth and abundance of materials becoming available, and some of the difficulties that scholars encounter in making use of them. This includes the central problem of copyright law and Digital Rights Management.

Appendix B: Technologies for researching speech, music and moving image. This section summarises the major capabilities, possibilities, and lines of future development of digital technologies in humanities research with audiovisual media.

Appendix C: Current practices and expectations of humanities researchers. Based on a series of interviews and field visits to humanities researchers in a variety of disciplines, this part of the report outlines what researchers do and do not do, and what they would like to be able to do in their research.

1.2.2 Accessing audiovisual materials

The increasing volume of audiovisual materials is obvious. The quantity of data available exceeds the capacity of any library. Only a fraction of this is of interest to arts and humanities researchers, but it is no longer possible to identify a clear boundary between those materials which are of interest to scholars, and therefore should be preserved and made accessible, and those of no interest (if indeed it ever was possible to identify such a boundary). Researchers in the arts and humanities have a massive amount of material available to them, but it is less constant and less organised than the traditional text materials contained in libraries.

While there has been very significant growth in digital resources held by libraries and museums, often clustered around historical archives, accessing these materials, or the relevant parts of them, remains an issue. Often such collections are not easily searchable online, and the catalogues lack rich content descriptions. This can frustrate efforts to access material in a manner other than by the traditional identifiers of author, title and subject. In the space of ephemeral materials derived from popular culture, news media and general broadcast, an explosion of digital resources is occurring. Recorded music and film is mostly, sometimes solely, available in online formats. In consequence, the location of collections and repositories used by humanities scholars is shifting. Perhaps the most important access sites are no longer primarily institutionally managed. Instead, commercial services and user-produced archives and collections seem more important and relevant to much current scholarship. (Consider, for example, how significant Google has become in the everyday work of most researchers.) This situation yields greater certainty of access in some respects. For example, scholars are likely to have ready access to a much larger collection of recorded music through an online music library such as Naxos than most research libraries hold.

However, the format of the audiovisual materials affects access in practical ways. Issues of fidelity are of diminishing importance with audio, but remain critical with video. What is sufficient quality for teaching and broad analysis is not sufficient for automated analysis or close analysis. Issues of fidelity depend crucially on what the researcher aims to achieve. Some formats, especially streaming formats, frustrate research processes, though they impose no absolute obstacle since stream-ripping software is widely available and increasingly used. Greater use of portable storage and the networking of media devices and computers makes it possible to access audiovisual materials in a wider range of contexts and in varying ways.

Copyright law and techniques of rights management are a much more significant factor in some areas of research. Copyright-restricted access is narrowing at a number of different levels. At a low level, Digital Rights Management (DRM) attempts to lock out any usage beyond the content providers view of audiovisual media as entertainment. Access licences and pay-per-view schemes abound. Not surprisingly, a number of responses are emerging: new licensing schemes such as Creative Commons (2006) reserve some rights but with a general view that access should be open. However, access rights remains an important issue, and commercial usage of Creative Commons remains very small. Access can be a stumbling block for researchers who wish to work with audiovisual materials from established providers and, on the whole, is becoming a more serious problem.

1.2.3 Technologies state of the art, gaps, obstacles

We have examined technologies of varying maturity, and do not limit ourselves to commercially deployed products or current ICT research. Some of these technologies were not necessarily developed for humanities research at all, but might be repurposed for scholarly work.

1.2.3.1 Searching and collecting

With vast collections of digital audiovisual material available, actually searching for and finding a resource can be a major barrier to research; if one is unable to locate a resource, one is effectively denied access. Nearly all practical, current, multimedia access depends on good-quality metadata for search. Content-based information retrieval is a field of active research, but for the most part has not yielded results such that the average, non-ICT-expert researcher can expect or obtain good results. Speech-based information retrieval is by far the most mature instance within content-based retrieval, and todays performance can be used as a rough indication of where video and music tools are headed within a decade. The performance of current automatic speech-recognition technology is at such a level as to make content-based retrieval practical for certain kinds of speech materials in restricted domains. Software tools or search engines to effect this do not currently exist outside the laboratory (though Blinkx TV (2006) is an on-line video search system that claims to perform speech search), but such tools can be expected to emerge in the near future.

For now, however, and certainly for music, video and film, searching is based upon catalogue data and other associated information; what one can find is a direct function of what metadata is associated with the resource. Until fairly recently, that was the sole responsibility of the archivist, but in recent years alternative strategies have come to the fore. Methods that search providers have utilised with audiovisual content include contextual (e.g., containing web page) and associated information (e.g., closed captioning information with a video). More recently, user-supplied metadata is starting to play a larger role.

1.2.3.2 Annotation

The metadata associated with a resource can be sufficient for locating a resource, but once a resource is found, there is often the need to associate finer-grained metadata with certain points within the audiovisual content. This potentially rich process is what we call annotation.

It is quite easy to annotate most forms of audiovisual material: video, speech, and audio annotation tools abound. Even so, it is often very time consuming to make annotations, so the challenge is to allow users to do so in a way that has some enduring value. One response is the development of standards to render annotations durable and facilitate their reuse by others. Important developments in this area are MPEG-7 and Annodex, but neither has as yet been widely adopted. Collaborative annotation systems are another means towards durability of annotations, by establishing a form of consensus, and they can also save effort by involving more users. On the other hand, user surveys indicate that ad hoc annotation happens all the time, sometimes involving pencil and paper, and the unpredictable nature of research means that this will always be the case.

1.2.3.3 Transcription

Transcription can be seen as an audio-only technology. As it is the process of fixing time-based events into a permanent medium, and video tends to be its own best document (what you see is what you get), speech and music have received the most attention and success for transcription purposes. As with speech search, which uses transcription as the basis for text-based search, we can look at the recent history of speech technology to get an idea of the future of music transcription.

Speech transcription, although continually improving in performance, has not fundamentally changed in fifteen years. There continue to be blocks to the dream of completely general speech transcription. One must choose a constraint, such as supporting a limited vocabulary, a single speaker, high training time or discrete speech (unnaturally separated, with pauses), in order to reach decent performance. On the other hand, while transcription of speech into accurate, properly formed and punctuated text might not be achievable, a transcription which provides information useful for some kinds of research is already possible.

Music transcription, now most commonly represented within the music information retrieval (MIR) community, faces similar blocks: performance is constrained by polyphonic streams, inaccurate tuning, and/or musical convention. Current technology is not remotely close to automatically transcribing any but the simplest monophonic music into proper music notation. On the other hand, as for speech, transcriptions of other kinds which do show useful information are already possible, and the key challenge for MIR is to find those alternative views to note-based transcription which provide the most readily useful information.

1.2.3.4 Analysis

The location of analysis in Figure 1 indicates our intended meaning for the term: while many of the tasks and processes of annotation and transcription are in some sense analytical, we mean here that part of research where the results of annotation and transcription are subject to the judgement and intervention of the scholar who seeks to extract useful information, draw lessons, and form conclusions. With respect to audiovisual materials, ICT tools play two distinct but possibly interrelated roles. The first might be described as microscopic analysis, where the tool makes explicit characteristics of or data about the material which is otherwise too small, too fast or otherwise hidden. The prime example is Fourier analysis and other systems which extract time-varying frequency information from an audio signal, important in the analysis of both speech and music. Another example important in music is the discovery of timing information to an accuracy of a hundredth of a second (or less). The second role for ICT tools is to facilitate navigation through audiovisual materials, especially multiple materials, multiple views of materials, or annotations or transcriptions in association with audiovisual materials. Tools make it easy for scholars to jump to specified locations in a source, to align similar materials, to see or hear them aligned, and to view or hear audiovisual material aligned with annotations or visualisations.

1.2.3.5 Presentation

Presentation refers to all the different ways in which digital technologies display or render different audiovisual materials apart from simply reproducing them. For instance, the timeline in a video editor or the waveform in a sound editor are presentations of images and sound respectively. Technologies that enhance presentation often summarise it in some way. Speech summarisation has been actively developed, and short textual or audio summaries from speech can be generated. Summarisation technologies music have also been a topic of research, for example presenting the salient features of a pop song in a few seconds, but have not yet been put to use outside the laboratory. Technologies that translate directly between spoken languages also offer new forms of presentation that could be useful in certain research domains.

Tools that generate visualisations of audiovisual materials are common. At the most simple, they display timelines of camera shots or audio events. They present information derived from audiovisual material in some graphic form, enabling overall patterns or structure to be seen, or assisting in the identification of points of particular interest. For instance, a timeline can be used to create a diagram of the formal structure of a piece of music. Existing sound and video editing tools can be repurposed for this. VJ (video deejaying) software allows many different video clips to be assembled, compared and ordered very quickly. This is a very active area of software development and use, and could well yield useful tools for research with collections of video material.

1.2.3.6 Integration

Technologies that integrate all the preceding research processes are few and far between. Even in the domain of speech processing and analysis, the area where analytical tools are strongest, there are few examples of integrated analysis environments or packages of the kind that one finds in scientific software (for instance in bioinformatics, mathematics, statistics or engineering). The few tools that have begun to offer a fairly complete spectrum of analytical capabilities are large-scale, research driven initiatives. They are not currently very accessible to humanities researchers. The development of integrated analysis environments or knowledge studios for humanities researchers remains on the distant horizon.

1.2.4 User experience and expectations

Humanities researchers whom we interviewed were treated as technology users for the purposes of this report. We sought to gather information on the life cycle of audiovisual materials gathered for their research purposes, concentrating upon gathering resources, preparing data, analysis, and dissemination.

Audiovisual material generally falls into two categories: self-recorded or found material. In our observations, self-recorded material not only becomes a research resource, but can have a life as a work record, or take the form of research output or dissemination. Found (e.g., commercial or otherwise externally sourced) material usually only takes the form of a primary research resource, to be studied in and out of context. This situation is fairly natural: if the copyright lies with another party, it is often onerous for the researcher to obtain the rights for a small extract to appear in a research output. Both found and self-recorded material get classroom use.

Our user needs study interviewed 28 humanities researchers and several other technologists who work within the humanities. The research was carried out in three phases, starting with a general, cross-disciplinary study, and then moving progressively towards more specialised and audiovisual-specific research. General researchers were presented with screenshots and descriptions of certain exemplary projects garnered from the early phases of the technology review, and asked for their reactions. The specialist interviews focussed on specific needs and frustrations, and specific solutions were proposed or imagined in conjunction with the interviewers.

Self-recorded research sources were often based upon interviews, oral histories, or as documentary markers. There were few difficulties with recording equipment as it stands today. The common issue, however, was the vast amount of material collected and the limited time available to record and sift through it. As such, nearly all such interviewees wanted a solution for transcription of the material.

Found data runs into several potential roadblocks. The first is simply knowing where to look. As we have found, and as the report should demonstrate, not all large audiovisual archives are in obvious locations or maintained by the most obvious bodies to those accustomed to more traditional textual scholarship. The second is that there can be access problems: although technical barriers to access are being lifted in the online world, not all of the most relevant archives are digitised or transparent to outsiders. Beyond simple access, access rights become terribly important in the digital world: DRM can create difficulties from headaches and inconvenience to completely cutting off a legitimate line of inquiry on audiovisual material (e.g., automated signal processing and analysis on audio).

Once a data store is found and accessed, many find difficulty on the other side of the fence: there can be too much data for a single researcher to work on. A few researchers complained of coming across rich archives of video, but finding that manually tracking through for things that were personally interesting to them was too time consuming for the rewards. Again, transcription was an oft-requested desideratum, and implicitly demonstrates texts superiority for browsability over audiovisual material. Some researchers give themselves over to serendipity with found media, allowing broadcast media or online sources to open up new avenues for their research.

Once a particular piece of audiovisual content is chosen for deeper analysis, after an initial viewing/audition, a common first step is to develop some sort of timeline-based annotation. Although many ICT tools exist for this, many researchers are satisfied with making a table with notable time events, matched with other relevant notations, on paper. Those who deal with oral histories and other interviews cite making detailed transcriptions as a major effort and (often) expense. Further processing and analysis becomes much more individual to the researchers personal methods and motives, but some researchers did show some interest in collaborative annotation (whilst expressing some doubt as to its technical or legal feasibility). Dissemination and other forms of sharing the results of research were similarly up to individual researchers. Those who had made use of ICT in doing so were generally comfortable with the tools available, since the tasks involved are familiar and well documented.

Finally, as technical experts speaking with humanities researchers, we noted some common misapprehensions about ICT tools and what they imagined the tools could achieve. Most of the problems arose from a complete trust in the infallibility of computers: that a computer could express uncertainty or offer a wrong answer flies in the face of most peoples common understanding of computers. A few other problems came from those who were versed with the fact that computers are fallible: those researchers thought some operations were impossible with a given tool, when it was indeed possible, just obscured by the interface.

1.3 Conclusions and Recommendations

1. Network infrastructure and computing platform requirements for humanities research with audiovisual materials are growing and changing. Research with audiovisual materials typically requires higher network bandwidth, more storage, better audio and graphics processing capabilities, and display technologies than text-based research. Researchers are very interested and quick to pick up on devices and software that allow them to collect, view and search audiovisual materials. Poor or weak infrastructure thwarts experimentation with new research approaches. We see a role for the AHRC in providing the (relatively modest) support for improved hardware and networks for humanities researchers.

2. Many of the problems experienced by arts and humanities researchers working with AV are not purely technical, but involve broader issues. One of these is lack of knowledge and expertise, but the solution is not simply training in specific skills or with specific software tools (though these are important, and current efforts should continue, especially at the postgraduate level). Researchers in arts and humanities sometimes need broader knowledge of computing technology, its capabilities and limitations, and to be able to operate with statistical concepts of error and probability (as is common, for example, among researchers in some social science disciplines) in order to make proper use of ICT. The AHRC should seek to foster knowledge of the capabilities and limitations of computing technology, and appropriate knowledge of error and probability, among arts and humanities researchers using ICT.

3. Access and rights restrictions are important issues. Researchers are confused about their rights in dealing with audiovisual materials, and the law is indeed unclear in some respects. Clear guidance, where possible, to researchers about what they can and cannot do with audiovisual materials would be useful. We applaud the recent British Library manifesto on intellectual property (British Library, 2006a) and strongly recommend the AHRC to be engaged in public debate on this issue and to use its influence to establish rights of access to audiovisual materials for research. Digital rights management systems could, if widely adopted or even imposed in the way in which some companies propose, prevent effective research with audiovisual materials, even though such research would not harm the companys interests. In view of the increasing importance of such materials for arts and humanities research, the AHRC must be involved in public debates and use its influence to prevent this. It is essential that the legal rights of access and the practical ability to access materials be maintained and promoted.

4. The continuing digitisation of humanities-relevant AV resources and their exposure to the emerging AV search engines should make AV resources more appealing to researchers researchers for whom AV is not an essential primary resource. Such digitisation will happen anyway through the auspices of companies such as Google. The AHRC should focus funds toward resource-creation projects that will not be covered by commercial advances.

5. Many of the researchers interviewed could readily think of uses for AV data in their research, but researchers for whom AV is not an essential primary resource were often less than enthusiastic about pursuing these possibilities given the relative ease of locating and filtering text sources from their desk. The AHRC should be active in promoting the research opportunities that access to AV resources allows. Researchers currently using AV in their research could be explicitly recognised as early adopters and facilitated to act as examples for others. Activities such as the AHRC Methods Network which engage with identified experts will assist, but it is important that the current focus on existing disciplinary communities does not allow important new areas to fall through the gaps. It will be important also to engage more directly with the developers of technologies, who might not readily envisage the applications in arts and humanities research to which their technologies might be put.

6. Commercial tools will develop rapidly and be usable for many purposes, often beyond those intended by the developers. However, with those tools, we expect there to be deficiencies and problems in applying those particular tools for arts and humanities researchers. Such problems include bibliographic inadequacies, lack of access to raw material, management of large quantities of diverse or unusual content, interoperation with other tools, black-box tools whose exact functioning is unknown, and access to intermediate results and material. Longer-term problems with closed formats and commercial software arise because there is generally no way to guarantee that such data will be readable or that such software will remain useable by future researchers. Although data format transparency has improved over the past few years with the widespread acceptance of XML as a carrier format, there are still dangers with closed formats or incomplete data output with an open format. We recommend that the AHRC use and encourage the use of open, published data formats wherever possible.

7. Similarly, we note that some AV digitisation projects and efforts to increase access are often undertaken with audiences other than arts and humanities researchers in mind. The AHRC should monitor such activities as they are funded in order that arts and humanities researchers are consulted as stakeholders by at least some of these projects. A particular humanities research need is the availability of as good and complete metadata as is possible during the whole archival, creation, and capture workflow: if that descriptive data is not captured, it is effectively lost forever.

8. Researchers face substantial problems in organising their own collections of materials so that they or others can use them. Content management has become a problem for individuals and groups of researchers. To date the AHRC has directed efforts at the generation of large and sometimes centralised collections of research materials while expecting researchers to cope as best they can with their own collections. However, private collections will continue to be an important part of many research projects in the arts and humanities. The AHRC (perhaps through the Methods Network) should consider how individual researchers can be aided in generating and organising their own collections of audiovisual materials. This should both facilitate their own research and facilitate the subsequent use of their collections by other researchers.

9. The AHRC should encourage reuse of researcher-collected data via non-text-based support for browsing and exploration of AV deposited within archives. For example, the Arts and Humanities Data Service (AHDS) could develop standards for the deposit of non-text materials, accompanied by appropriate metadata, and rich mechanisms for browsing and searching such deposits. It similarly could advise on or even require appropriate standards for the annotation of deposited audiovisual materials, and for the deposit of additional annotations of already deposited materials.

(First) (Next) (Contents) (Home) (Previous) (Last)