ICT Tools for Searching, Annotation and Analysis of Audiovisual Media

(First) (Next) (Contents) (Home) (Previous) (Last)

3 Appendix B. Technologies for researching speech, music and moving image

This survey considers current technologies for some loci of in Figure 1 (p.2): accessing, searching and collecting, annotation, transcription, and analysis. Consideration will be given not only to technologies currently in use, but also to those which are the subject of research or development and likely to come into use by 2010. We adopt the following classification to indicate the current stages of development of the tools discussed:

Category 1: Mature project

Category 2: Usable but still under development

Category 3: Technical demo

Category 4: Proof of concept

Category 5: Lab experiment

The reader should note that classifications assigned are not exact. For example, many state-of-the-art research technologies could be described as falling into categories 3, 4 and/or 5.

3.1 Other sources of information

This report gives an indicative rather than comprehensive survey of current tools and technologies. Other lists, surveys or collections of tools include the following.

A useful survey of Internet free/shareware tools for voice analysis is maintained at http://www-users.york.ac.uk/~dmh8/dmh_pevoc4.htm, although it is not clear how actively it is being maintained (last update March 2005)
A comprehensive list of speech analysis (and transcription) software is maintained at http://liceu.uab.es/~joaquim/phonetics/fon_anal_acus/herram_anal_acus.html, some of which also handles video.
Another useful survey examining freeware, shareware and commercial digital speech processing tools is Gonet and Święciński (2002).
Among its advertised resources, the International Computer Music Association maintains a software library (in fact a collection of links to sources of software) (see http://www.computermusic.org), but some of this is now out of date.
PALATINE, a Subject Centre of the Higher Education Academy also maintains a set of links to music software as part of its Directory (http://www.lancs.ac.uk/palatine/directory.html).

3.2 Searching and collecting

Research in principle starts with some kind of searching and collecting of materials. The search for relevant materials often relies on previous analysis, annotation and sometimes transcription. There is no absolute point of origin for searching since almost every search relies on a prior categorisation. Searching and collecting take very different forms, and the technologies needed vary widely.

3.2.1 Searching the spoken word

The most widely used and highly developed search systems work with text, and so searching spoken word collections often relies on previous annotation, transcription or content analysis (topics covered in later sections) to derive text from, or associate text with, the spoken word.

3.2.1.1 Transcript search

Systems supporting the free-text querying of textual transcripts are now ubiquitous and similar systems exist for searching speech by querying the time-aligned transcripts automatically derived by speech-to-text systems. Surprisingly, with relatively little engineering ingenuity, the errors and lack of punctuation in these automatically derived transcripts have little impact upon the effectiveness of the search performance once the error rate falls below an often quite achievable rate of around 40%, at least for the types of data and tasks investigated to date (for more details and possible explanations, see Allan, 2003). However, difficulties may arise in scenarios where the transcription system vocabulary is inadequate to capture all the words in the content, because speech clips containing a word which does not exist in the automatically derived transcripts can never be found: such a problem is more likely to arise with dynamically evolving collections such as daily news (Hauptmann, 2005), rather than static archives, though this is not an absolute rule. There has been research into techniques for handling this problem, including techniques for searching secondary phonetic transcripts: a query term which falls outside the system word vocabulary is (hopefully) located by searching for its pronunciation in the phonetic transcripts (Logan et al., 2003; Amir et al., 2002). (In fact, some companies adopt the use of phonetic searching as the primary search mechanism and claim that this gives better results than a word level search, but this has been hotly disputed by researchers due to the lack of publicly available results.)

3.2.1.2 Browsing via metadata

Metadata (generated manually or automatically) can be added to indices of various types, analogous to but more flexible than those found for books. Thus, a user might choose to browse only segments corresponding to a particular speaker or those that have been associated with particular named entities such as people, places or locations. (There remains considerable art in designing interfaces for supporting efficient browsing through such metadata. Section 3.6.1, Summarisation, is of relevance here.)

Tools of these types have appeared in digital library scenarios since the 1990s (e.g., similar ideas, although less powerful technology, appear in the Princeton digital library (Wolf & Liang, 1997) and the Kansas digital library (Gauch et al., 1997)). Automated speech indexing technology is also beginning to appear in Web search tools, discussed in Section 3.2.4, Tools for locating AV on the web.

Companies offering category 1 tools include Aurix (2006, was 20/20 Speech), Scansoft Dragon MediaIndexer and Scansoft Audio Mining (Scansoft, 2006), BBN Audio Indexing System (BBN, 2004-6a), Nexidia/Fasttalk (Fasttalk, 2006) and also Autonomy (2006) which offers both speech search and video search solutions (via Softsound and what is/was Virage). Ted Leaths First Year PhD Report (Leath, 2005) makes a brief comparison of a subset of commercial products and research systems including BBN roughnready, FastTalk/Nexidia and ScanSoft MediaIndexer. (Companies exploiting such technology for Web AV search are discussed in Section 3.2.4.) There are also numerous research projects in categories 3-5 that are attempting to develop more sophisticated systems, and some of which are discussed in Section 3.7, Integration.

Another issue affecting search systems for multilingual spoken word collections relates to the handling of users who generate queries in languages different to the collection material. One solution is to include a human with appropriate language skills in the search process; the technical community is also attempting to address this problem under the Cross Lingual Information Retrieval umbrella. Typical solutions include translation of the query to match the language(s) in the collection, translation of the collection to match the query language, or representation of both query and language using some intermediate or interlingua representation. Much of this work is in categories 3-5 and has addressed the broadcast news domain, although there has been limited category 4-5 work addressing more conversational and emotional speech as part of the MALACH project (Oard et al., 2002).

3.2.2 Searching for music and sound

It is important to be clear whether in searching for music we mean finding information about where music can be located or actually gaining access to the music itself, equivalent to the distinction between a search yielding the bibliographic information for an article or the full text of the article itself. The possibilities for the former are currently much greater than the latter.

Most searches, whether within a collection such as the Naxos Music Library (Naxos, 2006), or within a database such as Gracenote (2006), depend on metadata such as title, composer or performer. The Naxos Music Library gives access to the actual streamed sound (for subscribers), while the Gracenote database gives catalogue details of CD recordings. In both cases the metadata is restricted and problematic, and based very closely on information provided with CDs. For example, titles of pieces may appear in a different language from the original composition.

In a few cases, the metadata associated with a recording is expanded to arbitrary tags, for example in the Freesound project (2006). The efficacy of this depends entirely on the usefulness of the original tags, generally collected through some collaborative process (see Section 3.3.3, Collaborative annotation). An approach which does not depend on explicit tagging is to assume that the text in proximity to a reference to a music or sound file is usefully associated with that music. Thus a search for Beethoven might yield music files which have beethoven in their title, or in the text of links referring to them, or which is linked from pages with beethoven in the title. Googles American site (not the UK one) offers a music search facility which is called up when a search is recognised to refer to an artist (see Google, 2005). Searching for Beatles for example gives access to specific searches related to each of their songs, but searching for Beethoven does not currently trigger any music search. Altavista (2006) allows search results to be restricted to audio files and will indeed give access to recordings of Beethovens music in response to a search for Beethoven.

There has been considerable interest in searching for music using sound rather than text as the search term, called query by humming. While there have been a number of experimental systems (category 3-5), some of them available on the web (e.g., NYU, n.d.), none has reached the stage of a usable tool. There are very significant technical issues to be addressed before this can be achieved and questions about the degree to which it would ever be a simple-to-use and effective tool (Pardo & Birmingham, 2003). If humming is not a good interface for finding music, an alternative is demonstrated in Muugle (Bosma et al., 2006) (category 5), which provides an on-screen music keyboard on which a user may play a query, which is then matched against the database. Input from a MIDI keyboard is also possible.

3.2.3 Searching video and film

Databases that list information about films and television shows are now common on the web. These databases rely on simple genre classifications, the names of producers/directors, publication information, subject keywords and sometimes other content-related information. For instance, the Internet Movie Database (IMDb, 2006) (category 1, finished product) provides reviews, plot summaries, much technical production information and sometimes trailers for over 800,000 films and television series (July 2006). Because volunteers have added so much information about plot summaries and characters to the database, it can be used to find films and television programs by subject, genre, etc. For computer gaming, projects such the Open Directory Project (2006) (category 1) or Games-db (2006) (category 1) offer something similar but without much of the production related information. Apart from player reviews, they focus on cheats instructions on how to play games more easily.

For current television content, new content alert systems based on program schedules provide automatic notification of broadcasts that fit certain criteria (e.g. MeeVee (2006) (category 1) or Radio Times (2006) (category 1)). The BBC has announced its commitment to making 1 million hours of television and radio searchable and available online. The BBC Programme Catalogue (BBC, 2006). (category 3) allows 75 years of broadcasting to be searched.

On the horizon of searchability, systems that bridge different media are under active development. For instance, search engines that range across television and web contents have been designed (e.g., Miyamori et al., 2006).

However, these systems really do not actually search the content of film, video or broadcast. As in the case of the spoken word resources, they still rely on previous cataloguing, annotation or transcription. Even the most advanced video upload sites such as Yahoo! Video (2006) (category 1) require submitters to supply the keywords used for indexing and cataloguing the clip. Web search engines will probably index video at a fine grained level as collaborative annotation techniques develop (see Section 3.3.3, Collaborative annotation). This topic is discussed further at the end of the following section.

3.2.4 Searching for AV on the web

There are already some established methods of accessing audio and video on the Web, some targeted specifically at researchers and the arts/humanities. These include portal efforts such as the BUFVCs Moving Image Gateway (BUFVC, 2006), which collects links to websites involving moving images and sound and their use in higher/further education, and HUMBUL, which includes categories such as Modern Languages General, Sound/Audio (HUMBUL, 2006). Another example is the work of OLAC (The Open Language Archives Community), which has extended the Open Archives Initiative infrastructure in order to support creation of virtual digital libraries comprising distributed language resources: community efforts not only support standardised resource discovery (including spoken audio and also associated tools) but also recommend best practice for resource creation (Simons & Bird, 2003; Goldman et al., 2005).

More generally, online suppliers of content for online and offline viewing are rapidly increasing in number. For example, iTunes allows the download of video content (e.g. TV shows or media company pod casts) for transfer to a video-capable iPod and TiVo now supports transfer of content recorded by the TiVo to an iPod or PlayStation portable. Video coming through these mechanisms have established charging mechanisms, some per month, some per content unit.

We distinguish these offerings from emerging Web search tools aimed at locating AV. Most of these fall into categories 1 or 2, and they share many similarities with search engines for text. One of the earliest of these was Speechbot, a general Web deployed tool for audio indexing speech recognition transcriptions. Speechbot supported many of the functions now familiar for text-based searching, allowing free text, advanced or power searches and produced a results list displaying a number of items comprising a 10 second long errorful transcription around the located (and highlighted) query terms, the ability to play the corresponding 10 second extract and the date of the recording. Speechbot is now unavailable due to the closure of the Compaq Cambridge Research Lab (US), but in the past couple of years a number of similar services have emerged. Many of these emerging services have been released as test or beta versions for audio and/or video, and changes to the functionality offered by any one site are appearing almost weekly at the time of writing. For this reason, we describe typical functionalities rather than describing specific systems in depth.

Some tools crawl the Web for audio and video made openly available on websites. For example, podscope offers the ability to search audio blogs and pod casts, as does Blinkx (2006). Truveo offers a similar service (Rev2.org, 2005).

Some tools support the search of video or audio submitted by users. For example, podscope allows users to submit content (Price, 2006a), while Google Video operates the Google Video Upload program (Google, 2006c), whereby video and optionally a transcript are submitted to the system.

Some tools index content legitimately provided by media companies and archives. For example, blinkx has major deals with ITN and Fox News Channel (net imperative, 2006). Yahoo! Video and Truveo accept videos through media RSS (Rev2.org, 2005), Google video operate a Premium Program for major producers (Google, 2006d) and also have a pilot project with the US National archives (News.com, 2006; Google, 2006a).

The tools perform the search in different ways. Some rely on metadata associated with videos, such as web page captions or user uploaded transcripts (the current version of Google video may fall into this category). Others extract closed captioning or use speech-to-text technology to allow more precise indexing as discussed earlier, returning results which play from the point of the first-matching query term (e.g. TV Eyes (Price, 2006b)). Most of the sites described offer services in English; services are also appearing in Mandarin (e.g., Blinkx, 2006) and Arabic (TV Eyes, 2003).

The business models of these companies are still evolving. Services such as blinkx appeared to be inserting advertisements into searchable content (net imperative, 2006). Others offer premium fee-based services e.g. TVEyes (Price, 2006b).

3.2.5 Content management systems

Video and audio are large media. On the web, in film and video databases, on DVDs, in legacy collections of video and film, there is no shortage of film footage or television content. Many scholars collect large amounts of this material on their own computers and on portable storage media. While professionally curated online archives usually have extensive catalogues and indexes, personal collections of audiovisual materials sometimes suffer from lack of organization.

At one level, the folder and directory structures available on desktop computers allow virtually any material to be organised. However, others means of organizing audiovisual materials are available. Most music and video player software such as iTunes, xmms or windows media player embodies some idea of bookmarks, libraries or playlists. Annotation software often includes file management features, sometimes for thousands of files. Dedicated personal information management software such as DEVONthink (Devon Technologies, 2006) handles multimedia and text files equally. Commercial media management software such as CONTENTdm (Dimema, 2006) and Retrieva (2006) offer more sophisticated ways of organizing contents. Some software attempts to automatically index still images and text files added to it. Their search capabilities only use tags and metadata for sound and image files.

3.3 Annotation

In the context of time-based media, annotation associates extra information, often textual but not necessarily so, with a particular point in an audiovisual document or media file. In humanities research, annotation has long been important, but in the context of sound and image, it takes on greater importance. Rich annotation of content is required to access and analyse audiovisual materials, especially given the growing quantities of this material. Annotation software for images, video, music and speech is widely available, but it does not always meet the needs of scholars, who annotate for different reasons. Sometimes annotation simply allows quick access or index of different sections or scenes. Annotation has particular importance for film and video where Annotation is sometimes used for thematic or formal analysis of visual forms or narratives. At more fine-grained levels, some film scholars analyse a small number of film frames in detail, following camera movements, lighting, figures, and framing of scenes. Annotation tools designed for analysis of cinema are not widely available. Most video analysis software concentrates on a higher level of analysis.

3.3.1 Annotation and standards

There are many different approaches with regards to standards in annotation. There are several well-known metadata standards applicable to humanities research, such as library standards like MARC and Z39.50, and other, broader standards like the Dublin Core. These are useful standards, but are dominated by the resource-level approach; most similar metadata standards describe content on the level of an entire entity within a library. This level of metadata is very useful, but does not satisfy the requirements of annotation as described above: the standards do not have robust models for marking points within the content.

MPEG-7 is an ISO standard (category 1), conceived in 1996, and finalised (in its first versions) in 2001-2002. It is intended to be a comprehensive multimedia content description framework, enabling detailed metadata description aimed at multiple levels within the content. It is worthwhile to go into a little detail on the standard and what it might offer to humanities researchers.

A key to understanding MPEG-7 is appreciating the goals that shaped its conception and the environment in which it was born. It was conceived in a time when the World Wide Web was just showing its potential to be a massively interconnected (multi-) media resource. Text search on the web was beginning, and throwing into relief the opacity of multimedia files: there was then no reliable way of giving human or computer access inside a multimedia resource without a human viewing it in its entirety. Spurred on by developments in the general area of query by example (including query by image content and query by humming), it was thought that MPEG could bring its considerable signal processing prowess to bear on those problems.

Along the way to the standard, people discovered that the problem of multimedia content description was not all that trivial, nor could it rely wholly upon signal processing. It had to bring in higher-level concerns, such as with knowledge representation and digital library and archivist expertise. In doing so, the nascent standard became much more complex, but had the potential to be much more complete.

The standard, as delivered, has a blend of high- and low-level approaches. The visual part of the standard kept closest to MPEGs old guard, concentrating on features unambiguously based upon signal processing and very compact representations. The newly created Multimedia Description Schemes subgroup (MDS) brought in a very rich, often complex set of description structures that could be adopted for many different applications. MPEG-7 Audio took a middle path, offering both generic, signal processing-inspired feature descriptors and high-level description schemes geared towards specific applications.

Technically, MPEG-7 offers a description representation framework expressible in XML. Data validation is offered by the computationally rich, but somewhat complex XML Schema standard. Users and application providers may customise the precise schema via a variety of methods. There are numerous descriptive elements available throughout the standard, which can be mixed and matched as appropriate. Most significantly, it allows for both simple and complex time- and space-based annotations, and it enables both automated and manual annotations.

Industrial take-up and generally available implementations of MPEG-7 have been inconsistent at best so far. The representation format offered by MPEG-7, however, seems to be one that would serve arts and humanities research very well. It is agnostic to media type and format. It is very general, and can be adapted to serve a variety of different applications. Despite its flexibility, it is far more than a guideline standard: it has very specific rules for ensuring compatibility and interoperability. If someone were to invent a framework serving the arts and humanities research community for its metadata needs, it would resemble MPEG-7, at least conceptually.

A fine-grained approach to the problem of re-using annotations relies on developing shared standards for annotation. Standards for annotation of video content have been developed. e.g. Annodex (2006), (category 2) is an open standard for annotating and indexing networked media, and draws to some extent upon experience gained from MPEG-7. Annodex tries to do for video what URL/URI (i.e. web links) have done for text and images on the web. That is, to provide pointers or links into time-based resources of video on the web. The Metavid project (2006) demonstrates Annodex in action on videos of U.S. Congress.

3.3.2 Manual annotation

There are numerous tools (and formats) for creating linguistic annotations, many catalogued by the Linguistic Data Consortium (2001). (According to the LDC, Linguistic annotation covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions audio, video and/or physiological recordings or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, named entity identification, co-reference annotation, and so on. The focus is on tools which have been widely used for constructing annotated linguistic databases, and on the formats commonly adopted by such tools and databases.) Some of the analysis tools mentioned earlier also support annotation, see e.g. Gonet and Święciński (2002) or the long catalogue of tools listed by Llisterri (2006). There is also the open source Transcriber tool (2006) and numerous other commercial solutions for more general transcription of digital speech recordings, such as NCHSwiftSound (2006). These tools fall variously into categories 1-4.

For video, a typical video annotation tool is Transana (category 1) developed by WCER, University of Wisconsin (2006), which allows researchers to identify analytically interesting clips, assign keywords to clips, arrange and rearrange clips, create complex collections of interrelated clips, explore relationships between applied keywords, and share your analysis with colleagues.

Annotation of music associates non-textual information with the original data more often than is the case for other media. For example, scholars needing to know where the beats come in a piece of music might associate a sequence of MIDI data with an audio or MIDI stream. Essentially the task remains the same: to associate some symbolic information with points or segments of the audiovisual medium. In the case of multi-channel or multi-track data, it is possible that annotations might be applied to separate channels or tracks, but we have found no instances of this. The kinds of annotations which researchers wish to make range from structural or quasi-semantic labels (e.g., first subject or mellow) to technical/analytical data (e.g., harmonic analyses, key or tempo) to the identification of small-scale events (e.g., specific notes or drum beats). These annotations can be attached to time points in the original stream, or to segments. In the latter case the annotations might or might not form a hierarchical structure (with segments contained within segments) and might or might not containing overlapping segments. Annotation tools are unfortunately rarely explicit about which of these kinds of annotation are supported. Lesaffre, Leman, De Baets & Martens (2004) discusses some of the theoretical issues around musical annotation.

Though not intended explicitly for annotation, music-editing or music-composition software can have annotation capabilities or be repurposed to perform annotation tasks. One example of the use of commercial sequencer software is Tanghe, Lesaffre, Degroeve, Leman, De Baets & Martens (2005), who used Cakewalk Sonar (by Twelve Tone Systems) to annotate the drum beats in extracts of sound recordings. Two MIDI tracks were added using the software, one indicating where the beats came and the other indicating each percussion stroke and the kind of instrument (bass drum, snare drum, etc.). An advantage of using the software was that the MIDI track could be played either with or without the original audio track, allowing the user to check by ear whether or not the percussion strokes had been correctly identified and correctly timed.

Tools intended for annotation of speech could be used also for annotating music, and while Lesaffre et al. dismiss these as not suitable because they do not support the kinds of annotation required for music, WaveSurfer (category 1; Sjöander & Beskow, 2000, 2006) has been used both directly and as a basis for specialised music-annotation tools. Similarly, generic tools for annotating video or audio, such as Project Pad (Northwestern University, 2006) (category 2), can be used for annotating music. Perhaps the most highly developed specific music annotation tool is the CLAM Music Annotator (MTG, 2006; Amatriain, Massaguer, Garcia & Mosquera, 2005) (category 2). This software allows different kinds of annotations to be attached to time points or to segments, and different kinds of annotations can attach to different segmentations. Annotation types can be defined using an XML schema, and software elements can be added to automate some annotation processes (see Section 3.3.4.2).

3.3.3 Collaborative annotation

Annotation of audiovisual materials can take a lot of time, and even if material has been annotated by one researcher, the problem remains of how any other researcher can make use of the annotation. It is therefore not a surprise that projects have investigated sharing the effort and the results. We see much activity in this area, and some promising early ideas. Annotation for the purpose of finding audiovisual material seems successful, but we have not seen anything like the sophisticated and consistent analysis that would be needed to write even a basic film or book review.

Simple collaborative annotation of audiovisual materials is now common on the web. Sites such as Google Video (Google, 2006b) (category 1) or Youtube (2006) (category 1) partly rely on tags supplied by contributors. Producers and consumers of audiovisual material such as photographs, speech, sound and music, or video tag them with keywords. These keywords then become searchable via web search engines or through subscription mechanisms (e.g., a user who subscribes to content tagged with Star Wars will receive notification whenever anything tagged with Star Wars has been added to the database). While people often choose very generic keywords, and the keywords often apply to large video files, the tags and keywords are clearly useful. There is a synergy between the descriptions supplied by different users. For example, one may annotate the style of the image, and another marks the presence of a street sign. Combinations of the annotations supplied by users allow database-driven websites such as flickr.com and youtube.com to provide reasonably powerful and selective search capabilities, more informative than one would expect from any single set of annotations. Currently, commercial video uploading and downloading services are growing rapidly, and they offer increasingly sophisticated annotation features (e.g. Viddler, 2006, category 3). However, by and large, the annotations only describe the most obvious features, which limits the searches that can be done.

A number of projects have attempted to design and construct collaborative software environments for video annotation. In collaborative video annotation, a number of people can work on the same video footage. Efficient Video Annotation (EVA) (Volkmer, 2006) (category 2) is novel Web tool designed to support distributed collaborative indexing of semantic concepts in large image and video collections. Some video annotation tools such as Transana (WCER, 2006) already exist in multi-user versions. Another approach to collaborative annotation is to set annotation up as a game in which you get annotations generated as a side-result. See, for example, The ESP Game: Labeling the Web (Carnegie Mellon University, 2005a), which is a collaborative gaming approach to image annotation that is described in more detail in (von Ahn & Dabbish, 2004). The same team has a later game, Peekaboom (Carnegie Mellon University, 2005b), which helps in generating labels for segmented images, useful for computer vision, for example. The designers claim to have generated over 10 million descriptive words for one million images.

A different approach is taken by the application mediaBase (Institute for Multimedia Literacy, 2005) (category 2), which requires some manual annotation or tagging of any media file put in the system. However, after this initial tagging, it encourages rich media authorship as a way of investigating relations between different media components. MediaBase publishes resulting compositions on the web, and they can be altered, edited, revised or added to by others. The goal of MixedMediaGrid (NCeSS, 2005) (category 4), an ESRC e-Science funded project, is to generate tools and techniques for social scientists to collaboratively analyse audio-visual qualitative data and related materials over the Grid. Certainly, these tools and techniques could be used in the humanities too. MediaMatrix (category 2) developed at Michigan State University (2005) is a similar online application that allows users to isolate, segment, and annotate digital media.

Similarly in music, a number of projects have suggested processes of collaborative annotation to allow researchers to pool effort and benefit from each others annotations. Project Pad (Northwestern University, 2006) is designed explicitly to allow teams (envisaged as students, but they could be researchers) to share annotations. Collaborative annotations of music in education have been reported, but none in research. A collaborative music-annotation project intended for research has been set up at Pompeu Fabra University, using either the CLAM Music Annotator or a Wavesurfer-based client to a web portal (Herrera et al., 2005), but there is as yet little evidence that it is accumulating a large set of annotations.

The BBC also has a project in this area, with the aim that listeners will progressively annotate recordings of radio programmes (Ferne, 2005). This is an internal BBC research project, but a public launch has been mooted. Interestingly, this project uses a Wiki-like approach, allowing the public to edit existing annotations, including viewing histories and reverting to previous versions, but with the underlying assumption that there is a single canonical annotation.

Already established and an everyday part of music on the web, but not really a research tool, is the Gracenote database of CD tracks (Gracenote, 2006). The database supplies annotations for media players to supply information about artist and title which is not recorded in the electronic data on an audio CD. Publishers of CDs can supply the original information to Gracenote, but many CDs were published long before there were media players on computers, let alone before the Gracenote database existed. Commonly, when a media player finds there is no information on the database for a CD, the user is invited to supply this information, which is then sent to the database. Thus Gracenote is effectively a global collaborative annotation tool. However, in the area of classical music recordings, it is notoriously inaccurate, largely because the categories for the database (Artist, Song, etc.) do not map clearly to the commonly significant details of a classical composition (e.g., is the Artist the composer, the soloist, or the conductor?). Research uses for the database are therefore likely to be confined to popular music.

3.3.4 Automatic annotation

An alternative response to the time-consuming nature of manual annotation is to automate part of the process. Clearly, different kinds of annotations present different levels of difficulty in automation, and it is in the simple and explicit partitioning of audio, in particular, that automatic annotation has had the greatest success. The challenges of more semantic levels are much greater, though some projects in this area have had a degree of success, particular with respect to music.

3.3.4.1 Audio partitioning

The goal of audio partitioning systems is to divide up the input audio into homogeneous segments and (typically) to determine their type. The class types considered may vary by application but a typical partitioning might distinguish pure music, pure speech, noise, combined speech and music and combined speech and noise (Tranter & Reynolds, 2006). The resultant partitioning may provide useful metadata for the purpose of flexible access, but such partitioning is also an important prerequisite for speech-to-text transcription systems (e.g. it enables the removal of audio that might otherwise generate transcription errors) (Gauvain & Lamel, 2003). For some applications, a more knowledge-based partitioning and filtering may be applied, such as removing advertisements and other audio segments that are either not of interest to the end user and/or are likely to degrade automated system performance downstream. Such technology falls into categories 3-5 and is typically available from companies and/or research labs with interests in speech-to-text transcription.

3.3.4.2 Music

The past decade has seen the birth and rapid growth of the field of Music Information Retrieval (MIR), fed in part by the interest of music businesses in technologies to facilitate user interaction with large databases of downloadable music. While query by humming (see Section 3.2.2, Searching for music and sound) was an initial impetus to this field, more research has recently been directed at what are effectively various kinds of annotations of music. Some of these are concerned with partitioning (e.g., note onset detection or segmentation into broad sections) and some concerned with richer information such as tempo, beat, harmony and tonality, and various kinds of similarity or classification. Two well developed tools for MIR are Marsyas, by George Tzanetakis (Tzanetakis, n.d.; Tzanetakis & Cook, in press), and M2K, by J. Stephen Downie and others (Information Systems Research Laboratory, 2005), which functions within the D2K Data to Knowledge framework of the US National Centre for Supercomputing Application.

The achievements of recent MIR research are best shown in the results of the MIREX competition (MIREX, n.d.) associated with the international conferences on Music Information Retrieval (ISMIR, n.d.). The 2005 competition had ten categories: Audio Artist Identification, Audio Drum Detection, Audio Genre Classification, Audio Key Finding, Audio Melody Extraction, Audio Onset Detection, Audio Tempo Extraction, Symbolic Genre Classification, Symbolic Melodic Similarity, and Symbolic Key Finding. The audio competitions used recorded sound as the raw data, while the symbolic competitions used MIDI files. The best audio systems typically performed with accuracies of 7080%, and though the key finding approached 90% accuracy, this is still well below the level at which such software would produce reliable results with real saving of effort if details of individual cases are important. The best symbolic systems interestingly performed at similar levels of accuracy, despite the much lower complexity of the input data. Other tasks on symbolic data, on the other hand, such as pitch spelling (i.e., determining a note name and accidental for each note such as C sharp or D flat) can be performed with levels of accuracy of greater than 98% (Meredith, 2006), promising useful research tools. Most MIR software falls into categories 3-5. Only Marsyas has become sufficiently widely used to take on the status of category 2 (released, but not yet finished, software), but its use is currently as a toolkit for MIR research rather than a tool for musicologists.

It would be a mistake, however, to think that MIR research will not assist musicological and music-analytical research. While it is true that tools which automate the typical tasks of music analysis are, as yet, not in prospect, MIR tools do produce a wealth of potentially useful and interesting data about musical sound of a somewhat different nature (e.g., measures of acoustic roughness, and various kinds of correlations). With a change of focus by music analysts (and a certain amount of re-education, since the acoustics and mathematics involved are not part of the general knowledge of music analysts), these tools promise novel and fruitful areas of research which focus on the analysis of music as sound rather than music as notated structure.

3.3.4.3 Video

A video can be partitioned into shots. A shot is an uninterrupted segment of video frame sequence of time, space and graphical configurations. For the last decade, many research projects have been working on automated video partition of footage into shots, topics, and face recognition (particularly in news video processing). Some of this research has led to commercial products. Some of these systems use manual annotation to start with, and then automatically annotates and indexes any related video materials. For instance, the Marvel video annotation system (IBM, 2006) (category 3) demonstrates the ability to generate semantic and formal labels from television news footage. Marvel builds statistical models from visual features using training examples and applies the models to automatically annotate large repositories. Other projects seek to generate topic structures for TV content using TV viewers comments on live web chat rooms (Miyamori et al., 2006).

3.4 Transcription

Transcription is typically applicable only to audio within time-based multimedia. More technically, as it is a process of writing down events in a canonical form, it applies to events that are transitory and constrained. As such, music, dance, and speech are the most commonly transcribed sources of those within the projects remit. Automatic general video transcription makes little sense in the near-term because it essentially requires a model of the whole world. With constrained worlds, some transcription is possible, and there has been some automatic video understanding of sports on video as well.

3.4.1 Speech-to-text transcription

Speech-to-text (or automatic speech recognition) systems aim to convert a speech signal into a sequence of words. Progress in the field has been driven by standardised metrics, corpora and benchmark testing through NIST since the mid-1980s, with systems developed for evermore challenging tasks or speech domains: developing from the domain of single person dictation systems to todays research into systems for the meetings and lectures domain. A brief history of speech (and speaker) recognition research can be found in Furui (2005a).

Some of the differences between speech domains can create additional difficulty for automatic systems. For example, speech from the lecture domain has much in common with speech from a more conversational domain including false starts, extraneous filler words (like okay) and filled pauses (uh). It also exhibits poor planning at higher structural levels as well as at the sentence level, often digressing from the primary theme. An evaluation in 2004 reported state-of-the-art transcription systems to achieve a word error rate (a measure of system accuracy which incorporates word deletions, insertions and substitutions) of 12% for broadcast news in English, but 19% for Arabic. For conversational telephone speech, the figures were 15% for English and 44% for Arabic (Le, 2004). The effect of differences in the manner of capture of the audio is illustrated in the figures from an evaluation in 2005 for meetings and lectures (in English), where the error rates were 26% and 28% respectively when speakers had individual headset microphones but 38% and 54% in the case of multiple distant microphones in the meeting room (Fiscus, 2005).

Development of a system for a new speech domain or application ideally builds upon a large amount of manually transcribed in-domain training data in order to build a speech transcription system tailored to that domain (often of the order of hundreds if not thousands of hours for state-of-the-art systems (Kim et al., 2005). The level of accuracy of the transcriptions need not be perfect: techniques have recently been developed to handle less than perfect transcriptions such as closed captions (Kim et al., 2005): technologists report that up to a 5-10% word error rate can be handled in a single transcript or multiple transcriptions of different reliability exploited (Phil Woodland, personal communication). Where sufficient adequately transcribed data cannot be made available for financial or other reasons, as much adequately transcribed in-domain acoustic data as is feasible is obtained which will sometimes be none and models from a similar domain are adjusted or adapted in terms of their acoustic, vocabulary or word predictor components in order to match the new domain as well as possible. Vocabulary and language model (word predictor) adjustments can also be made based upon in-domain textual information such as transcripts, textbooks or other metadata where available.

There is a computation time versus accuracy trade-off: a real-time system will typically perform less well than a 10-times-real-time (10xRT) or even unconstrained system, but the degradation will vary with situation. Similarly, memory constraints can affect things. State-of-the-art systems typically use hardware beyond that of todays average desktop. (The word-error rates for English speech referred to above were achieved with a constraint of 10xRT and 20xRT respectively (Le, 2004).)

It is important to note that speech recognition systems developed for one domain cannot, in many if not most situations, be employed as a black box that can handle any domain: even speech from the same domain that differs from the training data may be problematic (e.g. speech from previously unseen broadcast news shows in Le, 2004). There exist components of the system which are brittle or sensitive to such changes: the system has been trained to recognise certain types of speech and, whilst it may perform quite well on those types of speech, it may perform badly on speech which is different. Such differences may include (but are not limited to):

channel differences, such as speech which is recorded over the telephone versus speech which is recorded using a headset microphone;
individual speaker differences, including accent, vocal range;
style of data, whether conversational, dictated, produced and carefully pronounced (as in broadcast news);
vocabulary.

There exist system adaptation techniques to compensate for such differences to some extent (e.g. Gales 1996), but despite significant progress in this area the development of systems which are robust to differences in data is a key research goal at present (Le, 2004; Ostendorf et al., 2005).

Systems have also been developed for some domains in many other major European languages e.g. the LIMSI-CRNS spoken language processing group has developed broadcast news transcription systems for French, German, Portuguese and Spanish in addition to English, Mandarin and Arabic (Gauvain & Lamel, 2003). Mention should also be made of the recently-started DARPA Global Autonomous Language Exploitation (GALE) program (see Linguistic Data Consortium, 1996-2005), which is developing technologies to absorb, analyse and interpret huge volumes of speech and text in multiple languages: as part of this, projects such as AGILE (Autonomous Global Integrated Language Exploitation, involving multiple sites including the University of Cambridge and the University of Edinburgh) are developing combined speech-to-text translation systems that can ingest foreign-language news programmes and TV shows and generate synchronised English subtitles (Machine Intelligence Laboratory, 2005). (Such technology is becoming commercially available for certain scenarios, though at a cost hinted to fall in the band of hundreds of thousands of dollars (US dollars): IBM has recently developed the TALES server system that perpetually monitors Arabic TV stations, dynamically transcribing and translating words into English subtitles; the video processed through TALES is delayed by about four minutes, yielding an accuracy of 60 to 70% compared to a estimated 95% human translator performance (PC Magazine, 2006).

Differences in speech transcription performance across different domains mean that speech transcription tools fall into development categories 1-5 depending upon the difficulty of the domain. For example, desktop systems for dictated speech-to-text and desktop control are readily available (e.g. Dragon NaturallySpeaking), as are systems for constrained domains such as medical transcription (e.g. Philips SpeechMagic supports 23 languages and specialised vocabularies). The Microsoft SDK can be freely downloaded and used for the development of speech-driven applications and is supplied with recognisers for US English, simplified Chinese and Japanese (Microsoft, 2006). All of these tools fall into categories 1-2 but will perform well only in certain situations.

State-of-the-art speech-to-text systems are typically made available through joint projects with universities or commercial organisations such as Philips and Scansoft. These tend to fall into categories 2-5. For the enthusiast with time to spare, the HTK project offers downloadable software that will let you build a reasonable word-level or phonetic-transcription system and now offers an API (called ATK) for building experiment applications (HTK, n.d.); SPHINX-4 (Sphinx, 1996-2004) is an alternative, and there are many other tools of interest, such as the CSLU Toolkit (Centre for Spoken Language Understanding, n.d.). These tools fall into categories 4-5.

Church (2003) presents a chart showing that speech-to-text transcription researchers have achieved 15 years of continuous error rate reduction and we might wonder what the future holds. At present, the accuracy of current systems lags about an order of magnitude behind the accuracy of human transcribers on the same task (Moore, 2003; David Nahamoo quoted in Howard-Spink, n.d.). Moore has estimated that it would take a minimum of 600,000 hours of acoustic training data to approach a 0% error rate using current techniques, which he also estimates to be a minimum of four times a typical humans lifetime exposure to speech!

3.4.1.1 Speech-to-phonetic transcription

Researchers have also investigated automatically extracting textual transcriptions comprising a sequence of sub-word units (e.g. syllables or single sounds referred to as phones). This task has not been as heavily researched in recent years, but has relevance to search and indexing applications since such subword transcriptions often form the basis for techniques for searching for out-of-vocabulary query words with which the word level transcription system is not familiar. New words appear in the news every day (e.g. 9/11 suddenly entered our vocabulary) and may not appear in the basic speech to text system vocabulary, so could not be recovered in a straight word transcription search. (A discussion of techniques for handling OOV (out of vocabulary) queries in spoken audio can be found in Logan et al (2003).) Recent work examining phonetic transcription includes Saraçar et al. (2000) and Saraçar & Khudanpur (2004). Phonetic transcription tools typically fall into development category 5 and exist within universities and research labs, though only for specific phone sets and not necessarily in forms which are easily packaged. The NICO toolkit (KTH, n.d.) also supports development of a neural network-based estimator of phoneme probabilities, though this would probably be of interest only to hardcore enthusiasts.

3.4.1.2 Transcription with video

As in some other problem domains, there is some convergence in research based on audio and video. Audiovisual speech-to-text systems, which combine information about the movement of the lips and possibly a wider region of interest around the mouth with audio information, have been found to improve over audio-only speech-to-text in certain conditions (e.g. noisy conditions and/or large vocabulary dictation tasks) (Potamianos, 2003). Category 2 tools of this type are under development for constrained domains such as finance and within-car use.

Allowing the combined use of audio and video was also found to improve the segmentation of stories on video, relative to purely speech transcript-based approaches for most systems at TRECVID 2004 (Kraaij et al., 2004); multimodal information retrieval systems can also outperform speech based retrieval systems, although speech-based retrieval contributes most of the performance to date (Hauptmann, 2005). Multimodality can also be usefully exploited in presentation e.g. the CueVideo system offers the end-user a choice between presentation formats such as visual-only storyboards (slideshows of key frames without audio) and moving storyboards with audio, allowing them to select the most appropriate presentation mode for the video content in use (Amir et al., 2002). All of these research areas are still at quite preliminary stages, with the exception of audiovisual speech recognition work, and fall mostly into categories 3-5. However, it seems likely that solutions which make use of multiple rather than single information sources, where this is an option, will prove most successful in the future.

3.4.1.3 Time-alignment of speech and text

A convenient property of the most popular (statistical) approach to speech recognition is that the same algorithm used for speech to text transcription can be used to time align a word level transcript (e.g. a script) with the corresponding speech signal, associating each word associated with its start and end time in the audio signal. (In a robust system, the algorithm may be lightly modified to allow for errors in the script, e.g. Chan & Woodland (2004) and for distracting nonspeech audio such as music or other background noise.)

3.4.2 Transcription-related annotation of speech

The speech-to-text transcriptions as discussed above have historically comprised an unpunctuated and unformatted stream of text. There has been considerable recent research into generating richer transcriptions annotated with a variety of information that can be extracted from the audio signal and/or an imperfect word level transcript. Such annotations may improve applications which involve presentation of transcripts (e.g. user reading of results returned by search systems), but may also improve downstream processing (e.g. machine translation). Areas of interest include:

3.4.2.1 Punctuation and structural information

There has been investigation into automatically generating punctuation as well as into generating speech-specific structural information such as marking interruption points, edit regions and boundaries of sentence-like units. Much of the latter work fell under the umbrella of the DARPA EARS program, under the structural metadata task (Liu et al., 2005).

3.4.2.2 Speaker-related information

Associated tasks include speaker detection and tracking (identifying a number of speakers and grouping utterances from the same speaker, although absolute speaker identities remain unknown), speaker identification (determining who is speaking at a particular time), speaker verification (determining whether a particular speaker corresponds to a claimed identity) and tasks related to speaker localisation (e.g. in meeting scenarios). Examples of such work include the summary paper by van Leeuwen et al (2006) and (Tranter & Reynolds, 2006).

3.4.2.3 Named entity extraction

The task involves annotating transcripts to mark word sequences corresponding to items such as proper names, people, locations and organisations, or dates and times. The BBN Identifinder (BBN Technologies, 2004-6b), which is a category 1 named entity extractor that has been quite widely used in the technical community.

3.4.2.4 Topic-related information

Tasks investigated include the detection of topic boundaries in a stream of data, clustering of related segments of data, the automatic detection of later occurrences of data relating to the story of interest and story link detection tests to determine whether two given stories are related. As this description suggests, these tasks make most sense for news data which comprises a sequence of stories although there has been related work for conversational speech such as that in the MALACH project. (Allan, 2001 includes a summary of topic-detection and tracking activity; Franz et al., 2003 describes MALACH-related work.)

3.4.2.5 Information extraction

This encompasses attempts to identify relationships between entities in one or more documents (e.g. coreferencing) and the extraction of domain specific event types (e.g. free kicks, goals in football matches). The MUMIS project at Sheffield University is attempting to extract such information across multiple, multimodal sources. The problems of information extraction are discussed in Cowie & Wilks (2000) and Grishman (1997); the problems of information extraction from errorful automated speech recognition transcripts are considered in Grishman (1998).

3.4.2.6 Other

There are preliminary investigations into the extraction of language information (see e.g. the 2003 or 2005 NIST language recognition evaluations described in papers such as Martin & Przybocki (2003), dialect information (also addressed in the language recognition evaluations, with some sites treating each dialect in the same way they would treat a distinct language (Chen, 2006), emotional information (e.g. speaker state) (see e.g. the useful list of emotion related projects maintained by the EU HUMAINE project (Humaine, 2003-6), dialogue act information (see e.g., Wright (1999) and Webb et al. (2005)), and prosody.

Progress in the first four of these annotation processes has been driven by evaluations. Named entity, topic and information extraction techniques have been most heavily investigated for text rather than errorful transcriptions of unplanned speech developing more robust techniques for handling such text is the subject of ongoing research (Ostendorf et al., 2005). A few category 1-2 named entity extraction tools exist, but most of the rich metadata annotation research above falls into categories 4-5 and much of it has only been investigated for speech from a small set of domains, such as broadcast news and conversational speech. Emotion related work in particular is very preliminary and falls into category 5.

3.4.3 Music transcription

For years, scholars have anticipated a tool which could transcribe musical performances to music notation. Indeed, the original aim of one of the earliest and best known projects in musical computing was a system which would transcribe the performance of a musician on a specially designed music keyboard into music notation (Longuet-Higgins, 1976). A tool which automatically transcribes even a simple musical performance into correct and accurate music notation remains a distant goal, however. Perhaps this should be no surprise, since only highly trained musicians can make any such transcription at all, and even so the process involves a high degree of approximation and guess-work. On the other hand, transcription into some form of notation which gives useful information is possible for restricted kinds of musical sound, and it can be a useful tool in, for example, ethnomusicological research where systems like the melograph (a device which derives a continuous pitch curve from monophonic sound) have been in use for some time. A recent review of the state of the art in music transcription is (Klapuri, 2004).)

3.5 Analysis

The location of analysis in the diagram indicates our intended meaning for the term: while many of the tasks and processes of annotation and transcription are in some sense analytical, we mean here that part of research where the results of annotation and transcription are subject to the judgement and intervention of the scholar who seeks to extract useful information, draw lessons, and form conclusions.

3.5.1 Analysis of audio and music

With respect to sound, the most significant contribution of ICT tools to analysis has been in the now almost routine extraction of frequency-domain information from sound signals. Analyses which focus on acoustic properties, in phonetics and music, regularly make use of tools which employ Fourier analysis or other methods such as auto-correlation to determine the component frequencies of a signal and their relative strengths. In the case of non-static sounds, this information is most commonly presented in a sonogram (a two-dimensional display with time as the horizontal axis and frequency as the vertical). Many such tools exist to effect such analysis: Wavesurfer (Sjöander & Beskow, 2006) is a good example of software from the research community, while Matlab (with its Signal Processing toolbox (The MathWorks, 1994-2006)) is probably the most commonly used commercial software. Musicians use such tools for many purposes, including the analysis of instrumental tone (e.g., Fitzgerald, 2003) and the analysis of pitch articulations and vibrato in performance (Rapaport, 2004).

The analysis of musical performance has become a topic of considerable interest, spurred by the two factors of a now substantial history of recorded music and ICT tools to facilitate the analysis of music-as-sound. (The most distinctive project in the UK in this area is the Centre for the History and Analysis of Recorded Music (CHARM), at Royal Holloway and Kings College, University of London.) However, no distinct set of ICT tools seems to meet the needs of researchers in this area. It would appear that there are still considerable gaps between the information which software can derive and present about musical sound and the information which researchers want to discover. For example, it is rarely a simple and straightforward matter to distinguish where notes begin and end in a sonogram, and while information on the precise frequency composition of a sound can be derived, that does not always correlate simply with its perceived pitch composition. The most effective use of ICT in this area, therefore, comes when software can automate some of a task or present information in a manner which allows the researcher to bring to play more effectively or more rapidly his or her musical ear and judgement.

A nice example of this is a simple piece of software, MATCH (Dixon & Widmer, 2005; Dixon, 2005) (category 2), which aligns two performances of the same piece, bringing two benefits. One is continuous data on the relative timings of the performances: a researcher can quickly discover if a longer overall performance results from a slower tempo throughout or from longer pauses at certain points, for example. The other is that the alignment facilitates simple switching from one recording to another at equivalent points in the piece. Thus a researcher can quickly and easily compare how two performers treat the same passage of the piece.

A second example is an equally simple suite of programs, to be used in conjunction with the music-analysis software Humdrum (Huron, 2002) and a sound editor, to capture information about the timing of beats in a performance, (Sapp, n.d.). Software to automatically recognise and track beats does exist (mentioned above under automatic annotation), but none has yet reached a level of accuracy and reliability where it has been adopted as a research tool. Researchers still wish to determine by ear exactly where the beats fall. To use the software, the researcher taps the beats as the music is played back. The time of each tap is recorded by the software, and can subsequently be adjusted in a sound editor in parallel with the original music. (The suite of programs is thus a form of manual annotation software.)

For music scholars, analysis generally refers to the distinct sub-discipline of music analysis which examines the structure of individual musical compositions in depth. This generally depends on the score as the primary source, and so does not fall within the remit of this report as dealing with audiovisual materials. There is no particular reason, however, why analyses which take musical sound as the primary source should seek to examine aspects of performance (as in the CHARM project, for example) rather than the details of specific pieces. Indeed, for popular music and electroacoustic music, there generally is no primary source other than the sound, so music analysis which examines the sound is most appropriate. We can expect, therefore, that ICT tools, perhaps intended originally for MIR or for the analysis of performances, will come to be used in the sub-discipline of music analysis also.

3.5.2 Analysis of film

Two main avenues of software-augmented analysis of film and video exist. The first seeks to automate analysis of the visual forms, and narrative structure of film and television. The second uses databases and presentation software (media players mainly) to facilitate new kinds of analysis. The two avenues have not yet converged. It will be interesting to see what happens if they do.

As for the first, tools for automated analysis of visual content, as mentioned above in the partitioning discussion, exist already. Some analysis is done in the interests of indexing and searching. For instance, the Virage VideoLogger software claims to automatically create structured indexes of content: At the same time video is being encoded, VideoLoggers advanced capture and analysis technology works in real time to automatically create a structured index about the content. Time-synchronised to every encoded copy made, the index enables immediate, accurate search and retrieval of assets. In addition, because the video is data-driven, it can then be tied to applications for revenue generation, enhanced collaboration and expedited communication (Virage, 2006) (category 1). More ambitious projects try to provide a semantic analysis. The MoCA Project (Automatic Movie Content Analysis) (Praktische Informatik IV, 2006) (category 3) seeks to provide automatic identification of the genre of a film by comparing visual statistics of frames and sequences with genre statistical profiles.

To date, the main software technology used in analysis of film and television has been the database coupled with DVD. The Digital Hitchcock project, by Stephen Mamber (UCLA, n.d.), represents a well-known early example. It represents all 1,100 shots in the Hitchcocks The Birds alongside Hitchcocks storyboard illustrations. Using commercial multimedia authoring software such as Macromedia Director, various projects such as the Labyrinth Project (Kinder, n.d.) have used a combination of presentation technologies (Quicktime, Flash, Shockwave) and online databases to analyse narrative in feature and documentary film.

3.6 Presentation

At almost every stage of the research process, researchers make use of different ways of visualising, summarising or tabulating audiovisual materials. Presentation refers to all the different ways in which digital technologies display or render different audiovisual materials apart from simply reproducing them. For instance, the timeline in a video editor or the waveform in a sound editor are presentations of images and sound respectively. In this sense, a Microsoft Powerpoint show is not really a presentation as such. Presentation is closely linked to analysis. In some ways, we could say analysis is nothing but a process of generating increasingly complex, conceptually ordered presentations.

3.6.1 Summarisation

It is not always appropriate to play back a full recording or clip to a user. (Consider filtering texts by eye versus filtering a set of audio clips: the latter is usually more time-consuming and not necessarily appropriate in a search system.) There has been some, albeit limited, work on generating summaries or shorter but informative representations of spoken word content, mostly in category 5. The limited amount of research may be due in part to difficulties in evaluating the quality of summaries, since the desired properties of the summary will often be application- or even user-specific.

A summary may be generated in one of three ways: using the audio alone, using an automatically generated speech-to-text transcription alone or using both the audio and the speech-to-text transcription in combination. Techniques using the audio alone include time compression techniques such as eliminating silence or speeding up the clip (often maintaining pitch for intelligibility) (e.g., Tucker & Whittaker 2005): the resulting compressed signal can then be played back. Techniques operating from errorful transcripts may be direct adoptions of techniques for general text summarisation (as is apparently the case in Pickering et al., 2003), though performance might be improved by incorporating knowledge of the errorful input (Furui, 2005b).

Techniques exploiting both audio and transcript include the work by Koumpis and colleagues, who use both lexical and audio derived prosodic information to identify elements to include in the summary (Koumpis & Renals, 2001). Summaries generated from transcript or from both audio and transcript can be presented in textual form or in audio form (perhaps involving the use of the speech synthesis system). The choice of summary presentation format is likely to be application dependent: as an extreme example, audio summaries might be useful in situations such as over-the-phone access to a voicemail box, whereas textual summaries could be sent to a mobile phone via SMS (as in the VoiSum system (Koumpis & Renals, 2005)). Spoken word content summarisation and usability issues have been considered in some detail by Arons (1997) and by Furui (2005b).

The same considerations have motivated research into automatic summarisation of music. The common approach is to perform some kind of self-similarity analysis of the audio signal, often by means of a frequency-domain transformation, and then to extract those segments which are similar to other segments. These are likely to correspond to recurring passages such as the chorus of a song, and so to contain music which is salient and typical of the whole. A short segment of audio can then be constructed by stringing together characteristic extracts (see, for example, Peeters et al., 2002). There are many complications and issues in the process, however, and no tool has yet advanced beyond category 3.

Summarisation is not always necessary. For some applications such as the display of search results, users may tolerate presentation of short search-matching sections of an errorful transcript together with an audio playback option for verification (described as What You See Is Almost What You Hear interfaces (Koumpis & Renals, 2005)).

3.6.2 Speech-to-Speech Translation

In systems somehow supporting cross-language searching, the issue arises of how to present material in some foreign language to a user who queried in a different language. Aside from the most obvious solution of involving a human in this process, a recent line of research (speech-to-speech translation) has addressed the translation of spoken content from one language to another. To date, this work has mostly addressed constrained domains far from that found in a typical audiovisual archive (such as travel, emergency medical diagnosis, defence-oriented force protection and security. The IBM MASTOR Multilingual Automatic Speech to Speech Translator project (IBM, n.d.), for example, falls into category 2. Speech-to-text translation systems (see Sections 3.4.1 & 3.4.2) are also only in the category 4-5 stages of development. However, developments in these areas may be relevant to future digital libraries projects.

3.6.3 Visualisation

Often it is useful to present the information in or derived from audiovisual material in some other graphic form either to enable overall patterns or structure to be seen, or to assist in the identification of points of particular interest. The topic is particularly common in music research, where systems which enable one to see music of which our experience is otherwise ephemeral. Discussion of different kinds of music visualisation are given in Isaacson (2005). The efficacy of visualisation is demonstrated in a study which examined the degree to which providing a visualisation of various acoustic properties (spectral magnitude, novelty, rhythm magnitude) aided users in finding particular points in a music recording (Wood & OKeefe, 2005). The study concluded that visualisations did indeed help in navigation, but visualisation of different properties was most effective for different pieces of music.

Obviously, the kinds of visualisation and their level of detail will vary from one purpose to another the whole point is to present the data in a form suitable for the particular project but some commonalities do emerge. One is the repurposing of editing software to produce visualisations of the composition or structure of some material. Film scholars use commercial software such as FinalCutPro and AdobePremiere not only to edit digital video footage (for example, to extract clips for presentation or personal archives), but also as a way of examining the composition of film at various levels. The editing timeline is a central component in most video editing software, representing the complete set of frames in a film. Using the timeline, scholars can zoom in and out from frames to the overall film, and also view overall structure of the film or analyse transitions between shots.

For music, scholars similarly use audio editors for visualisation. Every audio editor shows a waveform display of the signal, showing the peaks of the sound pressure wave or (at higher resolutions) the actual wave itself. This provides a quick and easy way of spotting sound and silence, and sometimes allows the beginnings and endings of sound events to be found also. It is not uncommon for audio software to generate sonograms also, which show the power of different frequencies in a two-dimensional display, often using colour, but the degree of fine control over their generation varies, as does the ability to export the resulting data.

Specialised software involving visualisation exists also. Video editing and mixing tools developed for Vjaying (selecting and mixing found video materials, and setting them to music) have addressed the problem of how to rapidly select and organise quite large collections of film and television footage. Software such as Resolume (n.d.), an instrument for live video performances (category 1), allows rapid selection, changing, combining and comparing of video clips on screen. For music, there are examples of projects which attempt to show higher-level or more semantic qualities in the audio stream. Examples are provided by aspects of the CLAM Music Annotator, mentioned above, which includes panels to visualise automatically extracted data on harmony and tonality in a time-varying two-dimensional colour display where neighbouring regions are associated with related keys or harmonies and colours indicate strength of reference (Gomez and Bonada, 2005). A similar tool, which explicitly allows the development of new plug-ins for new visualisations, is the Sonic Visualiser from Queen Mary, University of London (Centre for Digital Music, n.d.)

Visualisation has also been used to show relations between pieces of music rather than within them. In particular, a number of projects use it as a means to solve the apparent problem of how to navigate a very large collection of popular-music recordings, related by various similarity measures, using metaphors of contour maps (M�chen, Ultsch, N�ker & Stamm, 2005), or factors like size and colour (Goto & Goto, 2005). In this, however, they do not significantly differ from data-visualisation projects in general.

3.7 Integration

It should be clear from the above that searching, annotating, transcribing, analysis and presentation are not discrete, atomic operations. However, very few attempts have been made to develop technologies that combine all aspects of research. Again, tools designed for working with speech lead the way. Apart from a few large scale research-led development projects, technology that integrates different aspects of the research process does not yet exist for working with music and video. The large-scale integrated projects which do exist are broad in scope and different aspects of the projects typically fall into different development categories.

3.7.1 Malach (Multilingual Access to Large Archives)

The Survivors of The Shoah Visual History Foundation (VHF) was founded by Steven Spielberg to enable survivors of the Holocaust to tell their stories and have them saved in a collection that could be used to teach about intolerance. Over 52,000 testimonies (116,000 hours of video) have been collected, containing 30 languages and forming a 180 TB digital library of MPEG 1 video (Gustman et al., 2002). This has been manually catalogued, although not in the detail that was originally planned: the very detailed and extensive human cataloguing that had been envisioned and was completed for over 3000 testimonies was found to take about 15 hours per one hour of video, which even with good tools was estimated to cost over $150 million. The foundation therefore backed off to a real-time cataloguing methodology, in which one minute clips are linked with descriptions and person objects. In parallel with this, a significant research project (the MALACH project, n.d.) investigated methods for fully or partially automating cataloguing. The project addressed issues such as automatic speech recognition to automatically generate transcripts for speech in multiple languages, automatic segmentation of transcripts into shorter units suitable for retrieval, automatic classification in order to assign thesaurus terms to segments, automatic translation in order to allow querying in multiple languages and also undertook user studies in order to investigate the types of user access that would be most useful to users of the collection. This represents a very challenging domain for speech recognition, since the interviews contain natural speech filled with age-related coarticulation due to the age of the speakers, heavily accented language, and uncontrolled speaker and language switching by often very emotional speakers. The automatically generated metadata was perceived to be usable by the project investigators, providing a flexible means of access for users whose needs may not have been addressed by the metadata schema used in manual annotation (Byrne, 2006). One of the major contributions of this project was an information retrieval test collection for spontaneous, conversational speech (625 hours of automatically transcribed speech) based upon real information needs derived from user requests. This should provide a standard test for conversational speech indexing systems. (Oard et al., 2004).

3.7.2 Variations2

The Variations2 project, from Indiana University (2005) was initiated as research to establish a next-generation [system] for research in digital library system architecture, metadata, network services, usability, intellectual property rights, and music pedagogy. (Dunn et al., 2006) The Variations2 system, as deployed, allows access to recordings as well as scanned and encoded musical scores. This clearly ambitious, end-to-end system is grounded by a mature, tested metadata model, in which contributors (performers) create instantiations of works (which, in turn, are by composer contributors). Instantiations (performances) appear on a container (such as a CD), which may give rise to other media objects (an encoded MPEG file). This model is as simple as the world allows it to be, but is powerful enough to enable such further developments as searching, collaborative annotation, and automated and manual analysis of music. These and other rich metadata definitions have been designed to be explicitly suitable for classical music, are much more complex than the categories of the Gracenote database, and grounded in information-science research.

The Variations2 tools are mostly based on cross-platform Java applications. The audio tools include a timeline tool to support formal analysis of works, allowing users to add timepoints (annotations) between sections, which can be visualised as bubble-like arcs, which are familiar to those studying musical structure. Although the goals tend more towards pedagogy than research, Variations2 is one of the best examples of an end-to-end music information retrieval system with scholarly underpinnings.

3.7.3 Informedia Digital Video Library project

The Informedia Digital Video Library (categories 2-5) (Carnegie Mellon University, 1994-2006) represents one of the most ambitious projects in the space. Funded by both the first and second phase of the NSF Digital Library Initiative, it has the overarching goal of achieving automated machine understanding of video, including search, retrieval, visualisation and summarisation in both contemporaneous and archival content collections. Informedia has aggregated a library of multiple terabytes of video, mostly broadcast television news and documentary content. The first phase of the project integrated technology for speech, image and natural language understanding to automatically transcribe, segment and index broadcast video for searching and retrieval purposes as seen in the news on demand application which automatically processed broadcast news shows for the Archive. The second project phase investigated techniques and video information summarisation and visualisation, extending single video abstractions to summarising multiple documents in different collections in visualising very large video data sets. Separate projects investigated a variety of tasks including multilingual broadcast news archive and cross-cultural archives, including the connection to the ECHO project in Europe and collaboration with the Chinese University of Hong Kong. Related work is ongoing.

Informedia is notable for its wide-ranging exploration of the space, but also (as noted in a recent paper about lessons learned during the 10 years of the project) it has derived an infrastructure that allows daily processing without any manual intervention. This distinguishes the system from many of those described elsewhere, which are often research-deployed only or deployed with a limited number of users. Informedia has developed a robust grouping of components and experienced problems that do not arise when investigating single research issues including identifying techniques which are too computationally expensive, those which are overtrained to a particular dataset and those that go out of date over time (Hauptmann, 2005).

3.7.4 National Gallery of the Spoken Word

Similar issues are being investigated in the NSF digital library initiative project developing the National Gallery of the Spoken Word (n.d.). The project as a whole is investigating issues related to digital watermarking, digitising and categorising, copyright, distribution and educational program development for 60,000 hours of historical recordings from the 20th century. One specific part of this problem is investigating the recognition of and search within this data and has developed an experimental online spoken document retrieval system called Speech Find. The challenges of this collection include the variety of recording technologies, acoustic environments, speaking styles, names and places, accents and languages and the time varying grammar and word usage (Zhou & Hansen, 2002; Hansen et al., 2001).

(First) (Next) (Contents) (Home) (Previous) (Last)