Complexity and Scale in Audio Archives

min read

© 2009 Jerry Goldman and Andrew Gruen. The text of this article is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License (

EDUCAUSE Review, vol. 44, no. 1 (January/February 2009): 6–7.

Complexity and Scale in Audio Archives

By Jerry Goldman and Andrew Gruen

Jerry Goldman is a professor of political science at Northwestern University. Andrew Gruen is a Gates Scholar at King’s College, University of Cambridge.

Comments on this article can be sent to the authors at <[email protected]> and <[email protected]> and/or can be posted to the web via the link at the bottom of this page.

“I've been rich and I've been poor. Believe me, honey, rich is better.” —Sophie Tucker

National archives and public broadcast archives in Europe and the United States hold millions of hours of spoken-word materials, the bulk of it in analog form. In 2005, an EU-U.S. working group estimated that world holdings of audio materials in analog formats total approximately 100 million hours.1 These holdings will perish within a few decades unless we take steps to preserve them. Millions more hours come into existence in digital formats each year. The accelerating growth in spoken-word documents will generate demand for efficient archiving and retrieval strategies. But these resources will prove stillborn if we do not identify ways to reveal their contents.

The U.S. Supreme Court installed a recording system in the courtroom in October 1955 and archived the recordings at the National Archives and Records Administration (NARA) for research and academic study. Today, NARA holds about 9,000 hours of that audio, in (mostly) analog formats. The Oyez Project ( began as a way to deliver this audio to interested citizens in digital form, but it is much more than that today. A multimedia archive devoted to the Supreme Court of the United States and its work, the project became a microcosm of how future audio archives will store and deliver their holdings in a networked world.

At its inception, the Oyez dataset was small. The project digitized recordings of the oral arguments from key cases in U.S. constitutional law, wrote abstracts of each case, and put them in a HyperCard stack. However, as the project grew, it began to directly confront issues of both scale and complexity. In addition to streaming recordings and placing abstracts on the web, Oyez began to collect other pieces of text and audio, along with the metadata. The project also started to collect audio, text, photos, and videos related to the operation of the Court but not to any individual case. The generation of multiple versions of audio items further complicated the task of curating the growing collection of Oyez materials. The latest iteration of Oyez, version 5, contains three main data types: text, audio, and various forms of audio metadata including transcripts, time-synced transcripts, speaker information, speaker biographies, speaker photos, annotations, and commentary.

As data collection drastically increased, Oyez started looking for computational curation aids (instead of hiring more undergraduates). The project is turning to RDF in an attempt both to improve internal organization and to "future-proof" the dataset. Oyez built an RDF schema that describes the structure and concepts within the data. Unlike constructing traditional taxonomies, creating the RDF schema was like building a waterfall from the bottom up; the process started with the smallest constituent parts and then interrelated them into larger categories. Since the focus of Oyez is, primarily, to archive the U.S. Supreme Court, and since the work of the Court is broken into cases, the Oyez schema uses the individual case as its starting point. Cases are made up of events, people, and the roles those people play in each event. Oyez describes each constituent part of all the case categories, but more important, all of the types are interrelated: events are tied to specific cases, some events require particular roles to be present, and people fill roles for any given case.

The benefits of RDF for the project are twofold. First, RDF should make the Oyez data machine "understandable"—that is, it should add semantics to indicate, for example, that a string of text like “10-22-1956” is a date. Second, the schema will be publicly accessible and, thus, freely extensible. By marking up Oyez material with semantic metadata, researchers can begin to ask questions of the data where the answers are held implicitly. For example, although it has never been the project's aim to discuss the effects of aging on appellate judging (and that data is not stored in the Oyez database explicitly), by using RDF, scholars can look at trends in the decision-making of every justice between the ages of, say, fifty-six and sixty-one. Because birthdates and the dates of events are known as dates, software can automatically locate records within the age range and can produce meaningful results with minimal input from the researcher.

Because RDF schemas are public and extensible, students of related fields can both peer into how Oyez chose to organize its data and build upon the project's pre-existing structure. In the future, a congressional researcher could extend the aging study noted above to include both the Oyez data and other datasets, again without the need to have planned the data sets with such a study in mind. At the initiation of a new study, a scholar could compare two schemas, quickly identify points of comparison at the organizational level rather than at the record level, and let a piece of software create a new dataset. As schemas begin to reference across domains, all other semantic metadata becomes more valuable because a piece of software can make inferences that were previously found only by extremely rare (and extraordinarily talented) interdisciplinary researchers.

Oyez is a metadata-rich prototype. But what about the more traditional archive, where we know little about the assets? Consider this example provided by Ant Miller at BBC-Archives:

Let us imagine an archive that has a set of audio assets about which we know very little. Perhaps the holdings are old, inherited or just 'found'. Chances are that before long the institution will have to digitize these holdings, and digitization means that the metadata will be essential. Once digitized as files, these assets will be good as lost without metadata.

One could, given sufficient resources, do some human detective work: in short, catalog it. From hard experience, most archive institutions know that done well this is an expensive, time-consuming work.

But perhaps automation can help reduce some of this burden. In some domains and subject areas, this is possible, and here's how one can try.

1) Digitize your content now, store it locally, and give it a global unique ID. We think that it's a good idea to start building a metadata set right now- include all you can about the original carrier, the process of digitization (there's often a surprising amount of data available), and the organizational information associated with the asset- always a good idea to start a metadata set young we think!

2) Apply speech to text. If you're looking for readable transcripts, your results will be disappointing, but for the purposes of metadata generation, 70 to 80% is fine. Improving that percent (which is possible) is an improvement that might be worthwhile. That transcript will be the key.

3) Cross-reference. Take your transcript and cross-reference it to textual sources that are roughly contemporaneous with the assets. News library holdings, other transcripts, editorials, even literature referring to the period are all useful. A proximal set of matches is what you're looking for.

4) Structure the results. These proximal matches can be used within a taxonomic structure, however an ontology is preferred. By using a semantic structure with meaningful relationships between the terms you can begin to extrapolate a set of useful reference terms beyond your original matching set. Berlin, Blockade, and Tempelhof are three useful terms. But an Ontology describes that Tempelhof is an <airport> in the <city> of Berlin, <capital<divided>> of the <country> of <germany> that was used in the <airlift> to relieve the <blockade>? Suddenly, with a good degree of confidence, you have a much stronger set of key terms associated with the asset. It is now placed it within a richly structured information space.

5) Use your new index to build tools for now and later. Add your ontology keywords as simple tags on MPEGs to enable the Google generation to find your content now. But keep your ontology, too. Build it, grow it, try OWL and RDF implementations, and enjoy.

Either way now your orphaned audio can be referenced via words that were never recognized, maybe never even said, in the original.2

The challenges are enormous. But these challenges are what make some of us climb mountains or dive wrecks or make millions of hours of spoken-word collections accessible and useful in a data-rich but metadata-poor world.

  1. EU-US Working Group on Spoken-Word Audio Collections, Section 2.2, <>.
  2. Ant Miller, e-mail to the authors, March 1, 2007.