If You Build It, They Will Scan: Oxford University’s Exploration of Community Collections

  • Traditional large digitization projects demand massive resources from the central unit (library, museum, or university) that has acquired funding for them.
  • Another model, enabled by easy access to cameras, scanners, and web tools, calls for public contributions to community collections of artifacts.
  • Community collections involving the public benefit from reduced costs and access to an astounding wealth of hitherto undiscovered material and knowledge.

In 2009 the University of Oxford ran a groundbreaking digitization project focused on getting members of the public to digitally capture, submit, catalogue, and assign usage rights to material they personally held to do with the First World War. The results demonstrated the potential of this approach to save money compared with traditional digitization projects. It also revealed that community collections could capture a wealth of hitherto undiscovered material held in private hands.

Mass Amateur Digitization and Mobilizing the Public

In 2008 the NPD Group’s Household Penetration Study: Ownership Landscape 2008 reported that nearly 75 percent of all U.S. households owned at least one digital camera. These ranged from compact point-and-shoot cameras to full digital single lens reflex (DSLR) cameras. Add to this figure the number of mobile phones with cameras and the public availability of flat-bed scanners or combination scanners/photocopiers/printers, and it would not be a wild claim to say that in North America, Western Europe, and other developed countries the ability to digitize visual material is almost ubiquitous. Or, to put it another way, an extraordinary resource is just waiting to be exploited - namely, mass amateur digitization. The question is how to tap into this resource for the benefit of research and teaching.

The concept of mobilizing large cohorts of volunteers to assist in public projects is not new:

  • In an area to the south of Oxford lie the remains of a volunteer project led by John Ruskin and his undergraduate "Hinksey diggers" to build a causeway linking the city with the nearby town of Abingdon.
  • In the late 1930s the mass observation movement in the U.K. used an army of volunteers across the country to record everyday life, conversations, and behaviors of the average Briton through diaries, correspondences, and questionnaires.
  • In more recent times came public participation in screen saver projects such as the LifeSaver initiative, which used a grid of personal computers to analyze data related to cancer. Indeed, the whole Web 2.0 phenomenon relies on voluntary users creating content.

So are institutions looking to create digital archives missing a trick here? Could they build on the potential for voluntary projects and the clear willingness of the public to assist in projects in which they feel some form of investment, and take advantage of the widespread availability of domestic digitization equipment? Or, to put it another way, could one create a "community collection" whereby members of the public generate the digital content? More importantly, can individual institutions take on such initiatives?

Traditional Digitization Projects Versus Community Collections

Before discussing the Oxford project that shows how mass amateur digitization might be achieved, it is perhaps worth considering why one might want to consider a community collection in the first place. The Internet is awash with digital objects (there are, for example, 29.5 million images of cats listed under the Google Image search), and it is perhaps a responsibility of higher education to only add objects of true value to that virtual mountain.

In 2001 one of us (Lee) posed the question as to whether the cultural heritage sector had spent wisely during the 1990s on the major digitization projects undertaken.1 In particular, could the costs for these large-scale digitization projects be justified when placed against competing demands (notably the clamor for online subscriptions to journals and other data sets)? Ten years later, in the current global financial predicament it would seem apt to resurrect the question and perhaps ask whether the model usually adopted for a digitization project is sustainable.

In most cases a digitization project is led by a large central unit (library, museum, or university) that has acquired funding to concentrate on a major collection they hold or have access to. The material is captured (for the most part as digital images) at professional standards either in-house or by a third party (with all the transport and insurance costs this might involve), returned, quality assured, post-processed, quality assured again, catalogued, archived, migrated to a delivery system, and so on. There is no doubt that the end product is of exceptionally high quality, and in the U.K. the wealth of material now made available under the recent digitization programs run by the Joint Information Systems Committee offers immense value to researchers and teachers.

However, things have changed, and one would be foolish to ignore these changes.

  • First, as noted, digitization can now be undertaken by the masses. Whereas in the past something like a large Kontron camera was needed, nowadays everyone carries digitization devices in their pockets.
  • Second, attitudes to searching and browsing and tactics for information retrieval are changing. The simple Google search box is standard, diverting a generation of users away from using catalogues and advanced searches. This encourages reassessing not only the interfaces designed, and the search and browse engines used, but also the level of complexity needed in terms of metadata.
  • Third, there are now far more ways to make digital content available than the standard website. Parts of a collection can now be exposed in, or fed to, third-party sites, as witnessed by the number of institutions that make image collections available on Flickr and Flickr Commons.

More importantly, perhaps it’s time to stop and think about the approaches to large digital content creation projects and ask three questions:

  1. What is affordable in such tight economic times?
    Digitization projects are generally resource intensive. Nobody could deny the quality or importance of an initiative such as the British Library’s project to digitize two million pages from 19th-century newspapers, but that project involved substantial amounts of funding, staff resources, and time. In higher education, certainly, the financial climate is putting considerable pressure on nonessentials. Although such visionary funding agencies as the Joint Information Systems Committee (JISC) make many such projects possible, even they are constantly looking to achieve more with less.
  2. Who can be part of the digitization process?
    Digitization could be considered an exclusive activity. Outside of the major libraries and museums of the world and those institutions who receive direct funding to run a digitization initiative, most units (and here one could include universities, colleges, local libraries, local museums, and archives) are simply excluded. Partly this is due to a lack of sufficient infrastructure to run a major digitization project, but also because high-profile collections over the centuries have understandably gravitated towards the main national heritage centers.
  3. Are digitization projects just doing the same old thing?
    Finally, the selection process for what to digitize is often circular, with previous demand for items often directing the selection process. This is entirely understandable and justifiable, but one could argue that the items that that should be digitized are exactly the ones that nobody ever asks for because nobody knows they exist.

It is important to maintain a sense of balance, however. We believe the digitization initiatives undertaken by the major cultural centers in the past 20 years have been extremely worthwhile and that scholars of the future will look back on these decades as the start of a golden age where access to resources opened up, helping researchers, teachers, and the plain curious. Furthermore, the approach of prioritizing the capture of material people want access to, and basing that on historical demand, is sensible. At the same time, many digitization projects have been focused on other needs, such as preservation, or releasing to researchers content that up to now has remained unnoticed or needs its profile raised.

Nonetheless, we suggest that another approach could be used, based upon the mass amateur digitization movement and the general willingness of the public to participate in initiatives they consider worthwhile. A project on which we worked did exactly that.

Oxford University’s Great War Archive Initiative

The First World War Poetry Digital Archive, a project run at Oxford University, launched on November 11, 2008. The archive released over 12,000 digital objects drawn from collections in the U.K. and U.S., for free worldwide educational use via the web. The project has a particular focus on the major British poets of the Western Front,2 but the archive also includes a wealth of historical material to provide context to the poetry (including audio and video) drawn from collections held at the Imperial War Museum in London and the U.K.’s National Archives. In this sense, then, it is a standard digitization project funded by a national agency (the JISC) and thus no different from many others that preceded it, or will follow it, with the possible exception that it makes a point of surrounding the collection with a series of educational resources and tools targeted specifically for teaching.

What makes the poetry archive relevant to this discussion is the extra project undertaken as part of the funding - the Great War Archive (GWA) initiative, which is an example of a community collection initiative. Originally intended as a small adjunct, the GWA rapidly became a major project in its own right and has attracted considerable attention worldwide.

The GWA focused entirely on what the public owned and not what was in the major collections. We issued a call to arms (or rather to attics, garages, and bottom drawers) through the main media channels in the U.K., asking members of the public to submit, via the web, digital surrogates of material they personally held to do with the First World War and to which they controlled the rights (family photographs, diaries, letters, artifacts owned or collected from the war). We also asked them to record the stories that had been passed down to them over the years about their family’s experiences.

Over a period of 16 weeks we made available a website that was a front end to a simple piece of software the project developed called CoCoCo. The software allowed anyone to upload objects following a set of simple steps that guided them through the provision of some basic metadata and necessitated agreeing to the license conditions. Behind the scenes CoCoCo also provided some administrative controls for further cataloguing and quality assurance.

With the metadata the trick was to get the most useful information from the people submitting but at the same time not make it so laborious as to dissuade them from participating. In short, we asked them to provide:

  • Contact details (which they could keep anonymous when the site went live)
  • Author (the person who "created" the item)
  • Creation place (if known)
  • Creation date (if known)
  • Content type (using a series of keywords)
  • Further information through a large, open "notes" field (family anecdotes, for example)

In conjunction with this the project team  ran a series of "submission road shows" around the country where we would base ourselves in a local museum or library and invite people to bring the objects along on a particular day (Figure 1). We would then talk to them about the item, get them to fill in a form with further information about themselves and what they had brought (the basic metadata again), and then we would photograph or scan the item or items. To get the word out, we targeted local newspapers and radio shows and produced a series of small, simple cards that we left in pubs, libraries, trains, and other public places (Figure 2). We also provided a "Submission Day Pack" for libraries we could not visit, which guided them through running their own submission days.

Figure 1. Submission Road Shows for the GWA Project

Figure 2. Cards Inviting Submissions to the GWA Project

To a certain extent this was a risky venture. What if nobody submitted anything? What if nobody turned up on the submission days? Was the system just going to get spammed by the world’s pornographers? Would we be inundated with material that was fake or irrelevant?

The results were the exact opposite. In the space of 16 weeks we "collected" over 6,500 items. These were all quality assured by two subject experts, and a technical imaging expert where appropriate, and only one submission was rejected because it was from the Boer War. The online administration system permitted adding additional information (for example, when a contributor was uncertain about the soldier’s regiment, subject experts could add the data based on the information in the photograph).

Submission days were packed, with people bringing in the items their families had treasured over the years (Figure 1). Most importantly, these were items that had never seen the light of day, up to now. For the most part, the items were catalogued by the public, scanned by the public, and the rights for distribution agreed to by the public.

The submission website was available for a limited time, as the project was very much an experiment to see if the approach would work. Afterward, though, people contacted us wishing to add yet more material. To assist with them, we opened a Flickr group, which now has a further 1,600 items (this time under the individual’s choice of Creative Commons license).

An interesting observation in all of this was the blurring between the amateur and the professional. Although the digitization standards and the physical environments the public used (based on guidelines posted on the GWA submission site) were not comparable with professional work practices, and one would not want to rely on this process for archiving extremely rare items, they did the job, providing thousands of usable digital surrogates. Moreover, the wealth of information in the collective public knowledge base is astounding and demonstrated that many so-called amateurs have a lot to contribute to the academy. The comments and discussions on the Flickr site alone demonstrate the depth of public knowledge that can be tapped.

Although 6,500 items might sound like a lot, numbers are not everything. For example, if the archive consisted of several thousand blank field postcards (a template card issued to soldiers, where they could only select basic choices such as "I am quite well"; see Figure 3) or numerous other duplications, then our understanding of the war would not have advanced much. Thankfully, this was not the case. The project received 42 unique, unpublished diaries by soldiers from a range of battlefields, 63 memoirs, 255 unpublished letters, over 700 photographs, various pamphlets, local recruiting posters, images of rare objects (such as the original designs for the tomb of the unknown soldier), and so on.

Figure 3. Field Postcard

A particularly fine example relates to the scrapbook of the Reverend L. T. Pearson, a chaplain attached to the Royal Army Medical Corps. Pearson had brought a camera with him and recorded, through his photographs and ephemera he collected from the battlefields, his journey throughout the war up to the British occupation of Cologne in 1918–1919. The last picture in the album is his first view of the white cliffs of Dover as he returned to England. (See the case study, "Reverend Leonard Thomas Pearson’s Scrapbook.")

Individual items, which might not add much to our knowledge of the events of the First World War, do give insight into what the people endured:

Throughout the collection process, a GWA blog recorded other stories/items of interest (extremely useful when it came to promoting the site through the national media).

The impact that the GWA has had since its launch is attested to by the many military historians and national museums in the U.K., Canada, and Australia who contacted us throughout the project and afterwards. It has been reported in press publications worldwide and has generated widespread interest. Early usage statistics show that the GWA has drawn more users to the First World War Poetry Archive than the material that constitutes the more scholarly collections of the War Poets. Teachers regularly download material to illustrate their lessons on the First World War across a range of topics, to bring the subject alive and to captivate learners. For researchers - those attached to academic institutions, genealogists, local historians, and those simply interested in following their own interest in the subject - the hitherto unseen material is providing new avenues of research. The project is regularly contacted by members of the public who have been able to trace new histories within their families and communities as a result.

Cost Savings

On analyzing the costs of the GWA digitization project, we calculated that each item cost around £3.50 ($5.70) to collect, catalogue, QA, and distribute - mainly because the costs were shifted from the project to the public contributors. In comparison, the complete capture and distribution process for each image of the main Poetry Archive project (the rare items held in museums and libraries) came in at around £40.00 per image ($65.00). Moreover, the cost per item under the GWA (not just images, as we also received audio and text files) was derived by simply dividing the total cost of the project by the number of submissions. The total included, therefore, the initial set-up costs and development of the submission software CoCoCo. The latter was performed by a contracted developer and cost £3,000 ($4,900). Had the project continued, the unit cost would have reduced further as the number of items scaled up (the actual system for collection was never really put under much load stress). Moreover, if another initiative were to use CoCoCo, it would not have to cover the initial investment in development.

Again, though, this is not comparing like with like, as the quality of the material coming from the professional reprographic studios as part of the Poetry Archive was much higher than that delivered by the public. On the other hand, the public shared items never before seen, many of considerable historical value, and the stories associated with the items which outlined their provenance and history. Submissions from across the U.K. reflected key regional interests, and once the Flickr site opened, submissions came from a global audience. Moreover, the quality of the scanning was good - certainly of "workable" quality. Similarly, the cataloguing provided by the public achieved an acceptable standard, so most of our time was spent adding extra information, not correcting errors.

Conclusions

The GWA presents a model for others to follow. Not only was it extremely cost effective, it also:

  • Released to researchers material previously unseen
  • Engaged the general public in a university project

It is true that the subject matter, the First World War, attracts great interest in the U.K., and no doubt the project benefited from the widespread interest in genealogy and tracing family roots. However, the workflows and systems developed could be reused for other subjects, and all are freely available (the collection software - CoCoCo - is open source and is available on request). Regardless of continued funding of large digitization projects, the GWA illustrates another area that could be explored - mobilization of the public to contribute digitized items to national archives.

These community collections could provide a cost-effective means of expanding research resources. In the U.K. there is already considerable interest in taking this model further, and one could envisage a central national service that would allow researchers and research projects to quickly and easily set up their own community collection sites. In building community collections, however, we are also building communities themselves. If such initiatives source input from the public, then serious consideration needs to be given to how such communities can be fostered and maintained, and how queries and questions can be answered if projects only run for a limited time. Funding for digitization projects is often for a set period only, to capture, catalogue, and deliver material over a period of months or years. Yet a community collection requires resourcing beyond supporting the public during the submission stage and making the material available - it also, we would argue, requires sustaining the community into the future by answering questions, providing further information, and assisting teachers and researchers.

Perhaps current funding models do not address the level of sustainability required for engaging the general public and need to be rethought to support longer term activity. Nevertheless, public collaboration projects within the field of digitization look to become increasingly popular, and the GWA initiative has provided the foundations for a critical understanding as to precisely what the benefits of such a collaboration may be, what challenges these projects will encounter, and how future efforts can benefit from such experience.

Endnotes
  1. Stuart D. Lee, "Digitization: Is It Worth It?" Computers and Libraries, vol. 21, no. 5 (May 2001), pp. 28-31.
  2. Wilfred Owen, Robert Graves, Edward Thomas, Isaac Rosenberg, Vera Brittain, David Leighton, David Jones, Edmund Blunden, Ivor Gurney, and Siegfried Sassoon.