Using Cloud Infrastructure as Part of a Digital Preservation Strategy with DuraCloud

min read

Key Takeaways

  • Preservation of digital content in digital repositories and archives requires appropriate backup and replication, with geographic separation of the replicas as a best practice.
  • Cloud services can aid in replication and preservation of digital content, with concerns about risk mitigated by having trusted organizations provide oversight of data stored with cloud providers.
  • The DuraCloud open-source software platform being developed by DuraSpace aims to provide a fully integrated platform where services and data can be managed across multiple cloud providers, to prevent lock-in or reliance on any single provider.
  • DuraCloud will be offered as a hosted service providing data storage, data replication, and services to support data preservation, data transformation, and data access.

To ensure multiple copies of digital content for digital repositories and archives, both systems must offer backup and replication services. Backup alone does not serve as an appropriate solution for trusted digital archives, however; replication of content is best practice, and it is especially important to separate the replicas geographically. To mitigate the risks of technology failure, it is even better to store replicated content in systems that use different underlying technologies than the original archive system. To avoid information loss due to obsolescence of content and metadata formats, archiving systems should provide mechanisms to monitor and transform content as needed. Lastly, appropriate security mechanisms are essential to prevent tampering with and unauthorized access to content.

With the emergence of cloud infrastructure as a service (IaaS), the prospect of using cloud technologies to support data replication has become an option. While well-documented concerns exist (trust, control of data location, guarantees against data loss),1 taking advantage of the scalability and low cost of utility cloud providers looks increasingly interesting. Research commissioned by DuraSpace found that 50 percent of technology decision makers surveyed in the DSpace and Fedora communities indicated that they expected to use cloud services within the next year. Participants in the study cited replication and preservation services as their top interests in the cloud. They also indicated that the prospect of trusted organizations providing oversight of data stored with cloud providers mitigated their concerns about risk.

DuraCloud, a software platform being developed by the DuraSpace not-for-profit organization, will provide easy entry into the cloud infrastructure by offering data storage, data replication, and services to support data preservation, data transformation, and data access. DuraSpace is planning to host DuraCloud as a service following the completion of a pilot phase with three partners, now under way and funded by the Library of Congress.2 Since the core components of DuraCloud will be released as open-source software, institutions or consortia will be able to install the DuraCloud core to create and manage their own cloud networks.

DuraCloud aims to provide trusted cloud mediation with different levels of service aimed at making digital content (1) durable, meaning accessible for long periods of time, and (2) usable, meaning that it can be retrieved and accessed or dynamically transformed to fit within a variety of application and system contexts. DuraCloud provides a simple, open application programming interface (API) with back-end connectors to multiple cloud storage providers. The strategy of mediating between digital repositories and archive systems and multiple cloud providers hedges risks and overcomes obstacles to storing data at any one provider, such as having a single point of failure for data storage and data lock-in.2 Currently, there are DuraCloud connectors to three commercial cloud services (Amazon Web Services, EMC Atmos Online Services, and the Rackspace Cloud). Key features of the software enable users to:

  • Transparently push content to multiple third-party storage providers. This allows organizations to take advantage of cost-effective Internet-based storage, using the DuraCloud software to send content to one or more underlying cloud storage providers.
  • Use value-added services. The DuraCloud platform adds value to what the underlying storage providers offer, with a particular focus on services that enable longevity of content and facilitate flexible use and reuse. These services are provided as a menu from which users can choose services to implement. Services planned include:
    • Preservation support: Replication, file format transformation, and bit integrity checking.
    • Access and reuse: Image viewing and editing, video streaming and editing, and faceted browse and search.
  • Leverage open-source technologies. DuraCloud is being built as open-source software, keeping with the open-source principles promoted by both Fedora Commons and DSpace. Core DuraCloud software components will be released as open source in the Summer of 2010.
  • Choose hosted or run-your-own. The DuraSpace organization plans to run the DuraCloud platform software as a service. Since it is built on open-source technologies, others can pick up the service and run local instances to create their own hybrid cloud or cloud consortium network.

The DuraCloud project has already demonstrated cloud replication capabilities using data from our initial pilot partners, having ingested 10 terabytes of data from each partner and currently testing cloud replication with up to three replicas across different cloud providers. We have implemented integrity checking (checksum calculation and validation) at every transfer point in the ingestion process. During the pilots we also have successfully navigated both policy-imposed and practical data-transfer limits that are less than the actual storage limit for a file. We see cloud providers establishing limits on maximum single file size (in Amazon S3, the established limit is 5 gigabytes per object stored). Files that exceed the transfer limit can be stored after they have been broken into parts ("chunked") from which they can be reassembled, a capability that can be used across our multiple underlying storage providers. In addition, we are currently working on the "stitching" capability that puts the chunks back together for access. To enable graceful evolution of DuraCloud, we created a service plug-in architecture and demonstrated the deployment of an initial set of services, including format transformation and viewing of very large images, with a data mining service and video streaming service up next.

We have begun work to support repository replication and synchronization with existing local Fedora and DSpace repositories as well as other file-based content management systems. The synchronization tool will copy the underlying file directory from the repository or content management system to DuraCloud, keeping the cloud store synchronized with the primary local store if desired.

The DuraSpace team will conclude the pilot program by the end of 2010. The DuraSpace organization believes IaaS will become as ubiquitous as electric utility infrastructure is today. Therefore, it is imperative we begin the process of learning how to use this infrastructure and connect it to our existing systems to adapt and take advantage of what the cloud has to offer for digital repositories and archive systems.

Endnotes
  1. Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia, "Above the Clouds: A Berkeley View of Cloud Computing, Electrical Engineering and Computer Sciences," University of California at Berkeley, Technical Report No. UCB/EECS-2009-28, February 10, 2009.
  2. With funding from the Library of Congress National Digital Information Infrastructure and Preservation Program (NDIIPP), initial DuraCloud pilots were formed with three partners: Biodiversity Heritage Library, New York Public Library, and WGBH Media Archive.