Developing Research Data Management Services

min read

Key Takeaways

  • Collaboration is essential in developing research data management services, bringing together disparate but critical entities on how best to address the storage, curation, sharing, and access of research data.
  • Three entities emerged as key partners in the delivery of research data services at CU-Boulder: the University Libraries, Research Computing, and the Office of the Vice Chancellor for Research.
  • Sustainable sociotechnical infrastructure, needed for successful implementation of research data management services, requires substantial investment.
  • Institutional culture plays a significant role in determining what research data management services are needed, including how these services might need to evolve as new technologies emerge on campus.

In 2008, University of Colorado Boulder (CU-Boulder) librarians began informal discussions as they became aware of faculty and researchers' growing data management needs, particularly in STEM disciplines. This demand came about in part as a result of increasing requirements that federal grants data be managed and archived; the University Libraries (the Libraries) were being asked to provide solutions. Conversations with faculty and an initial survey conducted by the Libraries indicated that researchers wanted a tool that would allow a cohesive presentation of their intellectual output and that would, above all, save them time spent on data management planning. In addition, researchers indicated that they wanted layers of authentication and that data reuse was not regarded as a primary purpose. They also expressed the need to manage orphan data as well as teaching and learning materials that accompanied their research projects.

As a result, interested librarians created an informal group to discuss responses and strategies to meet data management needs. Using Google Groups, CUBDataLibrarians shared reports, articles, workshop information, and more to help inform their efforts. Membership eventually grew to 28 and included staff from nearby federal labs and institutions, such as the U.S. Department of Commerce Boulder Labs and the National Center for Atmospheric Research (NCAR). In addition, a staff member from CU-Boulder Research Computing joined the group as a core member.

As with many research institutions, the announcement by the National Science Foundation (NSF) in January 2011 requiring all grant proposals to include a data management plan affected CU-Boulder significantly. Proposals submitted to the NSF account for the largest number of submissions at CU-Boulder (547 of 1,361 submitted to federal agencies in FY 2012); therefore, the NSF requirement had a direct impact on CU-Boulder faculty and researchers.1

As early as March 2011, Research Computing shared examples of a data management plan submitted to the NSF with the campus community.

Key Partnerships

The mission of both the University Libraries and Research Computing includes services in support of advancing research activities at CU-Boulder. Administration of the research endeavors is primarily handled by the Office of the Vice Chancellor for Research. It is natural, then, that these three entities constitute the key partners for developing research data services at CU-Boulder.2

The existing close partnership between the Libraries and Research Computing allowed for the creation of an ad hoc group to explore research data management issues on campus. It also paved the way for a smooth transition to the formal creation of Research Data Services.

Informational pages describing data management needs and services initially resided within the Research Computing website, and the Libraries were in the process of developing their own data management web pages. The director of Research Computing and I agreed that the Research Data Services domain needed to be at the university level, not under Research Computing or Libraries, to indicate the support of senior administration.

Nonetheless, Research Data Services is a partnership between Research Computing and the Libraries that includes other groups with relevant expertise. The research data manager from Research Computing and the research data and metadata librarian from the Libraries serve as co-coordinators of the Research Data Services operations.

The Data Management Task Force

The Data Management Task Force (DMTF) was created in July 2011 by the Vice Chancellor for Research. The task force's charter included:

  • Act as a nexus for leading data management efforts
  • Make recommendations about the storage and curation of digital data produced in the course of CU-Boulder–based research
  • Address the roles of individual researchers, departments and institutes, staff, and the university as a whole
  • Consider a wide array of data during this process (observational, experimental, clinical, simulation, instrument)
  • Address storage and maintenance issues in both the short  and long term and potential funding models for each
  • Provide specific recommendations about how CU-Boulder investigators can respond to NIH and NSF policies
  • Review policies and practices at other universities as well as the national context in formulating its recommendations for CU-Boulder

The DMTF was small initially, with seven members invited by the vice chancellor of research:

  • The associate vice chancellor of research, who served as the chair
  • Two members from the Libraries (the director of the sciences department and the director of metadata services [author Jina Wakimoto])
  • Two members from Research Computing (the director and the research data manager)
  • Two members from the National Snow and Ice Data Center (NSIDC), the data stewardship program manager and a senior associate scientist

Located at CU-Boulder, NSIDC has a well-developed data management service; its members' experience with cryospheric data sets informed the work of the task force. As the DMTF began discussing how to develop research data services to meet the needs of the entire campus, we determined that we could benefit from including faculty and researchers from various disciplines with different data management needs, thus representing campus research more broadly. Five new members accepted our invitation and joined the DMTF:

  • A faculty member in Civil, Environmental, and Architectural Engineering
  • A faculty member in Geography
  • A faculty member in Chemistry and Biochemistry (also director of the Nuclear Magnetic Resonance Spectroscopy Facility)
  • A new metadata librarian (now renamed research data and metadata librarian)
  • The lead developer from Faculty Information Systems, who was developing VIVO, a research and expertise discovery tool

The DMTF met monthly following its formation in July 2011 and worked until October 2012 to meet the expectations specified in its charge. Our efforts divided roughly into two phases:

  • The first half, we focused on understanding the state of the campus and the review of peer institutions.
  • The second half, we focused on our recommendations, informed by the understanding of national context such as U.S. funding agencies' data policies and requirements, as well as understanding the research data life cycle.

The writing of the report — the outcome of our work — fell to a subset of the task force because of scheduling conflicts and members' varying degrees of expertise. Five members (three from the Libraries and two from Research Computing) served as primary writers, with each volunteering to write portions according to their expertise; the entire DMTF reviewed and edited the drafts until all members endorsed the final report. The writing team's composition, though not intended as a statement, reinforced the key partnership and leadership of the Libraries and Research Computing among the stakeholders for developing research data management.

Our approach is further described below.

The State of the Campus Survey

As a first step to assessing campus needs, the DMTF decided to survey researchers' data management practices. An anonymous Qualtrics survey was sent to the campus community in January 2012 via the Faculty and Research E-Memo list consisting of 4,411 names. We received 148 complete responses. While this low response rate cannot be considered statistically significant, we decided that the survey data nonetheless would inform our work. The survey gathered information about:

  • Respondents and their research areas
  • Types of research data generated
  • Current storage amounts and projected growth
  • Length of time research data will need to be accessible
  • Data and metadata formats for storage
  • Maintenance of data or metadata documentation
  • Implementation of formal data management plans
  • People managing data
  • Storage and backup technology being used
  • Proportion of data that is sensitive, confidential, or proprietary
  • Interest in different types of data management services

Survey results indicated that:

  • The vast majority of respondents at CU-Boulder manage their own research data (see figure 1).

    Figure 1
    Figure 1. Who manages my data?

  • A wide variety of methods for storing data were used across the university (see figure 2). Further analysis showed that the researchers in basic and applied sciences were more likely to use LAN storage than those in social science and liberal arts. Off-campus storage is used significantly only in basic and applied sciences.

    Figure 2
    Figure 2. Data storage methodologies

  • A wide range of answers was reported with regard to percentage of total data that could be considered sensitive, confidential, or proprietary. There was no pattern, statistically speaking, to link the amount of proprietary data with department or area of research.
  • Many different file types and data types for research data are used, and the amount of total data stored by researchers at CU-Boulder varies widely. Most researchers store between 1 GB and 10 TB. Not surprisingly, basic and applied sciences were most likely to have large amounts of data.
  • Few researchers had created a data management plan (74 percent did not have a data management plan), nor did they maintain metadata or data documentation (64 percent did not).
  • The majority of respondents expressed a need for assistance with their data management. Types of data management services desired are listed in figure 3.

    Figure 3
    Figure 3. Type of data management service

A complete report of the survey is available in Appendix D of the Data Management Task Force Report.3

ARL E-Science Institute

The task force reviewed a report resulting from the CU-Boulder Libraries' participation in the 2011–2012 Association of Research Libraries (ARL) E-Science Institute, which was designed in 2011 to help research libraries develop a strategic agenda for e-research support, with a particular focus on the sciences. The report recommends involvement of cross-functional experts, funding the development of tools to support data sharing and preservation, and establishing a Research Data Services unit at CU-Boulder, among others. This report informed the DMTF of the initiative for research data support and services by the University Libraries and helped reinforce some of the DMTF's recommendations.

Institutional Repository

The task force also examined the possibility of using the existing CU-Boulder institutional repository to house research data. Considering the repository software platform (Ex Libris's DigiTool), the task force concluded that the current institutional repository is not capable of serving as a full life-cycle data management system; however, it could be used in a limited way to provide access to and store completed data sets.

Use Cases

The DMTF also collected four actual use cases from CU-Boulder faculty:

  • Instrument and sensor data in biometeorology and climatology
  • A data collection of characters from ancient Chinese manuscripts
  • Diverse data types including non-digital data with relations to other types of data and virtual objects for the Exchange for Local Observations and Knowledge in the Arctic (ELOKA)
  • Diverse data sets (instrument, laboratory notebook, etc.) in the NMR spectroscopy facility

These use cases were converted into requirements, which were used in comparing three possible data management systems available to us for testing — Data Conservancy, DigiTool, and Islandora. These systems did not adequately meet all the data management needs without significant development, thus the DMTF's final report did not include the comparison of these particular technical systems. This method, however, would be useful in future assessment of more fully developed systems.

Review of Peer Institutions

It was helpful to review policies and practices at other universities in formulating our recommendations. We looked at nearly 20 universities with research data management websites and selected six of those universities — Cornell University, Purdue University, University of Illinois at Urbana-Champaign, University of North Carolina Chapel Hill, University of Virginia, and University of Wisconsin at Madison. All, except Cornell, are institutions CU-Boulder considers its peer group for comparisons of faculty salaries, instruction, and many other activities as AAU (Association of American Universities) public institutions. Cornell, a private university, was included because it provides many data management services some DMTF members regarded as a good model. We conducted an in-depth review of the organizational and service models and funding models of these six universities.

  • The review revealed a number of commonalities in organizational and service models. One particular commonality was the existence of a collaborative model that, in each case, included the institutional equivalents of the Libraries, Office of Information Technology, and associate vice chancellor for research.
  • Conversely, the review demonstrated a variety of funding models. In some cases, specific units, primarily libraries and campus IT, absorb the costs, while other institutions received funding or obtained new personnel from campus administration. Reallocation of existing personnel was also a common theme.

Recommendations

Our efforts led to several recommendations.

  1. It was important to begin with a common understanding of objectives and some basic premises about the nature of research and scholarship as redefined in a digital environment to set the context for our recommendations. We reviewed literature on research data and funding agencies' data policies and requirements for data management plans as an introduction. As we were writing our recommendations, we continued to return to those national contexts to explain the rationale for or to reinforce some of our recommendations. One report of note was Digital Research Data Sharing and Management, by the National Science Board's Committee on Strategy and Budget Task Force on Data Policies.4 This report (henceforth referred to as the NSB report) includes a statement of principles to frame research data issues and to guide development of relevant data policies, which the DMTF found especially helpful. Rather than writing our own, these statements were used as the foundation for our efforts.
  2. The DMTF agreed early on to use the term "sociotechnical infrastructure" for its explicit recognition of the social element and contexts of the people, favoring it over "cyberinfrastructure." We refer to sociotechnical infrastructure as a required investment for successful implementation of research data management (RDM) services and for long-term sustainability.
  3. The initial drafts of our recommendations included sections on Policy, Governance, Service Model, Technical Solutions, and Funding Model. While those sections did not remain intact as our recommendations went through many iterations, the final version included all of those components in one form or another.
  4. A good portion of the writing team's meetings was spent in spelling out the suite of services necessary for good research data management. To describe the service components, seven data life-cycle models were studied and compared, and the associated data management needs were articulated. Based on the needs, we derived common components of services necessary to meet data management requirements from funding agencies. This approach also allowed us to delineate the services we could provide immediately with existing resources and expertise from the services necessary but requiring additional resources.
  5. To deliver the services requires a sound organizational infrastructure. We recommended three levels of organizational groups: an operational group, an executive committee to provide direction and support for the operational group, and an advisory committee to address governance issues and develop necessary policies.
  6. For sustaining a sociotechnical infrastructure necessary to provide research data services, the addition of five dedicated FTEs was suggested. There were differences of opinions among DMTF members ranging from making the recommendation for additional positions more detailed and explicit to leaving out this recommendation altogether. In the end, recognizing that successful research data management requires resources and expertise that were only minimally available or not at all at CU-Boulder, softer language was used to suggest additional personnel.
  7. The initial drafts of recommendations did not include heavy emphasis on the promotion of responsible research data management. As the report was nearing the final draft, it became clear to the writing team that the report should begin with a recommendation that the academic leadership highlight and encourage research data management, following the principles recommended in the NSB report, in particular the first two statements: "Openness and transparency are critical" in scientific progress and research and "Open Data sharing is closely linked to Open Access publishing and they should be considered in concert." Further, we recommended that the academic leadership promote these principles by encouraging:
    • Alignment of faculty review systems (i.e., tenure and promotion)
    • Faculty to consider various forms of open access publishing
    • Faculty to adopt policies for sharing data in the most open way possible
    These were bold recommendations and not without controversy. Some DMTF members questioned the inclusion of open access publishing in this report. A detailed and thoughtful explanation of facets of open access, its relationship to data management, and both being regarded as fundamental practices for good and responsible scholarship helped everyone understand and accept the recommendations as written.
  8. The recommendations in the final report are presented in four sections:
    • Highlight and encourage research data management
    • Formally create a Research Data Services Organization
    • Develop research data governance and procedures
    • Establish a sustainable sociotechnical infrastructure necessary for full life-cycle data management
    Further, we included the responsible parties who should lead the effort in carrying out the recommendations in each section.

The full report of the DMTF [http://hdl.handle.net/10971/1398] can be accessed online [http://digitool.library.colostate.edu/R/?func=dbin-jump-full&object_id=174733].

Conclusion

Issues surrounding research data management are complex. CU-Boulder was not alone as an institution with strong research endeavors in attempting to develop services and infrastructure for research data management. Strong partnership that already existed between Research Computing and the University Libraries, along with expertise residing at NSIDC, paved the way for a smooth transition from an informal virtual group to the formal Research Data Services. Charged by the vice chancellor for research to act as a nexus for leading data management efforts and to make recommendations, the Data Management Task Force began its work with a significant list of expectations. Our approach was to begin with the needs assessment for data management by conducting a campus-wide survey. The survey results and the existing Libraries' initiatives, as well as a review of peer institutions and a synthesis of data life-cycle models, informed the DMTF in formulating its recommendations.

Discussions within the writing team and the DMTF took many months, as did distilling them into final recommendations. Developing research data management services at an institution is not an easy endeavor without substantial investment in sociotechnical infrastructure (our preferred term for cyberinfrastructure). It also will be a multiyear endeavor, as the environment and technology continue to evolve. In the end, the recommendations the DMTF presented were made with full recognition and consideration of our particular institutional culture at CU-Boulder. One size does not fit all where data is concerned.

Notes
  1. See FY 2012 (PDF) [http://www.colorado.edu/VCResearch/reports/12/annreport12.pdf], which resides chronologically with other annual reports of sponsored research on the Research Administration's Reports and Strategic Planning [http://www.colorado.edu/VCResearch/reports/index.html] page under Sponsored Projects.
  2. The mission of the Libraries is "to enrich and advance learning and discovery in the University, community, state, and nation by providing access to a broad array of resources for education, research, scholarship, and creative work to ensure the rich interchange of ideas in the pursuit of truth and learning." Research Computing's mission statement is "The integration of computing resources, software, and networking, along with data storage, information management, and human resources to advance scholarship and research is a fundamental goal of cyberinfrastructure (CI). The mission of the research computing group is to provide leadership in developing, deploying, and operating such an integrated CI to allow CU-Boulder to achieve further preeminence as a research university." The Office of the Vice Chancellor for Research "serves as a focal point for the investment in campus research, scholarship, and creative works, and strives to provide the infrastructure and administrative support necessary to promote and sustain our world class faculty and research programs."
  3. Research Data Management at the University of Colorado Boulder: Recommendations in Support of Fostering 21st Century Research Excellence (2012).
  4. National Science Board, Digital Research Data Sharing and Management, prepublication copy, December 14, 2011.