Five New Paradigms for Science and an Introduction to DataONE

Authors:: William Michener
Published:: Wednesday, March 21, 2012
Columns:: E-Content
Collection:: In Print
PDF:: PDF

min read

William Michener ([email protected]) is Professor and Director of e-Science Initiatives for University Libraries at the University of New Mexico. He is Project Director for Data Observation Network for Earth (DataONE) and is involved in research related to sustainability of cyberinfrastructure, development of federated data systems, and community engagement and education.

Comments on this article can be posted to the web via the link at the bottom of this page.

We are entering a new era of science and scholarship. At least five paradigm shifts are driving many of the emerging trends associated with this new era. First, “grand challenge” questions are increasingly dominating the scientific research agenda. The National Science Foundation (NSF) budget, for example, designates significant funding for challenging problems including clean energy research; science, engineering, and education for sustainability; and creating cyberinfrastructure for the 21st century.¹ NSF has invested heavily in telescopes, gravitational observatories, and other community instruments for the astronomy and physics communities. The term “big science” is often used to refer to the use of these community-based infrastructure platforms that engage large, interdisciplinary teams of scientists in addressing extremely complex and challenging questions. The biological sciences and geosciences are now seeing similar investments in community infrastructure such as EarthScope (http://www.earthscope.org/), the Ocean Observatories Initiative (http://www.oceanobservatories.org/), and the National Ecological Observatory Network (http://www.neoninc.org/). Thus, in this new era, big science extends to all research domains.

Second, data are now being viewed as valuable products of the scientific enterprise, as evidenced by the requirements for data-management plans by the National Institutes of Health and NSF.² This represents a major departure from the past, when project success was judged almost entirely by the number of publications and number of students supported.

Third, libraries are going virtual and are becoming the new era’s repositories for knowledge, information, and data. Increasingly, books are shelved on moveable stacks, freeing up space for new collaboration spaces, as well as computing and visualization hardware. One consequence of this change is the move toward digital content collections that are readily accessible via the web, enabling even small libraries to develop and make accessible valuable digital material as part of curated collections.

Fourth, data-intensive science has been characterized as the fourth research paradigm, following on the heels of experimentation, theory, and computer simulation.³ It is now possible, for example, to perform dynamic simulations with sensor networks providing real-time data that are used to update models and forecasts on-the-fly.

Fifth, with the emergence of data-intensive science, it can be argued that data management has become the “new statistics,” meaning that students now need to be trained in all aspects of the data life cycle so that they can proficiently manage massive volumes of complex data and use new analytical and visualization tools to interpret underlying patterns and processes.

Associated Challenges

The five paradigm shifts described above have created a need for new information infrastructure and research approaches. This is exemplified in the environmental sciences, where the scope and nature of biological, environmental, and earth sciences research are evolving in response to environmental challenges such as global climate change, invasive species, and emergent diseases. Scientific studies, as a consequence, are increasingly focusing on long-term, broad-scale, and complex questions. Large volumes of diverse data collected by remote-sensing platforms and embedded environmental sensor networks via collaborative, interdisciplinary science teams are required to address such questions. In addition, new approaches are necessary for managing, preserving, analyzing, and sharing the diverse array of data.

We face several challenges as we move into this new era of grand challenge science and scholarship. First, big science and the digital makeover of libraries require substantial funding, which has been difficult to realize in recent times when research sponsors and university systems have been financially strapped.

Second, numerous informatics-related challenges complicate the picture. In a recent survey of environmental scientists, Carol Tenopir and her colleagues ascertained that more than 80 percent of the respondents agreed that they would be “willing to share data across a broad group of researchers who use data in different ways.”⁴ However, a majority of respondents also noted that they experienced difficulties in doing so because of the absence of formal established processes to store data beyond the project, inadequate tools and support for data management during the life of the project, and the poor state of existing tools for preparing documentation. These challenges are amplified by the fact that most data sets reside in hard-to-discover data silos, including individual laptop and desktop computers, institutional repositories, and even large data centers that may not be readily accessible to the interested scientist. Hence, I postulate that science is presently hindered by the “80:20 problem”—that is, 80 percent of a scientist’s effort is spent discovering, acquiring, documenting, transforming, and integrating data, whereas only 20 percent of the effort is devoted to more intellectually stimulating pursuits such as analysis, visualization, and making new discoveries. New IT solutions are clearly needed.

New Solutions to the Informatics Challenges

The DataNet program at NSF was created to “catalyze the development of a system of science and engineering data collections that is open, extensible and evolvable.”⁵ To date, five DataNet awards have been made, three of them in late 2011; the two earlier awards (in 2009) went to DataONE (University of New Mexico) and the Data Conservancy (Johns Hopkins University). To focus on one example, DataONE (https://www.dataone.org/) was designed to provide an underlying information infrastructure that facilitates data preservation and reuse for research with a principal focus on the biological, environmental, and earth sciences. DataONE, which stands for Data Observation Network for Earth, supports rapid data discovery and access across diverse data centers distributed worldwide and will provide scientists with an integrated set of familiar tools that support all elements of the data life cycle (e.g., from data-management planning and acquisition through data integration, analysis, and visualization).

The cyberinfrastructure implemented by DataONE comprises three principal components: Member Nodes, Coordinating Nodes, and an Investigator Toolkit. Member Nodes include existing or new data repositories that install the DataONE Member Node application programming interfaces (APIs). Member Nodes encompass natural history collections, earth-observing institutions, research projects and networks, libraries, universities, and governmental and nongovernmental organizations. Each Member Node acquires and maintains data and frequently provides value-added support services (e.g., user help desk, visualization services) to a particular community of users.

Coordinating Nodes are designed to be tightly coordinated, stable platforms providing network-wide services to Member Nodes. They are responsible for cataloging content, managing replication of content, providing search and discovery mechanisms, managing access-control rules, and mapping identities among different identity providers. Three initial Coordinating Nodes are located at Oak Ridge Campus (a consortium comprising Oak Ridge National Laboratory and the University of Tennessee), the University of California, Santa Barbara, and the University of New Mexico. Coordinating Nodes maintain the integrity of the DataONE federation by ensuring sufficient replicas are made of digital objects (e.g., data plus associated metadata) to facilitate long-term preservation and by tracking those replicas to enable the identification of specific Member Nodes where the content can be retrieved. The Coordinating Node indexing services, in essence, provide a system-wide search mechanism enabling users to discover relevant content from all participating Member Nodes.

The Investigator Toolkit is a modular set of software and plug-ins that enables interaction with DataONE infrastructure through commonly used analysis and data-management tools. Components in the Investigator Toolkit include low-level software libraries intended for developers and more technically inclined investigators, desktop application plug-ins like the R Project for Statistical Computing (http://www.r-project.org/), and operating system extensions such as file system drivers that expose DataONE as essentially a large network drive. The overarching goal of the Investigator Toolkit is to provide seamless interaction with the DataONE cyberinfrastructure for storing, retrieving, discovering, and visualizing data.

Ushering in the New Era

Innovative cyberinfrastructure platforms, a re-envisioning of the library’s role in support of scholarship, and scientific approaches that place high value on data stewardship are needed to resolve the numerous grand challenges faced by scientists and society. Platforms like DataONE and tools that reduce the amount of time scientists spend focusing on more mundane data-management activities are expected to significantly advance the nature and pace of science.

The new era of grand challenge science and scholarship offers significant potential to advance our state of knowledge, transform academia, and benefit society. Three specific actions can advance this transition. First, we need to promote this change by embracing interdisciplinary, transdisciplinary, collaborative, and data-intensive science. This requires lobbying for the necessary funding and providing support and recognition to those individuals who choose to join teams in addressing grand challenge problems. Second, we need to educate future generations of scientists by inculcating informatics throughout domain curricula, not just in the computer and information sciences. Third, we need to advocate for change with a focus on breaking down data, academic, and funding silos so that adequately funded, interdisciplinary teams of scientists are poised to tackle the grand challenges.

Notes

1. “NSF Presents President's Fiscal Year 2012 Budget Request of $7.76 Billion,” press release, February 14, 2011, <http://www.nsf.gov/news/news_summ.jsp?cntn_id=118642>.

2. NIH Data Sharing Policy and Implementation Guidance: <http://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm>; NSF Dissemination and Sharing of Research Results: <http://www.nsf.gov/bfa/dias/policy/dmp.jsp>.

3. Tony Hey, Stewart Tansley, and Kristin Tolle, eds., The Fourth Paradigm: Data-Intensive Scientific Discovery (Redmond, Wash.: Microsoft Research, 2009), <http://research.microsoft.com/en-us/collaboration/fourthparadigm/>.

4. Carol Tenopir, Suzie Allard, Kimberly Douglass, Arsev Umur Aydinoglu, Lei Wu, et al., “Data Sharing by Scientists: Practices and Perceptions,” PLoS ONE, vol. 6, no. 6 (June 2011), pp. 1–21, <http://www.plosone.org/article/info:doi/10.1371/journal.pone.0021101>.

5. National Science Foundation, Office of Cyberinfrastructure, Directorate for Computer & Information Science & Engineering, “Sustainable Digital Data Preservation and Access Network Partners (DataNet),” <http://www.nsf.gov/pubs/2007/nsf07601/nsf07601.htm>.

EDUCAUSE Review, vol. 47, no. 2 (March/April 2012)

ParentTopics:: Data Administration and Management Research Digital Collections Digital Libraries