As Yogi Berra said, "Predictions are hard, especially about the future." Nevertheless, as a computer guy, I'd like to offer a few forward-looking observations about the emerging impact of information technology on scientific research. And I'd like to ask a couple of questions. Are some of the current uses of information technology in scientific research redefining traditional scientific research? Has the computer revolution produced a "new renaissance," one that has resulted in the creation of additional, new paradigms of science?
Traditional Scientific Research
Scientific research refers to a particular method for acquiring knowledge about natural phenomena. This method has two dimensions: one of observation and experimentation and one of description and explanation. Sometimes, observation precedes explanation, and sometimes a proposed explanation precedes experimental confirmation. A scientific explanation is often made by creating a model of (some definable part of) reality. As the statistician George Box observed: "Essentially, all models are wrong, but some are useful."1 That is, all models are only approximations and are subject to being superseded by more useful ones.
The observation and experimentation dimension of scientific research is called, appropriately enough, experimental science; the creation of models to describe and explain natural phenomena is called theoretical science. These two dimensions—experiment and theory—are sometimes called the two paradigms of science. We might say that observation/experiment is the ground of science and explanation/theory is the superstructure. A theoretical model is especially valued if it not only explains previously observed phenomena but also predicts new phenomena that are subsequently observed. Two examples from the history of science demonstrate the interplay of observation and explanation.
Astronomy provides one example of traditional scientific research. Perhaps as long ago as 3000 BCE, the Babylonians had observed wanderers (i.e., planets) in the sky and followed the wanderings of the planets among the fixed stars. Around 100 CE, Ptolemy created a model of the universe that, among other things, explained these wanderings. His model had a fixed earth at the center of the universe, with the sun, the moon, and other planets orbiting it in circles. Some of those orbits had to be circles within circles to explain the "backing up" that some planets occasionally did.
In the 1500s, Nicolaus Copernicus suggested a different model that put the sun at the center of the universe and placed the earth as the third planet orbiting it (i.e., two planets orbited closer to the sun, and three were farther away). One of the reasons this new model was not immediately accepted was that observation proved it to be as inaccurate as the old one. But when Johannes Kepler modified the model in the early 1600s by changing the orbits from circles to ellipses, it became much more accurate. And it was simpler: no more orbits of circles within circles. Galileo added experimental evidence in support of Kepler's model using the new "information technology" called the telescope. As a culmination of this phase of astronomy, Sir Isaac Newton's mathematical laws of motion and gravitation explained why the planets followed elliptical orbits, while the laws also provided equations that could predict the planets' motions.
A second example of traditional scientific research is provided by the biological sciences. Around 300 BCE, Aristotle began the work of classifying different life forms based on their similarities and differences. The insights of Charles Darwin and Alfred Russel Wallace in the 19th century explained those similarities and differences as resulting from evolution from previous life forms, where more similar organisms had more recent common ancestors. A few years later, Gregor Johann Mendel's experiments with peas suggested a mechanism for inheritance: genes, which were carriers of inheritable traits. In 1953, Francis Crick and James Watson identified the double helix structure of the chemical DNA in the nucleus of cells and established that it contained the genes of the organism. This discovery was followed by another that explained how cells use genes to create proteins, from which all parts of living organisms are formed. Also, genetic imperfections during reproduction and at other times can be seen to foster changes that (occasionally) produce organisms with better chances of survival—explaining, at least in part, how evolution works. This 20th-century understanding of how life works was at least as revolutionary as, and probably much more so than, the 17th-century understanding of how the solar system works.
Uses of Information Technology in New Scientific Research
These days, information technology means electronic information technology. Even the single word technology seems to mean electronic information technology (to the regret of the engineering profession). Although this article also focuses on electronic information technology, I acknowledge the long, important train of IT developments. For example, developments in optical information technology—such as telescopes, microscopes, and cameras—contributed greatly to the advancement of science. In addition, mathematical information technology—such as mechanical calculators and tables of logarithms—helped to relieve the drudgery of performing numerical calculations, which were increasingly important in science after Newton. The mechanical calculator, in particular, foreshadowed the dramatic advance to electronic computation.
Newton's laws galvanized the use of mathematical models for scientific theory in the "hard sciences," which became a synonym for a science based on mathematics. It would be difficult to overestimate the importance of Newtonian science on the Western mind. Before Newton, witches were abroad in the land, casting spells and killing cattle; after Newton, we lived in a mechanical universe where such things were impossible. But technical problems remained: having a mathematical equation and solving it were often two different matters. A famous example of this difficulty arose when it was discovered that the three-body gravitation problem could not be solved. For example, the equations for the motions of the sun, the earth, and the moon have no closed-form solution, even if no other gravitational forces are considered.
Consequently, mathematical models were not always practical by themselves. This problem began to be solved by the application of electronic information technology to scientific research: namely, the computational simulation of mathematical models. For example, the earth-moon-sun gravitation problem could be simulated by starting with initial positions and velocities for those three bodies and computing new positions and velocities "one time step later" by computational use of Newton's laws, then repeating this process for some (usually large) number of time steps. Of course, this process is not exact. Computers can carry approximations only to real numbers, and accumulated round-off error can destroy the accuracy of a computation. Also, the time step has to be large enough to let a number of steps be taken but small enough to control the round-off error resulting from using finite time steps (as opposed to the continuous flow of time). Since computational scientists always want to improve the accuracy of their results and tackle bigger problems, they have an insatiable appetite for bigger computers that can perform massive numbers of these calculations in a reasonable time.
A major step in the use of computational science was taken when it was realized that nonmathematical models could be simulated as well as mathematical ones. That is, rules other than mathematical equations could be incorporated into computer programs and stepped through simulated time "to see what happens" to the state of the model. For example, traffic simulations can predict where bottlenecks and other problems may occur. Even simple rules can generate complex behavior, making it very valuable to observe the output of the simulation to understand what the rules imply. On the other hand, given a set of data, either observed or human-entered, can computers help us find rules to explain the data?
In a recent international workshop on cyberinfrastructure for science, the following statement was made: "Many fields [of science] have depended on computational science simulations, and many now are beginning to depend on computationally intensive data analysis."2 This statement succinctly expresses the relatively recent addition of big data information technology to big computing information technology in scientific research. Supercomputers, the fastest computers available at a given time, have always produced big data output from simulation runs, but now big data information technology also occurs in many other venues, such as sensor output, experiment output, and databases.
The reason big data has recently become important is that disk storage is now cheap enough that scientists can afford to store massive amounts of data. When disk storage was introduced in the 1950s, the cost was about one dollar per byte. But keeping the cost the same, disk capacities have doubled every year and a half, even faster than Moore's law for the number of transistors on a chip. We passed the gigabyte-per-dollar threshold in the last decade. Later this decade, a terabyte of disk storage may cost one dollar!
Big data storage management is a research topic in and of itself: we can currently store more data than we know how to process. Scientists are learning how to look for patterns and trends in massive databases, but other challenges remain. The February 2012 issue of Computer focused on an aspect of big data research related to what is called the CAP theorem.3 CAP stands for consistency, availability, and partition tolerance, and the theorem states that only two of those three attributes can be maintained in a massive dataset. Traditional relational database systems could maintain all three attributes, so various methods for dealing with the two-out-of-three trade-off are being studied, such as the one Google uses to enable searching over the Web. As Google software crawls the Web to prepare the keyword indices that enable speedy search results, multiple copies of the same page may be included when the page has been updated over time. To search for such inconsistencies while creating the database would be impossibly time-consuming, so the inconsistencies are tolerated and are then dealt with at a later stage of the process.
Big data has other ramifications beyond data-intensive science. The October 2011 McKinsey Quarterly article entitled "Are You Ready for the Era of 'Big Data'?" asks five questions for businesses to consider:
- What happens in a world of radical transparency, with data widely available?
- If you could test all of your decisions, how would that change the way you compete?
- How would your business change if you used big data for widespread, real-time customization?
- How can big data augment or even replace management?
- Could you create a new business model based on data?4
The New Astronomy
Astronomy provides an example of a science that may be revolutionized by big data, because astronomers are building a massive database of all the sky images produced by many telescopes. Moreover, these images will be kept in time sequence, meaning that computer processing can look for changes taking place and not just individual moments.
The New Biology
Biological sciences offer another example of how information technology is revolutionizing a science. The application of information technology to biology has even been given a name: bioinformatics. In fact, bioinformatics may be more than just the application of information technology to biology; it may involve the co-creation of new biology and new information technology. Biology is now recognized as an information science, founded on the genome information bases common to all forms of life.
Over the last decade, following the completion of the human genome project, exponential improvements in the process and price of computerizing a genome have occurred, in part by means of advanced IT techniques. The first genome cost millions, if not billions, of dollars. Now the goal of a thousand-dollar genome is in sight and may be reached in a few years. Medical science looks to the day when every person will have a computerized genome, which will be used to personalize medical treatment in ways that are just coming into view. Of course, without massive, cheap data storage and the ability to process it, none of this would be possible.
The New Social Science
Sociologists and other social scientists conduct experiments in which human behavior is studied. They also administer surveys that seek to study human behavior via the responses made by the subjects. Traditionally, information technology has been utilized to implement statistical analyses of such studies. Recently, William Sims Bainbridge has argued that a radically new mode of social science research will be enabled by information technology—namely, the use of simulations as an experimental environment. Electronic games such as World of Warcraft and other electronic environments such as Second Life offer social scientists a broad vista in which to study human behavior, a vista that extends well beyond what could be accomplished ethically in the physical world.5 (In this case, the simulations themselves would not be the object of study, as in the previously discussed uses of computational simulations, except as pertains to the human behavior in those simulated environments.)
Computer Processing of Scientific Literature
The designation "big data" refers to more than just big size. It can also refer to big complexity. In fact, any data that we don't yet understand well enough to computerize can be called big data. One such area of big data is textual information. Natural language processing (NLP) has frustrated computer scientists for decades but is now starting to yield to computer processing (and translation). When IBM's Watson computer defeated Jeopardy champions in head-to-head competition, we saw an example of significant progress in NLP.
An experimental system developed at the National Library of Medicine (NLM) provides an example of a scientific research project involving NLP. This project has used the MEDLINE database of titles and abstracts of biomedical research articles to create a knowledge base called Semantic MEDLINE. NLM scientists have processed the MEDLINE database with three interlocking IT systems. First, they have employed a huge thesaurus of biomedical terms, also developed at NLM, called UMLS (Unified Medical Language System), which enables them to choose one term from each set of synonyms (a controlled vocabulary). Next, they have developed an NLP system to find the key sentences in each abstract and put them into a standard form. These key sentences describe the claims of the article in the form "subject-predicate-object." Each key sentence also points back to the original title-abstract record for later use. Finally, all the key sentences (currently 60 million extracted from 20 million MEDLINE abstracts) are used to create a knowledge base, which is a directed graph—that is, nodes interconnected with arrows. (The Web is also a directed graph, one whose nodes are the web pages and whose arrows are the embedded web links that point to other pages.) In this case, the nodes are the nouns (the subjects and the objects) from the key sentences, and the arrows are links labeled by the verbs (the predicates) from the key sentences.
This knowledge base is in the form of the new IT data system called the Semantic Web. It can be queried and browsed in ways that are analogous to (but considerably more powerful than) relational database systems. New scientific results have been derived from it by automatically combining key sentences from multiple abstracts. Experts conjecture that such systems will be important new sources of scientific results in the future, especially for interdisciplinary studies.
Is Information Technology Redefining Scientific Research?
In The New Renaissance, Douglas S. Robertson asserted that computing is creating a renaissance as great as the three that preceded it—namely, those created by spoken language (500,000–50,000 years ago), written language (5,000 years ago), and printing (500 years ago).6 If he's right, a revolution in scientific research will be simply a part of the legacy of electronic information technology.
Indeed, the use of information technology is propelling scientific research forward in many ways, including ways not discussed here (e.g., citizen science and enhanced collaboration). Previous inventions such as the telescope and the microscope likewise propelled astronomy and biology forward, but these tools merely enhanced the observation/experiment paradigm.7 And new mathematics such as calculus in the 17th century and game theory in the 20th century merely enhanced the explanation/theory paradigm.
On the other hand, computational modeling and simulation has been called "the third paradigm" of science by observers who think it adds something beyond the paradigms of experiment and theory. Global weather simulations are so faithful to reality that hurricanes are spawned (by the simulation itself) when the conditions are right. And simulations of the interior of the earth suggested that the inner core was rotating more slowly than the outer core, even though there was neither data nor theory to suggest that this was happening (later analysis of earthquake seismic records confirmed that the simulation was correct). Computational simulation would seem to be theoretical science in that it involves the construction and manipulation of a model of a part of reality. But what manipulation! Churning through centuries of climate simulation in hours, for example, is certainly theory on steroids.
More recently, big data has been called "the fourth paradigm" of science. Big data can be observed, at least by computers processing it and often by humans reviewing visualizations created from it. In the past, humans had to reduce the data, often with statistical processing, to be able to make sense of it. Perhaps new big data processing techniques will help us make sense of it without traditional reduction. One of the goals of big data discussed in the book The Fourth Paradigm is to make the scientific record a first-class scientific object.8 As discussed above, Semantic MEDLINE is one step in that direction for textual information. Perhaps we will see techniques for doing likewise for tabular information.
Conservative observers might say that computation and big data make only quantitative changes in what we can do, not qualitative ones. Liberal observers might respond that large-enough quantitative changes can produce qualitative change. The printing press, for example, produced only a quantitative change in the time and cost required to make many copies of a document. Yet the qualitative changes that this quantitative change engendered in Western civilization were huge, probably enabling science itself.
So, maybe there are only two paradigms of science. But maybe there are four. Or then again, maybe there are three! In the last chapter of his book Phase Change, Douglas S. Robertson shifts from considering phase changes in scientific disciplines to considering the scientific method. He suggests the three paradigms (though he doesn't call them that) of collecting, compressing, and organizing information. Collecting information is clearly another term for observation and experimentation. Compressing information encompasses both traditional theory (e.g., mathematical modeling) and computational simulation (the compressed information is the simulation program). Organizing information refers to big data that can't be compressed.
These are especially exciting times for science, and that excitement is due in no small measure to the effects that information technology is having on scientific disciplines and on the scientific method itself. In keeping with Yogi's dictum, it may be hard to predict what will be the most useful way of re-characterizing the scientific method in the future, when the dust kicked up by information technology settles.
1. George E. P. Box and Norman R. Draper, Empirical Model-Building and Response Surfaces (New York: Wiley, 1987), p. 424.
2. "Grid Computing: The Next Decade," Zakopane, Poland, January 4–6, 2012.
3. Computer, vol. 45, no. 2 (February 2012).
4. Brad Brown, Michael Chui, and James Manyika, "Are You Ready for the Era of 'Big Data'?" McKinsey Quarterly, October 2011.
5. William Sims Bainbridge, "The Scientific Research Potential of Virtual Worlds," Science, vol. 317, no. 5837 (July 27, 2007), pp. 472–476.
6. Douglas S. Robertson, The New Renaissance: Computers and the Next Level of Civilization (New York: Oxford University Press, 1998).
7. Douglas S. Robertson calls such inventions "phase changes," which allow paradigm shifts in the understanding of the sciences and mathematics. Although considering phase changes in disciplines rather than as paradigm additions to the scientific method, Robertson's book makes the claim, as do I in this article, for the revolutionary impact of information technology on science. For a book-length treatment with more examples and more details, see Douglas S. Robertson, Phase Change: The Computer Revolution in Science and Mathematics (New York: Oxford University Press, 2003).
8. Tony Hey, Stewart Tansley, and Kristin Tolle, The Fourth Paradigm: Data-Intensive Scientific Discovery (Redmond, Wash.: Microsoft Research, 2009).