Research Computing in the Cloud: Leveling the Playing Field

min read

Advances in cloud computing allow almost unlimited access to high-end computing resources for researchers at every type of institution, creating a more level playing field for experimentation than has ever existed before.

white cloud in a blue-green space with overhead lights
Credit: Stavros Constantinou / Getty Images, © 2019

Imagine you're a researcher at a regional university or a small college, and you want to analyze a big data set or perform an experiment that requires massive amounts of computing power and storage. Until recently, you had the difficult task of trying to get access to equipment worth thousands or millions of dollars, equipment that your institution couldn't procure or support. But today, advances in cloud computing mean that you can "rent" such equipment, often for hundreds or thousands of dollars. Such is the dramatic revolution in access to high-end computing resources enabled by Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform, as well as others.

Traditional high-performance computing (HPC) presents major challenges for research faculty at most campuses. First is the high cost of acquisition. A small computing cluster can cost a hundred thousand dollars or more, and the sky's the limit. Second, most academic and IT departments lack the specialized expertise to support such equipment—in some cases, they may not even have an appropriate place to house it. The lack of in-house support means that campus researchers spend valuable time specifying, configuring, and operating HPC, and unless HPC is your research area, that time detracts from valuable research time. Large research universities may use graduate students to do some of this work, but even so, this is often not the most productive use of their time either.

Furthermore, HPC is often funded by one-time research grants and quickly becomes obsolete. Frustrated researchers find their once state-of-the-art computing clusters aging and look to institutional support to keep them updated—support that's often not available. Without technical and administrative support, in-house technology can be underutilized. HPC in one department might be used by researchers in another department, if they know it exists, if they have documentation and assistance, and if a scheduling process is available to support equipment sharing. In the absence of this infrastructure, valuable equipment in one department or college often sits idle while another department struggles to obtain funding for its own HPC.

The universal availability of commodity cloud services and high-speed networks can eliminate the requirement that departments must have local HPC resources. The infrastructure available from large cloud providers such as AWS dwarfs and outperforms all but the largest and most-specialized supercomputing facilities. Researchers can design and deploy experiments requiring hundreds or thousands of high-end processors in short timeframes, creating access for research faculty, graduate students, and even undergraduate students to HPC environments that were unimaginable just a few years ago.

Modern programming tools such as Docker and Kubernetes enable researchers to build, scale, and share massively parallel computation analysis and experiments. As one researcher told me: "I can re-create the exact same environment my colleagues at NASA are using, without having to configure a single piece of hardware. Most people just don't get how big a deal this is." Jupyter Notebooks have become a de facto standard for organizing, documenting, and sharing computational experiments.

Significant challenges to widespread adoption exist, of course. Cloud computing for research requires a different model for research support, just as it does for business information technology. Instead of one-time capital investments—which can often be made opportunistically from one-time funds, grants, and donations—cloud-based HPC requires ongoing financial support. With the cloud, you never stop getting a bill. However, you also eliminate the problems associated with supporting aging, obsolete equipment.

A lack of experience among researchers and IT support staff creates a fear of runaway costs. This is a little like the problem that was common among new cell phone users who would arrive home to the shock of their phone bill after an overseas trip. However, just as with cell phone providers, cloud providers offer sophisticated tools for estimating, controlling, and reducing the costs of cloud usage. While this concern is legitimate, it's often blown out of proportion.

Cloud computing doesn't eliminate the support and expertise issue. While cloud providers have invested a lot of time and effort into providing training and simplified interfaces, most researchers are not going to create a cloud account on Monday and be running an experiment on Tuesday. But the volume of users and the amount of standardization that cloud providers have developed—along with tools such as Jupyter Notebook, Docker, and Kubernetes—have created a more common HPC infrastructure that enables a quicker ramp-up than in the past. It's entirely possible for researchers with a rudimentary understanding of these tools to use a computing environment created at another research lab, substitute their own data, and be up and running with an experiment in a few hours. In this way, research productivity is greatly accelerated.

Network bandwidth can be another issue. Moving large data sets on commodity networks, or even on regional research and education networks, simply doesn't work well for hundreds of terabytes or petabytes of data, which is the scale required by modern researchers in many fields. Often researchers resort to shipping hard drives; cloud services such as AWS Snowball and AWS Snowmobile have been developed to support the process. Three steps in the network path—from the lab to the campus border, from the campus to the ISP or regional research and education network, and from the network provider to the cloud provider—can each pose a challenge.

To begin to address these issues, the Pacific Research Platform (PRP), a collaboration among research universities and CENIC (operator of the CalREN network serving California), has been funded by the National Science Foundation to support the streaming of "elephant flows." The PRP uses dedicated, specialized network endpoints (FIONAs) to optimize the continuous, high-speed streaming of large data sets, connected to a dedicated network reserved for research. It also depends on end-to-end performance measurement to (1) ensure that theoretical throughput can be achieved in practice and (2) diagnose network bottlenecks that can occur, whether in the PRP network or—more often—in the "last mile" connection on campus between the campus border and the research network.

As is the case with any use of cloud services, information security considerations play a role. Intellectual property protection, personally identifiable information, and export controls can all present issues, requiring the appropriate review and analysis. Since many researchers lack a detailed understanding of these issues, the responsible use of cloud resources will often require collaboration with campus information security personnel. Just as with traditional enterprise computing, most research can be moved to a cloud provider if the appropriate protocols are in place, but some may benefit from being kept in the local environment.

Even though the cloud provides real advantages for many workloads, there may be good practical and financial reasons to stick with on-premise resources. A lab or a campus may want to fully amortize an investment in a local computing cluster rather than "paying twice" by adding cloud services fees into the mix. Furthermore, while a local cluster might be suboptimal for the largest computations, it can be a great "sandbox" environment for teaching students and developing computational techniques without having to "pay by the second." And of course, if the intent of the research or education is to better understand the implementation of high-performance computing, there can be great value in having a hands-on lab environment available for researchers and students.

A promising strategy today is a hybrid environment that takes advantage of both local resources and cloud resources as appropriate. The tools mentioned above—such as Jupyter Notebook, Docker, and Kubernetes—make executing the same code in the local environment and in the cloud relatively easy. Thus, researchers can develop and test their code using owned equipment, and then if they need to access more processor cores and memory, they can run their final experiments in the cloud once they're confident that they understand how it will work—thereby reducing the chance of incurring cloud fees for an unsuccessful run.

While there are still challenges and objections to using the cloud as a research instrument, the advantages in many cases are so compelling that we will continue to see research migrate in that direction. From large research universities to small liberal arts colleges, the research cloud will become a growing part of the research instrumentation portfolio, and campus IT departments will likely have to add cloud facilitation for research to the ever-growing list of capabilities expected of a modern IT department. The good news is that the cloud has the potential to allow almost unlimited access to high-end computing resources for researchers at every type of institution, creating a more level playing field for experimentation than has ever existed before.

Michael Berman is Chief Innovation Officer and Deputy CIO at California State University, Chancellor's Office. He is the 2019 Editor of the New Horizons column for EDUCAUSE Review.

© 2019 Michael Berman. The text of this article is licensed under the Creative Commons Attribution 4.0 International License.

EDUCAUSE Review 54, no. 1 (Winter 2019)