Seamless support for scientific data demands

Science Article


November 18, 2004

Data-intensive shared resource facilities rely on Research Computing Support to provide and maintain a framework for complex experiments

Michael Gutteridge, Riani Wangadi, David Chambers and Suprianto Agus
From left to right: Research Computing Support team members Michael Gutteridge, Riani Wangadi, David Chambers and Suprianto Agus provide the computational infrastructure needed by the center's Scientific Imaging, Animal Health, Flow Cytometry, Protemics, Genomics and Electron Microscopy shared resources.
Photo by Todd McNaught

By DANIELLE IPPOLITO

Confronted by the output of a confocal microscope or a mass spectrometer, even a medical student cramming for an anatomy exam would reconsider the definition of "data overload." Preventing either instrument from bogging down under its own data requires information-handling technology as sophisticated as the equipment itself.

Working behind the scenes of data-intensive shared resources that make use of these and other types of high-tech equipment, Research Computing Support staff members provide the computational infrastructure that supports the data churned out by the center's most prolific technology.

"There are around 20 information technology (IT) groups at the center working on different aspects of computational support," said Tim Hunt, the manager of the Research Computing shared resource. "We are unique in that we work strictly within a research context. Some shared resources need a lot of computational power, and we provide support including high-level, complex desktop support and research services."

Collaborative data management

Data-intensive shared resource facilities relying on their services include Scientific Imaging, Animal Health, Flow Cytometry, Proteomics, Genomics and Electron Microscopy.

"We rely on Research Computing Support to provide and maintain the data-management infrastructure to meet the demands of our experiments," said Dr. Philip Gafken, manager of the Proteomics shared resource.

"There can be a significant amount of data associated with a proteomics-based experiment. For me, the IT issues associated with storing and processing the data are daunting."

The lynchpin of proteomics experiments is the mass spectrometer, an instrument whose refined sensitivity allows investigators to identify hundreds to thousands of proteins within complex mixtures. For example, center investigators envision using mass spectrometry to sift through thousands of proteins in blood, improving cancer diagnosis by detecting changes in key proteins associated with cancer in the very earliest stages.

But the improved sensitivity provided by newer mass spectrometers also translates into a glut of data burdening the center's storage capacity.

"There is a collaborative effort between Research Computing Support and my group," Gafken said. "Once we collect all the data files, they need to be stored and processed. The processing all takes place on a computing cluster that Research Computing Support maintains for us. But the more data you throw at it, the slower it gets."

Hierarchical storage management

The amount of space a given data-point needs to exist on the center's network of hard drives is quantified in terms of bytes. A single proteomics experiment can consume five to 100 million bytes (megabytes) or more. An experiment investigating how proteins in a mixture change over time could require collecting five-megabyte samples 20 times in a single day.

"In the future, in addition to single proteins, we anticipate looking at yeast lysates (the entire contents of yeast cells grown in culture), for example, with more than 2000-3000 proteins. The amount of data from the mass spectrometer will increase dramatically," Gafken said.

"As we increase the number of samples we process, we put more demands on the computer cluster. The need to store and archive the data also grows. We arrange with Research Computing Support to budget in additional hardware.

"What's nice about Tim's group is that they monitor our storage and processing needs," Gafken said. "If they notice our processing times are getting longer or our storage space is getting low, they work to expand the computer cluster and the storage space to meet our needs."

Scientific Imaging Shared Resource represents another big spender in the memory department. When Scientific Imaging recently acquired a new microscope with advanced imaging capabilities, investigators marveled at the clarity of three-dimensional images taken from specimens in the micron size range. But Hunt and his colleagues were more impressed by the fact that some of these images could be running at several hundred megabytes apiece.

"We work with Dr. Julio Vazquez, manager of Scientific Imaging, to be sure they have the storage capacity they need to run their equipment efficiently," Hunt said.

Advances in information technology help Research Computing Support keep pace with these changes in storage-capacity requirements. Expanding capacity involves a network design termed hierarchical storage management. An image collected from one of the Scientific Imaging microscopes, for instance, begins its life-cycle on a designated hard drive accessible through the center's ubiquitous "Fred" server. A complex series of images may require yoking together a cluster of hard-drives.

Use of latest 'blade' technology

When the image file is no longer frequently accessed, it is shunted to less expensive media — cheaper hard drives and fast tape drives. When a researcher goes to retrieve an image or another set of data that has reached this stage of its life-cycle, the only noticeable difference is a matter of a few extra seconds.

"After logging into Fred, a user accessing older files could be directed to a tape without even knowing it," Hunt said.

If conventional servers and storage alone serviced an investigator's collection of images, a single hard drive could reach maximum capacity in the course of a few experiments. Subsequently, data collection and analysis would probably grind to a halt as servers were taken out of commission for upgrades. But Research Computing Support staff keep the data flowing even as they buffer storage capacity by using both the latest "blade" technology from IBM and a storage area network (SAN). In effect, both blade and large servers provide the computational horsepower needed to drive the data collection and subsequent analysis behind microscopic imaging, proteomics and other data-intensive experiments.

Invisible, ever-present expertise

Besides having a smaller physical footprint than conventional technology, blade servers are easier to install and manage. Simply simply adding another blade to the existing cluster can expand compute capacity rather than revamping the entire server, Hunt said.

Handling the center's most complicated data sets requires correspondingly large backup systems.

"Tetrabytes and tetrabytes of data (and over 12 million files), mostly through the Fred account on the Fred server, are backed up for long-term storage for copyright issues and customer access of data at the center," Hunt said.

Because Research Computing Support services provide the computational framework for data-intensive research, transitions must be seamless as users collect and retrieve data from various locations within the network.

"Research Computing Support handles the aspects of making available the infrastructure we need for doing large data manipulations," said Dr. Jeffrey Delrow, manager of the Genomics Shared Resource.

"The way things are set up, they are invisible to us. If problems arise and we need their expertise, they immediately snap into action. But as long as things are going well, we don't even know they're there."

Center News Table of Contents


Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
©2009 Fred Hutchinson Cancer Research Center, a nonprofit organization.
Terms of Use & Privacy Policy.

CenterNetCheck E-mail