Friday, October 11, 2013

Biology's Big Problem: There's Too Much Data to Handle




Twenty years ago, sequencing the human genome was one of the most ambitious science projects ever attempted. Today, compared to the collection of genomes of the microorganisms living in our bodies, the ocean, the soil and elsewhere, each human genome, which easily fits on a DVD, is comparatively simple. Its 3 billion DNA base pairs and about 20,000 genes seem paltry next to the roughly 100 billion bases and millions of genes that make up the microbes found in the human body.



And a host of other variables accompanies that microbial DNA, including the age and health status of the microbial host, when and where the sample was collected, and how it was collected and processed. Take the mouth, populated by hundreds of species of microbes, with as many as tens of thousands of organisms living on each tooth. Beyond the challenges of analyzing all of these, scientists need to figure out how to reliably and reproducibly characterize the environment where they collect the data.


“There are the clinical measurements that periodontists use to describe the gum pocket, chemical measurements, the composition of fluid in the pocket, immunological measures,” said David Relman, a physician and microbiologist at Stanford University who studies the human microbiome. “It gets complex really fast.”


Ambitious attempts to study complex systems like the human microbiome mark biology’s arrival in the world of big data. The life sciences have long been considered a descriptive science — 10 years ago, the field was relatively data poor, and scientists could easily keep up with the data they generated. But with advances in genomics, imaging and other technologies, biologists are now generating data at crushing speeds.


One culprit is DNA sequencing, whose costs began to plunge about five years ago, falling even more quickly than the cost of computer chips. Since then, thousands of human genomes, along with those of thousands of other organisms, including plants, animals and microbes, have been deciphered. Public genome repositories, such as the one maintained by the National Center for Biotechnology Information, or NCBI, already house petabytes — millions of gigabytes — of data, and biologists around the world are churning out 15 petabases (a base is a letter of DNA) of sequence per year. If these were stored on regular DVDs, the resulting stack would be 2.2 miles tall.


“The life sciences are becoming a big data enterprise,” said Eric Green, director of the National Human Genome Research Institute in Bethesda, Md. In a short period of time, he said, biologists are finding themselves unable to extract full value from the large amounts of data becoming available.



Solving that bottleneck has enormous implications for human health and the environment. A deeper understanding of the microbial menagerie inhabiting our bodies and how those populations change with disease could provide new insight into Crohn’s disease, allergies, obesity and other disorders, and suggest new avenues for treatment. Soil microbes are a rich source of natural products like antibiotics and could play a role in developing crops that are hardier and more efficient.

Life scientists are embarking on countless other big data projects, including efforts to analyze the genomes of many cancers, to map the human brain, and to develop better biofuels and other crops. (The wheat genome is more than five times larger than the human genome, and it has six copies of every chromosome to our two.)


However, these efforts are encountering some of the same criticisms that surrounded the Human Genome Project. Some have questioned whether massive projects, which necessarily take some funding away from smaller, individual grants, are worth the trade-off. Big data efforts have almost invariably generated data that is more complicated than scientists had expected, leading some to question the wisdom of funding projects to create more data before the data that already exists is properly understood. “It’s easier to keep doing what we are doing on a larger and larger scale than to try and think critically and ask deeper questions,” said Kenneth Weiss, a biologist at Pennsylvania State University.


Compared to fields like physics, astronomy and computer science that have been dealing with the challenges of massive datasets for decades, the big data revolution in biology has also been quick, leaving little time to adapt.


“The revolution that happened in next-generation sequencing and biotechnology is unprecedented,” said Jaroslaw Zola, a computer engineer at Rutgers University in New Jersey, who specializes in computational biology.


Biologists must overcome a number of hurdles, from storing and moving data to integrating and analyzing it, which will require a substantial cultural shift. “Most people who know the disciplines don’t necessarily know how to handle big data,” Green said. If they are to make efficient use of the avalanche of data, that will have to change.


Big Complexity


When scientists first set out to sequence the human genome, the bulk of the work was carried out by a handful of large-scale sequencing centers. But the plummeting cost of genome sequencing helped democratize the field. Many labs can now afford to buy a genome sequencer, adding to the mountain of genomic information available for analysis. The distributed nature of genomic data has created its own challenges, including a patchwork of data that is difficult to aggregate and analyze. “In physics, a lot of effort is organized around a few big colliders,” said Michael Schatz, a computational biologist at Cold Spring Harbor Laboratory in New York. “In biology, there are something like 1,000 sequencing centers around the world. Some have one instrument, some have hundreds.”



David Relman, a physician and microbiologist at Stanford University, wants to understand how microbes influence human health. Image: Peter DaSilva for Quanta Magazine



As an example of the scope of the problem, scientists around the world have now sequenced thousands of human genomes. But someone who wanted to analyze all of them would first have to collect and organize the data. “It’s not organized in any coherent way to compute across it, and tools aren’t available to study it,” said Green.


Researchers need more computing power and more efficient ways to move their data around. Hard drives, often sent via postal mail, are still often the easiest solution to transporting data, and some argue that it’s cheaper to store biological samples than to sequence them and store the resulting data. Though the cost of sequencing technology has fallen fast enough for individual labs to own their own machines, the concomitant price of processing power and storage has not followed suit. “The cost of computing is threatening to become a limiting factor in biological research,” said Folker Meyer, a computational biologist at Argonne National Laboratory in Illinois, who estimates that computing costs ten times more than research. “That’s a complete reversal of what it used to be.”


Biologists say that the complexity of biological data sets it apart from big data in physics and other fields. “In high-energy physics, the data is well-structured and annotated, and the infrastructure has been perfected for years through well-designed and funded collaborations,” said Zola. Biological data is technically smaller, he said, but much more difficult to organize. Beyond simple genome sequencing, biologists can track a host of other cellular and molecular components, many of them poorly understood. Similar technologies are available to measure the status of genes — whether they are turned on or off, as well as what RNAs and proteins they are producing. Add in data on clinical symptoms, chemical or other exposures, and demographics, and you have a very complicated analysis problem.


“The real power in some of these studies could be integrating different data types,” said Green. But software tools capable of cutting across fields need to improve. The rise of electronic medical records, for example, means more and more patient information is available for analysis, but scientists don’t yet have an efficient way of marrying it with genomic data, he said.


To make things worse, scientists don’t have a good understanding of how many of these different variables interact. Researchers studying social media networks, by contrast, know exactly what the data they are collecting means; each node in the network represents a Facebook account, for example, with links delineating friends. A gene regulatory network, which attempts to map how different genes control the expression of other genes, is smaller than a social network, with thousands rather than millions of nodes. But the data is harder to define. “The data from which we construct networks is noisy and imprecise,” said Zola. “When we look at biological data, we don’t know exactly what we are looking at yet.”


Despite the need for new analytical tools, a number of biologists said that the computational infrastructure continues to be underfunded. “Often in biology, a lot of money goes into generating data but a much smaller amount goes to analyzing it,” said Nathan Price, associate director of the Institute for Systems Biology in Seattle. While physicists have free access to university-sponsored supercomputers, most biologists don’t have the right training to use them. Even if they did, the existing computers aren’t optimized for biological problems. “Very frequently, national-scale supercomputers, especially those set up for physics workflows, are not useful for life sciences,” said Rob Knight, a microbiologist at the University of Colorado Boulder and the Howard Hughes Medical Institute involved in both the Earth Microbiome Project and the Human Microbiome Project. “Increased funding for infrastructure would be a huge benefit to the field.”


In an effort to deal with some of these challenges, in 2012 the National Institutes of Health launched the Big Data to Knowledge Initiative (BD2K), which aims, in part, to create data sharing standards and develop data analysis tools that can be easily distributed. The specifics of the program are still under discussion, but one of the aims will be to train biologists in data science.


“Everyone getting a Ph.D. in America needs more competency in data than they have now,” said Green. Bioinformatics experts are currently playing a major role in the cancer genome project and other big data efforts, but Green and others want to democratize the process. “The kinds of questions to be asked and answered by super-experts today, we want a routine investigator to ask 10 years from now,” said Green. “This is not a transient issue. It’s the new reality.”


Not everyone agrees that this is the path that biology should follow. Some scientists say that focusing so much funding on big data projects at the expense of more traditional, hypothesis-driven approaches could be detrimental to science. “Massive data collection has many weaknesses,” said Weiss. “It may not be powerful in understanding causation.” Weiss points to the example of genome-wide association studies, a popular genetic approach in which scientists try to find genes responsible for different diseases, such as diabetes, by measuring the frequency of relatively common genetic variants in people with and without the disease. The variants identified by these studies so far raise the risk of disease only slightly, but larger and more expensive versions of these studies are still being proposed and funded.


“Most of the time it finds trivial effects that don’t explain disease,” said Weiss. “Shouldn’t we take what we have discovered and divert resources to understand how it works and do something about it?” Scientists have already identified a number of genes that are definitely linked to diabetes, so why not try to better understand their role in the disorder, he said, rather than spend limited funds to uncover additional genes with a murkier role?


Many scientists think that the complexities of life science research require both large and small science projects, with large-scale data efforts providing new fodder for more traditional experiments. “The role of the big data projects is to sketch the outlines of the map, which then enables researchers on smaller-scale projects to go where they need to go,” said Knight.



Pages: 1 2 View All


Source: http://feeds.wired.com/c/35185/f/661370/s/32566d99/sc/4/l/0L0Swired0N0Cwiredscience0C20A130C10A0Cbig0Edata0Ebiology0C/story01.htm
Tags: Rosh Hashanah 2013   Miley Cyrus Vmas 2013   VMAs   Rosalind Franklin   project runway  

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.