Published 3 July 2014 by Mohit Kumar Jolly
Big Data – not a big deal, just another tool
‘Big Data’, however fancy it might appear, is just another tool that can be useful to find some associations.‘Big Data’, a buzz word these days in biological research, promises to collect and analyze the large datasets (genomics, proteomics, metabolomics etc.) to predict some novel associations between genes and diseases. “But these are just associations, and unless we go back in the laboratory and establish causal connection, it’s hardly of any use. You can really get fooled badly by the big data” , said J. Michael Bishop. “There was no big data earlier, we still used to do good science”, opined Jules A. Hoffmann. “Big data is just another tool – not the only one certainly. Use it when you need it. You need not learn it yourself”, mentioned Brian P. Schmidt. Together with Bruce A. Beutler the Nobel Laureates debated about the role of big data at a panel discussion in Lindau.
They discussed their expectations as well as apprehensions about the ‘big data’ approach. Schmidt mentioned that big data is one tool that can propose some hypothesis that can be used to drive research. Bishop added to it, saying that big data analysis and (reductionist) experiments in the laboratory often form a vicious cycle: “Let’s say you identify an oncogene using big data, then you go and verify that in the lab. If you’re lucky, you design a drug, and do clinical trials. Then, as expected, you’ll get drug resistance. Then you again go back sequence the genome of the patient, and this cycle continues.”
Bishop shared two specific examples where the predicted associations were completely misleading:
“In the google flu-tracking study, they almost made us believe that they would be able to predict the spread of flu in different areas. We had high hopes, but it all failed; later they realized that the metric they used for analysis was too squishy. ”
More importantly, a recent big data study predicted an association between cholesterol levels and a pulmonary disease. The patients were treated with cholesterol inhibiting drugs, and the clinical trials failed miserably.
Schmidt, Nobel Laureate in Physics, unlike the three other Laureates on the panel (all of them in Physiology), mentioned that he had been using the big-data approach rigorously for a long time. “But in physics, we use really stringent statistical tests before concluding anything – I do not see that kind of stringency in big data biological studies. Also, biological systems have its own unique framework.” Bishop agreed with the same, saying that analyses with big data in tumor biology often gives a list of potential oncogenes, but that is not sufficient to identify the driver oncogene (the gene that is absolutely essential for causing tumors).
Beutler mentioned: “The big data approach is opposite to mine. I am a reductionist. I start my study with a phenotype, and try to identify what causes it.” On being asked about his views about big data, he said “I do not think we should argue among tools. Tools should be chosen according to the problem one’s trying to address, not because everyone else uses it.”
Thus, two messages came out very clearly about the big data approach in biology –
(i) It can only give associations, not causal connections or mechanisms.
(ii) It can at maximum be an extra tool to complement the canonical reductionist approach, not replace it.
Therefore, what can be considered as a bridge between ‘big data analysis’ and canonical ‘small scale analysis’ is ‘meso scale analysis’ based on physical sciences where we study the set of interactions between a finite number of molecular players involved to understand how those interactions explain the emergent phenotypes. This approach has already been in use in simplier organisms (eg. bacterium) for some time to elucidate their operating principles, and is entering cancer research too.