Tuesday, July 16, 2013

5 basic rules for big data analysis (and to avoid dead-salmon-capable-of-social-interaction type situation)

Just read a great post by Thomas Zoëga Ramsøy , who uses neuroscience to make 5 good points an analyst/data scientist must consider while dealing with big data. The author has a PhD in Neurobiology and provides a fascinating example from Neuroscience.  Neuroscience has dealt with big data for a long time and like most other new disciplines that are discovering the joys of big data has seen the data it must deal with grow every year.

"Put it this way, a typical data set for a functional MRI scan takes up over 500 MB per person, providing a new data point in thousands of minuscule subregions of the brain every 2 seconds or so. With an EEG, you get a new data point every 1 millisecond or so for typically 10 to 128 electrodes, and you can look at five or more different frequencies, producing millions of data points per person. If you want to think big data, neuroscience can take you there."

The author goes on mention the biggest challenge when a data scientist/statistician is given lots and lots of data.

"But with big data, you also get big challenges. This has been one of the largest issues in neuroscience. As any statistician will tell you, if you have an enormous statistical power, any test you run can easily turn out to be significant. Everything is significant in the land of Big Data!"

He presents the example of statistician's finding significant positive co-relation in a dead salmon when near a fish of another species. Amazing stuff.

"Take one example, a scientific paper entitled “Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction” (and winner of the 2012 IgNobel Prize). Here, researchers reported that they could find brain activation during interspecies social interaction – in a dead salmon. You read right: brain activation in a really dead salmon! Just by doing the analysis wrong and not correcting the data, the researchers got a strong false positive effect."

Here are the 5 simple but effective rules he proposes to kill any chance of this happening in analysis of your big data of choice.
  1. Correct for multiple comparisons-. Find the appropriate way of minimizing the chances that your significant results could have happened by pure chance! 
  2. Look at the smaller samples -  If you cannot find your effect in a smaller sample, is it really interesting? If you need to include thousands of people to find a significant effect, chances are that it’s not a very interesting phenomenon.
  3. Look at the extremes -  Sometimes you might learn more from the extremes than the boring mean. 
  4. Use different sources - If you’re using surveys, try other measures. Combine methods. 
  5. Look for ways to disapprove your pet theory/idea and not for ways to make it stick
That is it. 5 simple rules to stay alive in the land of big data.