5 basic rules for big data analysis (and to avoid dead-salmon-capable-of-social-interaction type situation)
"Put it this way, a typical data set for a functional MRI scan takes up over 500 MB per person, providing a new data point in thousands of minuscule subregions of the brain every 2 seconds or so. With an EEG, you get a new data point every 1 millisecond or so for typically 10 to 128 electrodes, and you can look at five or more different frequencies, producing millions of data points per person. If you want to think big data, neuroscience can take you there."
The author goes on mention the biggest challenge when a data scientist/statistician is given lots and lots of data.
He presents the example of statistician's finding significant positive co-relation in a dead salmon when near a fish of another species. Amazing stuff.
Here are the 5 simple but effective rules he proposes to kill any chance of this happening in analysis of your big data of choice.
- Correct for multiple comparisons-. Find the appropriate way of minimizing the chances that your significant results could have happened by pure chance!
- Look at the smaller samples - If you cannot find your effect in a smaller sample, is it really interesting? If you need to include thousands of people to find a significant effect, chances are that it’s not a very interesting phenomenon.
- Look at the extremes - Sometimes you might learn more from the extremes than the boring mean.
- Use different sources - If you’re using surveys, try other measures. Combine methods.
- Look for ways to disapprove your pet theory/idea and not for ways to make it stick