Sameer Manek    Photos    Projects    Archive    About

Random Correlations

I’ve noticed people frequently misusing data to find correlations between seemingly unrelated data sets and inferring a relationship. While they’ll generally volunteer that they haven’t proven causality, they frequently claim that there must be some underlying relationship for the p value to be so low.

I built a toy to try and show the error in this. Essentially, you can take almost any real life data and infer a relationship, especially if you perform multiple tests. Here I take a number of data sets from Quandl and plot whichever have very low p-values.

Wait for it to load, then hit the “Another Relationship!” button

The causes of these ‘relationships’ vary, but a few key factors that I think are generally worth checking. These don’t invalidate the slope or intercept, but they may call the test statistics into question (e.g., p value).

  • Are the residuals normally distributed?
  • What if I detrend the data?
  • Are the residuals autocorrelated?
  • Do the residuals have constant variance?
  • Are there any points with a lot of leverage?
  • How many relationships did I test before finding this? Do I need to apply a multiple testing correction?