Ever wondered how relevant the R^2 value of any data series actually is? Sometimes, you see reasonably good coefficients of determination (depending on your scientific discipline and research question, “reasonably good” means something between .2 and .99) for comparably small data sets. This is not how this works.
Let me illustrate the problem: If you only take two random coordinates in the interval (0,1)^2 and calculate their coefficient of determination (COD), then it’s always 1 (perfect correlation) or -1 (perfect anti-correlation). If you take three points, the situation improves only mildly:
The graph above is the result a Monte Carlo run with 10^7 vectors of three coordinates each that have been randomly drawn from a uniform distribution over (0,1)^2. Afterwards, the COD has been calculated for each of those vectors. The blue line shows the histogram of the resulting COD values, the green line shows the cumulative density function (CDF) for the absolute COD value. If you would say “I accept anything as reasonably correlated that has a R^2 of more than .8“, then this would apply to about 40% of any randomly chosen set of points.
Using the CDF, we can define something similar to a p-value. (I know that there is no p-value for R^2, but let’s think of this measure as such.) We say:
Given a certain number of data points, what R^2 do I have to require to haver fewer than x% random data series of identical length being better correlated?
The answer can be found with Monte Carlo again:
# data points | p = 20% | p = 10% | p = 5% | p = 1% | p = 0.1% |
---|---|---|---|---|---|
3 | 0.951 | 0.987 | 0.997 | 0.999 | >0.999 |
4 | 0.800 | 0.901 | 0.951 | 0.990 | 0.999 |
5 | 0.687 | 0.807 | 0.880 | 0.960 | 0.991 |
6 | 0.608 | 0.731 | 0.815 | 0.921 | 0.975 |
7 | 0.550 | 0.671 | 0.758 | 0.880 | 0.955 |
8 | 0.504 | 0.622 | 0.710 | 0.840 | 0.930 |
9 | 0.472 | 0.585 | 0.672 | 0.806 | 0.907 |
10 | 0.443 | 0.552 | 0.636 | 0.773 | 0.882 |
100 | 0.127 | 0.163 | 0.194 | 0.254 | 0.323 |
500 | 0.051 | 0.068 | 0.082 | 0.110 | 0.141 |
1000 | 0.042 | 0.053 | 0.063 | 0.083 | 0.105 |
So the next time you see somebody presenting “good correlation agreement” on sufficiently small data sets, then remember, that you can easily get good agreement for random numbers…
Technical Details
See below for the distributions for all the random vector lengths shown above. Here is the code used to generate them:
import numpy as np import matplotlib.pyplot as plt import scipy.stats as sts import sys def test_one(n): xs = np.random.random(n) ys = np.random.random(n) slope, intercept, r_value, p_value, std_err = sts.linregress(xs, ys) return r_value def show_hist(n, m): fig = plt.figure(figsize=(15, 10)) vals = np.array([test_one(n) for _ in xrange(m)]) hist, bins = np.histogram(vals, bins=10000, density=True) hist /= max(hist) xs = (bins[1:] + bins[:-1])/2 plt.plot(xs, hist, label='Histogram') xs = xs[len(xs)/2:] hist = np.cumsum((hist + hist[::-1])[len(xs):]) hist /= max(hist) plt.plot(xs, hist, label='CDF of absolute R$^2$') for pval in (20., 10., 5., 1., 0.1,): smaller = np.where(hist < 1-pval/100) print max(xs[smaller]), print '' plt.title('R$^2$ distribution for %d uniform random points (sampled from $10^%d$ random vectors)' % (n, np.log(m)/np.log(10))) plt.legend() plt.savefig('rsq-mc-%d.png' % n) if len(sys.argv) == 3: print 'Rsq MC for %s uniform random points and %s random vectors.' % (sys.argv[1], sys.argv[2]) show_hist(int(sys.argv[1]), int(sys.argv[2]))