When not to use R squared – Life on Numbers

Ever wondered how relevant the R^2 value of any data series actually is? Sometimes, you see reasonably good coefficients of determination (depending on your scientific discipline and research question, “reasonably good” means something between .2 and .99) for comparably small data sets. This is not how this works.

Let me illustrate the problem: If you only take two random coordinates in the interval (0,1)^2 and calculate their coefficient of determination (COD), then it’s always 1 (perfect correlation) or -1 (perfect anti-correlation). If you take three points, the situation improves only mildly:

Distribution of R^2 values for three random points.

The graph above is the result a Monte Carlo run with 10^7 vectors of three coordinates each that have been randomly drawn from a uniform distribution over (0,1)^2. Afterwards, the COD has been calculated for each of those vectors. The blue line shows the histogram of the resulting COD values, the green line shows the cumulative density function (CDF) for the absolute COD value. If you would say “I accept anything as reasonably correlated that has a R^2 of more than .8“, then this would apply to about 40% of any randomly chosen set of points.

Using the CDF, we can define something similar to a p-value. (I know that there is no p-value for R^2, but let’s think of this measure as such.) We say:

Given a certain number of data points, what R^2 do I have to require to haver fewer than x% random data series of identical length being better correlated?

The answer can be found with Monte Carlo again:

# data points	p = 20%	p = 10%	p = 5%	p = 1%	p = 0.1%
3	0.951	0.987	0.997	0.999	>0.999
4	0.800	0.901	0.951	0.990	0.999
5	0.687	0.807	0.880	0.960	0.991
6	0.608	0.731	0.815	0.921	0.975
7	0.550	0.671	0.758	0.880	0.955
8	0.504	0.622	0.710	0.840	0.930
9	0.472	0.585	0.672	0.806	0.907
10	0.443	0.552	0.636	0.773	0.882
100	0.127	0.163	0.194	0.254	0.323
500	0.051	0.068	0.082	0.110	0.141
1000	0.042	0.053	0.063	0.083	0.105

So the next time you see somebody presenting “good correlation agreement” on sufficiently small data sets, then remember, that you can easily get good agreement for random numbers…

Technical Details

See below for the distributions for all the random vector lengths shown above. Here is the code used to generate them:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sts
import sys

def test_one(n):
    xs = np.random.random(n)
    ys = np.random.random(n)
    slope, intercept, r_value, p_value, std_err = sts.linregress(xs, ys)
    return r_value

def show_hist(n, m):
    fig = plt.figure(figsize=(15, 10))
    vals = np.array([test_one(n) for _ in xrange(m)])
    hist, bins = np.histogram(vals, bins=10000, density=True)
    hist /= max(hist)
    xs = (bins[1:] + bins[:-1])/2
    plt.plot(xs, hist, label='Histogram')

    xs = xs[len(xs)/2:]
    hist = np.cumsum((hist + hist[::-1])[len(xs):])
    hist /= max(hist)
    plt.plot(xs, hist, label='CDF of absolute R$^2$')

    for pval in (20., 10., 5., 1., 0.1,):
        smaller = np.where(hist &lt; 1-pval/100)
        print max(xs[smaller]),
    print ''

    plt.title('R$^2$ distribution for %d uniform random points (sampled from $10^%d$ random vectors)' % (n, np.log(m)/np.log(10)))
    plt.legend()
    plt.savefig('rsq-mc-%d.png' % n)

if len(sys.argv) == 3:
	print 'Rsq MC for %s uniform random points and %s random vectors.' % (sys.argv[1], sys.argv[2])
	show_hist(int(sys.argv[1]), int(sys.argv[2]))

Technical Details

Leave a comment Cancel reply