logo

johzu

About

Linear correlation

Linear correlation is a statistical measure that quantifies the strength and direction of a linear relationship between two quantitative random variables. Its purpose is to determine the extent to which variations in one variable are associated with proportional variations in another.

Pearson correlation coefficient

The most commonly used measure is the Pearson linear correlation coefficient, introduced by Karl Pearson. It is denoted by \(r\) for samples and by \(\rho\) for populations.

In the sample case, it is defined as:

$$r = \frac{ \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) }{ \sqrt{ \sum_{i=1}^{n} (x_i - \bar{x})^2 } \cdot \sqrt{ \sum_{i=1}^{n} (y_i - \bar{y})^2 } }$$

In population terms:

$$\rho = \operatorname{cov}(X, Y) / ( \sigma_X \sigma_Y )$$

where:

The coefficient is dimensionless and satisfies: \(-1 \leq r \leq 1\)

Interpretation

The magnitude \(|r|\) indicates the strength of the linear relationship. However, it does not provide information about the slope of a regression line, nor about non-linear relationships.

Relevant properties

  1. It is symmetric: \(r(X,Y) = r(Y,X)\).
  2. It is invariant under linear transformations of the form \(X' = aX + b\), \(Y' = cY + d\), with \(a,c ≠ 0\).
  3. It can be interpreted geometrically as the cosine of the angle between mean-centered data vectors in the Euclidean space \(R^n\).
  4. It is sensitive to outliers.
  5. It does not imply causation.

Statistical considerations

Linear correlation analysis typically assumes:

For statistical inference about \(\rho\), tests based on Student’s t distribution are commonly used under assumptions of bivariate normality.

Instructions

Write the points in the table with their \(x\) and \(y\) coordinates. Click the boxes to display the elements of the linear regression.


Explanation

The linear equation is

$$y = mx + b$$

where \(x\) is the independent variable, \(y\) is the dependent variable, \(m\) is the slope, and \(b\) is the \(y\)-intercept. In a linear regression, the slope can be calculated as:

$$m = \frac{\sum_{i=1}^{n} \left( x_{i} - \bar{x} \right) \left( y_{i} - \bar{y} \right)}{\sum_{i=1}^{n} \left( x_{i} - \bar{x} \right)^{2} }$$

The expressions on the right are equivalent to those on the left and are often easier to calculate:

$$\sum_{i=1}^{n} \left( x_{i} - \bar{x} \right)^{2} = \sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^{2}$$

$$\sum_{i=1}^{n} \left( y_{i} - \bar{y} \right)^{2} = \sum_{i=1}^{n} y_{i}^{2} - n\bar{y}^{2}$$

$$\sum_{i=1}^{n} \left( x_{i} - \bar{x} \right) \left( y_{i} - \bar{y} \right) = \sum_{i=1}^{n} x_{i}y_{i} - n\bar{x}\bar{y}$$

Then, we make the substitution

$$\begin{split} m &= \frac{\sum_{i=1}^{n} x_{i}y_{i} - n\bar{x}\bar{y}}{\sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^{2}} \\ &= \frac{\sum_{i=1}^{n} x_{i}y_{i} - \frac{1}{n}\sum_{i=1}^{n}x_{i}\sum_{i=1}^{n}y_{i}}{\sum_{i=1}^{n} x_{i}^{2} - \frac{1}{n} \left(\sum_{i=1}^{n}x_{i}\right)^{2}} \cdot \frac{n}{n} \\ &= \frac{ n \sum_{i=1}^{n} x_{i}y_{i} - \sum_{i=1}^{n}x_{i}\sum_{i=1}^{n}y_{i}}{ n \sum_{i=1}^{n} x_{i}^{2} - \left(\sum_{i=1}^{n}x_{i}\right)^{2}} \end{split}$$

$$\boxed{ \therefore m = \frac{ n \sum_{i=1}^{n} x_{i}y_{i} - \sum_{i=1}^{n}x_{i}\sum_{i=1}^{n}y_{i}}{ n \sum_{i=1}^{n} x_{i}^{2} - \left(\sum_{i=1}^{n}x_{i}\right)^{2}} }$$

where the \(y\)-intercept \(b\) is calculated as

$$\begin{split}b &= \bar{y} - m \bar{x} \\ &= \bar{y} - \left( \frac{\sum_{i=1}^{n} x_{i}y_{i} - n\bar{x}\bar{y}}{\sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^{2}} \right) \bar{x} \\ &= \bar{y} - \frac{\bar{x}\sum_{i=1}^{n} x_{i}y_{i} - n\bar{x}^{2}\bar{y}}{\sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^{2}} \\ &= \frac{\bar{y}\sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^{2}\bar{y} + n\bar{x}^{2}\bar{y} - \bar{x} \sum_{i=1}^{n}x_{i}y_{i}}{\sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^{2}} \\ &= \frac{ \frac{1}{n}\sum_{i=1}^{n} y_{i}\sum_{i=1}^{n} x_{i}^{2} - \frac{1}{n}\sum_{i=1}^{n}x_{i} \sum_{i=1}^{n}x_{i}y_{i}}{\sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^{2}} \cdot \frac{n}{n} \\ &= \frac{\sum_{i=1}^{n} y_{i} \sum_{i=1}^{n} x^{2}_{i} - \sum_{i=1}^{n} x_{i} \sum_{i=1}^{n} x_{i}y_{i}}{n \sum_{i=1}^{n} x^{2}_{i} - \left( \sum_{i=1}^{n} x_{i} \right)^{2}} \end{split}$$

$$\boxed {\therefore b = \frac{\sum_{i=1}^{n} y_{i} \sum_{i=1}^{n} x^{2}_{i} - \sum_{i=1}^{n} x_{i} \sum_{i=1}^{n} x_{i}y_{i}}{n \sum_{i=1}^{n} x^{2}_{i} - \left( \sum_{i=1}^{n} x_{i} \right)^{2}} }$$

The Pearson correlation coefficient is calculated by dividing the covariance by the square root of the product of the variances of both variables.

$$\begin {split}r &= \frac{\operatorname{cov}(X,Y)}{\sigma_x \sigma_y} \\ &= \frac{\sum_{i=1}^{n} \left[ \left( x_{i} - \bar{x} \right) \left( y_{i} - \bar{y} \right) \right]}{\sqrt{\sum_{i=1}^{n} \left( x_{i} - \bar{x} \right)^{2}\sum_{i=1}^{n} \left( y_{i} - \bar{y} \right)^{2}}}\end{split}$$

Any negative value will result from the product of factors with different signs in the covariance, leading to a negative correlation.