Linear correlation
Linear correlation is a statistical measure that quantifies the strength and direction of a linear relationship between two quantitative random variables. Its purpose is to determine the extent to which variations in one variable are associated with proportional variations in another.
Pearson correlation coefficient
The most commonly used measure is the Pearson linear correlation coefficient, introduced by Karl Pearson. It is denoted by \(r\) for samples and by \(\rho\) for populations.
In the sample case, it is defined as:
$$r = \frac{ \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) }{ \sqrt{ \sum_{i=1}^{n} (x_i - \bar{x})^2 } \cdot \sqrt{ \sum_{i=1}^{n} (y_i - \bar{y})^2 } }$$
In population terms:
$$\rho = \operatorname{cov}(X, Y) / ( \sigma_X \sigma_Y )$$
where:
- \(\operatorname{cov}(X, Y)\) is the covariance,
- \(\sigma_X\) and \(\sigma_Y\) are the population standard deviations.
The coefficient is dimensionless and satisfies: \(-1 \leq r \leq 1\)
Interpretation
- \(r > 0\): positive linear association.
- \(r < 0\): negative linear association.
- \(r = 0\): no linear association.
- \( \vert r \vert = 1\): perfect linear relationship.
The magnitude \(|r|\) indicates the strength of the linear relationship. However, it does not provide information about the slope of a regression line, nor about non-linear relationships.
Relevant properties
- It is symmetric: \(r(X,Y) = r(Y,X)\).
- It is invariant under linear transformations of the form \(X' = aX + b\), \(Y' = cY + d\), with \(a,c ≠ 0\).
- It can be interpreted geometrically as the cosine of the angle between mean-centered data vectors in the Euclidean space \(R^n\).
- It is sensitive to outliers.
- It does not imply causation.
Statistical considerations
Linear correlation analysis typically assumes:
- Quantitative variables.
- An approximately linear relationship.
- No substantial outliers.
- Independence of observations.
For statistical inference about \(\rho\), tests based on Student’s t distribution are commonly used under assumptions of bivariate normality.
Instructions
Write the points in the table with their \(x\) and \(y\) coordinates. Click the boxes to display the elements of the linear regression.
Explanation
The linear equation is
$$y = mx + b$$
where \(x\) is the independent variable, \(y\) is the dependent variable, \(m\) is the slope, and \(b\) is the \(y\)-intercept. In a linear regression, the slope can be calculated as:
$$m = \frac{\sum_{i=1}^{n} \left( x_{i} - \bar{x} \right) \left( y_{i} - \bar{y} \right)}{\sum_{i=1}^{n} \left( x_{i} - \bar{x} \right)^{2} }$$
The expressions on the right are equivalent to those on the left and are often easier to calculate:
$$\sum_{i=1}^{n} \left( x_{i} - \bar{x} \right)^{2} = \sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^{2}$$
$$\sum_{i=1}^{n} \left( y_{i} - \bar{y} \right)^{2} = \sum_{i=1}^{n} y_{i}^{2} - n\bar{y}^{2}$$
$$\sum_{i=1}^{n} \left( x_{i} - \bar{x} \right) \left( y_{i} - \bar{y} \right) = \sum_{i=1}^{n} x_{i}y_{i} - n\bar{x}\bar{y}$$
Then, we make the substitution
$$\begin{split} m &= \frac{\sum_{i=1}^{n} x_{i}y_{i} - n\bar{x}\bar{y}}{\sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^{2}} \\ &= \frac{\sum_{i=1}^{n} x_{i}y_{i} - \frac{1}{n}\sum_{i=1}^{n}x_{i}\sum_{i=1}^{n}y_{i}}{\sum_{i=1}^{n} x_{i}^{2} - \frac{1}{n} \left(\sum_{i=1}^{n}x_{i}\right)^{2}} \cdot \frac{n}{n} \\ &= \frac{ n \sum_{i=1}^{n} x_{i}y_{i} - \sum_{i=1}^{n}x_{i}\sum_{i=1}^{n}y_{i}}{ n \sum_{i=1}^{n} x_{i}^{2} - \left(\sum_{i=1}^{n}x_{i}\right)^{2}} \end{split}$$
$$\boxed{ \therefore m = \frac{ n \sum_{i=1}^{n} x_{i}y_{i} - \sum_{i=1}^{n}x_{i}\sum_{i=1}^{n}y_{i}}{ n \sum_{i=1}^{n} x_{i}^{2} - \left(\sum_{i=1}^{n}x_{i}\right)^{2}} }$$
where the \(y\)-intercept \(b\) is calculated as
$$\begin{split}b &= \bar{y} - m \bar{x} \\ &= \bar{y} - \left( \frac{\sum_{i=1}^{n} x_{i}y_{i} - n\bar{x}\bar{y}}{\sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^{2}} \right) \bar{x} \\ &= \bar{y} - \frac{\bar{x}\sum_{i=1}^{n} x_{i}y_{i} - n\bar{x}^{2}\bar{y}}{\sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^{2}} \\ &= \frac{\bar{y}\sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^{2}\bar{y} + n\bar{x}^{2}\bar{y} - \bar{x} \sum_{i=1}^{n}x_{i}y_{i}}{\sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^{2}} \\ &= \frac{ \frac{1}{n}\sum_{i=1}^{n} y_{i}\sum_{i=1}^{n} x_{i}^{2} - \frac{1}{n}\sum_{i=1}^{n}x_{i} \sum_{i=1}^{n}x_{i}y_{i}}{\sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^{2}} \cdot \frac{n}{n} \\ &= \frac{\sum_{i=1}^{n} y_{i} \sum_{i=1}^{n} x^{2}_{i} - \sum_{i=1}^{n} x_{i} \sum_{i=1}^{n} x_{i}y_{i}}{n \sum_{i=1}^{n} x^{2}_{i} - \left( \sum_{i=1}^{n} x_{i} \right)^{2}} \end{split}$$
$$\boxed {\therefore b = \frac{\sum_{i=1}^{n} y_{i} \sum_{i=1}^{n} x^{2}_{i} - \sum_{i=1}^{n} x_{i} \sum_{i=1}^{n} x_{i}y_{i}}{n \sum_{i=1}^{n} x^{2}_{i} - \left( \sum_{i=1}^{n} x_{i} \right)^{2}} }$$
The Pearson correlation coefficient is calculated by dividing the covariance by the square root of the product of the variances of both variables.
$$\begin {split}r &= \frac{\operatorname{cov}(X,Y)}{\sigma_x \sigma_y} \\ &= \frac{\sum_{i=1}^{n} \left[ \left( x_{i} - \bar{x} \right) \left( y_{i} - \bar{y} \right) \right]}{\sqrt{\sum_{i=1}^{n} \left( x_{i} - \bar{x} \right)^{2}\sum_{i=1}^{n} \left( y_{i} - \bar{y} \right)^{2}}}\end{split}$$
Any negative value will result from the product of factors with different signs in the covariance, leading to a negative correlation.