Module V: Statistical Techniques

1. Karl Pearson's Coefficient of Correlation (r)

Definition
Measures the degree of linear relationship between two variables \( X \) and \( Y \). Value ranges from \(-1\) to \(+1\).
\[ r = \frac{\sum(x - \bar{x})(y - \bar{y})}{\sqrt{\sum(x - \bar{x})^2 \cdot \sum(y - \bar{y})^2}} = \frac{\sum xy - n\bar{x}\bar{y}}{\sqrt{(\sum x^2 - n\bar{x}^2)(\sum y^2 - n\bar{y}^2)}} \]
Shortcut Formula: \[ r = \frac{n\sum xy - \sum x \sum y}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}} \]
Value of r Interpretation
\( r = +1 \)Perfect positive correlation
\( r = -1 \)Perfect negative correlation
\( r = 0 \)No linear correlation
\( 0 < r < 1 \)Positive correlation
\( -1 < r < 0 \)Negative correlation
Find Karl Pearson's correlation coefficient for the data
X12345
Y24568
\( r = 0.99 \) (Strong positive correlation)

2. Spearman's Rank Correlation Coefficient (R)

Definition
Used when data is given in ranks or when exact values are not available. Also useful for ordinal data.

Case 1: Non-Repeated Ranks

\[ R = 1 - \frac{6\sum d^2}{n(n^2 - 1)} \]

where \( d = \) difference between ranks, \( n = \) number of pairs

Find Spearman's rank correlation coefficient
Judge A12345
Judge B21435
\( R = 0.8 \)

Case 2: Repeated Ranks

\[ R = 1 - \frac{6\left[\sum d^2 + \frac{m_1^3 - m_1}{12} + \frac{m_2^3 - m_2}{12} + \cdots\right]}{n(n^2 - 1)} \]

where \( m = \) number of times a rank is repeated

For repeated ranks, assign average rank. E.g., if 3rd and 4th positions are tied, both get rank \( \frac{3+4}{2} = 3.5 \)
Find R with repeated ranks
X405050607070
Y304050506080
\( R = 0.9 \)

3. Lines of Regression

Definition
Regression lines are used to predict the value of one variable from another. There are two regression lines.

Line of Regression of Y on X

\[ y - \bar{y} = b_{yx}(x - \bar{x}) \]

where \( b_{yx} = \frac{\sum xy - n\bar{x}\bar{y}}{\sum x^2 - n\bar{x}^2} = r \cdot \frac{\sigma_y}{\sigma_x} \)

Line of Regression of X on Y

\[ x - \bar{x} = b_{xy}(y - \bar{y}) \]

where \( b_{xy} = \frac{\sum xy - n\bar{x}\bar{y}}{\sum y^2 - n\bar{y}^2} = r \cdot \frac{\sigma_x}{\sigma_y} \)

Key Properties
Find regression lines for the data
X12345
Y24568
Y on X: \( y = 1.4x + 0.8 \)
X on Y: \( x = 0.7y - 0.5 \)

4. Curve Fitting

First Degree (Linear): \( y = a + bx \)

Normal Equations: \[ \sum y = na + b\sum x \] \[ \sum xy = a\sum x + b\sum x^2 \]
Fit \( y = a + bx \) to the data
X1234
Y25710
\( y = -0.5 + 2.6x \)

Second Degree (Parabola): \( y = a + bx + cx^2 \)

Normal Equations: \[ \sum y = na + b\sum x + c\sum x^2 \] \[ \sum xy = a\sum x + b\sum x^2 + c\sum x^3 \] \[ \sum x^2y = a\sum x^2 + b\sum x^3 + c\sum x^4 \]
Fit \( y = a + bx + cx^2 \) to the data
X0123
Y12920
\( y = 1 - x + 2x^2 \)

Quick Reference: All Formulas

Concept Formula
Karl Pearson's r \( r = \frac{n\sum xy - \sum x\sum y}{\sqrt{[n\sum x^2-(\sum x)^2][n\sum y^2-(\sum y)^2]}} \)
Spearman's R (no repeat) \( R = 1 - \frac{6\sum d^2}{n(n^2-1)} \)
Spearman's R (repeat) Add \( \frac{m^3-m}{12} \) for each repeated rank
Regression Y on X \( y - \bar{y} = b_{yx}(x - \bar{x}) \)
Regression X on Y \( x - \bar{x} = b_{xy}(y - \bar{y}) \)
Relation \( r^2 = b_{yx} \times b_{xy} \)
Linear fit \( \sum y = na + b\sum x \); \( \sum xy = a\sum x + b\sum x^2 \)