Module V: Statistical Techniques
1. Karl Pearson's Coefficient of Correlation (r)
Definition
Measures the degree of linear relationship between two variables \( X \) and \( Y \). Value ranges from \(-1\) to \(+1\).
\[ r = \frac{\sum(x - \bar{x})(y - \bar{y})}{\sqrt{\sum(x - \bar{x})^2 \cdot \sum(y - \bar{y})^2}} = \frac{\sum xy - n\bar{x}\bar{y}}{\sqrt{(\sum x^2 - n\bar{x}^2)(\sum y^2 - n\bar{y}^2)}} \]
Shortcut Formula:
\[ r = \frac{n\sum xy - \sum x \sum y}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}} \]
| Value of r |
Interpretation |
| \( r = +1 \) | Perfect positive correlation |
| \( r = -1 \) | Perfect negative correlation |
| \( r = 0 \) | No linear correlation |
| \( 0 < r < 1 \) | Positive correlation |
| \( -1 < r < 0 \) | Negative correlation |
Find Karl Pearson's correlation coefficient for the data
- Prepare calculation table:
| X | Y | XY | X² | Y² |
| 1 | 2 | 2 | 1 | 4 |
| 2 | 4 | 8 | 4 | 16 |
| 3 | 5 | 15 | 9 | 25 |
| 4 | 6 | 24 | 16 | 36 |
| 5 | 8 | 40 | 25 | 64 |
| Σ=15 | Σ=25 | Σ=89 | Σ=55 | Σ=145 |
- \( n = 5 \), \( \sum x = 15 \), \( \sum y = 25 \), \( \sum xy = 89 \), \( \sum x^2 = 55 \), \( \sum y^2 = 145 \)
- Apply formula:
\[ r = \frac{5(89) - 15 \times 25}{\sqrt{[5(55) - 225][5(145) - 625]}} = \frac{445 - 375}{\sqrt{(275-225)(725-625)}} \]
\[ = \frac{70}{\sqrt{50 \times 100}} = \frac{70}{\sqrt{5000}} = \frac{70}{70.71} \]
\( r = 0.99 \) (Strong positive correlation)
2. Spearman's Rank Correlation Coefficient (R)
Definition
Used when data is given in ranks or when exact values are not available. Also useful for ordinal data.
Case 1: Non-Repeated Ranks
Find Spearman's rank correlation coefficient
| Judge A | 1 | 2 | 3 | 4 | 5 |
| Judge B | 2 | 1 | 4 | 3 | 5 |
- Calculate differences:
| Rank A | Rank B | d = A - B | d² |
| 1 | 2 | -1 | 1 |
| 2 | 1 | 1 | 1 |
| 3 | 4 | -1 | 1 |
| 4 | 3 | 1 | 1 |
| 5 | 5 | 0 | 0 |
| Σd² = | 4 |
- \( n = 5 \), \( \sum d^2 = 4 \)
- \[ R = 1 - \frac{6 \times 4}{5(25-1)} = 1 - \frac{24}{120} = 1 - 0.2 \]
\( R = 0.8 \)
Case 2: Repeated Ranks
For repeated ranks, assign average rank. E.g., if 3rd and 4th positions are tied, both get rank \( \frac{3+4}{2} = 3.5 \)
Find R with repeated ranks
| X | 40 | 50 | 50 | 60 | 70 | 70 |
| Y | 30 | 40 | 50 | 50 | 60 | 80 |
- Assign ranks (1 = lowest):
| X | Rank X | Y | Rank Y | d | d² |
| 40 | 1 | 30 | 1 | 0 | 0 |
| 50 | 2.5 | 40 | 2 | 0.5 | 0.25 |
| 50 | 2.5 | 50 | 3.5 | -1 | 1 |
| 60 | 4 | 50 | 3.5 | 0.5 | 0.25 |
| 70 | 5.5 | 60 | 5 | 0.5 | 0.25 |
| 70 | 5.5 | 80 | 6 | -0.5 | 0.25 |
| Σd² = | 2 |
- X: rank 50 repeated 2 times (\(m_1 = 2\)), rank 70 repeated 2 times (\(m_2 = 2\))
Y: rank 50 repeated 2 times (\(m_3 = 2\))
- Correction factors: \( \frac{2^3-2}{12} = \frac{6}{12} = 0.5 \) (three times)
- \[ R = 1 - \frac{6[2 + 0.5 + 0.5 + 0.5]}{6(36-1)} = 1 - \frac{6 \times 3.5}{210} = 1 - \frac{21}{210} \]
\( R = 0.9 \)
3. Lines of Regression
Definition
Regression lines are used to predict the value of one variable from another. There are two regression lines.
Line of Regression of Y on X
Line of Regression of X on Y
Key Properties
- Both lines pass through \( (\bar{x}, \bar{y}) \)
- \( r^2 = b_{yx} \times b_{xy} \), so \( r = \pm\sqrt{b_{yx} \times b_{xy}} \)
- Sign of \( r \) = sign of \( b_{yx} \) = sign of \( b_{xy} \)
Find regression lines for the data
- From earlier: \( n=5 \), \( \sum x=15 \), \( \sum y=25 \), \( \sum xy=89 \), \( \sum x^2=55 \), \( \sum y^2=145 \)
\( \bar{x} = 3 \), \( \bar{y} = 5 \)
- \[ b_{yx} = \frac{89 - 5(3)(5)}{55 - 5(9)} = \frac{89-75}{55-45} = \frac{14}{10} = 1.4 \]
- Y on X: \( y - 5 = 1.4(x - 3) \) → \( y = 1.4x + 0.8 \)
- \[ b_{xy} = \frac{89 - 75}{145 - 5(25)} = \frac{14}{145-125} = \frac{14}{20} = 0.7 \]
- X on Y: \( x - 3 = 0.7(y - 5) \) → \( x = 0.7y - 0.5 \)
Y on X: \( y = 1.4x + 0.8 \)
X on Y: \( x = 0.7y - 0.5 \)
4. Curve Fitting
First Degree (Linear): \( y = a + bx \)
Normal Equations:
\[ \sum y = na + b\sum x \]
\[ \sum xy = a\sum x + b\sum x^2 \]
Fit \( y = a + bx \) to the data
- Prepare table:
| X | Y | XY | X² |
| 1 | 2 | 2 | 1 |
| 2 | 5 | 10 | 4 |
| 3 | 7 | 21 | 9 |
| 4 | 10 | 40 | 16 |
| 10 | 24 | 73 | 30 |
- Normal equations:
\( 24 = 4a + 10b \) ... (1)
\( 73 = 10a + 30b \) ... (2)
- From (1): \( a = \frac{24-10b}{4} = 6 - 2.5b \)
Substitute in (2): \( 73 = 10(6-2.5b) + 30b = 60 - 25b + 30b \)
\( 73 = 60 + 5b \) → \( b = 2.6 \)
\( a = 6 - 2.5(2.6) = -0.5 \)
\( y = -0.5 + 2.6x \)
Second Degree (Parabola): \( y = a + bx + cx^2 \)
Normal Equations:
\[ \sum y = na + b\sum x + c\sum x^2 \]
\[ \sum xy = a\sum x + b\sum x^2 + c\sum x^3 \]
\[ \sum x^2y = a\sum x^2 + b\sum x^3 + c\sum x^4 \]
Fit \( y = a + bx + cx^2 \) to the data
- Prepare table:
| X | Y | X² | X³ | X⁴ | XY | X²Y |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2 | 1 | 1 | 1 | 2 | 2 |
| 2 | 9 | 4 | 8 | 16 | 18 | 36 |
| 3 | 20 | 9 | 27 | 81 | 60 | 180 |
| 6 | 32 | 14 | 36 | 98 | 80 | 218 |
- Normal equations:
\( 32 = 4a + 6b + 14c \) ... (1)
\( 80 = 6a + 14b + 36c \) ... (2)
\( 218 = 14a + 36b + 98c \) ... (3)
- Solving these simultaneous equations:
From (1): \( 16 = 2a + 3b + 7c \)
From (2): \( 40 = 3a + 7b + 18c \)
From (3): \( 109 = 7a + 18b + 49c \)
Solving: \( a = 1 \), \( b = -1 \), \( c = 2 \)
\( y = 1 - x + 2x^2 \)
Quick Reference: All Formulas
| Concept |
Formula |
| Karl Pearson's r |
\( r = \frac{n\sum xy - \sum x\sum y}{\sqrt{[n\sum x^2-(\sum x)^2][n\sum y^2-(\sum y)^2]}} \) |
| Spearman's R (no repeat) |
\( R = 1 - \frac{6\sum d^2}{n(n^2-1)} \) |
| Spearman's R (repeat) |
Add \( \frac{m^3-m}{12} \) for each repeated rank |
| Regression Y on X |
\( y - \bar{y} = b_{yx}(x - \bar{x}) \) |
| Regression X on Y |
\( x - \bar{x} = b_{xy}(y - \bar{y}) \) |
| Relation |
\( r^2 = b_{yx} \times b_{xy} \) |
| Linear fit |
\( \sum y = na + b\sum x \); \( \sum xy = a\sum x + b\sum x^2 \) |