Principle Components Analysis for Data Reduction

In this markdown we show step by step how to reduce the dimensionality of data using principle components analysis via singular value decompositions. The source can be found here. The grade data used is here.



Principle Components Analysis


Assume you have a lot of correlated data: what to do?

A simple 4-dimensional example

First, let’s generate multivariate normal random samples from the 4-D space. We have to create a covariance matrix and make sure it is positive definite:

Ndims = 4
Nsamples = 25
Sigma <- matrix(runif(Ndims^2,min = 0.1,max=10),Ndims,Ndims)
Sigma <- Sigma %*% t(Sigma)
samples <- data.frame(mvrnorm(n=Nsamples, runif(4,min=-2,max=2), Sigma))

Looking at the data with ggpairs shows that the dimensions are correlated with each other:

ggpairs(samples)

plot of chunk unnamed-chunk-3

The next thing we do is compute the singular value decomposition of the data matrix, in this case, samples. It is useful to make the columns of the matrix have a zero mean to help interpret the singular values.

#create mean 0 columns
norm_samples <- samples
columnMeans <- 0
for (i in 1:Ndims){
  columnMeans[i] <- mean(norm_samples[,i])
  norm_samples[,i] <- norm_samples[,i] - columnMeans[i]  
}

#compute SVD
sv_decomp <- svd(norm_samples)
svDF = data.frame(singular_values = sv_decomp$d, sv = 1:Ndims)
svDF$cumulativeFrac <- sv_decomp$d/sum(sv_decomp$d)
ggplot(svDF,aes(x=as.factor(sv),y=singular_values, label=paste(round(cumulativeFrac*100,2),"%"))) + geom_bar(stat="identity") + scale_x_discrete("Singular Value #") + scale_y_continuous("Singular Value") + geom_text(color="red",  aes(y=(sv_decomp$d)*.5))

plot of chunk unnamed-chunk-4

From the figure, we see that the first singular value represents 62.73% of the cumulative sum of the singular values. One way to interpret this is the that 62.73% of the variability of the data is represented by a singular principle component. The sum of the first two singular values is 85.22% of the cumulative sum of the singular values. That is, about 85.22% of the total variability in the data is represented by two components.

The upshot of this is that if we replace the 4-d space with an appropriate 2-d space, we can capture most of the affect of the 4-d space.

One PC Reconstuction

Reconstructing the data using only one principle component requires using the singular value decomposition formula with all of the singular values, besides the first, being set to zero.

#one_var reconstruction
normData1 <- sv_decomp$u %*% diag(c(sv_decomp$d[1],rep(0,Ndims-1))) %*% t(sv_decomp$v)
origScale1 <- data.frame(normData1)
for (i in 1:Ndims){
  origScale1[,i] <- origScale1[,i] + columnMeans[i]  
}

plotDF <- cbind(samples,origScale1)
colnames(plotDF) <- c("O1","O2","O3","O4","X1","X2","X3","X4")

ggplot(plotDF,aes(x=O1,y=X1)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("1 Principle Component Reconstruction") + ggtitle(paste("Dimension 1, RMS Error:", round(sqrt(mean((plotDF$O1 - plotDF$X1)^2)),2)))

plot of chunk unnamed-chunk-5

ggplot(plotDF,aes(x=O2,y=X2)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("1 Principle Component Reconstruction") + ggtitle(paste("Dimension 2, RMS Error:", round(sqrt(mean((plotDF$O2 - plotDF$X2)^2)),2)))

plot of chunk unnamed-chunk-5

ggplot(plotDF,aes(x=O3,y=X3)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("1 Principle Component Reconstruction") + ggtitle(paste("Dimension 3, RMS Error:", round(sqrt(mean((plotDF$O3 - plotDF$X3)^2)),2)))

plot of chunk unnamed-chunk-5

ggplot(plotDF,aes(x=O4,y=X4)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("1 Principle Component Reconstruction") + ggtitle(paste("Dimension 4, RMS Error:", round(sqrt(mean((plotDF$O4 - plotDF$X4)^2)),2)))

plot of chunk unnamed-chunk-5

Two PC Reconstuction

We can do much better using 2 principle components:

#one_var reconstruction
normData2 <- sv_decomp$u %*% diag(c(sv_decomp$d[1:2],rep(0,Ndims-2))) %*% t(sv_decomp$v)
origScale2 <- data.frame(normData2)
for (i in 1:Ndims){
  origScale2[,i] <- origScale2[,i] + columnMeans[i]  
}

plotDF <- cbind(samples,origScale2)
colnames(plotDF) <- c("O1","O2","O3","O4","X1","X2","X3","X4")

ggplot(plotDF,aes(x=O1,y=X1)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("2 Principle Component Reconstruction") + ggtitle(paste("Dimension 1, RMS Error:", round(sqrt(mean((plotDF$O1 - plotDF$X1)^2)),2)))

plot of chunk unnamed-chunk-6

ggplot(plotDF,aes(x=O2,y=X2)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("2 Principle Component Reconstruction") + ggtitle(paste("Dimension 2, RMS Error:", round(sqrt(mean((plotDF$O2 - plotDF$X2)^2)),2)))

plot of chunk unnamed-chunk-6

ggplot(plotDF,aes(x=O3,y=X3)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("2 Principle Component Reconstruction") + ggtitle(paste("Dimension 3, RMS Error:", round(sqrt(mean((plotDF$O3 - plotDF$X3)^2)),2)))

plot of chunk unnamed-chunk-6

ggplot(plotDF,aes(x=O4,y=X4)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("2 Principle Component Reconstruction") + ggtitle(paste("Dimension 4, RMS Error:", round(sqrt(mean((plotDF$O4 - plotDF$X4)^2)),2)))

plot of chunk unnamed-chunk-6

Correcting Co-linearity with Principle Components

Given data that are nearly perfectly correlated, simple linear regression models can fall on their face. Here we create data that are nearly collinear, and then produce a dependent variable that has the simple formula \[y = x_1 + x_2 + 3.\]

dHW <- data.frame(x1=0,x2=0,y=0)
dHW[1:20,"x1"] <- rnorm(20)
dHW[1:20,"x2"] <- rnorm(20,mean=dHW$x1,sd=0.01)
dHW[1:20,"y"] <- rnorm(20,mean=dHW$x1 + dHW$x2 + 3)
ggplot(dHW, aes(x=x1, y = x2)) + geom_point()

plot of chunk unnamed-chunk-7

Running R’s standard linear model on this gives something pretty bad:

summary(simpMod <- lm(y ~ x1 + x2, data = dHW))
## 
## Call:
## lm(formula = y ~ x1 + x2, data = dHW)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9474 -0.5991  0.0162  0.5517  1.7395 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.830      0.214   13.22  2.3e-10 ***
## x1            16.372     28.057    0.58     0.57    
## x2           -13.857     28.012   -0.49     0.63    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.941 on 17 degrees of freedom
## Multiple R-squared:  0.896,  Adjusted R-squared:  0.884 
## F-statistic: 73.3 on 2 and 17 DF,  p-value: 4.38e-09

In other words we get \[y = 16.3721 x_1 + -13.8567 x_2 + 2.8299.\] The estimates of the standard error of the coefficients gives that neither of the coefficients for \(x_1\) or \(x_2\) are significant.

We can fix this with principle components. First, we compute the SVD,

svdLin <- svd(dHW[,c("x1","x2")])

Then we multiply the original data by the matrix \(V\):

print(svdLin$v)
##        [,1]    [,2]
## [1,] 0.7065 -0.7077
## [2,] 0.7077  0.7065
newData <- dHW
newData[,c("x1","x2")] <- as.matrix(dHW[,c("x1","x2")]) %*% svdLin$v

Finally, we build the model on the new variables

summary(pcMod <- lm(y ~ x1 + x2, data = newData))
## 
## Call:
## lm(formula = y ~ x1 + x2, data = newData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9474 -0.5991  0.0162  0.5517  1.7395 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.830      0.214   13.22  2.3e-10 ***
## x1             1.761      0.146   12.09  8.9e-10 ***
## x2           -21.376     39.647   -0.54      0.6    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.941 on 17 degrees of freedom
## Multiple R-squared:  0.896,  Adjusted R-squared:  0.884 
## F-statistic: 73.3 on 2 and 17 DF,  p-value: 4.38e-09

Now \(x_1\) is the original value of \(x_1\) multiplied by 0.7065 plus the original \(x_2\) multiplied by 0.7077, and the modified \(x_2\) is the original value of \(x_1\) multiplied by -0.7077 plus the original \(x_2\) multiplied by 0.7065. We can solve for the coefficient in the original basis by solving the system of equations \[ 0.7065 x_1 + (0.7077) x_2 = 1.7613,\] \[ -0.7077 x_1 + (0.7065) x_2 = 0.\] Here we have set the second coefficient to zero because it was not significant.

The solution to this equation gives that the coefficient for \(x_1\) in the original basis is 1.2444 and the coefficient for \(x_2\) in the original basis is -1.2464 or \[y = 1.2444 x_1 + -1.2464 x_2 + 2.8299.\] This model is much closer to the true model that we used to generate the data.

How should we measure success

Consider the following data from a course.

grades <- read.csv("PostExam2Grades.csv")
summary(grades)
##      Quiz.1         Exam.1          HW.1           HW.2      
##  Min.   : 0.0   Min.   :49.0   Min.   : 0.0   Min.   :  0.0  
##  1st Qu.: 3.0   1st Qu.:68.5   1st Qu.:29.8   1st Qu.: 59.5  
##  Median :11.0   Median :74.5   Median :38.5   Median : 77.5  
##  Mean   :10.0   Mean   :75.4   Mean   :35.0   Mean   : 71.8  
##  3rd Qu.:14.2   3rd Qu.:82.2   3rd Qu.:43.0   3rd Qu.: 90.0  
##  Max.   :20.0   Max.   :96.0   Max.   :55.0   Max.   :110.0  
##       HW.3           Quiz.2          Quiz.3           HW.4     
##  Min.   :  0.0   Min.   : 0.00   Min.   : 0.00   Min.   : 0.0  
##  1st Qu.: 67.5   1st Qu.: 7.50   1st Qu.: 5.00   1st Qu.:83.0  
##  Median : 89.0   Median :10.00   Median : 7.00   Median :90.0  
##  Mean   : 74.6   Mean   : 8.11   Mean   : 6.47   Mean   :80.9  
##  3rd Qu.: 98.0   3rd Qu.:10.00   3rd Qu.:10.00   3rd Qu.:90.0  
##  Max.   :100.0   Max.   :10.00   Max.   :10.00   Max.   :95.0  
##      Exam.2     
##  Min.   : 39.0  
##  1st Qu.: 74.8  
##  Median : 86.5  
##  Mean   : 84.5  
##  3rd Qu.: 95.0  
##  Max.   :115.0
ggplot(grades,aes(x=Exam.1,y=Exam.2, color = Quiz.1 + Quiz.2 + Quiz.3, size = HW.1 + HW.2 + HW.3 + HW.4)) + geom_point() + scale_colour_gradient2("Quiz Total",low="red",mid="orange",high="blue",midpoint=20) + scale_x_continuous("Exam 1 Score", lim=c(40,100)) + scale_y_continuous("Exam 2 Score") + scale_size(range=c(1,12), "Homework Total") 

plot of chunk unnamed-chunk-12

There are 9 total grades, grouped into 3 categories: homework, quizzes and exams. Let’s see how the different grades vary together:

ggpairs(grades)

plot of chunk unnamed-chunk-13

Let’s throw a principle components analysis at this. We will use the built in R function prcomp<>, and scale the data because the grades have different scales, e.g., quizzes are out of 10 and exams are out of 100.

print(gradePC <- prcomp(grades,scale=T))
## Standard deviations:
## [1] 2.0424 1.0808 1.0283 0.8683 0.7418 0.6328 0.6198 0.5367 0.4756
## 
## Rotation:
##           PC1     PC2      PC3      PC4      PC5       PC6     PC7
## Quiz.1 0.3215 -0.2183  0.44268 -0.07590  0.60799 -0.415984  0.1511
## Exam.1 0.3433 -0.3539  0.33747 -0.07707  0.04394  0.679980 -0.1107
## HW.1   0.3310  0.3706  0.42528 -0.18182 -0.25607  0.062867 -0.2030
## HW.2   0.3749  0.3767 -0.04686 -0.06509 -0.08394 -0.377252 -0.4672
## HW.3   0.3955  0.2741 -0.23545  0.14641 -0.08953  0.312574  0.1321
## Quiz.2 0.2937 -0.3609 -0.52117  0.16885  0.26490  0.006468 -0.5149
## Quiz.3 0.2620 -0.3657 -0.26255 -0.69879 -0.37759 -0.192117  0.2432
## HW.4   0.3617  0.3221 -0.30067  0.06354  0.29876  0.076804  0.5636
## Exam.2 0.2942 -0.3240  0.14388  0.63931 -0.49576 -0.278856  0.2148
##             PC8      PC9
## Quiz.1 -0.16210  0.23550
## Exam.1  0.40003 -0.02194
## HW.1   -0.50752 -0.41335
## HW.2    0.56776  0.14532
## HW.3   -0.31754  0.68382
## Quiz.2 -0.29522 -0.24669
## Quiz.3 -0.02433  0.03436
## HW.4    0.20757 -0.46386
## Exam.2  0.04591 -0.08524
summary(gradePC)
## Importance of components:
##                          PC1   PC2   PC3    PC4    PC5    PC6    PC7   PC8
## Standard deviation     2.042 1.081 1.028 0.8683 0.7418 0.6328 0.6198 0.537
## Proportion of Variance 0.463 0.130 0.117 0.0838 0.0611 0.0445 0.0427 0.032
## Cumulative Proportion  0.463 0.593 0.711 0.7945 0.8557 0.9002 0.9429 0.975
##                           PC9
## Standard deviation     0.4756
## Proportion of Variance 0.0251
## Cumulative Proportion  1.0000

Based on the principle components analysis, there are only 5-7 grades hidden in the data. We can look at the rotation matrix to interpret the data. For instance, PC1 is a measure of the overall performance of the student (it is a weighted sum of all the grades). Therefore, the higher this score, the higher the overall score of the student.

PC2 is negative for exams and quizzes and positive for homeworks. Students with a high PC2 outperformed on homework relative to exams and quizzes; a negative PC2 score would indicate they rose to the occasion on quizzes and exams. The presence of a measure such as this indicates that there was a measureable partition of the class that had a different type of performance on homework versus quizzes and exams.

The other PC’s can be interpreted also. The benefit of this kind of approach is that it makes the story the data is telling, a bit more clear because you have transformed to a uncorrelated basis. As we saw in the figures above, There are strong correlations in the data.



Leave a Reply

Your email address will not be published. Required fields are marked *