Skip to Tutorial Content

The Basics

Introduction

Large, complex datasets such as the PISA dataset often have many different variables that can be used to predict an outcome. In scenarios like this it can be very difficult to work out which predictors are valuable to your analyses and which predictors are not contributing much. Furthermore, the issue of multicollinearity may well be a problem if there are variables that are intercorrelated. Finally, if you are interested in looking for clusters in your data, it is not always clear which variables should be used in the cluster analysis. For these reasons, it may be necessary to reduce your data down to its most fundamental elements using Principle Component Analysis (PCA).

PCA is a statistical technique designed to reduce a large number of explanatory variables down to a smaller subset that collectively explain most of the variance in the data. It is an unsupervised method that finds the directions of maximum variance, which can be thought of as axes in multidimensional space along which the datapoints are most spread out. Each one of these axes is a principle component, which are all uncorrelated with each other. This means that the first component will contain the highest amount of variance, followed by the second, then the third and so on.

This section will explain the key concepts of PCA using a simple 2-Dimensional example, before letting you try it out for yourself with a more complicated set of data.

Sample PISA Data

To illustrate the concept, we will choose a couple of plausibly related variables from the PISA dataset: Bullying and Discipline scores. These are standardised scores that indicate average levels of Bullying and Discipline within schools from 70 different countries across the world. For this tutorial, we have converted this data into a tibble format called bull_dis, which we will be using throughout the first two sections. To give you an idea of what this looks like, the first few countries in the data are shown below:

Country Bullying Discipline
Australia 1.0270260 -1.1297133
Austria -0.4584654 0.5749536
Belgium -0.8062746 -1.0716304
Canada 0.3172649 -0.8320444
Chile -0.3736581 -0.8233760
Colombia 0.2870083 -0.0723921

Visualising the Data

Once again, we will start by visualising our data with a scatterplot so that we become familiar with the datapoints. The plots in this section are interactive thanks to the ggplotly function, and you can hover over the points to view the country name.

#Raw spending and reading data
ggplotly(
  ggplot(bull_dis, aes(Bullying, Discipline, group = Country)) +
  geom_point(colour = "black") +
  xlab('Prevalence of Bullying') +
  ylab('Level of Discipline') +
  theme_minimal() +
  theme(legend.position='none'),
tooltip = "group")

 

Scaling the Variables

When we run a PCA, we need to make sure that all of the variables are scaled so that no variable has an unfair weighting on the direction of maximum variance. Scaling - also known as standardising or z-scoring - involves expressing the data in standard deviation units. The prcomp() function in R can do this for you when you run your PCA with the .scale argument, however if we want to visualise the components on the axes of the original variables we must first scale them. If we use the scale() function on a column with default arguments, this will take the column mean away from each score and divide the result by the standard deviation, providing us with a z-score. This essentially expresses each value in terms of “number of standard deviations from the mean”.

HOWEVER, our Bullying and Discipline variables are both already z-scored, so we should not scale them again as this will alter the values. Likewise, we should not use the .scale argument when running our PCA.

Running the PCA

The number of components we will have is equal to the number of dimensions in the data. For example, in the example data from the PISA dataset that we will use here, there are 2 components for 2 variables.

The first component (PC1) is the axis with the direction of maximum variance and the second component (PC2) is the axis that is directly perpendicular to the first component. Between them they will explain 100% of the variance, but the first component will explain the greatest amount.

#run the initial PCA analysis using the prcomp() function
pca1 <- prcomp(bull_dis[c(2,3)])

#view the summary
summary(pca1)
## Importance of components:
##                           PC1    PC2
## Standard deviation     1.0334 0.8534
## Proportion of Variance 0.5945 0.4055
## Cumulative Proportion  0.5945 1.0000

We can see here that PC1 explains about 59% of the variance and PC2 about 41%, with the total equaling 100%. This is indicated by the Proportion of Variance and Cumulative Proportion values.

Each component also has a value for direction and magnitude, which are known as the eigenvector and eigenvalue respectively. These values express the amount of variance along the component axis. Since this is an expression of variance, we can work out the eigenvalue by squaring the standard deviation of the component. In this case, the eigenvalue is 1.068 for PC1 and 0.728 for PC2.

Visualising the Components

These values can be difficult to interpret on their own, so plotting lines on top of the data using the eigenvectors and eigenvalues helps us to visualise how the components are made up.

# visualise the PCA lines in the original scatterplot
ggplotly(
  ggplot(bull_dis, aes(Bullying, Discipline, group = Country)) +
  geom_point(colour = "black") +

  #axis labels
  xlab('Prevalence of Bullying') +
  ylab('Level of Discipline') +
  theme_minimal() +
  theme(legend.position='none') +
  
  #add the pca component lines using the output values from the analysis
  geom_segment(aes(x=0, y=0, xend=pca1$sdev[1]^2*pca1$rotation[1,1], yend=pca1$sdev[1]^2*pca1$rotation[2,1]),
               color = 'dodgerblue') +
  geom_segment(aes(x=0, y=0, xend=pca1$sdev[2]^2*pca1$rotation[1,2], yend=pca1$sdev[2]^2*pca1$rotation[2,2]),
               color = 'dodgerblue'), 

  tooltip = "Country")

 

Here the longer line indicates the direction and magnitude of PC1, and the shorter line does the same for PC2. This allows us to see that they are indeed orthogonal, and we gain a clearer picture of how these abstract components relate to the original variables.

Correlation Matrices

In order to see how our original variables load onto the principle components, we can view the correlation matrix of the analysis. Values closer to 1 mean that the variable is captured more by that particular principle component.

#print correlation matrix for pca1
pca1$rotation
##                   PC1        PC2
## Bullying   -0.8504578 -0.5260433
## Discipline  0.5260433 -0.8504578

With just 2 variables this does not reveal much information to us, but it will become more useful as we start to work with multidimensional data later on. Let’s see how we can visualise this information to allow for easier interpretation.

Biplots

A biplot is a useful visualisation to see how the original variables load onto the principle components. The directions of the lines indicate how each variable aligns with each component: a horizontal line suggests a high correlation with PC1, whereas a vertical line would suggest correlation with PC2.

Biplot and screeplot functions were obtained from https://github.com/vqv/ggbiplot. Some very minor modifications were made to customise axis colour and rename “group” to “Country” for labeling purposes.

#plot biplot using ggbiplot function
ggplotly(
  ggbiplot(pca1, alpha = 0.3, obs.scale = 1, 
          varname.size = 3.5, groups = bull_dis$Country) +
  theme(legend.position='none') +
  #ggbiplot defaults to giving each item in "groups" a different colour  
  #we can make points the same color when grouping by Country for ggplotly
  scale_color_manual(values = rep("black", length(bull_dis$Country))),
  #now we can display our tooltip label and keep all points the same colour! 
  tooltip = "Country")

 

Outliers

Re-Examine the Scatterplot

You may have noticed that some datapoints have appeared to be quite far away from the rest. PCA is highly sensitive to outliers, so it is important that any outliers are identified and addressed.

 

By examining the plot from the previous section, it appears that Brunei and the Philippines are potential outliers. There are numerous ways to determine whether a datapoint is indeed an outlier.

Z-Scores

One of the simplest tests is to see whether a datapoint passes a z-score threshold. This typically is set to be a minimum of 2.5 standard deviations from the mean, but this value can be higher depending on how many potential outliers we have and how much we think this will affect the analyses. Let’s have a look at our data and isolate any values above 2.5 to start with.

pca_outliers <- bull_dis %>%
  filter(Bullying > 2.5 | Discipline > 2.5) 
Country Bullying Discipline
Brunei Darussalam 2.888896 -0.5015575
Philippines 4.662757 -1.1341054

We can see that there are a couple of outliers on the Bullying dimension, specifically Brunei and the Philippines. There are no outliers on the Discipline dimension.

However, when using PCA we will almost always be working with more than 2 variables. In these scenarios, distances between values in Euclidean space do not necessarily provide an accurate picture of whether a datapoint is an outlier. Let’s try another method to test for multivariate outliers using the PCA that we have run.

Mahalanobis Distance

Mahalanobis distance is an extension of the univariate z-score that we employed previously, but rather than basing scores on just one variable it takes into account the correlation structure between all variables. This method uses the 0.975 percentile of a chi-squared distribution, which is based on the assumption of normality in the multivariate data. You can find more information on the mathematics behind Mahalanobis distance here: https://www.machinelearningplus.com/statistics/mahalanobis-distance.

Let’s use the Moutlier() function from the chemometrics package to determine any outliers using Mahalanobis distance.

#run an outlier test using Mahalanobis distance
#quantile refers to the cut-off point, defaulting as the 0.975 quantile of a Chi-Square distribution 

maha <- Moutlier(pca1$x, quantile = 0.975, plot = FALSE)

Now let’s plot the results. First, we need to make a table containing all of the relevant information from our outlier analysis.

#make table with outlier data
maha_table <- mutate(bull_dis, 
                MD = maha$md, 
                Outlier = ifelse(MD > maha$cutoff, "Yes", "No"),
                PC1 = pca1$x[,1],
                PC2 = pca1$x[,2])
Country Bullying Discipline MD Outlier PC1 PC2
Australia 1.0270260 -1.1297133 1.4024002 No -1.3994606 0.3109972
Austria -0.4584654 0.5749536 0.8467828 No 0.7606157 -0.3573173
Belgium -0.8062746 -1.0716304 1.4483129 No 0.1902382 1.2259957
Canada 0.3172649 -0.8320444 0.7987180 No -0.6392521 0.4312074
Chile -0.3736581 -0.8233760 0.9236362 No -0.0470913 0.7872907
Colombia 0.2870083 -0.0723921 0.3117384 No -0.2139101 -0.1989285

Now we can use this table to make a scatterplot with different colours for outliers and non-outliers.

#plot results
ggplotly(
  
  # grouping by country enables the tooltip to display country names in ggplotly
  ggplot(maha_table, aes(PC1, PC2, group = Country)) +
  # colour the outlier points so that they can be seen
  geom_point(aes(colour = Outlier)) +
  scale_colour_manual(values = c("black", "dodgerblue")) +
  theme_minimal() +
  # remove the default ggplotly legend
  theme(legend.position = "none"),
  
  # make the tooltip display country name and outlier status 
  tooltip = c("Country", "Outlier")
  )

 

Here we can see that Brunei, Kazakhstan and the Philippines have again been identified as outliers (blue dots), further suggesting that they may need to be addressed.

Removing the Outliers

There are many different ways to deal with outliers, but the most straightforward is to simply remove them. However, it is important to try and find a theoretical justification to do this. In general, should not remove outliers unless you suspect that there is an extraneous factor responsible for their extremity. For demonstration purposes, let’s assume this is the case and look at our results with the outlier countries excluded.

# remove outliers from data
no_outliers <- bull_dis %>%
  filter(Country != "Brunei Darussalam" &
           Country != "Kazakhstan" &
           Country != "Philippines")

# run pca
pca2 <- prcomp(no_outliers[2:3])

# print loadings
pca2$rotation
##                   PC1       PC2
## Bullying    0.3222180 0.9466655
## Discipline -0.9466655 0.3222180

The loadings are quite different to those we observed from the first PCA with the outliers included. But did this make a difference to the proportion of variance explained by each one?

# print variance explained
summary(pca2)
## Importance of components:
##                           PC1    PC2
## Standard deviation     0.8744 0.7244
## Proportion of Variance 0.5930 0.4069
## Cumulative Proportion  0.5930 1.0000

Although removing outliers changed the loadings quite a bit, it did not seem to make any major difference to the variance captured by each component. However, in some cases you will find that removing the outliers leads to a greater proportion of variance captured by a smaller number of components, which is usually desirable.

First Exercise!

Now it is your turn to visualise the data without the outliers. Use the ggbiplot function to create a representation of how the original variables load onto the components, and make it interactive with ggplotly. If you get stuck, use the hints to help or refer back to the previous section.

#create biplot
ggplotly(
  ggbiplot(data = ___, alpha = 0.3, groups = ___, color = "black", obs.scale = 1, varname.size = 4) +
  scale_color_manual(values = ___) +
  theme(legend.position = none),

  tooltip = ___)
## DATA ##
# we want to plot the loadings from pca2
## DATA ##
# we want to plot the loadings from pca2

## GROUPS ## 
# if we group by country (stored in 'no_outliers'), we can display "Country" name
# on the ggplotly tooltip
## DATA ##
# we want to plot the loadings from pca2

## GROUPS & TOOLTIP ## 
# if we group by country (stored in 'no_outliers'), we can display "Country" 
# name on the ggplotly tooltip 

## POINT COLOUR ##
# since ggbiplot defaults to different colours for each 'group', we need to 
# manually set the colours to all black

# WARNING: next hint shows full solution
#create biplot
ggplotly(
  ggbiplot(pca2, alpha = 0.3, groups = no_outliers$Country, color = "black", obs.scale = 1, varname.size = 4) +
  scale_color_manual(values = rep("black", length(no_outliers$Country))) +
  theme(legend.position='none'),

  tooltip = "Country")

Nice! The lines do seem to have different orientations now, which shows us that removing the outliers changed the way that the original variables load onto the components. Remember to make sure that there is a proper theoretical justification before you remove outliers in real data (e.g. a country recovering from a major disaster). Now that we know a bit more about PCA, let’s move on to a more realistic multivariate example!

 

 

Selecting Components

 

A More Realistic Example

In the real world, PCA is used to reduce the dimensionality of data with many different variables - that is what it is designed for. When we have multiple variables, it is not so easy to conceptualise the resulting components. It is therefore important to apply the principles we have covered when it comes to component selection, outlier removal and interpretation. Below, we have a fairly simple example of a more complex set of data that needs to be reduced. Theses are the 7 variables relating to teachers within the PISA dataset, and we want to reduce this down to the most fundamental components that explain most of the variance. Again, we have converted this into a tibble and called it teach_dat, which we will use throughout this section.

Country StudentTeacherRatio TeacherMasters TeacherBehaviour TeacherFeedback TeacherEngagment TeacherInterest TeacherSupport
Australia -0.0712881 -0.0678425 0.6094465 1.1355192 0.1540272 0.5320862 0.3500998
Belgium -0.9216817 -0.9057254 1.1798267 -1.7840853 -1.3490851 -0.3727032 -1.2767625
Chile 1.0724185 1.0555939 1.2706896 0.2263485 0.5063455 1.0809846 0.9033412
Colombia 2.2459838 2.2201889 0.5438208 0.5861280 0.2530077 1.1684291 0.6489175
Czech Republic -0.1783992 -0.1790092 -1.0070867 -1.4865771 -1.4756825 -1.8298620 -1.4459510
Denmark -0.1298961 -0.1496363 -0.7612653 0.0803892 -0.1760991 0.2768527 -0.2182383

Now it’s your turn. Using the data in teach_dat, run a principle component analysis on the 7 variables and assign it to a variable called pca3. Remember that the first column in teach_dat contains country names, so you will need to exclude this. If you find an error that doesn’t make sense, you can view the solution for help.

# run the multivariate PCA analysis and call it 'pca3'
pca3 <- 
# run the multivariate PCA analysis and call it 'pca3'
pca3 <- prcomp(___)
# run the multivariate PCA analysis and call it 'pca3'
pca3 <- prcomp(___)

## DATA ##
# the data is contained in columns 2:8 of 'teach_dat'

# WARNING: next hint shows full solution
# run the multivariate PCA analysis 
pca3 <- prcomp(teach_dat[2:8])

Now that you have run your PCA, it is useful to have a look at the correlation matrix to see how the variables load onto the components. Use the $ operator to bring up rotation from within pca3.

# view the correlation matrix from pca3
# view the correlation matrix from pca3 
pca3$rotation

The correlation matrix is much more interesting now, as we can see more clearly how certain components contain the variance within specific variables. For example, PC3 is very highly correlated with TeacherBehaviour only, suggesting that this component is mostly made up of this variable.

Now it is your turn to visualise the data. Let’s use another biplot to show the variables in relation to PC1 and PC2. For this exercise, let’s use an alpha level of 0.3 and make all points and lines black. The obs.scale, lab_alpha and varname.size inputs have already been set for visual purposes. If you get stuck, the first hint will provide some structure to the code for you to fill in the blanks. You can also refer back to the last section!

# create a biplot of the first two principle components 
# from pca3 using the ggbiplot function and
# make it interactive using the ggplotly function

ggplotly(
  ggbiplot(
  ),
)
ggplotly(
  ggbiplot(data = ___, alpha = ___, groups = ___,  
           lab_alpha = 0, obs.scale = 1) +
  theme(legend.position = ___) +
  scale_color_manual(values = ___),
tooltip = ___
)
ggplotly(
  ggbiplot(data = ___, alpha = ___, groups = ___,  
           lab_alpha = 0, obs.scale = 1) +
  theme(legend.position = ___) +
  scale_color_manual(values = ___),
tooltip = ___
)

## DATA ##
# The loadings of each variable onto each component are stored in pca3
ggplotly(
  ggbiplot(data = pca3, alpha = ___, groups = ___,  
           lab_alpha = 0, obs.scale = 1) +
  theme(legend.position = ___) +
  scale_color_manual(values = ___),
tooltip = ___
)

## DATA ##
# The loadings of each variable onto each component are stored in pca3

## GROUPS ##
# We want our groups argument to reflect all of the individual countries
# in our plot. This allows our ggplotly tooltip to display the country names.
# Where is this information stored?
ggplotly(
  ggbiplot(data = pca3, alpha = 0.3, groups = teach_dat$Country,  
           lab_alpha = 0, obs.scale = 1) +
  theme(legend.position = ___) +
  scale_color_manual(values = ___),
tooltip = "Country"
)

## DATA ##
# The loadings of each variable onto each component are stored in pca3

## GROUPS ##
# We want our groups argument to reflect all of the individual countries
# in our plot. This allows our ggplotly tooltip to display the country names.
# Where is this information stored?

## LEGEND ##
# ggbiplot() brings up a legend for the groups argument. In this case,
# we don't want to show this as our groups reflect every individual point. 
# How can we make the legend not appear?
ggplotly(
  ggbiplot(data = pca3, alpha = 0.3, groups = teach_dat$Country,  
           lab_alpha = 0, obs.scale = 1) +
  theme(legend.position = 'none') +
  scale_color_manual(values = ___),
tooltip = ___
)

## DATA ##
# The loadings of each variable onto each component are stored in pca3

## GROUPS ##
# We want our groups argument to reflect all of the individual countries
# in our plot. This allows our ggplotly tooltip to display the country names.
# Where is this information stored?

## LEGEND ##
# ggbiplot() brings up a legend for the groups argument. In this case,
# we don't want to show this as our groups reflect every individual point. 
# How can we make the legend not appear?

## COLOURS ##
# because ggbiplot assigns different colours to each item in the group by 
# default, we need to alter this so that all of the points are the same

#WARNING: next hint shows full solution
# create a biplot of the first two principle components 
# from pca3 using the ggbiplot function and
# make it interactive using the ggplotly function
ggplotly(
  ggbiplot(pca3, alpha = 0.3, groups = teach_dat$Country, 
           lab_alpha = 0, obs.scale = 1) +
  theme(legend.position = 'none') +
  scale_color_manual(values = rep("black", length(teach_dat$Country))),
tooltip = "Country"
)

Though it is difficult to see since some of the variables are very similarly aligned on these dimensions, you can view the country and variable names by hovering over them thanks to the ggplotly() function. This gives us an idea of how our variables make up our first two components, but it is a bit harder to visualise since we are viewing all 7 variables and only the first 2 components.

Choosing the Number of Components

There is no exact science to selecting the number of components to retain, and the appropriate decision will likely depend on the circumstances. However, there are several methods that can help with this decision that we will go over in this section. Before we look at some of these methods, let’s check the variance explained by our components. Use the summary() function to view this.

#print variance explained
#print variance explained
summary(pca3)

If we look at the row called Cumulative Proportion, we can see that we can capture a high amount of the variance in the data with fewer than 7 components. But how do we decide what the optimal number is?

Scree Plot

A scree plot can visualise either the proportion of variance explained by each component or the cumulative variance explained as the number of components increases. Just like with cluster analysis, we can select the number of components by looking for an “elbow” in the scree plot.

Let’s start with the proportion explained per component, using the screeplot function obtained here.

ggscreeplot(pca3, 'pev')

 

There seems to be a distinct flattening of the curve after 4 components, suggesting that each component after PC4 is not adding very much. Now let’s look at the cumulative proportion explained to compare. Using the same code as above, create a screeplot and change ‘pev’ to ‘cev’.

#create screeplot for cumulative proportion of explained variance
#change 'pev' to 'cev' from the screeplot code above
#create screeplot for cumulative proportion of explained variance
ggscreeplot(pca3, 'cev')

The flattening here is not as distinct, and both 3 and 4 components seem to be reasonable. Taking both plots into account, it would seem that 4 components is a justifiable cut-off. Let’s look at another method to see if we get a similar indication.

Kaiser’s Stopping Rule

Kaiser’s Stopping Rule is another simple method of judging the number of components, and simply consists of removing components with an eigenvalue of less than 1. Remember from earlier that we can obtain eigenvalues by squaring the standard deviation of the component; let’s find the eigenvalues for our components and retain everything above 1.

#make a table showing all eigenvalues
#include 2 columns: "PC" and "Eigenvalue"
eigentable <- tibble("PC" = ___, "Eigenvalue" = ___) %>%
  #retain only components with eigenvalues greater than 1
  filter(___)

print(eigentable)
## COLUMNS ##
# the 'PC' column should contain the numbers 1:7
## COLUMNS ##
# the 'PC' column should contain the numbers 1:7
# the standard deviation of each component can be accessed with pca3$sdev
## COLUMNS ##
# the 'PC' column should contain the numbers 1:7
# the standard deviation of each component can be accessed with pca3$sdev
# the eigenvalue formula is standard deviation squared
## COLUMNS ##
# the 'PC' column should contain the numbers 1:7
# the standard deviation of each component can be accessed with pca3$sdev
# the eigenvalue formula is standard deviation squared

## FILTER ##
# Retain only rows with an eigenvalue greater than 1

# WARNING: next hint shows full solution
#make a table showing all eigenvalues 
eigentable <- tibble("PC" = c(1:7), "Eigenvalue" = pca3$sdev^2) %>%
  #retain all eigenvalues that are greater than 1
  filter(Eigenvalue>1)

print(eigentable)

Here we can see that only PC1 and PC2 have eigenvalues of greater than 1, suggesting a different number to retain compared to the screeplots. This could perhaps indicate that 3 is a happy medium, but again there is no exact science to selecting the number of components and the analyst must use their discretion.

Cumulative Explained Variance Threshold

The final method we will cover here is a simple predetermined cut-off for the cumulative explained variance. This involves deciding how much variance you want to retain with your components and basing your retention on this. A common value to choose is 90%, which in this case would require 4 components. This can be adjusted depending on the situation, as you may have situational reasons to want to retain more variance or use fewer components.

It is clear that there is justification for a range of component numbers, and when choosing it is important to keep in mind the aims and context of your research. In the clustering section you will have an opportunity to see the effect of retaining different numbers of components when they are applied to further analyses.

Check for Outliers

Before we go any further, let’s quickly check for outliers in our data using the Mahalanobis distance method we covered in the last example. This time it is your turn to implement the code! Let’s do it step-by-step. Refer back if you need a reminder of the steps, or access the hints if you are stuck.

#run an outlier test using Mahalanobis distance
maha2 <- 
maha2 <- Moutlier(___, quantile = ___, plot = ___)
maha2 <- Moutlier(___, quantile = ___, plot = ___)

## DATA ##
# the data is contained in the x variable of pca3
maha2 <- Moutlier(pca3$x, quantile = ___, plot = ___)

## DATA ##
# the data is contained in the x variable of pca3

## QUANTILE ##
# the standard cut-off for outliers is the 0.975th quantile
maha2 <- Moutlier(pca3$x, quantile = 0.975, plot = ___)

## DATA ##
# the data is contained in the x variable of pca3

## QUANTILE ##
# the standard cut-off for outliers is the 0.975th quantile

## PLOT ##
# we don't want to plot the results, so we can set plot to FALSE

#WARNING: next hint shows full solution
#run an outlier test using Mahalanobis distance
maha2 <- Moutlier(pca3$x, quantile = 0.975, plot = FALSE)

The next step is to put our results into a dataframe and determine which countries exceed the outlier cut-off value. Let’s include a column with the mahalanobis distance, a “Yes” or “No” column indicating whether the country is an outlier, and the loadings of each country onto PC1 and PC2.

# add the results to the teach_dat dataframe
maha_table2 <- mutate(
                .data = ___, 
                MD = ___, 
                Outlier = ___,
                PC1 = ___,
                PC2 = ___) 
## DATA ##
# we are adding the columns to teach_dat
## DATA ##
# we are adding the columns to teach_dat

## Mahalanobis Distance ##
# this is stored as 'md' within the maha2 variable
## DATA ##
# we are adding the columns to teach_dat

## Mahalanobis Distance ##
# this is stored as 'md' within the maha2 variable

## Outlier ##
# if MD is is greater than the cut-off value (stored in maha2 as 'cutoff'),
# this should contain "Yes", otherwise it should contain "No"
## DATA ##
# we are adding the columns to teach_dat

## MAHALANOBIS DISTANCE ##
# this is stored as 'md' within the maha2 variable

## OUTLIERS ##
# if MD is is greater than the cut-off value (stored in maha2 as 'cutoff'),
# this should contain "Yes", otherwise it should contain "No"

## PRINCIPLE COMPONENTS ##
# The loadings for PC1 and PC2 are contained in columns 1 and 2 of pca3$x

# WARNING: next hint shows full solution
# add the results to the teach_dat dataframe
maha_table2 <- mutate(teach_dat, 
                MD = maha2$md, 
                Outlier = ifelse(MD > maha2$cutoff, "Yes", "No"),
                PC1 = pca3$x[,1],
                PC2 = pca3$x[,2])

Now let’s make a table showing just the outlier countries.

# make a table showing only the outliers
teach_outliers <- maha_table2 %>%
  # remove all non-outliers

  # remove all columns bar Country and Outlier
  
  # print table
## FILTER ##
# the filter() function can be used to retain rows that fit certain criteria.
# in this case, we can use the outlier column to find rows containing "Yes"
## FILTER ##
# the filter() function can be used to retain rows that fit certain criteria.
# in this case, we can use the outlier column to find rows containing "Yes"

## SELECT ##
# the select() function can be used to retain the desired columns 'Country' 
# and 'Outlier'

# WARNING: next hint shows full solution
# make a table showing only the outliers
teach_outliers <- maha_table2 %>%
  # remove all non-outliers
  filter(Outlier == "Yes") %>%
  # remove all columns bar Country and Outlier
  select(Country, Outlier) %>%
  # print table
  print()

Before we decide what to do with the outliers, let’s visualise them on a scatterplot. Refer back to the previous section or check the hints for guidance.

# create a scatterplot showing outliers in "dodgerblue" colour
ggplotly(
  ggplot(
  ),
)
ggplotly(
  ggplot(data = ___, aes(x = ___, y = ___, group = ___)) +
  # make a scatterplot with different colours for outliers and non-outliers 
  geom_point(aes(colour = ___)) +
  # select colours for points and outliers
  scale_color_manual(values = c("black", "dodgerblue")) +
  theme_minimal() +
  # hide the plotly legend for neatness
  theme(legend.position = ___),
  
  tooltip = ___)
ggplotly(
  ggplot(data = ___, aes(x = ___, y = ___, group = ___)) +
  # make a scatterplot with different colours for outliers and non-outliers 
  geom_point(aes(colour = ___)) +
  # select colours for points and outliers
  scale_color_manual(values = c("black", "dodgerblue")) +
  theme_minimal() +
  # hide the plotly legend for neatness
  theme(legend.position = ___),
  
  tooltip = ___)

## DATA ##
# we are plotting data from maha_table2
ggplotly(
  ggplot(data = maha_table2, aes(x = ___, y = ___, group = ___)) +
  # make a scatterplot with different colours for outliers and non-outliers 
  geom_point(aes(colour = ___)) +
  # select colours for points and outliers
  scale_color_manual(values = c("black", "dodgerblue")) +
  theme_minimal() +
  # hide the plotly legend for neatness
  theme(legend.position = ___),
  
  tooltip = ___)

## DATA ##
# we are plotting data from maha_table2

## X & Y ##
# we are plotting PC1 against PC2
ggplotly(
  ggplot(data = maha_table2, aes(x = PC1, y = PC2, group = ___)) +
  # make a scatterplot with different colours for outliers and non-outliers 
  geom_point(aes(colour = ___)) +
  # select colours for points and outliers
  scale_color_manual(values = c("black", "dodgerblue")) +
  theme_minimal() +
  # hide the plotly legend for neatness
  theme(legend.position = ___),
  
  tooltip = ___)

## DATA ##
# we are plotting data from maha_table2

## X & Y ##
# we are plotting PC1 against PC2

## GROUP & TOOLTIP ##
# we want to group datapoints by Country so that we can display country names on
# the tooltip
ggplotly(
  ggplot(data = maha_table2, aes(x = PC1, y = PC2, group = Country)) +
  # make a scatterplot with different colours for outliers and non-outliers 
  geom_point(aes(colour = ___)) +
  # select colours for points and outliers
  scale_color_manual(values = c("black", "dodgerblue")) +
  theme_minimal() +
  # hide the plotly legend for neatness
  theme(legend.position = ___),
  
  tooltip = ___)

## DATA ##
# we are plotting data from maha_table2

## X & Y ##
# we are plotting PC1 against PC2

## GROUP & TOOLTIP ##
# we want to group datapoints by Country so that we can display country names on
# the tooltip

## COLOUR ##
# we need to split the colour of the points by whether or not they are outliers
ggplotly(
  ggplot(data = maha_table2, aes(x = PC1, y = PC2, group = Country)) +
  # make a scatterplot with different colours for outliers and non-outliers 
  geom_point(aes(colour = Outlier)) +
  # select colours for points and outliers
  scale_color_manual(values = c("black", "dodgerblue")) +
  theme_minimal() +
  # hide the plotly legend for neatness
  theme(legend.position = ___),
  
  tooltip = c("Country", "Outlier"))

## DATA ##
# we are plotting data from maha_table2

## X & Y ##
# we are plotting PC1 against PC2

## GROUP & TOOLTIP ##
# we want to group datapoints by Country so that we can display country names on
# the tooltip

## COLOUR ##
# we need to split the colour of the points by whether or not they are outliers

## LEGEND ##
# as before, we need to set the legend.position to "none" 

# WARNING: next hint shows full solution
ggplotly(
  ggplot(maha_table2, aes(x = PC1, y = PC2, group = Country)) +
  geom_point(aes(colour = Outlier)) +
  scale_color_manual(values = c("black", "dodgerblue")) +
  theme_minimal() +
  theme(legend.position = "none"),
  
  tooltip = c("Country", "Outlier"))

Those don’t really look like outliers, do they? Well, since we have 7 dimensions now instead of 2, we can’t necessarily visualise the outliers as we can only see 2 dimensions at a time. In your own analysis, you will have to decide on your outlier criteria and make a judgement based on the tests you run. Let’s check what happens if we remove them.

# remove outliers
teach_dat_no_outlier <- teach_dat %>%
  filter(Country != "United States" & 
           Country != "Albania" & 
           Country != "Kazakhstan" &
           Country != "Panama")

# run the multivariate PCA analysis using the prcomp() function
pca4 <- prcomp(teach_dat_no_outlier[2:8])

# view the proportion of variance explained
summary(pca4)
## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6     PC7
## Standard deviation     1.8335 1.2532 0.8798 0.63102 0.5804 0.42093 0.01414
## Proportion of Variance 0.5079 0.2373 0.1170 0.06016 0.0509 0.02677 0.00003
## Cumulative Proportion  0.5079 0.7452 0.8621 0.92230 0.9732 0.99997 1.00000

It seems that removing the outliers from the Mahalanobis distance test slightly increased the amount of variance explained by PC1, but did not make much difference to the others. The decision regarding what to do with them will depend on whether there is a good justification to remove them, and it might be worthwhile to run some further analyses. At the end of the next section, we will be applying our PCA to a cluster analysis, and this will allow us to compare the consequences of retaining or excluding our outliers in a practical setting.

 

WELL DONE! You have completed the PCA section of the tutorial! In the next section you will learn about different methods for clustering data, and how this can be combined with PCA to group countries based on many abstract variables. Keep up the good work!

 

 

Principle Component Analysis

Sean Westwood