## R Biplot Example Csv

- Preliminaries
- Extracting the variables

- Principal Component Analysis
- Principal component scores

- Multidimensional scaling
- Visualization

- R Biplot Example Csv Word
- R Biplot Example Csv Pdf
- R Biplot Example Csv Excel
- R Biplot Example Csv Excel

Welcome to R for Statistical Learning! While this is the current title, a more appropriate title would be “Machine Learning from the Perspective of a Statistician using R” but that doesn’t seem as catchy. About This Book This book currently serves as a supplement to An Introduction to Statistical. A biplot is plot which aims to represent both the observations and variables of a matrix of multivariate data on the same plot. There are many variations on biplots (see the references) and perhaps the most widely used one is implemented by biplot.princomp.The function biplot.default merely provides the underlying code to plot two sets of variables on the same figure.

## Introduction

We will consider principal components analysis (PCA) and multidimensional scaling (MDS) as examples of multivariate dimension reduction. Both techniques are included in the base R installation, respectively as **prcomp** and **cmdscale**. We will also use the (best practice) graphics package **ggplot2** for our plots.

We will use the **Guerry_85** file that contains observations on socio-economic characteristics for the 85 French departments in 1830.

Note: this is written with R beginners in mind, more seasoned R users can probably skip most of the comments on data structures and other R particulars. Also, as always in R, there are typically several ways to achieve a specific objective, so what is shown here is just one way that works, but there often are others (that may even work faster, or scale better).

Items covered:

scaling a multivariate data set (i.e., standardizing to mean zero and variance one) using

**scale**computing principal components using

**prcomp**extracting loadings, scores and proportion explained variance

creating a

**scree**plot to assess the proportion variance explained and to select the number of meaningful componentsusing

**ggplot2**to create a scatter plot with meaningful labelscreating a

**biplot**to interpret the relative contribution of two PCcomputing multivariate distance using

**dist**carrying out multidimensional scaling using

**cmdscale**plotting the results of multi-dimensional scaling using

**ggplot2**

Packages used:

**foreign****ggplot2**

## Preliminaries

As is customary by now, we start by installing the required packages, read in the data, and provide a **summary** of the data. The input file is **Guerry_85.dbf** and we use **read.dbf** to turn it into an R data frame (your contents of the Guerry file may be slightly different).

### Extracting the variables

We will extract a subset of the variables included in the data frame for use in the PCA. First, we define a list with all the variable names, and then we use the standard column subsetting of the initial data frame (note the empty space before the comma to specify that we select all observations or rows). We also summarize the new data frame.

#### Standardizing the variables

We standardize the variables by means of the **scale** command. We again also provide a summary. Obviously, the mean is zero for all variables. We check the variance for one selected variable (Crm_prs) and it is indeed one.

Note that the resulting object is a **matrix** and not a data.frame (the $ notation does not work to extract the Crm_prs column). We specify the column explicitly by giving its name (in quotes), preceded by a comma with empty space before the comma (meaning all the rows are selected).

## Principal Component Analysis

### R Biplot Example Csv Word

The computations for PCA are carried out by means of the **prcomp** function. Since we already scaled our variables, we do not need to specify this as an argument and the only item passed to the function is the name of the matrix containing the scaled variables, **vds** in our example (see the help file for other options).

The result of this computation is an object of the special class **prcomp**. It contains lots of information, which we can check with the usual **str** command.

A **summary** of the principal component object yields the standard deviation associated with each component (the variance corresponds to the eigenvalue), and the corresponding proportion and cumulative proportion of the explained variance. The standard deviations are also contained in the **sdev** attribute.

The standard deviation can also be extracted separately as **dev**.

The results are not that great. Three components (out of six!) are needed to explain 75% of the variance, and even five do not explain 95%. This is due largely to the low correlation among the variables (**cor**).

### Scree plot

The scree plot shows the proportion variance explained as a decreasing function of the principal components (each component explains a little less than the previous component). This is used to “eyeball” a reasonable number of components to use in further analysis. Note that there is no point to using all the principal components because then there would be no dimension reduction. The whole objective is to capture most of the variance in the data by means of a small number of components.

To select this number, one looks for an **elbow** or **kink** in the scree plot, i.e., a meaningful change in the slope such that the additional variance explained is small relative to the previous PC.

The **prcomp** function does not include an explicit function to create a scree plot, but it is relatively straightfoward to write a small function to accomplish this goal. The function **scree_plot** below does just that.

The function takes as argument the principal component object. The default is to create a scree plot. However, setting the option **cumulative=TRUE** creates the complement, i.e., a curve showing the cumulative variance explained. Note how that is computed using the **cumsum** command in the function. The plot is a simple line plot (type = “b”) with titles appropriate for each plot (this illustrates the use of **if**).

In our example, the scree plot is created using **scree_plot(prc)**, as shown below.

Unlike most textbook examples, this plot does not have a clear kink. In part, this is due to the low correlations between the variables, which does not lend itself to identifying common dimensions that explain a lot of the underlying variance.

The graph with the cumulative proportion of the explained variable is obtained by setting **cumulative=TRUE**, as shown below.

Here again, there is no clear kink. This is a graphical description of the variance proportions we saw above.

### Loadings

The loadings (i.e., the coefficients that apply to each of the original variables to obtain the principal component score for an observation) are contained in the **rotation** attribute of the PC object. We extract in the usual fashion. The loadings are the row elements for each of the columns of the matrix that correspond to a principal component.

The interpretation of the loadings can be tricky, but sometimes there is a clear interpretation when only a subset of the variables shows high values for the coefficients, or when the signs for a given variable are very different between the components. We return to this below when we consider the **biplot**.

More informative than the loadings is the matrix of squared correlations between the original variables and the principal components. We return to this below.

### Principal component scores

The score for a principal component for each observation is obtained by multiplying the original values for the variables that went into the components by the matching loading (remember that all the variables were standardized, so it is the standardized version that gets multiplied by the loadings). A small number of these principal component scores can then be used instead of the full set of variables to represent a sizeable fraction of the variance in the data.

The component scores can be used as regular variables, in that they can be plotted, mapped, etc. However, keep in mind that they are orthogonal by construction, so the slope in a bivariate scatter plot of two components will always be zero (i.e., the linear fit will be a horizontal line).

The scores are contained in the **x** attribute of the principal component object.

They are contained in a class matrix.

And, of course, are uncorrelated.

A more informative way to interpret the connection between the original variables and the scores is the squared correlation matrix.

The elements along each row give the proportion of the variance of the variable in that row explained by the respective principal component. The values in each column for a principal component give the squared correlation between the original variable and that component, suggesting the relative importance of the former in interpreting the latter.

The sum across each of the rows (i.e., variables) equals 1.

#### Converting the scores to a data frame

As we saw above, the scores are a **matrix** object, not a data frame. If we want to output these results (e.g., to join with a map in GeoDa), we need to turn them into a data frame.

In the few lines below, we first create a data frame from the matrix and then add the department ID (**dept**) as an additional variable. The resulting data frame can be written out to a csv file using **write.csv** and it can also be used by the plotting commands in **ggplot** (see further below).

#### Plotting the PC scores

We can now construct a scatter plot for any pair of principal component scores. For example, below we use **ggplot** to easily add the labels corresponding to the departments to the plot. It turns out this is a bit tricky in the standard **plot** command, and it gives us an excuse to start exploring the functionality of **ggplot**.

The **ggplot** grammar follows Wilkinson’s grammar of graphics, which is an elegant way to abstract the construction of a wide array of statistical graphs. A characteristic of the **ggplot** approach is that a graph is created incrementally. In its bare bones essentials, there are two commands. First is the specification of the data and variables for the plot, entered as arguments to the **ggplot** command. In our case, the data set is the just created **pcs1** and the two components PC1 and PC2 that correspond to the x and y coordinates in the plot. In **ggplot**, this is specified through the **aes** attribute (for aesthetics).

Beyond this first command, a graph is built up by adding (litterally, using the **+** sign) different geometric objects. In our case, we specify that the graph is a point graph (for x-y coordinates, this will give a scatter plot) by means of **geom_point** (all the different types are prefaced by **geom**). In addition, we add the labels as text using **geom_text**. We specify the **AREA_ID** as the label (as part of an **aes** attribute) and we use **nudge-y** to move the label above the point (the default is to have it listed on top of the point). For further specifics and options to make this look really fancy, check out the documentation of **ggplot2** (or the excellent book on this package by its creator, Hadley Wickham).

We can write the data frame to a csv file (using **write.csv**) and then merge this with a layer for the French departments in the Guerry layer in GeoDa. Using linking and brushing, we can examine the extent to which neighbors in multivariate space (points close together in the scatter plot) are also geographical neighbors. We will revisit this issue when we deal with clusters. A casina de teo pontevedra de.

### Biplot

An alternative way to visualize the results of a principal components analysis is by means of a so-called **biplot**. For any pair of components, this combines the scatter plot with a set of vectors showing the loadings for each variable. The vector is centered at zero. Its x dimension shows the importance of the loading for that variable for the principal component on the x-axis. The y dimension does the same for the component on the y-axis.

An easy case to interpret is when the loading is large for one component and small for the other, which will result in a very steep curve. Other easy cases are when the signs of the loadings are opposite. For example, a positive sign for PC1 and a negative sign for PC2 would give a vector pointing down and to the right. The information in the biplot confirms what we saw earlier in the matrix of squared correlations.

The biplot is invoked by the **biplot** command. The default is to plot the first two components, so that the only argument to the function is the principal component object. In our example, this is **prc**. We add the option **scale=0** to make sure the arrows are scaled such that they reflect the loadings.

## Multidimensional scaling

Multidimensional scaling consists of finding a lower dimensional representation of the data that respects the multidimensional distance between observation pairs as much as possible.

In R, this is computed using the **cmdscale** command. It takes as input a dissimilarity or distance matrix, computed using the **dist** command. As a default, this uses p-dimensional Euclidean distance, but several other options are available as well (see the documentation). In most circumstances, it is most appropriate the base the distance computation on standardized variables (use **scale** first).

The output of the MDS procedure is a matrix (not a data frame) with the coordinates in the lower-dimensional space (typically two dimensional) for each observation. These can be readily plotted. Note that unlike what holds for the principal components, it does not make sense to plot or map the coordinates in the MDS plot by themselves. However, similar to what holds for principal components, points that are close in the MDS plot are close in multivariats space (but not necessarily in geographical space).

### Creating the distance matrix

In our example, we first create the distance matrix by passing the standardized values in **vds** to the **dist** function. We will then use the resulting object as input into the MDS procedure.

### MDS calculation

In the default case (considered here), the **cmdscale** command takes **vdiss** as the only argument.

The result is an n by 2 matrix.

### Visualization

#### Convert matrix to data frame

We follow the same procedure as for the principal components to convert the matrix of coordinates into a data frame and to add the department identifiers. We can use the result to write out to a csv file (to merge with a GeoDa layer) or to plot using **ggplot**.

University of Chicago, Center for Spatial Data Science – [email protected]↩

The primary objectives of an exploratory factor analysis (EFA) are to determine (1) the number of common factors influencing a set of measures, (2) the strength of the relationship between each factor and each observed measure and (3) the factor scores

Some common uses of EFA are to

- To reduce a large number of variables to a smaller number of factors for modeling purposes, where the large number of variables precludes modeling all the measures individually.
- To establish that multiple tests measure the same factor, thereby giving justification for administering fewer tests. Factor analysis originated a century ago with Charles Spearman's attempts to show that a wide variety of mental tests could be explained by a single underlying intelligence factor
- To validate a scale or index by demonstrating that its constituent items load on the same factor, and to drop proposed scale items which cross-load on more than one factor.
- To select a subset of variables from a larger set, based on which original variables have the highest correlations with the principal component factors.
- To create a set of factors to be treated as uncorrelated variables as one approach to handling multicollinearity in such procedures as multiple regression.
- To identify the nature of the constructs underlying responses in a specific content area.
- To determine what sets of items “hang together” in a questionnaire.
- To demonstrate the dimensionality of a measurement scale. Researchers often wish to develop scales that respond to a single characteristic.
- To determine what features are most important when classifying a group of items.
- To generate “factor scores” representing values of the underlying constructs for use in other analyses.

## An example

We illustrate how to conduct exploratory data analysis using the data from the classic 1939 study by Karl J. Holzinger and Frances Swineford. In the study, twenty-six tests intended to measure a general factor and five specific factors were administered to seventh and eighth grade students in two schools, the Grant-White School ((n = 145)) and Pasteur School ((n = 156)). Data used in this example include nineteen tests intended to measure four domains: spatial ability, verbal ability, speed, and memory. In addition, only data from the 145 students in the Grant-White School are used.

The data are saved in the file `GrantWhite.csv`

. The 26 tests are described below with the 19 used in the example are highlighted.

visual | scores on visual perception test, test 1 |

cubes | scores on cubes test, test 2 |

paper | scores on paper form board test, test 3 |

lozenge | scores on lozenges test, test 4 |

general | scores on general information test, test 5 |

paragrap | scores on paragraph comprehension test, test 6 |

sentence | scores on sentence completion test, test 7 |

wordc | scores on word classification test, test 8 |

wordm | scores on word meaning test, test 9 |

add | scores on add test, test 10 |

code | scores on code test, test 11 |

counting | scores on counting groups of dots test, test 12 |

straight | scores on straight and curved capitals test, test 13 |

wordr | scores on word recognition test, test 14 |

numberr | scores on number recognition test, test 15 |

figurer | scores on figure recognition test, test 16 |

object | scores on object-number test, test 17 |

numberf | scores on number-figure test, test 18 |

figurew | scores on figure-word test, test 19 |

deduct | scores on deduction test, test 20 |

numeric | scores on numerical puzzles test, test 21 |

problemr | scores on problem reasoning test, test 22 |

series | scores on series completion test, test 23 |

arithmet | scores on Woody-McCall mixed fundamentals, form I test, test 24 |

paperrev | scores on additional paper form board test, test 25 |

flagssub | scores on flags test, test 26 |

## Exploratory factor analysis

The usual exploratory factor analysis involves (1) Preparing data, (2) Determining the number of factors, (3) Estimation of the model, (4) Factor rotation, (5) Factor score estimation and (6) Interpretation of the analysis.

### Preparing data

In EFA, a correlation matrix is analyzed. The following R code calculates the correlation matrix. In order to simplify the other steps, we save the correlation matrix in the data file `GWcorr.csv`

and will be used later.

### Determining the number of factors

With the correlation matrix, we first decide the number of factors. There are several ways to do it. But all the methods are based on the eigenvalues of the correlation matrix. From R, we have the eigenvalues below. First, note the number of eigenvalues is the same as the number of variables. Second, the sum of all the eigenvalues is equal to the number of variables.

The basic idea can be related to the variance explained as in regression analysis. With the correlation matrix, we can take the variance of each variable as 1. For a total of $p$ variables, the total variance is therefore $p$. For factor analysis, we try to find a small number of factors that can explain a large portion of the total variance. The eigenvalues correspond to the variance of each factor. If the eigenvalue corresponding to a factor is large, that means the variance explained by the factor is large. Therefore, the eigenvalues can be used to select the number of factors.

#### Rule 1

The first rule to decide the number of factors is to use the number of eigenvalues larger than 1. In this example, we have four eigenvalues larger than 1. Therefore, we can have 4 factors.

#### Rule 2

Another way is to select the number of factors with the cumulative eigenvalues accounting for 80% of the total variance. This is to say if we add the eigenvalues of the selected number of factor, the total values should be larger than 80% of the sum of all eigenvalues.

#### Cattell's Scree plot

The Cattell's Scree plot is a plot of eigenvalues on the Y axis along with the number of factors on the X axis. The plot looks like the side of a mountain, and 'scree' refers to the debris fallen from a mountain and lying at its base. As one moves to the right, toward later components/factors, the eigenvalues drop. When the drop ceases and the curve makes an elbow toward less steep decline, Cattell's scree test says to drop all further components/factors after the one starting the elbow. For this example, we can identify 4 factors based on the scree plot below.

### Estimation of model / Factor analysis

Once the number of factors is decided, we can conduct exploratory factor analysis using the R function `factanal()`

. The R input and output for this example is given below.

In EFA, each observed data consists of two part, the common factor part and the uniqueness part. The common factor part is based on the four factors, which are also called the common factors. The uniqueness part is also called uniqueness factor, which is specific to each observed variable.

Using the variable `visual`

as an example, we have

[ visual = 0.536times Factor1 + 0.176times Factor2 + 0.392times Factor3 - 0.249times Factor4 + u_{visual} ]

Note the factor loadings are from the `Loadings`

section of the output. The loadings are the regression coefficients of the latent factors on the manifest indicators or observed variables. The variance of the uniqueness is in the `Uniquenesses`

section. For (u_{visual}), the variance is 0.465. For the other variables, it's the same.

The other section is related to the variance explained by the factors. `SS loadings`

is the sum squared loadings related to each factor. It is the overall variance explained in all the 19 variables by each factor. Therefore, the first factor explains the total of 5.722 variance, that's about `30.1%=5.722/19`

. `Proportion Var`

is the variances in the observed variables/indicators explained by each factor. `Cumulative Var`

is the cumulative proportion of variance explained by all factors.

A test is conducted to test whether the factor model is sufficient to explain the observed data. The null hypothesis that a 4-factor model is sufficient. For this model, the chi-square statistic is 102.06 with

degrees of freedom 101. The p-value for the chi-square test is 0.452 which is larger than .05. Therefore,we fail to reject the null hypothesis that the factor model have a good fit to the data.

### Factor rotation

Although we have identified 4 factors and found the 4-factor model is a good model. We cannot find a clear pattern in the factor loadings to have a deep understanding of the factors. Through factor rotation, we can make the output more understandable and is usually necessary to facilitate the interpretation of factors. The aim is to find a simple solution that each factor has a small number of large loadings and a large number of zero (or small) loadings. There are many different rotation methods such as the varimax rotation, quadtimax rotation, equimax rotation, oblique rotation, etc. The PROMAX rotation is one kind of oblique rotation and is widely used. After PROMAX rotation, the factor will be correlated.

### R Biplot Example Csv Pdf

The output of PROMAX rotation is shown below. In the output, we use `print(fa.res, cut=0.2)`

to show factor loadings that are greater than 0.2. Note that after rotation, many loading are actually smaller than 0.2. The pattern of the factor loadings are much clear now. For example, the variable `visual`

has a large loading 0.747 on `Factor 2`

but small than 0.2 loadings on all the other three factors. In this case, we might say that the variable `visual`

is mainly influenced by `Factor 2`

.

Different from the variable `visual`

, the variable `straight`

has large loadings on both `Factor 2`

and `Factor 4`

. Alternatively, straight measures both factors than just a single factor.

We can also see that the primary indicators for `Factor 1`

are `general`

, `paragrap`

, `sentence`

, `wordc`

, and `wordm`

. And for Factor 4, the indictors include `add`

, `code`

, `counting`

, and `straight`

.

The correlation among the factors are given in the section of `Factor Correlation`

. For example, the correlation between `Factor 1`

and `Factor 2`

is 0.368. Note that after rotation, the test of the model is the same as without rotation.

### Interpret the results from EFA

Based on the rotated factor loadings, we can name the factors in the model. This can be done by identifying significant loadings. For example, the `Factor 1`

is indicated by `general`

, `paragrap`

, `sentence`

, `wordc`

, and `wordm`

, all of which are related to verbal perspective of cognitive ability. One way to name the factor is to call it a verbal factor. Similarly, the second is called the spatial factor, the third can be called the memory factor, and the last one can be called the speed factor.

### R Biplot Example Csv Excel

### Factor scores

### R Biplot Example Csv Excel

Sometimes, the purpose of factor analysis is to estimate the score of each latent construct/factor for each participant. Factor scores can be used in further data analysis. In general, there are two methods for estimating factor scores: the regression method and the Bartlett method. The second method generally works better. For example, the following code obtains the Bartlett factor scores. As an example, the linear regression is also fitted.