The Problem
When clustering data using principal component analysis, it is often of interest to visually inspect how well the data points separate in 2-D space based on principal component scores. While this is fairly straightforward to visualize with a scatterplot, the plot can become cluttered quickly with annotations as shown in the following figure:
Solution using ggrepel
The ggrepel
package by Kamil Slowikowski implements functions to repel overlapping text labels away from each other and away from the data points that they label. It’s an easy to use package that works well in this example as shown in the following figure:
Solution using plotly
An alternative solution is to use interactive plots that are usable from the R
console, in the RStudio
viewer pane, in R Markdown
documents, and in Shiny
apps. Annotations can be viewed by hovering the mouse pointer over a point or dragging a rectangle around the relevant area to zoom in. Interactive plots using plotly
allow you to de-clutter the plotting area, include extra annotation information and create interactive web-based visualizations directly from R
. Once uploaded to a plotly
account, plotly
graphs (and the data behind them) can be viewed and modified in a web browser.
The resulting plot is clean and not cluttered with text annotations. While the ggrepel
package provides a nice solution in this example, the plotly
solution will be even more useful with a larger number of data points.
The Code
Principal Component Analysis and Hierarchical Clustering
# cor = TRUE indicates that PCA is performed on # standardized data (mean = 0, variance = 1) pcaCars <- princomp(mtcars, cor = TRUE) # view objects stored in pcaCars names(pcaCars) # proportion of variance explained summary(pcaCars) # scree plot plot(pcaCars, type = "l") # cluster cars carsHC <- hclust(dist(pcaCars$scores), method = "ward.D2") # dendrogram plot(carsHC) # cut the dendrogram into 3 clusters carsClusters <- cutree(carsHC, k = 3) # add cluster to data frame of scores carsDf <- data.frame(pcaCars$scores, "cluster" = factor(carsClusters)) carsDf <- transform(carsDf, cluster_name = paste("Cluster",carsClusters))
First figure using ggplot2
library(ggplot2) p1 <- ggplot(carsDf,aes(x=Comp.1, y=Comp.2)) + theme_classic() + geom_hline(yintercept = 0, color = "gray70") + geom_vline(xintercept = 0, color = "gray70") + geom_point(aes(color = cluster), alpha = 0.55, size = 3) + xlab("PC1") + ylab("PC2") + xlim(-5, 6) + ggtitle("PCA Clusters from Hierarchical Clustering of Cars Data") p1 + geom_text(aes(y = Comp.2 + 0.25, label = rownames(carsDf)))
Second figure using ggplot2
with ggrepel
library(ggplot2) library(ggrepel) p1 + geom_text_repel(aes(y = Comp.2 + 0.25, label = rownames(carsDf)))
Interactive plot using plotly
library(plotly) p <- plot_ly(carsDf, x = Comp.1 , y = Comp.2, text = rownames(carsDf), mode = "markers", color = cluster_name, marker = list(size = 11)) p <- layout(p, title = "PCA Clusters from Hierarchical Clustering of Cars Data", xaxis = list(title = "PC 1"), yaxis = list(title = "PC 2")) p