Chapter 2 Introduction to R

It is often said that the majority of time is spent on data cleaning and 20% on analysis. A common trap when start using R is that the library have not been installed or the data are in different folder to the working directory. Installing library is easy with the command install.packages. The command getwd() will tell you which directory you are in.

Sometimes, the user encounter issues with R. This is not an uncommon problem and often not due to serious errors but forgetting that R is case-sensitive. Also when copying codes from the net across, R does not recognise the “inverted comma” but has its own “inverted comma”. Unlike Python, R does not tolerate space between variables. R also does not like variables to be named starting with a number.

R treats the variables in certain ways depend on their class. Some function requires that the variables are numeric. The Breast Cancer data from mlbench has 11 columns with the 9 covariates being factors and ordered factors. These issues will be dealt further in the chapter on data wrangling. One way to get an overview of the structure of the data is to use the glimpse function from dplyr.

Another issue is that some libraries have assign certain names to their function and which is also used by other libraries. In the course of opening multiple libraries, R may be confused and use the last library open and which may lead to error result. We will illustrate this with the forest function in the Statistics chapter.

A selling point of R is that there is a large community of users who often post their solutions online. it’s worth spending time on stack overflow to look at similar problems that others have and the approaches to solving them. A clue to pose questions is to look at the error. If you wish to post questions for others to help then it’s encouraged that the question include some data so that the person providing the solution can understand the nature of the errors. Useful blogs are also available at https://www.r-bloggers.com/.

This chapter focusses on the ggplot2 and its related libraries for plotting (Wickham 2016). Base R also has plotting function but lacks the flexibility of ggplot2. Plotting is introduced early to enage clinicians who may not have the patience read the following chapter on data wrangling prior. The book is not intended to be used in a sequential fashion as the reader may find elements of this chapter relevant to them and jump to another chapter such as chapter 3 on statistics.

2.1 Plot using base R

Below we illustrate bar plot using base R.

data("Leukemia", package="Stat2Data")
#AML dataset-treatment
colnames(Leukemia)

## [1] "Age"    "Smear"  "Infil"  "Index"  "Blasts" "Temp"   "Resp"   "Time"  
## [9] "Status"

#base R
hist(Leukemia$Age, 8, xlab = "Age",main="Acute Myeloid Leukemia Dataset")

Line plot can be performed in base R using abline argument. The font is defined by cex argument and colour defined by col argument. The title is defined by main argument. The X and Y axes can be labelled using xlab and ylab arguments within the plot function. Below we illustrate example of drawing a line to a point and in the segment below we illustrate example of drawing a line.

#set parameters of plot
X=seq(0,1,by=.1) #define a set of point from 0 to 1, separated by 0.1.
Y=seq(0,1,by=.1)

#define A and B
A=.065; B=.44

#location of point
plot(X,Y, main="ROC curve")
points(A,B,pch=8,col="red",cex=2) #add point
abline(coef = c(0,1)) #add diagonal line

#draw line to a point
segments(x0=0,y0=0,x1=A,y1=B,col="blue")
segments(x0=A,y0=B,x1=1,y=1,col="blue")

This is an illustration of using base R to plot regression line. Later, we will illustrate using geom_smooth call from ggplot2.

TPR_1=.44; FPR_1=.065

plot(X,Y,main="Likelihood ratio graph", xlab="1-Specificity",ylab="Sensitivity",cex=.25)
  points(.02,.8,pch=8,col="red",cex=2) #add point
  df1<-data.frame(c1=c(0,TPR_1),c2=c(0,FPR_1))
  reg1<-lm(c1~c2,data=df1)
  df2<-data.frame(c1=c(TPR_1,1),c2=c(FPR_1,1))
  reg2<-lm(c1~c2,data=df2)
  abline(reg1) #draw line using coefficient reg1
  abline(reg2) #draw line using coefficient reg2
  text(x=FPR_1,y=TPR_1+.3,label="Superior",cex=.7)
  text(x=FPR_1+.2,y=TPR_1+.2,label="Absence",cex=.7)
  text(x=.0125,y=TPR_1-.1,label="Presence",cex=.7)
  text(x=FPR_1+.1,y=TPR_1,label="Inferior",cex=.7)

Function can be plotted using variations of the above. This requires a formula to describe variable Y. Shading in base R is performed with the polygon function.

AUC_Logistic<-function (A,B,C,D){
  #binary data
  #A=True pos %B=False positive   %C=False negative   %D=True negative
  
  TPR=A/(A+C)
  FPR=1-(D/(D+B))
  
  #binomial distribution sqrt(np(1-p))
  #Statist. Med. 2002; 21:1237-1256 (DOI: 10.1002/sim.1099)
  STDa=sqrt((A+C)*TPR*(1-TPR));
  STDn=sqrt((B+D)*FPR*(1-FPR));
  a=STDn/STDa;
  theta=log((TPR/(1-TPR))/((FPR/(1-FPR))^a));
  
  #define a set of point from 0 to 1, separated by 0.001.
  X=seq(0,1,by=0.001)  
  
  #logistic regression model
  Y1=(X^a)/(X^a+(((1-X)^a)*exp(-1*theta)));
  AUC=round(pracma::trapz(X,Y1),2)
  AUC

#SE using Hanley & McNeil
#Preferred method if more than one TPR,FPR data point known
#Hanley is less conservative than Bamber
  Nn=B+D;
  Na=A+C;
  Q1=AUC/(2-AUC);
  Q2=2*AUC^2/(1+AUC);
  
SEhanley=sqrt(((AUC*(1-AUC))+((Na-1)*(Q1-AUC^2))+((Nn-1)*(Q2-AUC^2)))/(Na*Nn))

#SE using Bamber
#Ns is the number of patients in the smallest group
if (A+C>B+D) {
  Ns=B+D
  } else {
   Ns=A+C
  }
SEbamber=sqrt((AUC*(1-AUC))/(Ns-1))

# plot smoothed ROC
plot(X,Y1,main="ROC curve", xlab="1-Specificity",ylab="Sensitivity",cex=.25)
points(FPR,TPR,pch=8,col="red",cex=2) #add point

Y2=0
polygon(c(X[X >= 0 & X <= 1],
          rev(X[X >= 0 & X <= 1])),
        c(Y1[X >= 0 & X <= 1],
          rev(Y2[X >= 0 & X <= 1])),
        col = "#6BD7AF")


print(paste("The Area under the ROC curve using the logistic function is", 
            AUC,". The Area under the ROC curve using rank sum method is", round(.5*(TPR+(1-FPR)),2)))
}

AUC_Logistic(10,20,2,68)

## [1] "The Area under the ROC curve using the logistic function is 0.84 . The Area under the ROC curve using rank sum method is 0.8"

2.2 ggplot2

The plot below uses ggplot2 or grammar of graphics. The plot is built layer by layer like constructing a sentence. Plotting is a distinct advantage of R over commercial software with GUI (graphical user interface) like SPSS. A wide variety of media organisations (BBC, Economist) are using ggplot2 with their own stylised theme. The plot has a certain structure such the name of the data and aesthetics for x and y axes. For illustration purpose the aesthetics are labelled with x and y statements. The fill argument in the aesthetics indicate the variables for coloring. The colors chosen for this graph were imposed by the journal Epilepsia (Seneviratne et al. 2019). To run the examples, check that you have install the libraries. an error can occurred if you don’t have the required library. The meta-character # is used to signal that the line is meant to comment the code ie R will not read it. The install.packages command only need to be run once.

2.2.1 Histogram

The flexibility of ggplot2 is shown here in this histogram. The legend can be altered using the scale_fill_manual function. If other colours are preferred then under values add the preferred colours.

There are different ways to use ggplot2: quick plot or qplot with limited options and full ggplot2 with all the options. The choice of the method depends on individual preference and as well as reason for plotting.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.5

#qplot
qplot(Age, data=Leukemia, bins=8)

Now a more complex version of histogram with ggplot with color added.

#ggplot2
ggplot(data=Leukemia,aes(Age,fill=as.factor(Resp)))+
  geom_histogram(bins=8)

Note the legend is a bit untidy. The legend can be changed using scale_fill_manual. The color can be specified using rgb argument. This is important as some journals prefer certain color.

#adding legend, changing the values 0 and 1 to treatment response
ggplot(data=Leukemia,aes(Age,fill=as.factor(Resp)))+
  geom_histogram(bins=8)+
  scale_fill_manual(name="Response",values=c("#999999","#56B4E9"),
                    breaks=c(0,1),labels=c("No Treatment","Treatment"))

Adding title is easy with ggtitle.

#adding title
ggplot(data=Leukemia,aes(Age,fill=as.factor(Resp)))+
  geom_histogram(bins=8)+
  scale_fill_manual(name="Response",
                    values=c("#555555","#56B4E9"),
                    breaks=c(0,1),
                    labels=c("No Treatment","Treatment"))+
  ggtitle("Acute Myeloid Leukemia Treatment dataset")

2.2.2 Bar plot

Previously, we had used base R for bar plot. Here we use the geom_bar argument in ggplot.

library(ggplot2)
library(extrafont)

## Registering fonts with R

#this data came from a paper published in Epilespia 2019 on cost of looking 
#after patients with pseudoseizure 

##                              Lower  Upper
##                    percentage 0.95   0.95  
## Duration_PNES.lmg  0.0974     0.0057 0.2906
## pseudostatus.lmg   0.2283     0.0687 0.3753
## Hx_anxiety.lmg     0.0471     0.0057 0.1457
## Hx_depression.lmg  0.0059     0.0041 0.1082
## DSP.lmg            0.0582     0.0071 0.1500
## seizure_burden.lmg 0.0179     0.0041 0.1058
## sex.lmg            0.0413     0.0030 0.1519
df<-data.frame("Predictors"=c("Duration_PNES","pseudostatus","Hx_anxiety",
"Hx_depression","DSP","seizure_burden","sex"),
    "Importance"=c(0.09737,0.22825, 0.047137,0.00487,0.058153,0.01786,0.04131),
    "Lower.95"=c(0.0057,0.0687,0.0057,0.0041,0.0071,0.0041,0.0030),
    "Upper.95"=c(0.2906,0.3753,0.1457,0.1082,0.1500,0.1058,0.1519))

#check dimensions of data frame
dim(df)

## [1] 7 4

#check variables in data frame
colnames(df)

## [1] "Predictors" "Importance" "Lower.95"   "Upper.95"

#bar plot uses geom_bar 
ggplot(df, aes(x=Predictors,y=Importance))+
  geom_bar(stat="identity")

This bar plot may be considered as untidy as the variables have not been sorted. Reordering the data requires ordering the factors. The colors were requested by the journal Epilesia in order to avoid recognitin of the bar from color blindness.

#reordering the data
df3<-df2<-df
df3$Predictors<-factor(df2$Predictors, levels=df2[order(df$Importance),"Predictors"])
#adding color
p<-ggplot(df3, aes(x=Predictors,y=Importance,fill=Predictors))+
  geom_bar(colour="black",
    stat="identity",
    fill=
      c("#e4b84b","#ce8080","#511c23","#e37c1d","#ffde75","#abb47d","#a1c5cb"))
p

This bar plot is now ordered but the labels on the axis seem to run into each other. One solution is to tile the axis title using element_text. Note that the text size can also be specified within this argument.

#rotate legend on x axis label by 45
p+theme(axis.title.y = element_text(face="bold",  size=12),
        axis.title.x = element_text(face="bold", size=12), 
        axis.text.x  = element_text(angle=45, vjust=0.5, size=10))

The title can be broken up using the backward slash.

#adding break in title
p1<-p+geom_errorbar(aes(ymin=Lower.95,ymax=Upper.95,width=0.2))+
  labs(y="R2 exaplained (%)")+
  theme(text=element_text(size=10))+
  ggtitle(" Relative Importance of Regressors \n Cost among patients with non-epileptic seizure")
p1

2.2.3 Pie chart

This example uses the data above on contribution of non-epileptic seizure variables to hospitalised cost.

library(ggplot2)
df3$Percentage=round(df3$Importance/sum(df3$Importance)*100,0)
ggplot(df3, aes(x="" ,y=Percentage,fill=Predictors))+
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start=0) +
  theme_void()

2.2.4 Scatter plot

The above data is used here to illustrate scatter plot. We can denote the color difference among the Predictors by adding color argument in the aesthetics.

#color in qplot
qplot(data=df3, Predictors, Importance,color=Predictors)

Adding color in ggplot is the same as in qplot.

#color ggplot
ggplot(df3, aes(x=Predictors,y=Importance,color=Predictors))+
  geom_point()+
  theme(axis.title.y = element_text(face="bold",  size=10),
        axis.title.x = element_text(face="bold", size=10), 
        axis.text.x  = element_text(angle=45, vjust=0.5, size=10))

The size argument within aes can be used like color. In this case, it’s used to denote the importance of the predictors.

#size
ggplot(df3, aes(x=Predictors,y=Importance,color=Predictors,size=Predictors))+
  geom_point()+
  theme(axis.title.y = element_text(face="bold",  size=12),
        axis.title.x = element_text(face="bold", size=12), 
        axis.text.x  = element_text(angle=45, vjust=0.5, size=12))

## Warning: Using size for a discrete variable is not advised.

This is a more complicated example of scatter plot combined with formula of regression line. The paste0 function is used to add the equation to the plot. The data comes from GBD 2016 publication on lifetime risk of stroke. A comparison with plotting from base R is also provided.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## v purrr   0.3.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

load("./Data-Use/world_stroke.Rda")
#fitting a regression model
fit<-lm(MeanLifetimeRisk~LifeExpectancy,data=world_sfdf)
fitsum<-summary(fit)
#base R scatter plot with fitted line
x=world_sfdf$LifeExpectancy #define x
y=world_sfdf$MeanLifetimeRisk #define y
plot(x,y, data=world_sfdf, main = "Lifetime Stroke Risk",
     xlab = "Life Expectancy", ylab = "Life time Risk",
     pch = 19)
abline(lm(y ~ x, data = world_sfdf), col = "blue")

The ggplot version is now provided. Note that the line is fitted in the geom_smooth argument. An interesting aspect of this plot is that the data can be describe as heterosecadic in which the variance changes throughout the plot.

#ggplot2 scatter plot with fitted line
SR<-ggplot(world_sfdf,  aes(x=LifeExpectancy,y=MeanLifetimeRisk))+
  geom_smooth(method="lm")+geom_point()+
  xlab("Life Expectancy")+
  ggtitle(paste0("Life time Risk", "=", 
                 round(fitsum$coefficients[1],2),"+",
                 round(fitsum$coefficients[2],2)," x ","Life Expectancy"))
SR

## `geom_smooth()` using formula 'y ~ x'

To use the name of the country as label, we use geom_text. The world_sfdf data is now partitioned to show only data from Europe. An interesting pattern emerge. There is clumping of the data around Belgium and Portugal. The nudge_x and nudge_y function are used to adjust the labels and the size argument adjust the label.

library(tidyverse)
library(sf)

## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1

world_sfdf %>% filter(Continent=="EUROPE") %>%
ggplot( aes(x=LifeExpectancy,y=MeanLifetimeRisk))+
  geom_smooth(method="lm")+
  xlab("Life Expectancy")+
  geom_text(aes(label=Country, nudge_x=.35, nudge_y=.5, avoid_overlap=T), 
            size=2)

## `geom_smooth()` using formula 'y ~ x'

In order to understand the reason for deviations from the fitted line above, it is possible possible to add additional step to explore the relationship for each income group. This graph illustrates that the high income countries have a ceiling in the relationship between lifetime risk and life expectancy from age of 70 onward.

SRIncome<-ggplot(world_sfdf, aes(x=LifeExpectancy,y=MeanLifetimeRisk))+
  geom_smooth(method="lm", 
              aes(group=Income, linetype=Income, colour= Income))+ 
  geom_point()+
  xlab("Life Expectancy")

SRIncome

## `geom_smooth()` using formula 'y ~ x'

2.2.5 arrange plot in grids

Plots can be arrange in tabular format for presentation or journal submission.In base R multiple plots can be combined using par function and specify the number of columns by mfrow. The number of columns can be specified with ncol call when using gridExtra library.

#Leukemia data
par(mfrow=c(1,2)) #row of 1 and 2 columns

x=Leukemia$Age #define x
y=Leukemia$Blasts #define y

plot(x,y, data=Leukemia, main = "Leukemia data",
     xlab = "Age", ylab = "Blasts",
     pch = 19)
abline(lm(y ~ x, data = Leukemia), col = "blue")

y1=Leukemia$Smear
plot(x,y1, data=Leukemia, main = "Leukemia data",
     xlab = "Age", ylab = "Smear",
     pch = 19)
abline(lm(y1 ~ x, data = Leukemia), col = "blue")

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

SRContinent<-ggplot(world_sfdf, aes(x=LifeExpectancy,y=MeanLifetimeRisk))+
  geom_smooth(method="lm", 
              aes(group=Continent, linetype=Continent, colour= Continent))+
  geom_point()+
  xlab("Life Expectancy")+
  ylim(c(0,50))

grid.arrange(SRIncome, SRContinent, ncol=1)

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

An alternative way to display multiple plots is to use patchwork library.

library(patchwork)
SRIncome/SRContinent

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

2.2.6 Line plot

The data below is generated in the section on data simulation. The data were simulated using summary data from recent clot retrieval trials in stroke (Berkhemer et al. 2015, @pmid25671797)

library(ggplot2)
library(tidyverse)
dtTime<-read.csv("./Data-Use/dtTime_simulated.csv")
dtTrial<-read.csv("./Data-Use/dtTrial_simulated.csv")
#summarise data using group_by
dtTime2<-dtTime %>%
  group_by(period, T) %>%
  summarise(meanY=mean(Y),
            sdY=sd(Y),
            upperY=meanY+sdY,
            lowerY=meanY-sdY)

## `summarise()` has grouped output by 'period'. You can override using the
## `.groups` argument.

#individual line plot
ggplot(dtTime,aes(x=as.factor(period),y=Y))+
  geom_line(aes(color=as.factor(T),group=id))+
  scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")+ylab("NIHSS")

The line plot can also be represented as boxplot without the connecting lines.

#box plot
gg<-ggplot(dtTime,aes(x=as.factor(period),y=Y))+
  geom_boxplot(aes(color=as.factor(T)))+  
  xlab("Time")+ylab("NIHSS")
gg+scale_fill_discrete(name="Treatment")

To perform line plot on the grouped data, first fit the regression line to the grouped data.

#linear regression Y1 predict Y2 where Y1 and Y2 are grouped data 
#from the simulated data above.
fit<-lm(Y1~Y0+T, data=dtTrial)
dtTrial2<-filter(dtTrial, T==1)
fit2<-lm(Y2~Y1, data=dtTrial2)
#line plot by group
pd <- position_dodge2(width = 0.2) # move them .2 to the left and right
gbase  = ggplot(dtTime2, aes(y=meanY, colour=as.factor(T))) + geom_errorbar(aes(ymin=lowerY, ymax=upperY), width=.3, position=pd)+
geom_point(position=pd) 
gline = gbase + geom_line(position=pd) 
dtTime2$period=as.numeric(dtTime2$period)
unique((dtTime2$period))

## [1] 0 1 2 3

gline = gline %+% dtTime2
print(gline + aes(x=period))

2.2.7 Facet wrap

Facet wrap is a good way to visually explore different aspects pf the data. Using the dtTime data above, the plots are separated by trial assignment.

ggplot(dtTime,aes(x=as.factor(period),y=Y))+
  geom_line(aes(color=as.factor(T),group=id))+
  scale_color_manual(values = c("#e38e17", "#8e17e3"))+ 
  facet_wrap(~T)+ 
  xlab("Time")+ylab("NIHSS")

2.2.8 Polygons

The geom_polygon is often used in thematic plot of maps. It can be used to show polygons outside of map. It requires one data frame for coordinate and another for the values.

#simulate data
ids <- factor(c("1", "2", "3", "4","5","6"))

values <- data.frame(
  id = ids,
  value = c(3, 3.1, 3.2, 3.3,3.4,3.5)
)

a=seq(0,1.5,by=0.5)
x=c(a,a-.5,a+.5,a+.2, a-.3,a+.1)

positions <- data.frame(
  id = rep(ids, each = 4),
  x=c(a,a-.5,a+.5,a+.2, a-.3,a+.1),
  y=sample(x)
  )

# Currently we need to manually merge the two together
comb <- merge(values, positions, by = c("id"))

p <- ggplot(comb, aes(x = x, y = y)) +
  geom_polygon(aes(fill = value, group = id))
  
p

2.2.9 Gantt chart

Gantt chart can be used to illustrate project timeline. It needs a minimum of 4 data columns: Activity, Project, a start date and end date. This example below is meant as a template. If you have 6 rows of data then add 2 extra rows of data including colours.

library(tidyverse)
gantt_df<-data.frame(item=seq(1:4), 
    activity=c("Ethics submission","Length","Recruitment","Follow up"),
    category=c("Ethics","Duration","Recruitment","Follow up"),
    Start=c("2020-06-01","2021-01-01","2021-01-01","2022-01-01"),
    End=c("2021-01-01","2023-01-01","2022-01-01","2023-01-01"))

#Set factor level to order the activities on the plot
gantt_df <- mutate(gantt_df,
      activity=factor(activity,levels=activity[1:nrow(gantt_df)]),
      category=factor(category,levels=category[1:nrow(gantt_df)]))  
plot_gantt <- qplot(ymin = Start,
                    ymax = End,
                    x = activity,
                    colour = category,
                    geom = "linerange",
                    data = gantt_df,
                    size = I(10)) +    #width of line
    scale_colour_manual(values = c("blue", "red", "purple", "yellow")) +
    coord_flip() +
    xlab("") +
    ylab("") +
    ggtitle("Project plan")
plot_gantt

2.2.10 Heatmap

The ggplot2 library can also be used for creating heatmap. This plot uses the geom_tile function.

library(ggplot2)
library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

library(reshape)

## 
## Attaching package: 'reshape'

## The following objects are masked from 'package:plyr':
## 
##     rename, round_any

## The following object is masked from 'package:dplyr':
## 
##     rename

## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

#swiss fertility dataset from 1888
data(swiss, package = "datasets") 

swiss$swiss_canton<-row.names(swiss) #assign column name to row name
rownames(swiss)<-NULL #remove row name
data.m <- melt(swiss)

## Using swiss_canton as id variables

 data.m <- ddply(data.m, .(variable), transform, rescale = rescale(value))
 q <- ggplot(data.m, aes(variable, swiss_canton)) + 
         geom_tile(aes(fill = rescale), colour = "white")+
        scale_fill_gradient(low = "white", high = "steelblue")+
   ggtitle("Swiss Fertility Data 1888")
 q

2.3 ggplot2 extra

The following plots retains the framework of ggplot2. Their uses require installing additional libraries.

2.3.1 Alluvial and Sankey diagram

The Sankey flow diagram uses the width of the arrow used to indicate the flow rate. It is often used to indicate energy dissipation in the system. There are several libraries providing Sankey plot such as networkD3 library. Alluvial plot is a subset of Sankey diagram but differs in having a structured workflow. The ggalluvial library is chosen here for demonstration as it forms part of the ggplot2 framework.

library(ggalluvial)
library(Stat2Data)
data("Leukemia") #treatment of leukemia
#partition Age into 8 ordered factors
Leukemia$AgeCat<-cut_interval(Leukemia$Age, n=8, ordered_result=TRUE)
#axis1
ggplot(data=Leukemia, aes (y=Smear,axis1=AgeCat, axis2=Resp,axis3=Status))+
  geom_alluvium(aes(fill=AgeCat),width = 1/12)+
  geom_label(stat = "stratum", infer.label = TRUE) +
  geom_label(stat = "stratum", infer.label = TRUE)+
  scale_x_discrete(limits = c("AgeCat","Resp", "Status"),label=c("Age Category","Treatment Response","Mortality"), expand = c(.1, .1)) +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  ggtitle("Outcome after Leukemia Treatment")

2.3.2 Survival plot

The survminer library extends ggplot2 style to survival plot. It requires several libraries such as survival for survival analysis and lubridate to parse time. A description of survival analysis is provided in the Statistics section.

library(survminer)

## Loading required package: ggpubr

## 
## Attaching package: 'ggpubr'

## The following object is masked from 'package:plyr':
## 
##     mutate

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:reshape':
## 
##     stamp

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(survival)
 
data(cancer, package="survival")

#data from survival package on NCCTG lung cancer trial
#https://stat.ethz.ch/R-manual/R-devel/library/survival/html/lung.html
#time in days
#status cesored=1, dead=2sex:
#sex:Male=1 Female=2
sfit<- survfit(Surv(time, status) ~ sex, data = cancer)
ggsurvplot(sfit, data=cancer,
           surv.median.line = "hv",
           pval=TRUE,
           pval.size=3, 
           conf.int = TRUE,
           legend.labs=c("Male","Female"),xlab="Time (Days)", 
           break.time.by=50,
           font.x=5,
           font.y=5,
           ggtheme = theme_bw(),  
           risk.table=T,
           risk.table.font=2, #adjust font
           risk.table.height=.3  #adjust table height
           )

2.3.3 ggraph and tidygraph

The igraph library does the heavy lifting in graph theory analysis. This aspect will be expanded on in the chapter on Graph Theory. However, the plotting function with igraph is still not optimal. The ggraph and tidygraph libraries extend the ggplot2 style to graph.

library(tidygraph)

## 
## Attaching package: 'tidygraph'

## The following object is masked from 'package:reshape':
## 
##     rename

## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename

## The following object is masked from 'package:stats':
## 
##     filter

library(ggraph) 

#relationship among members of acute stroke team
tpa<-readr::read_csv("./Data-Use/TPA_edge010817.csv")%>%
  rename(from=V1, to=V2)

## 
## -- Column specification --------------------------------------------------------
## cols(
##   V1 = col_character(),
##   V2 = col_character(),
##   time.line = col_double(),
##   time.capacity = col_double()
## )

#node by degree centrality
graph<-as_tbl_graph(tpa) %>%  mutate(degree = centrality_degree())
ggraph(graph, layout = 'fr') + 
 geom_edge_link() + 
  #label size by degree centrality
  geom_node_point(aes(size=degree))+
  #label node
  geom_node_text(aes(label=name),repel=T)+
  ggtitle("Acute Stroke Team Network")

2.3.4 ggparty-decision tree

Decision tree can be plotted using ggplot2 and ggparty framework. This example uses data from the Breast Cancer dataset. It explores the effect of different histological features on class of breast tissue type.

library(ggparty)

## Warning: package 'ggparty' was built under R version 4.0.5

## Loading required package: partykit

## Loading required package: grid

## Loading required package: libcoin

## Loading required package: mvtnorm

## 
## Attaching package: 'ggparty'

## The following object is masked from 'package:ggraph':
## 
##     geom_node_label

library(partykit)
library(tidyverse)

data("BreastCancer",package = "mlbench")

#check data type
str(BreastCancer)

## 'data.frame':    699 obs. of  11 variables:
##  $ Id             : chr  "1000025" "1002945" "1015425" "1016277" ...
##  $ Cl.thickness   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
##  $ Cell.size      : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
##  $ Cell.shape     : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
##  $ Marg.adhesion  : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
##  $ Epith.c.size   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
##  $ Bare.nuclei    : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
##  $ Bl.cromatin    : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
##  $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
##  $ Mitoses        : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
##  $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...

BC<- BreastCancer %>% select(-Id)

treeBC<-ctree(Class~., data=BC, control = ctree_control(testtype = "Teststatistic"))

#plot tree using plot from base R
plot(treeBC)

#plot tree using ggparty

ggparty(treeBC) +
  geom_edge() +
  geom_edge_label() +
  geom_node_label(aes(label = splitvar),
                  ids = "inner") +
  geom_node_label(aes(label = info),
                  ids = "terminal")

ggparty(treeBC) +
  geom_edge() +
  geom_edge_label() +
  # map color to level and size to nodesize for all nodes
  geom_node_splitvar(aes(col = factor(level),
                         size = nodesize)) +
  geom_node_info(aes(col = factor(level),
                     size = nodesize))

2.3.5 ggROC- ROC curve

This example uses data from the Wisconsin Breast Cancer Dataset. It examines the use of fine needle aspirate to diagnose breast cancer.

library(ggROC)
library(pROC)

## Warning: package 'pROC' was built under R version 4.0.5

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:ggROC':
## 
##     ggroc

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(party)

## Loading required package: modeltools

## Loading required package: stats4

## 
## Attaching package: 'modeltools'

## The following object is masked from 'package:plyr':
## 
##     empty

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## 
## Attaching package: 'strucchange'

## The following object is masked from 'package:stringr':
## 
##     boundary

## 
## Attaching package: 'party'

## The following objects are masked from 'package:partykit':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp

#Wisonconsin Breast Cancer data
data("BreastCancer",package = "mlbench")
colnames(BreastCancer)

##  [1] "Id"              "Cl.thickness"    "Cell.size"       "Cell.shape"     
##  [5] "Marg.adhesion"   "Epith.c.size"    "Bare.nuclei"     "Bl.cromatin"    
##  [9] "Normal.nucleoli" "Mitoses"         "Class"

rocgg <- roc(BreastCancer$Class, as.numeric(BreastCancer$Mitoses))

## Setting levels: control = benign, case = malignant

## Setting direction: controls < cases

rocgg2 <- roc(BreastCancer$Class, as.numeric(BreastCancer$Cl.thickness))

## Setting levels: control = benign, case = malignant
## Setting direction: controls < cases

ggroc(rocgg)

#list of roc curves
roc.model <- roc(Class ~Epith.c.size+Cl.thickness+Cell.shape, 
                 data = BreastCancer)

## Setting levels: control = benign, case = malignant
## Setting direction: controls < cases

## Setting levels: control = benign, case = malignant

## Setting direction: controls < cases

## Setting levels: control = benign, case = malignant

## Setting direction: controls < cases

ggroc(roc.model)+ggtitle("Wisconsin Breast Cancer Dataset")

2.3.6 Map

Several simple examples are provided here. They illustrate the different plotting methods used according to the type of data. It is important to check the structure of the data using class() function.

2.3.6.1 Thematic map

The ggplot2 library can also be used to generate thematic (choropleth) map. Here we use map_data function from ggplot2 to obtain a map of USA. Geocoded data are contained in the long and lat columns. The US map data is in standard dataframe format. In this case, the geom_map function is used for mapping. The USArrests data contains a column for murder, assault, rape and urban population. The assault data presented here is normalised by the population data. This section will be expanded further in the Geospatial Analysis chapter.

library(dplyr)
library(ggplot2)
arrest<-data("USArrests") 
arrest<-USArrests%>% add_rownames("region") %>% 
  mutate(region=tolower(region))

## Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
## Please use `tibble::rownames_to_column()` instead.

US <- map_data("state") 
map_assault<-ggplot()+ 
  geom_map(data=US, map=US,
                    aes(x=long, y=lat, map_id=region),
                    fill="#ffffff", color="#ffffff", size=0.15)+
  #add USArrests data here
  geom_map(data=arrest, map=US,
              aes(fill=Assault/UrbanPop, map_id=region),
                    color="#ffffff", size=0.15)+
  scale_fill_continuous(low='thistle2', high='darkred', 
                                 guide='colorbar')

## Warning: Ignoring unknown aesthetics: x, y

map_assault

This is a more complex example and uses state by state COVID-19 data CDC website. Steps to extract the COVID-10 is shown in the next chapter. The shapefile for USA can also be extracted from tigris library. A challenge with plotting a map of US is that the country extends far North to Alaska and East to pacific islands.

library(ggplot2)
library(dplyr)
covid<-read.csv("./Data-Use/Covid_bystate_Table130420.csv") %>% mutate(region=tolower(Jurisdiction))
map_covid<-ggplot()+ 
  geom_map(data=US, map=US,
                    aes(x=long, y=lat, map_id=region),
                    fill="#ffffff", color="#ffffff", size=0.15)+
  #add covid data here
  geom_map(data=covid, map=US,
              aes(fill=CumulativeIncidence31.03.20, map_id=region),
                    color="#ffffff", size=0.15)+
  scale_fill_continuous(low='thistle2', high='darkred', 
                                 guide='colorbar')

## Warning: Ignoring unknown aesthetics: x, y

map_covid

In the simple example below we will generate a map of Australian State territories color by size of area. The ggplot2 combines with sf library and uses the shape file data in the geom_sf call.

library(ggplot2)
library(sf)
#shape file
Aust<-st_read("./Data-Use/GCCSA_2016_AUST.shp")

## Reading layer `GCCSA_2016_AUST' from data source `C:\Users\phant\Documents\Book\HealthcareRbook\Data-Use\GCCSA_2016_AUST.shp' using driver `ESRI Shapefile'
## Simple feature collection with 34 features and 5 fields (with 18 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 96.81694 ymin: -43.74051 xmax: 167.998 ymax: -9.142176
## geographic CRS: GDA94

colnames(Aust) #find out column variables

## [1] "GCC_CODE16" "GCC_NAME16" "STE_CODE16" "STE_NAME16" "AREASQKM16"
## [6] "geometry"

ggplot() + 
  geom_sf(data=Aust,aes(fill=AREASQKM16))+
  labs(x="Longitude (WGS84)", y="Latitude", title="Map of Australia") + 
  theme_bw()

2.3.6.2 tmap

The tmap library works in conjunction with ggplot2 and sf. The tm_shape function takes in the shape data. The tm_polygon function color the shape file with the column data of interest.

library(tmap)
load("./Data-Use/world_stroke.Rda")
#data from GBD 2016 investigators
colnames(world_sfdf)

##  [1] "FIPS"                                
##  [2] "ISO2"                                
##  [3] "ISO3"                                
##  [4] "UN"                                  
##  [5] "NAME"                                
##  [6] "AREA"                                
##  [7] "POP2005"                             
##  [8] "REGION"                              
##  [9] "SUBREGION"                           
## [10] "LON"                                 
## [11] "LAT"                                 
## [12] "MeanLifetimeRisk"                    
## [13] "Country code"                        
## [14] "y2015"                               
## [15] "Continent"                           
## [16] "Region, subregion, country or area *"
## [17] "18+"                                 
## [18] "Code"                                
## [19] "Region"                              
## [20] "Income"                              
## [21] "LendingCategory"                     
## [22] "Other"                               
## [23] "LifeExpectancy"                      
## [24] "MeanLifetimeRiskH"                   
## [25] "MeanLifetimeRiskI"                   
## [26] "geometry"                            
## [27] "Country"

class(world_sfdf) #contains simple features

## [1] "sf"         "tbl_df"     "tbl"        "data.frame"

#map of country income
m<-tm_shape(world_sfdf)+tm_polygons("Income")
m

#save object as png
#tmap_save(m,file="world_income.png"")

#save as leaflet object
#tmap_save(m,file="world_income.html"")

#map of lifetime  stroke risk
n<-tm_shape(world_sfdf)+tm_polygons("MeanLifetimeRisk")
n

2.3.6.3 Voronoi

The ggplot2 and sf libraries are extended to include drawing of voronoi. Voronoi is a special type of polygon. It can be seen as a mathematical approach to partition regions such that all points within the polygons are closest (depending on distance metrics) to the seed points. Voronoi has been used in disease mapping (John Snow mapping of Broad Street Cholera outbreak) and meteorology (Alfred Thiessen polygon method for measuring rainfall in basin). This is a more complex coding task. It uses the geom_voronoi call from ggvoronoi library. Some libraries have vignettes to help you implement the codes. The vignette in the ggvoronoi library can be called using vignette(“ggvoronoi”). The osmdata library will be used to provide map from OpenStreetMap. A related library is OpenStreetMap. The latter library uses raster file whereas osmdata provides vectorised map data. In the chapter on Geospatial Analysis we will expand on this theme with interactive map. One issue with the use of voronoi is that there are infinite sets and so a boundary needs to set. In the example below, the boundary was set for Greater Melbourne.

library(dplyr)
library(ggvoronoi)
library(ggplot2)
library(sf)

#subset data with dplyr for metropolitan melbourne
msclinic<-read.csv("./Data-Use/msclinic.csv") %>% filter(clinic==1, metropolitan==1)

Victoria<-st_read("./Data-Use/GCCSA_2016_AUST.shp") %>% filter(STE_NAME16=="Victoria")

## Reading layer `GCCSA_2016_AUST' from data source `C:\Users\phant\Documents\Book\HealthcareRbook\Data-Use\GCCSA_2016_AUST.shp' using driver `ESRI Shapefile'
## Simple feature collection with 34 features and 5 fields (with 18 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 96.81694 ymin: -43.74051 xmax: 167.998 ymax: -9.142176
## geographic CRS: GDA94

m<-ggplot(msclinic)+
  geom_point(aes(x=lon,y=lat))+ 
  #add hospital location
  geom_voronoi(aes(x=lon,y=lat,fill=distance),fill=NA, color="black")+ 
  #create voronoi from hospital location
  geom_sf(data=Victoria,aes(fill=AREASQKM16)) +
  labs(x="Longitude (WGS84)", y="Latitude", 
       title="Voronoi Map of MS Clinics in Melbourne")
m

This map is not so useful as the features of Victoria overwhelm the features of Greater Melbourne.

2.3.7 ggwordcloud

The ggwordcloud library extend the ggplot2 family to creating wordcloud. The following is an illustration of wordcloud created from comments on Youtube video “Stroke Heroes Act Fast”.

library(ggwordcloud)
library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(tidyverse)
heroes_df<-read.csv("./Data-Use/Feb_01_1_49_59 PM_2018_AEDT_YoutubeData.csv",stringsAsFactors = FALSE)
#cleaning data
keywords <- heroes_df$Comment 
keywords <- iconv(keywords, to = 'utf-8')
#create corpus
myCorpus <- VCorpus(VectorSource(keywords))
#lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
#remove numer
myCorpus <- tm_map(myCorpus, removeNumbers)
#remove punctuation
myCorpus <- tm_map(myCorpus, removePunctuation)
#remove stopwords
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"),lazy=TRUE) 
#remove white space
myCorpus <- tm_map(myCorpus, stripWhitespace, lazy=TRUE)
#term document matrix
dtm <- DocumentTermMatrix(myCorpus,control = list(wordLengths=c(3, 20)))
#remove sparse terms
dtm<-removeSparseTerms(dtm, 0.95)
#remove words of length <=3
tdm=TermDocumentMatrix(myCorpus,
                       control = list(minWordLength=4,maxWordLength=20) )
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
#remove words with frequency <=1
d <- data.frame(word = names(v),freq=v) %>% filter(freq>1)
#wordcloud
ggplot(data = d, 
       aes(label = word, size = freq, col = as.character(freq))) + 
  geom_text_wordcloud(rm_outside = TRUE, max_steps = 1,
                      grid_size = 1, eccentricity = .8)+
    scale_size_area(max_size = 12)+
    scale_color_brewer(palette = "Paired", direction = -1)+
  theme_void()

2.3.8 gganimate

The library gganimate can be used to generate videos in the form of gif file. The data needs to be collated from wide to long format. For the purpose of this book, the code has been turned off.

library(ggplot2)
library(gganimate)

#ensure data in long format, Y =NIHSS
u <- ggplot(dtTime2, aes(period, meanY , color = T, frame = period)) +
  geom_bar(stat="identity") +
  geom_text(aes(x = 11.9, label = period), hjust = 0) +   xlim(0,13)+
  coord_cartesian(clip = 'off') + 
  facet_wrap(~meanY)+
  labs(title = 'ECR Trials', y = 'Number') +
transition_reveal(period)
u
#create gif
#animate(u,fps=15,duration=15)
#anim_save(u,file="./Data-Use/simulated_ECR_trial.gif", width=800, height=800)

2.3.9 ggneuro

There are several ways of plotting mri scans. The ggneuro library is illustrated here as it relates to the ggplot2 family. The Colin \(T_1\) scan is a high resolution scan from MNI.

library(ggneuro)
library(neurobase)

## Loading required package: oro.nifti

## oro.nifti 0.11.0

## 
## Attaching package: 'oro.nifti'

## The following object is masked from 'package:tidygraph':
## 
##     slice

## The following object is masked from 'package:dplyr':
## 
##     slice

colin_1mm<-untar("./Data-Use/colin_1mm.tgz")
colinIm<-readNIfTI("colin_1mm") 
ggortho(colinIm)

2.4 plotly

Plotly has its own API and uses Dash to upload the figure on the web. It has additional ability for interaction as well as create a video. Examples are provided with calling plotly directly or via ggplot2. In the examples below, the plots are performed using ggplot2 and then pass onto plotly using ggplotly function. The example uses the Leukemia dataset from Stat2Data library.

2.4.1 Scatter plot with plotly

The plotly syntax uses a ~ after the = symbol to identify a variable for plotting.

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:neurobase':
## 
##     colorbar

## The following object is masked from 'package:oro.nifti':
## 
##     slice

## The following object is masked from 'package:reshape':
## 
##     rename

## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename, summarise

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(Stat2Data)
data("Leukemia") #treatment of leukemia
#scatter plot directly from plotly
plot_ly(x=~Age,y=~Smear, #percentage of blast
        color=~as.factor(Status),  #dead or alive
        symbol=~Resp, 
        symbols=c('circle' ,'square'), #Response to treatment
        data=Leukemia)

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

## Warning: `arrange_()` was deprecated in dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

2.4.2 Bar plot with plotly

Plotly can uses a ggplot object directly.

#using the epilepsy data in p generated above for bar plot
ggplotly(p)

## Warning: `group_by_()` was deprecated in dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help

2.4.3 Heatmap

#using swiss data above to create heatmap
ggplotly(q)

2.4.4 map

library(ggplot2)
library(plotly)
library(dplyr)

covid<-read.csv("./Data-Use/Covid_bystate_Table130420.csv") %>% mutate(region=tolower(Jurisdiction)) #consistent

ggplotly(
  ggplot()+ 
  geom_map(data=US, map=US,
                    aes(x=long, y=lat, map_id=region),
                    fill="#ffffff", color="#ffffff", size=0.15)+
  #add covid data here
  geom_map(data=covid, map=US,
              aes(fill=NumberCases31.03.20, map_id=region),
                    color="#ffffff", size=0.15)+
  scale_fill_continuous(low='thistle2', high='darkred', 
                                 guide='colorbar')
)

## Warning: Ignoring unknown aesthetics: x, y

You can write citations, too. For example, we are using the bookdown package (Xie 2020) in this sample book, which was built on top of R Markdown and knitr (Xie 2015).

References

Berkhemer, O. A., P. S. Fransen, D. Beumer, L. A. van den Berg, H. F. Lingsma, A. J. Yoo, W. J. Schonewille, et al. 2015. “A randomized trial of intraarterial treatment for acute ischemic stroke.” N. Engl. J. Med. 372 (1): 11–20.

Seneviratne, U., Z. M. Low, Z. X. Low, A. Hehir, S. Paramaswaran, M. Foong, H. Ma, and T. G. Phan. 2019. “Medical health care utilization cost of patients presenting with psychogenic nonepileptic seizures.” Epilepsia 60 (2): 349–57.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.

Xie, Yihui. 2020. Bookdown: Authoring Books and Technical Documents with R Markdown. https://github.com/rstudio/bookdown.