Chapter 3 Data Wrangling

This section deals with the nitty gritty of data analysis. There’s no nice plots like the previous chapter. In fact, this data wrangling is the major aspect of data science.

3.1 Data

This section on atomic vector, arrays, matrix and list is often considered boring and ignored.

3.1.1 Vector, Arrays, Matrix

3.1.1.1 Vector and list

All elements of an atomic vector and arrays are the same. A vector has one dimensional and a matrix has 2 or more dimension. List can contain different types of data. A complex example of a structured list is the json format shown below. In base R, c is a function for creating a vector or list. The function list can also be used to create list. It is best to avoid using c when assigning a name to a dataframe or vector (Wickham 2019).

a<-c(1,2,3)
is.vector(a)
## [1] TRUE
class(a) #numeric vector
## [1] "numeric"
b<-c("apple","orange","banana")
is.vector(b) #character vector
## [1] TRUE
class(b)
## [1] "character"
d<-c(1,2,"banana")
is.list(d) #character vector
## [1] FALSE
class(d) #FALSE
## [1] "character"
e<-list(a,b,c(TRUE,FALSE,FALSE))
is.list(e) #TRUE
## [1] TRUE

3.1.1.2 Matrix and Arrays

Arrays are arranged in rows and columns of a single data type. It contains a description of the number of dimension. Array can also be accessed by its indexing. Later in this chapter, we will illustrate the importance of arrays for manipulating MRI scans. A volumetric mri scan, there are 3 dimenions [,,]. The first column is the sagittal sequence, second column is the coronal sequence and the third column is the axial sequence. In the example below, this knowledge of arrays can be used to reverse the ordering of information on MRI scan (flip on it axis between left and right)

#vector
vector1<-c(1,2,3)
vector2<-c(4,5,6,7,8,9)
array1<-array(c(vector1,vector2),dim = c(3,3,2))
array1
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
array2<-array(c(vector1,vector2),dim = c(2,2,3))
array2
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
## 
## , , 3
## 
##      [,1] [,2]
## [1,]    9    2
## [2,]    1    3
#check if array or matrix
is.matrix(array1)
## [1] FALSE
is.array(array1)
## [1] TRUE

Arrays can be accessed by indexing its structure.

array1[,,1]
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

The second array of array1

array1[,,2]
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

First row of array1

array1[1,,]
##      [,1] [,2]
## [1,]    1    1
## [2,]    4    4
## [3,]    7    7

First column of array1

array1[,1,]
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3

3.1.2 apply, lapply, sapply

3.1.2.1 apply

The apply function works on data in array format.

#sum data across rows ie c(1)
apply(array1,c(1),sum)
## [1] 24 30 36
#sum data across columns(2)
apply(array1,c(2),sum)
## [1] 12 30 48
#sum data across rows and columns c(1,2)
apply(array1,c(1,2),sum)
##      [,1] [,2] [,3]
## [1,]    2    8   14
## [2,]    4   10   16
## [3,]    6   12   18

3.1.2.2 lapply

The call lapply applies a function to a list. In the section below of medical images the lapply function will be used to creates a list of nifti files and which can be opened painlessly with additional call to readNIfTI. Here a more simple example is used.

a<-c(1,2,3,4,5,6,7,8)
lapply(a, function(x) x^3)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 8
## 
## [[3]]
## [1] 27
## 
## [[4]]
## [1] 64
## 
## [[5]]
## [1] 125
## 
## [[6]]
## [1] 216
## 
## [[7]]
## [1] 343
## 
## [[8]]
## [1] 512

3.1.2.3 sapply

The call sapply applies a function to a vector, matrix or list. It returns the results in the form of a matrix.

a<-c(1,2,3,4)
sapply(a, function(x) x^2)
## [1]  1  4  9 16

3.1.2.4 tapply

The tapply function applies a function to a subset of the data.

dat.array1<-as.data.frame.array(array1)
dat.array1
##   V1 V2 V3 V4 V5 V6
## 1  1  4  7  1  4  7
## 2  2  5  8  2  5  8
## 3  3  6  9  3  6  9
dat.array1$V6<- dat.array1$V6 %% 2
dat.array1$V6
## [1] 1 0 1
tapply(dat.array1$V1, dat.array1$V6, sum) 
## 0 1 
## 2 4

The rapply function or recursive apply is applied to a list, contingent on the second argument. In this example, the first function is to multiple elements of the list by 2 contingent on the element being numeric.

a<-c(1,2,3.4,D,5, "NA")

rapply(a, function(x) x*2, class="numeric")
## [1]  2.0  4.0  6.8 10.0

3.1.3 Simple function

a function is written by creating name of function, calling on function (x) and brackets to define function.

# function sun
ArithSum<-function (x) {
  sum(x)
}

vector1<-c(1,2,3)
ArithSum(vector1)
## [1] 6

3.1.4 for loop

vector2<-c(4,5,6)

for (i in vector2){
  print(i)
}
## [1] 4
## [1] 5
## [1] 6

Perform math operation using for loop

for (i in vector2){
  m= sum(vector2)/length(vector2)
}

m
## [1] 5

The next command can be used to modify the loop. This simple example removes even number

for(i in vector2){
  if(!i %% 2){
    next
  }
  print(i)
}
## [1] 5

Perform a math operation along the columns.

A=c(1,2,3)
B=c(4,5,6)
df=data.frame(A,B)

output_col <- vector("double", ncol(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output_col[[i]] <- mean(df[[i]])      # 3. body
}
output_col
## [1] 2 5

Perform a math operation along the rows.

output_row <- vector("double", nrow(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output_row[[i]] <- mean(df[[i]])      # 3. body
}
output_row
## [1] 2 5 0

3.1.5 Functional

A functional is a function embedded in a function. Later, we will revisit functional to perform analysis on a list of nifti files.

Fou<-lapply(df, function(A) mean(A)/sd(A))

#returns a list
Fou
## $A
## [1] 2
## 
## $B
## [1] 5

The unlist function returns a matrix.

Fou2<-unlist(lapply(df, function(A) mean(A)/sd(A)))

#returns a matrix
Fou2
## A B 
## 2 5

3.1.5.1 Mapply

Mapply applies the function over elements of the data.

MeanFou<-lapply(df, mean)

MeanFou[]<-Map('/', df, MeanFou)

The equivalent code with lapply is provided below.

MeanFou2<-lapply(df, function(M) M/mean(M))

MeanFou2
## $A
## [1] 0.5 1.0 1.5
## 
## $B
## [1] 0.8 1.0 1.2

3.2 Data storage

Often one assumes that opening Rstudio is sufficient to locate the file and run the analysis. One way of doing this at the console is to click on Session tab, then Set Working Directory to location of file. Another way of doing this seemlessly is to use the library here. It is easy to find the files in your directory using the list.files() call.

To list all files in a directory then

To list files matching a pattern

#list files matching pattern
list.files(pattern=".Rmd")
##  [1] "01-intro.Rmd"                        "02-Data-Wrangling.Rmd"              
##  [3] "03-Statistics.Rmd"                   "04-multivariate-analysis.Rmd"       
##  [5] "05-machinelearning.Rmd"              "06-machinelearningpt2.Rmd"          
##  [7] "07-Bayesian-analysis.Rmd"            "08-operational-research.Rmd"        
##  [9] "09-graph-theory.Rmd"                 "10-geospatial-analysis.Rmd"         
## [11] "11-App.Rmd"                          "12-Appendix.Rmd"                    
## [13] "13-references.Rmd"                   "Applications-of-R-in-Healthcare.Rmd"
## [15] "index.Rmd"

3.2.1 Data frame

Data frame is a convenient way of formatting data in table format. It is worth checking the structure of data. Some libraries prefer to work with data in data frame while others prefer matrix or array structure.

a<-c(1,2,3)
b<-c("apple","orange","banana")
e<-data.frame(a,b)
rownames(e)
## [1] "1" "2" "3"

3.2.2 Excel data

Excel data are stored as csv, xls and xlsx. Csv files can be open in base R using read.csv function or using readr library and read_csv function. I would urge you to get used to manipulating data in R as the codes serve as a way to keep track with the change in data. The original xcel data should not be touched outside of R. A problem with excel data is that its autocorrect function change provenance of genomic data eg SEPT1, MARCH1. SEPT1 is now relabeled as SEPTIN1.

A<-read.csv("File.csv")

B<-readr::read_csv ("File.csv")

The readxl library can be used to open files ending in xls and xlsx.

C<-readxl::read_xlsx("File.xlsx",skip=1) #skip first row 

D<-readxl::read_xlsx("File.xlsx",sheet=2) #read data from sheet 2

3.2.2.1 Date and time

Date and time can be handle in base R. The library lubridate is useful for parsing date data. It is possible to get an overview of the functions of the library by typing help(package=lubridate). Errors with parsing can occur if there are characters in the column containing date data.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
dfdate<-data.frame("DateofEvent"=c("12/03/2005","12/04/2006",NA),
                   "Result"=c(4,5,6))
class(dfdate$DateofEvent)
## [1] "character"
dfdate$DateofEvent
## [1] "12/03/2005" "12/04/2006" NA
#remove NA using filter
dfdate %>% filter(!is.na(DateofEvent))
##   DateofEvent Result
## 1  12/03/2005      4
## 2  12/04/2006      5
#re assigning date data type
dfdate$DateofEvent2<-as.POSIXct(dfdate$DateofEvent)
class(dfdate$DateofEvent2)
## [1] "POSIXct" "POSIXt"
dfdate$DateofEvent2
## [1] "0012-03-20 LMT" "0012-04-20 LMT" NA

Problem can occur when date and time are located in separate columns. The first issue is that time data is assigned a default date 1899-12-30. This error occurs as the base date in MS Office is 1899-12-30. This issue can be compounded when there are 2 time data eg a patient has stroke onset at 22:00 and is scanned at 02:00. The data become 1899-12-31 22:00 and 1899-12-31 02:00. In this case it implies that scanning occurs before stroke onset. This could have been solved at the data collection stage by having 2 separate date colums. There are several solutions inclusing ifelse but care must be taken with this argument. Note that ifelse argument convert date time data to numeric class. This can be resolved by embedding ifelse statement within as.Date argument. This argument requires origin argument. The logic argument may fail when the default date differ eg 1904-02-08 02:00. In this case 1904-02-08 02:00 is greater than 1899-12-31 22:00.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v stringr 1.4.0
## v tidyr   1.1.4     v forcats 0.5.0
## v readr   1.4.0
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()
df<-data.frame("Onset_date"=c("2014-03-06","2013-06-09"), "Onset_Time"=c("1899-12-31 08:03:00","1899-12-31 22:00:00"), "Scan_Time"=c("1904-02-08 10:00:00","1899-12-31 02:00:00")) %>% 
  mutate(
  #use update argument to reassign year, month and day
  Scantime=update(ymd_hms(Scan_Time), year=year(ymd(Onset_date)), month=month(ymd(Onset_date)), mday=day(ymd(Onset_date))),
 
  Onsettime=update(ymd_hms(Onset_Time), year=year(ymd(Onset_date)), month=month(ymd(Onset_date)), mday=day(ymd(Onset_date))),
  
  Scan_date=ifelse(Onsettime>Scantime,1,0),
  Scantime=Scantime+Scan_date*hms("24:00:00"),
  
 DiffHour=Scantime-Onsettime #minutes
)

df %>% select(-Scan_date) %>% gt::gt()
Onset_date Onset_Time Scan_Time Scantime Onsettime DiffHour
2014-03-06 1899-12-31 08:03:00 1904-02-08 10:00:00 2014-03-06 10:00:00 2014-03-06 08:03:00 1.95
2013-06-09 1899-12-31 22:00:00 1899-12-31 02:00:00 2013-06-10 02:00:00 2013-06-09 22:00:00 4

One way of storing data in R format is to save the file as .Rda. This format will ensure that no one can accidentally rewrite or delete a number. For very large data, it’s quicker to save as .Rda file than as csv file.

3.2.3 Foreign data

The foreign library is traditionally use to handle data from SPSS (.sav), Stata (.dta) and SAS (.sas). One should look at the ending in the file to determine the necessary library.

library(foreign)
#write and read Stata
write.dta(dfdate,file="./Data-Use/dfdate_temp.dta")
a<-read.dta("./Data-Use/dfdate_temp.dta")
a

The foreign library can handle older SAS files but not the current version. The current version of SAS file requires sas7bdat. The clue is the file ends in .sas7bdat.

3.2.4 json format

Json is short for JavaScript object Notification. These files have a hierarchical structured format. The json file is in text format amd can also be examined using Notepad. These files can be read using the RJSONIO or rjson libraries in R.

library(RJSONIO)
j<-fromJSON("./Data-Use/0411.geojson")
j<-lapply(j, function(x) {
  x[sapply(x,is.null)]<-NA
  unlist(x)
})
k<-as.data.frame(do.call("cbind",j)) #list to data frame

3.3 Tidy data

Attention to collection of data is important as it shows the way for performing analysis. In general each row represents on variable and each column represents an attribute of that variables. Sometimes there is a temptation to embed 2 types of attributes into a column.

df2<-data.frame(Sex=c("Male","Female"), Test=c("positive 5 negative 5",
                        " negative 0 negative 10"))
df2
##      Sex                    Test
## 1   Male   positive 5 negative 5
## 2 Female  negative 0 negative 10

The above example should be entered this way. This change allows one to group variables by Test status: ‘positive’ or ‘negative’. One can easily perform a t-test here (not recommend in this case as the data contains only 2 rows).

df2<-data.frame(Sex=c("Male","Female"), `Test Positive` =c(5,0), `Test Negative`=c(5, 10))
df2
##      Sex Test.Positive Test.Negative
## 1   Male             5             5
## 2 Female             0            10

The below example is illustrate how to collapse columns when using base R.

dfa<-data.frame(City=c("Melbourne","Sydney","Adelaide"),
                State=c("Victoria","NSW","South Australia"))
#collapsing City and State columns and generate new column address
dfa$addresses<-paste0(dfa$City,",", dfa$State) #separate by comma
dfa$addresses2<-paste0(dfa$City,",", dfa$State,", Australia")
dfa
##        City           State                addresses
## 1 Melbourne        Victoria       Melbourne,Victoria
## 2    Sydney             NSW               Sydney,NSW
## 3  Adelaide South Australia Adelaide,South Australia
##                            addresses2
## 1       Melbourne,Victoria, Australia
## 2               Sydney,NSW, Australia
## 3 Adelaide,South Australia, Australia

This example is same as above but uses verbs from tidyr. This is useful for collapsing address for geocoding.

library(tidyr)
dfa1<-dfa %>% unite ("new_address",City:State,sep = ",")
dfa1
##                new_address                addresses
## 1       Melbourne,Victoria       Melbourne,Victoria
## 2               Sydney,NSW               Sydney,NSW
## 3 Adelaide,South Australia Adelaide,South Australia
##                            addresses2
## 1       Melbourne,Victoria, Australia
## 2               Sydney,NSW, Australia
## 3 Adelaide,South Australia, Australia

Using the data above, let’s split the column address

library(tidyr)
dfa2<-dfa1 %>% separate(addresses, c("City2", "State2"))
## Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
dfa2
##                new_address     City2   State2
## 1       Melbourne,Victoria Melbourne Victoria
## 2               Sydney,NSW    Sydney      NSW
## 3 Adelaide,South Australia  Adelaide    South
##                            addresses2
## 1       Melbourne,Victoria, Australia
## 2               Sydney,NSW, Australia
## 3 Adelaide,South Australia, Australia

3.3.1 Factors

There are several types of factors in R: ordered and not ordered. It is important to pay attention to how factors are coded. Sometimes, male is represented as 1 and female as 0. Sometimes, female is represented as 2. This discussion may seems trivial but a paper had been retracted in a high impact factor journal Jama because of miscoding of the trial assignment 1 and 2 rather than the assignment of 0 and 1. This error led to reversing the results with logistic regression (Aboumatar and Wise 2019). This error led to report that an outpatient management program for chronic obstructive pulmonary disease resulted in fewer admissions. Below is an example which can occur when data is transformed into factor and back to number. Note that the coding goes from 0 and 1 to 2 and 1.

In certain analyses, the libraries prefer to use the dependent or outcome variable as binary coding in numeric format eg logistic regression and random forest. The library e1071 for performing support vector machine prefers the outcome variable as factor.

library(Stat2Data)
data("Leukemia") #treatment of leukemia

Leukemia %>% dplyr::glimpse()
## Rows: 51
## Columns: 9
## $ Age    <int> 20, 25, 26, 26, 27, 27, 28, 28, 31, 33, 33, 33, 34, 36, 37, 40,~
## $ Smear  <int> 78, 64, 61, 64, 95, 80, 88, 70, 72, 58, 92, 42, 26, 55, 71, 91,~
## $ Infil  <int> 39, 61, 55, 64, 95, 64, 88, 70, 72, 58, 92, 38, 26, 55, 71, 91,~
## $ Index  <int> 7, 16, 12, 16, 6, 8, 20, 14, 5, 7, 5, 12, 7, 14, 15, 9, 12, 4, ~
## $ Blasts <dbl> 0.6, 35.0, 7.5, 21.0, 7.5, 0.6, 4.8, 10.0, 2.3, 5.7, 2.6, 2.5, ~
## $ Temp   <int> 990, 1030, 982, 1000, 980, 1010, 986, 1010, 988, 986, 980, 984,~
## $ Resp   <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, ~
## $ Time   <int> 18, 31, 31, 31, 36, 1, 9, 39, 20, 4, 45, 36, 12, 8, 1, 15, 24, ~
## $ Status <int> 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ~

The variable Resp is now a factor with levels 0 and 1

Leukemia$Resp
##  [1] 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0
## [39] 0 1 0 0 0 1 0 0 1 0 1 1 0
Leukemia$Response.factor<-as.factor(Leukemia$Resp)
head(Leukemia$Response.factor)
## [1] 1 1 1 1 1 0
## Levels: 0 1

Note in the conversion back to numeric ‘dummy’ values, the data takes the form 1 and 2. This has changed the dummy values of 0 and 1. It is important to examine the data before running analysis.

Leukemia$Response.numeric<-as.numeric(Leukemia$Response.factor)
Leukemia$Response.numeric
##  [1] 2 2 2 2 2 1 2 2 2 1 2 2 1 2 1 2 2 1 2 2 1 1 1 1 2 2 1 1 2 1 2 1 1 1 1 1 1 1
## [39] 1 2 1 1 1 2 1 1 2 1 2 2 1

For variables which are characters but considered as factors, it is necessary to convert to class character before converting to dummy values.

data("BreastCancer", package="mlbench")
BreastCancer %>% glimpse()
## Rows: 699
## Columns: 11
## $ Id              <chr> "1000025", "1002945", "1015425", "1016277", "1017023",~
## $ Cl.thickness    <ord> 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, 8, 7, 4, 4, ~
## $ Cell.size       <ord> 1, 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, 1, 7, 4, 1, 1,~
## $ Cell.shape      <ord> 1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1, 5, 6, 1, 1,~
## $ Marg.adhesion   <ord> 1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, 10, 4, 1, 1,~
## $ Epith.c.size    <ord> 2, 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2, 7, 6, 2, 2, ~
## $ Bare.nuclei     <fct> 1, 10, 2, 4, 1, 10, 10, 1, 1, 1, 1, 1, 3, 3, 9, 1, 1, ~
## $ Bl.cromatin     <fct> 3, 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3, 5, 4, 2, 3, ~
## $ Normal.nucleoli <fct> 1, 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1, 5, 3, 1, 1, ~
## $ Mitoses         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 4, 1, 1, 1, ~
## $ Class           <fct> benign, benign, benign, benign, benign, malignant, ben~

The steps for conversion are illustrated below. Conversion of multiple columns of factors and ordered factors can be done in one step using lapply function. This will be described much further below.

BreastCancer$Class<-as.character(BreastCancer$Class)
BreastCancer$Class[BreastCancer$Class=="benign"]<-0
BreastCancer$Class[BreastCancer$Class=="malignant"]<-1
BreastCancer$Class<-as.numeric(BreastCancer$Class)
head(BreastCancer$Class)
## [1] 0 0 0 0 0 1

This illustration describes conversion of a continuous variable into orderly factors.

library(Stat2Data)
data("Leukemia") #treatment of leukemia
#partition Age into 8 ordered factors
Leukemia$AgeCat<-ggplot2::cut_interval(Leukemia$Age, n=8, ordered_result=TRUE)
class(Leukemia$AgeCat)
## [1] "ordered" "factor"

3.3.2 Multiple files

Merging of files can be done using dplyr to perform inner_join, outer_join, left_join and right_join. Note that this can also be done in base R or using syntax of data.table. These files can be joined using %>% operator.

DF1<-data.frame(ID=c(1,2,3),Age=c(20,30,40))
DF2<-data.frame(ID=c(1,2,3),Sex=c(1,0,0))
DF3<-data.frame(ID=c(1,2,3), Diabetes=c(1,1,0))

DF <-DF1 %>% left_join(DF2, by="ID") %>% left_join(DF3, by="ID")

3.3.3 Pivot

A variety of different expressions are used to describe data format such as wide and long formats. In some case the distinction between such formats is not clear. The verbs for performing these operations are pivot_wide, pivot_long. Again data.table uses different verbs such as cast and melt. In general, most regression analyses are performed with data in wide format. In this case each row represents a unique ID. Longitudinal analyses are performed with data in long format. In this format, there are several rows with the same ID. In the next Chapter on Statistics, an example of data generated in wide format and coverted to long format using plyr. Here we will demonstrate the use of tidyr to pivot loner or wider.

The best way to think about how data should be presented is that data is analyzed according to columns not rows. The data below is extracted from CDC COVID website. Details are given below under Web scraping on how this task was performed.

library(dplyr)
library(tidyr)
library(stringr)
usa<-read.csv("./Data-Use/Covid_bystate_Table130420.csv")
# for demonstration we will select 3 columns of interest
usa_long <-usa %>% 
  select(Jurisdiction,NumberCases31.03.20,NumberCases07.04.20) %>% pivot_longer(-Jurisdiction,names_to = "Date",values_to = "Number.Cases")  
usa_long$Date <- str_replace(usa_long$Date,"NumberCases","")

#data in wide format
head(usa %>%select(Jurisdiction,NumberCases31.03.20,NumberCases07.04.20),6) 
##   Jurisdiction NumberCases31.03.20 NumberCases07.04.20
## 1      Alabama                 999                2197
## 2       Alaska                 133                 213
## 3      Arizona                1289                2575
## 4     Arkansas                 560                 993
## 5   California                8131               15865
## 6     Colorado                2966                5429
#data in long format
head(usa_long,6) 
## # A tibble: 6 x 3
##   Jurisdiction Date     Number.Cases
##   <chr>        <chr>           <int>
## 1 Alabama      31.03.20          999
## 2 Alabama      07.04.20         2197
## 3 Alaska       31.03.20          133
## 4 Alaska       07.04.20          213
## 5 Arizona      31.03.20         1289
## 6 Arizona      07.04.20         2575

3.4 Regular Expressions

Here is a short tutorial on regular expression. We will begin using base R. This section is based on experience trying to clean a data frame containing many words used to describe one disease or one drug.

3.4.1 base R

#create example dataframe
df<-data.frame(
  drug=c("valium 1mg","verapamil sr","betaloc zoc","tramadol","valium (diazepam)"),
  infection=c("pneumonia","aspiration pneumonia","tracheobronchitis","respiratory tract infection","respiratory.tract.infection"))
df
##                drug                   infection
## 1        valium 1mg                   pneumonia
## 2      verapamil sr        aspiration pneumonia
## 3       betaloc zoc           tracheobronchitis
## 4          tramadol respiratory tract infection
## 5 valium (diazepam) respiratory.tract.infection

Now that we have a data frame, we can use pattern matching to replace part of phrase. This step can be done simply using gsub command. First create a list so that the computer searches the phrases in the list.

#create list to remove phrase
redun=c("1mg", "zoc", "sr")
pat=paste0("\\b(",paste0(redun,collapse = "|"),")\\b")
df$drug1<-gsub(pat,"",df$drug)
df$drug1
## [1] "valium "           "verapamil "        "betaloc "         
## [4] "tramadol"          "valium (diazepam)"
#create list to remove phrase
redunc1=c("respiratory tract infection", "tracheobronchitis", "aspiration")
pat=paste0("\\b(",paste0(redunc1,collapse = "|"),")\\b")
df$infection1<-gsub(pat,"",df$infection)
df$infection1
## [1] "pneumonia"                   " pneumonia"                 
## [3] ""                            ""                           
## [5] "respiratory.tract.infection"

This section deals with meta-characterers. Examples of meta-characters include $ . + * ? ^ () {} []. These meta-characters requires the double back slashes \.

#create list to remove phrase
redun=c("1mg", "zoc", "sr")
pat=paste0("\\b(",paste0(redun, collapse = "|"),")\\b")   
df$drug2<-gsub(pat,"",df$drug)

#[a-z] indicates any letter
#[a-z]+ indicates any letter and those that follow the intial letter
df$drug2<-gsub("\\(|[a-z]+\\)","",df$drug2)
df$drug2
## [1] "valium "    "verapamil " "betaloc "   "tramadol"   "valium "

Back to our data frame df, we want to remove or change the different words accounting for pneumonia.

redunc=c("\\.")
redunc1=c("respiratory tract infection", "tracheobronchitis", "aspiration")
pat=paste0("\\b(",paste0(redunc,collapse = "|"),")\\b")
df$infection2<-gsub(pat," ",df$infection)
pat=paste0("\\b(",paste0(redunc1,collapse = "|"),")\\b")
df$infection2<-gsub(pat," ",df$infection2)
df$infection2
## [1] "pneumonia"   "  pneumonia" " "           " "           " "

3.4.2 stringr

The following examples are taken from excel after conversion from pdf. In the process of conversion errors were introduced in the conversion from pdf to excel. A full list of the options available can be found at https://stringr.tidyverse.org/articles/regular-expressions.html

library(stringr)
#error introduced by double space 
a<-c("8396 (7890 to 8920)","6 301 113(6 085 757 to 6 517 308)",
     "4 841 208 (4 533 619 to 5 141 654)",
     "1 407 701 (127 445 922 to 138 273 863)",
     "4 841 208\n(4 533 619 to\n5 141 654)")

b<-str_replace (a, "\\(c.*\\)","")

#this is a complex example to clean and requires several steps. Note that the original data in the list a is now assigned to b. 
b<-str_replace(a,"\n","") %>% 
  #remove (
  str_replace("\\(.*","") %>%
  str_replace("\n.*","") %>%
  #remove )
  str_replace("\\)","") %>%
  #remove empty space
  str_replace("\\s","") %>%
  str_replace("\\s","")%>% as.numeric()
b
## [1]    8396 6301113 4841208 1407701 4841208

Another example. This time the 2 numbers in the column are separated by a slash sign. Supposed you want to keep the denominator. The first remove the number before the slash sign. The _*_ metacharacter denotes the action occurs at the end.

df.d<-data.frame(seizure.rate=c("59/90", "90/100", "3/23"))
df.d$seizure.number<-str_replace(df.d$seizure.rate,"[0-9]*","") 
df.d$seizure.number
## [1] "/90"  "/100" "/23"

Now combine with the next step to remove the slash sign.

#We used [0-9] to denote any number from 0 to 9. For text, one can use [A-Z].
df.d$seizure.number<-str_replace(df.d$seizure.rate,"^[0-9]*","")%>%
  str_replace("/","\\")
df.d$seizure.number
## [1] "90"  "100" "23"

Removing the denominator requires a different approach. First remove the last number then the slash sign.

df.d$case<-str_replace(df.d$seizure.rate,"/[0-9]*"," ")
df.d$case
## [1] "59 " "90 " "3 "

The example below has several words mixed in numeric vector columns. The words are a mixture of upper and lower cases. Note that “NA” is treated as a word character while NA is treated as Not Available by R. This recognition is important as removing them requires different actions. Character “NA” can be removed by str_replace while NA requires is.na operator.

A<-c(1,2,3,"NA",4,"no COW now")
B<-c(1,2,NA,4,"NA","check")
C<-c(1,"not now",2,3, NA ,5)

D<-data.frame(A,B,C) 

#str_replace on one column
D$A1<-str_replace(D$A,"[A-Z]+","") %>% str_replace("[a-z]+","")

#change to lower case
D$A2<-str_to_lower(D$A) %>% str_replace("[a-z]+","")

#remove space before replacement
D$A3<-str_to_lower(D$A) %>% str_replace("\\s+","") %>% str_replace("[a-z]+","")

#note that this action does not remove the third word
D$A4<-str_to_lower(D$A) %>% str_replace("\\s","") %>% str_replace("[a-z]+","")

#repeat removal of empty space
D$A5<-str_to_lower(D$A) %>% str_replace("\\s","") %>% 
  str_replace("\\s","") %>%  str_replace("[a-z]+","")

#apply str_replace_all rather than repeat
D$A6<-str_to_lower(D$A) %>% str_replace_all("\\s","") %>% 
    str_replace("[a-z]+","")

#now combine into vector. Note the use of c to combine the vector and replace 
#the comma with equal sign
D$A7<-str_to_lower(D$A) %>%
 str_replace_all(c("\\s"="","[a-z]+"=""))

D
##            A     B       C    A1       A2   A3   A4 A5 A6 A7
## 1          1     1       1     1        1    1    1  1  1  1
## 2          2     2 not now     2        2    2    2  2  2  2
## 3          3  <NA>       2     3        3    3    3  3  3  3
## 4         NA     4       3                                  
## 5          4    NA    <NA>     4        4    4    4  4  4  4
## 6 no COW now check       5   now  cow now  now  now

The lessons from above can be combine in when creating data frame. The mutate_if function enable multiple columns to be changed. One problem to handle adding multiple columns which contain NA is the use of rowSums and dplyr::select. These examples are illustrated below.

#use the mutate function 

E<-data.frame(A,B,C) %>%
  mutate (A=str_to_lower(A) %>% str_replace_all(c("\\s"="","[a-z]+"="")),
          B=str_to_lower(B) %>%str_replace_all(c("\\s"="","[a-z]+"="")),
          C=str_to_lower(C) %>%str_replace_all(c("\\s"="","[a-z]+"="")))%>%
  
  #change character columns to numeric
  mutate_if(is.character, as.numeric)%>%
  
  #add across columns and avoid NA
  mutate(ABC=rowSums(dplyr::select(.,A:C),na.rm = T))

Sometimes, you may only want to keep the number and ignore the words in the column. This can be done using the str_extract function.

df.e<-data.frame(disability=c("1 - No significant disability despite symptoms; able to carry out all usual duties and activities","5 - Severe disability, bedridden, incontinent and requiring constant nursing care and attention","
1 - No significant disability despite symptoms; able to carry out all usual 
duties and activities"))
df.e$disability2<-str_extract(df.e$disability,"\\w") #extract number

3.5 PDF to xcel

Sometimes data from public government sites come in PDF form instead of excel. Conversion from pdf to excel or text can be difficult especially with special character eg Danish. There are several libraries for doing this: pdftables (require API key) and pdftools. The example below uses pdftools. available at https://docs.ropensci.org/pdftools/. The document is the 2018 Danish Stroke Registry report. The tabulizer package is excellent for converting table data. However, tabulizer package depends on rJava and requires deft handling.

library(pdftools)
## Using poppler version 0.73.0
txt<-pdf_text("./Data-Use/4669_dap_aarsrapport-2018_24062019final.pdf")
cat(txt[17]) #browse data page 13+4 filler pages
## 3. Indikatorresultater på lands-, regions- og afdelingsniveau
## Indikator 1a: Andel af patienter med akut apopleksi som indlægges inden for 3 timer
## efter symptomdebut. Standard: = 30%
## Indikator 1b: Andel af patienter med akut apopleksi som indlægges inden for 4,5
## timer efter symptomdebut. Standard: = 40%
##                                      Inden for 3 timer
##                                                   Uoplyst   Aktuelle år         Tidligere år
##                            Standard   Tæller/      antal      2018          2017            2016
##                             opfyldt   nævner        (%)    %    95% CI   % (95% CI)     % (95% CI)
##   Danmark                     ja    4730 / 11794   49 (0)  40  (39 - 41) 39 (38-40)     37 (36-38)
##   Hovedstaden                 ja     1502 / 3439   49 (1)  44  (42 - 45) 40 (38-42)     40 (39-42)
##   Sjælland                    ja      760 / 1917     0 (0) 40  (37 - 42) 39 (36-41)     40 (38-43)
##   Syddanmark                  ja      942 / 2433     0 (0) 39  (37 - 41) 39 (37-41)     35 (33-37)
##   Midtjylland                 ja      918 / 2590     0 (0) 35  (34 - 37) 36 (34-38)     35 (33-37)
##   Nordjylland                 ja      577 / 1341     0 (0) 43  (40 - 46) 41 (39-44)     35 (32-37)
##   Bopæl uden for Danmark      ja          31 / 74    0 (0) 42  (31 - 54) 51 (36-66)     39 (26-53)
##   Hovedstaden                 ja     1502 / 3439   49 (1)  44  (42 - 45) 40 (38-42)     40 (39-42)
##   Albertslund                 ja          17 / 52    0 (0) 33  (20 - 47) 30 (18-44)     43 (27-59)
##   Allerød                     ja          22 / 53    0 (0) 42  (28 - 56) 32 (20-46)     35 (22-50)
##   Ballerup                    ja        43 / 106     0 (0) 41  (31 - 51) 48 (38-58)     40 (31-51)
##   Bornholms Regionskommune    ja          38 / 98    0 (0) 39  (29 - 49) 28 (19-38)     32 (22-43)
##   Brøndby                     ja        45 / 113     0 (0) 40  (31 - 49) 31 (22-42)     34 (23-47)
##   Dragør                      ja          18 / 43    0 (0) 42  (27 - 58) 47 (30-65)     33 (17-54)
##   Egedal                      ja          45 / 95    0 (0) 47  (37 - 58) 40 (30-52)     48 (37-59)
##   Fredensborg                 ja          36 / 99    0 (0) 36  (27 - 47) 37 (27-47)     43 (33-53)
##   Frederiksberg               ja        74 / 141   13 (8)  52  (44 - 61) 42 (35-50)     52 (44-61)
##   Frederikssund               ja        53 / 141     0 (0) 38  (30 - 46) 39 (31-48)     42 (33-51)
##   Furesø                      ja        55 / 110     0 (0) 50  (40 - 60) 56 (44-67)     50 (38-62)
##   Gentofte                    ja        65 / 123     1 (1) 53  (44 - 62) 46 (37-56)     38 (30-47)
##   Gladsaxe                    ja        68 / 129     0 (0) 53  (44 - 62) 48 (38-57)     36 (27-44)
##   Glostrup                    ja          38 / 72    0 (0) 53  (41 - 65) 40 (27-54)     48 (33-63)
##   Gribskov                    ja        46 / 129     0 (0) 36  (27 - 45) 35 (26-44)     36 (28-46)
##   Halsnæs                     ja          45 / 92    0 (0) 49  (38 - 60) 34 (25-45)     34 (25-44)
##   Helsingør                   ja        50 / 129     0 (0) 39  (30 - 48) 32 (24-40)     38 (30-45)
##   Herlev                      ja          25 / 50    0 (0) 50  (36 - 64) 52 (38-65)     33 (22-46)
##   Hillerød                    ja        47 / 121     0 (0) 39  (30 - 48) 39 (30-49)     40 (30-50)
##   Hvidovre                    ja        57 / 135     0 (0) 42  (34 - 51) 38 (29-48)     47 (37-57)
##                                                                                                    13
screenshot13<-
  pdf_render_page("./Data-Use/4669_dap_aarsrapport-2018_24062019final.pdf", 
                  page =17)
png::writePNG(screenshot13, "./Data-Use/Danish-Stroke-page13.png")

knitr::include_graphics("./Data-Use/Danish-Stroke-page13.png")

3.5.1 Scanned text or picture

Importing data from scanned text will require use of Optical Character Recognition (OCR). The tesseract library provides an R interface for OCR. In the example below, a picture is taken from same CDC website containing mortality data (https://www.cdc.gov/coronavirus/2019-ncov/covid-data/covidview/04102020/ nchs-data.html). The screenshot of this website was then cleaned in paint. The data is available in the Data-Use folder.

library(tesseract)
eng <- tesseract("eng") #english
text <- tesseract::ocr("./Data-Use/Covid_PNG100420.png", engine = eng)
cat(text)
## NCHS Mortality Surveillance Data
## Data as of April 9, 2020
## For the Week Ending April 4, 2020 (Week 14)
## 
## COVID-19 Deaths Pneumonia Deaths* Influenza Deaths
## Year Week TotalDeaths Number %ofTotal Number %ofTotal Number  %of Total
## 2019 40 52,452 0 0 2,703 5.15 16 0.03
## 2019 4) 52,860 0 0 2,770 5.24 16 0.03
## 2019 42 54,129 0 0 2,977 5.50 18 0.03
## 2019 43 53,914 0 0 2,985 5.54 30 0.06
## 2019 44 53,980 0 0 2,908 5.39 31 0.06
## 2019 AS 55,468 0 0 3,063 5.52 31 0.06
## 2019 46 55,684 0 0 3,096 5.56 39 0.07
## 2019 47 55,986 0 0 2,993 5.35 50 0.09
## 2019 48 55,238 0 0 2,976 5.38 65 0.12
## 2019 49 56,990 0 0 3,305 5.80 99 0.17
## 2019 50 57,276 0 0 3,448 6.02 111 0.19
## 2019 51 56,999 0 0 3,345 5.87 125 0.22
## 2019 52 57,956 0 0 3,478 5.99 198 0.34
## 2020 1 58,961 0 0 3,998 6.77 416 0.71
## 2020 2 58,962 0 0 3,995 6.76 450 0.76
## 2020 3 57,371 0 0 3,903 6.78 441 0.77
## 2020 4 56,666 0 0 3,742 6.56 468 0.83
## 2020 5 56,381 0 0 3,617 6.42 452 0.80
## 2020 6 56,713 0 0 3,599 6.35 482 0.85
## 2020 7 55,237 0 0 3,577 6.48 487 0.88

3.6 Web scraping

The readers may ask why web scraping for healthcare. A pertinent example related to COVID-19 data is provided below. The library rvest is helpful at scraping data from an internet page. The rvest library assumes that web contents have xml document-tree representation. The different options available for web scraping with rvest are available at the website https://rvest.tidyverse.org/reference/. The user can use CSS selectors to scrape content. The library Rselenium is also useful for web scraping. For dynamic web page, the library CasperJS library does a better job especially if the data contain embedded java script.

The library cdccovidview provides access to the CDC website on COVID-19. In the example below, we will try to this manually. Data from CDC website on COVID-19 is downloaded, cleaned and saved in csv format. It is important to pay attention to the data. The first row contains header and is removed. There are several columns with commas. These commas can be removed using the exercises above. Further the data is updated on weekly basis. As such the data needs to be converted into a date time format using lubridate.

library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(tidyverse)
#assign handle to web page accessed 12/4/20
#cdc<-read_html("https://www.cdc.gov/coronavirus/2019-ncov/covid-data/
#               covidview/04102020/nchs-data.html")
# scrape all div tags
#html_tag <- cdc %>% html_nodes("div")
# scrape header h1 tags
#html_list<-html_tag %>% html_nodes("h1") %>% html_text()
#there is only one table on this web page
#Table1<- cdc %>% html_node("table") %>% html_table(fill = TRUE)
#Table1 has a header row
#Table1<-Table1[-1,]
#The data in the Total Deaths column has a comma 
#Table1$Total.Deaths<-as.numeric(gsub(",","",Table1$`Total Deaths`))
#now combine the year and week column to Date
#Table1$Date<-lubridate::parse_date_time(paste(Table1$Year, Table1$Week, 'Mon', sep="/"),'Y/W/a')
#there are still commas remaining in some columns. This is a useful exercise for the reader. A solution is provided in the next example.  
#write.csv(Table1,file="./Data-Use/Covid_Table100420.csv")

The next example is from the CDC COVID-19 website. It poses a different challenges as there are several columns with the same names. In this case we will rename the column by index. There are several columns containing commas. Rather than removing column by column we will write a function with lapply to do it over the table. the apply function returns a matrix whereas lapply returns a dataframe. There is one column containing percentage enclosed in a bracket. This can be removed using the example above on metacharacter ie using doule back slash in front of bracket and again at close of bracket.

library(rvest)
library(tidyverse)
cdc<-
read_html("https://www.cdc.gov/mmwr/volumes/69/wr/mm6915e4.htm?s_cid=mm6915e4_w")

# scrape all div tags
html_tag <- cdc %>% html_nodes("div")
# scrape header h1 tags
html_list<-html_tag %>% html_nodes("h1") %>% html_text()
#there is only one table on this web page
Table2<- cdc %>% html_node("table") %>% html_table(fill = TRUE)
#first row is header
names(Table2) <- as.matrix(Table2[1, ])
Table2<-Table2[-c(1:2,55),]#rows 1 and 2 are redundant
#rename the columns by index 
names(Table2)[2] <-"NumberCases31.03.20"
names(Table2)[3]<-"CumulativeIncidence31.03.20"
names(Table2)[4]<-"NumberCases07.04.20"
names(Table2)[5]<-"NumberDeath07.04.20"
names(Table2)[6]<-"CumulativeIncidence07.04.20"
#rather than removing column by column we will write a function with lapply to remove commas over the table. the apply function returns a matrix whereas lapply returns a dataframe.
Table2<-as.data.frame(lapply(Table2, function(y) gsub(",", "", y))) 
Table2<-as.data.frame(lapply(Table2, function(x)
  gsub("\\(|[0-9]+\\)","",x)))
#write.csv(Table2,file="./Data-Use/Covid_bystate_Table130420.csv")

3.7 Manipulating Medical Images

3.7.1 DICOM and nifti format

R can handle a variety of different data format. Medical images are stored as DICOM files for handling and converted to nifti files for analysis. The workhorses are the oro.dicom and oro.nifti libraries. Nifti is an S4 class object with multiple slots for data type. These slots can be accessed by typing the @ after the handle of the file. The slots also contain information on whether the data has been scaled. This can be checked by accessing the scl_slope and scl_inter slots. These data on slope and intercept provide a mean of returning an image to its correct value.

library(oro.nifti)
## oro.nifti 0.11.0
## 
## Attaching package: 'oro.nifti'
## The following object is masked from 'package:dplyr':
## 
##     slice
mca<-readNIfTI("./Data-Use/mca_notpa.nii.gz", reorient = FALSE) 

mca@scl_slope
## [1] 1

To find available slots

#find available of slots
slotNames(mca)
##  [1] ".Data"          "sizeof_hdr"     "data_type"      "db_name"       
##  [5] "extents"        "session_error"  "regular"        "dim_info"      
##  [9] "dim_"           "intent_p1"      "intent_p2"      "intent_p3"     
## [13] "intent_code"    "datatype"       "bitpix"         "slice_start"   
## [17] "pixdim"         "vox_offset"     "scl_slope"      "scl_inter"     
## [21] "slice_end"      "slice_code"     "xyzt_units"     "cal_max"       
## [25] "cal_min"        "slice_duration" "toffset"        "glmax"         
## [29] "glmin"          "descrip"        "aux_file"       "qform_code"    
## [33] "sform_code"     "quatern_b"      "quatern_c"      "quatern_d"     
## [37] "qoffset_x"      "qoffset_y"      "qoffset_z"      "srow_x"        
## [41] "srow_y"         "srow_z"         "intent_name"    "magic"         
## [45] "extender"       "reoriented"

In this example below we will simulated an image of dimensions 5 by 5 by 5.

library(oro.nifti)
set.seed(1234)
dims = rep(5, 3)
SimArr = array(rnorm(5*5*5), dim = dims)
SimIm = oro.nifti::nifti(SimArr)
print(SimIm)
## NIfTI-1 format
##   Type            : nifti
##   Data Type       : 2 (UINT8)
##   Bits per Pixel  : 8
##   Slice Code      : 0 (Unknown)
##   Intent Code     : 0 (None)
##   Qform Code      : 0 (Unknown)
##   Sform Code      : 0 (Unknown)
##   Dimension       : 5 x 5 x 5
##   Pixel Dimension : 1 x 1 x 1
##   Voxel Units     : Unknown
##   Time Units      : Unknown

View the simulated image.

neurobase::ortho2(SimIm)

This section provides a brief introduction to viewing nifti files. Data are stored as rows, columns and slices. To view sagital image then assign a number to the row data.

#plot mca #sagittal
image(mca[50,,]) 

To see coronal image, assign a number to the column data.

#plot mca #coronal
image(mca[,70,]) 

To see axial image, assign a number to the slice data.

#plot mca #axial
image(mca[,,35]) 

3.7.2 Manipulating array of medical images

These arrays of medical images should be treated no differently from any other arrays. The imaging data are stored as arrays within the .Data slot in nifti.

Data can be subset using the square bracket. The image is referred to x (right to left), y (front to back), z (superior to inferior).

library(oro.nifti)
#extract data as array using @ function
img<-readNIfTI("./Data-Use/mca_notpa.nii.gz", reorient = FALSE) 
k<-img@.Data

#change x orientation to right to left 91*109*91
k1<-k[91:1,,] 
#access slice 35 to verify that the image orientation has been switched.
image(k1[,,35])

With the image now flipped to the other side, we can create an image by returning the array into a data slot.

img2<-img
img2@.Data <- k1
img2
## NIfTI-1 format
##   Type            : nifti
##   Data Type       : 16 (FLOAT32)
##   Bits per Pixel  : 32
##   Slice Code      : 0 (Unknown)
##   Intent Code     : 0 (None)
##   Qform Code      : 0 (Unknown)
##   Sform Code      : 1 (Scanner_Anat)
##   Dimension       : 91 x 109 x 91
##   Pixel Dimension : 2 x 2 x 2
##   Voxel Units     : mm
##   Time Units      : sec

Arrays can be manipulated to split an image into 2 images. Below is a function to split B0 and B1000 images from diffusion series.

#split dwi file into b0 and b1000
#assume that b1000 is the second volume
#dwi<-readNIfTI("....nii.gz",reorient = F)

DWIsplit<-function(D) {
  DWI<-readNIfTI(D,reorient = F)
  b1000<-dwi #dim (b1000) [1] 384 384  32   2
  k<-dwi@.Data
  b1000k<-k[,,,2] #dim(b1000k) [1] 384 384  32
  writeNIfTI(b1000k,"b1000")
}

Measurement of volume requires information on the dimensions of voxel.

#measure volume
A="./Ext-Data/3000F_mca_blur.nii"

VoxelDim<-function(A){
  library(oro.nifti)
  img<-readNIfTI(A,reorient = F)
  VoxDim<-pixdim(img)
Volume<-sum(img>.5)*VoxDim[2]*VoxDim[3]*VoxDim[4]/1000
Volume
}

VoxelDim(A)
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## [1] 69

Find unsigned angle between 2 vectors k and k1

Morpho::angle.calc(k,k1)
## [1] 1.570771

Determine centre of gravity of an object.

#centre of gravity
neurobase::cog(img)
##     dim1     dim2     dim3 
## 65.74983 53.09204 46.67048

3.7.3 Combining arrays

This is an illustration of combining array using cbind.

#stack arrays
k2<-cbind(k,k1)
dim(k2) ### [1] 902629      2
## [1] 902629      2

The abind function produces a different array output. Later we will repeat the same exercise using list function.

#combine multi-dimensional arrays 
ab<-abind::abind(k,k1)
dim(ab) ### [1]  91 109 182
## [1]  91 109 182

This example uses 25 files. Rather than open one file at a time create a list from pattern matching.

library(oro.nifti)
library(abind)
library(CHNOSZ) # for working with arrays
## CHNOSZ version 1.4.0 (2020-11-11)
## reset: creating "thermo" object
## OBIGT: loading default database with 1880 aqueous, 3418 total species
## 
## Attaching package: 'CHNOSZ'
## The following object is masked from 'package:oro.nifti':
## 
##     slice
## The following object is masked from 'package:dplyr':
## 
##     slice
library(RNiftyReg)
## 
## Attaching package: 'RNiftyReg'
## The following objects are masked from 'package:oro.nifti':
## 
##     pixdim, pixdim<-
#create a list using pattern matching
mca.list<-list.files(path="./Ext-Data/",pattern = "*.nii", full.names = TRUE)

#length of list
length(mca.list)
## [1] 42
#read multiple files using lapply function

#use lappy to read in the nifti files
#note lapply returns a list
mca.list.nii <- lapply(mca.list, readNIfTI)
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
class(mca.list.nii)
## [1] "list"

This example illustrates how to view the first image in the list.

#view each image in the list
neurobase::ortho2(mca.list.nii[[1]])

In this example, the first 3 segmented images from the list are averaged and viewed.

#view average image
mca_ave3<-(mca.list.nii[[5]]+mca.list.nii[[6]]+mca.list.nii[[7]])/3
neurobase::ortho2(mca_ave3)

The output below is the same as above but is performed on arrays. The image_data function from oro.nifti library extracts the image attribute from the slot .Data.

#extract multiple arrays using lapply
mca.list.array<-lapply(mca.list.nii, img_data)

m3<-(mca.list.array[[5]]+mca.list.array[[6]]+mca.list.array[[7]])/3

#compare this with the output from above
neurobase::double_ortho(m3, mca_ave3)

Arrays can be extracted from list using list2array function.

class(mca.list.array)
## [1] "list"
#convert list to array
#CHNOSZ function
m.listarray<-list2array(mca.list.array)#91 109  91  25

class(m.listarray)
## [1] "array"

3.7.4 Math operation on multidimensional array

Here we use apply function to average over every element of the multidimensional array. The first argument of apply is the array, the second argument is the margin or the component for analysis and the last argument is the function. If the idea is to analyse the row then the margin argument is c(1), column then the margin argumen is c(2) and so on.

ma<-apply(m.listarray,c(1,2,3), mean)
neurobase::ortho2(ma)

Thresholding can be performed using mask_img function. This function can also be used to create a mask.

ma_mask=neurobase::mask_img(ma, ma>0.1)

neurobase::double_ortho(ma_mask, ma)

3.7.5 Math operation on list

In this example we us lapply to a function within a list. This is an example of a functional or a function which takes a function as an input and return a vector as an output. In this example, a functional operates on one element of the list at a time. This example of functional is the same as the function VoxelDim describes above.

vox=unlist(lapply(mca.list.nii, 
           function(A) sum(A>.5)*
             #obtain voxel dimensions
             oro.nifti::pixdim(A)[2]*
             oro.nifti::pixdim(A)[3]*
             oro.nifti::pixdim(A)[4]/1000
           ))

vox
##  [1]   0.000   0.000   0.000   0.000  69.000   3.624   8.768  58.200 326.760
## [10]   1.608   0.000   3.904   3.328  66.776 106.976  70.928  26.912  11.768
## [19]  17.592  14.720 151.872  87.528   0.000  24.192  80.248 152.792  69.552
## [28]  25.576   0.000   8.644   4.230   3.781   3.643   5.720   5.061   8.644
## [37]   4.230   3.781   3.643   5.720   5.061  43.968

3.7.6 Vectorising nifti object

One way of handling imaging data for analysis is to flatten the image, then create an empty array of the same size to return the image.

#flatten 3D
niivector = as.vector(img[,,]) #902629

#Create empty array of same size to fill up
niinew = array(0, dim=dim(img))

#return to 3D
niinew = array(niivector, dim=dim(img))

#confirm
neurobase::ortho2(niinew)

Another way of creating vector from the image is to use c function.

#vector can also be created using c
niivector2 = c(img[,,]) #902629

3.7.7 tar file

Image files can be large and are often stored as tar files. The tar (tgz file), untar, zip (gz file) and unzip function are from the utils library.

colin_1mm<-untar("./Data-Use/colin_1mm.tgz")
colinIm<-readNIfTI("colin_1mm") #1 x 1 x 1
class(colinIm)
## [1] "nifti"
## attr(,"package")
## [1] "oro.nifti"
neurobase::ortho2(colinIm)

The readNIfTI call can open gz file without the need to call unzip function.

library(RNiftyReg)
epi_t2<- readNIfTI(system.file("extdata", "epi_t2.nii.gz", package="RNiftyReg"))
class(epi_t2)
## [1] "nifti"
## attr(,"package")
## [1] "oro.nifti"
neurobase::ortho2(epi_t2)

3.7.8 Image registration

There are many different libraries for performing registration.

library(RNiftyReg)

#example from data
source<-readNifti("./Data-Use/mca10_pca10_border10.nii.gz")
pixdim(source)
## [1] 2 2 2
colin_1mm<-untar("./Data-Use/colin_1mm.tgz")
target<-readNifti("colin_1mm")
pixdim(target)
## [1] 1 1 1 1
target
## Image array of mode "double" (54.2 Mb)
## - 181 x 217 x 181 x 1 voxels
## - 1 x 1 x 1 mm x 1 s per voxel
#register source to target 
result <- niftyreg(source, target)
#affine transformation
result$forwardTransforms
## [[1]]
## NiftyReg affine matrix:
##   0.77338767    0.56943178    0.47119385    4.53472805
##  -0.59504575    0.72087985   -0.80796480  -19.90197945
##  -0.56219202   -0.06933179    0.55241060   21.51253319
##   0.00000000    0.00000000    0.00000000    1.00000000
#image in target space
result$image
## Image array of mode "double" (54.2 Mb)
## - 181 x 217 x 181 voxels
## - 1 x 1 x 1 mm per voxel
neurobase::ortho2(result$image)

Output from RNiftyReg are niftiImage objects. They can be converted to oro.nifti objects using nii2oro function.

otarget<-nii2oro(target)
oimage<-nii2oro(result$image)

overlay(otarget, y=oimage,z = 90, plot.type = "single" )

3.7.9 Rescaling

Perform affine registration and resampling of image using the transformation file.

#affine
ica1000<-readNifti("./Ext-Data/1000M_ica.nii")
ica1000
## Image array of mode "double" (6.9 Mb)
## - 91 x 109 x 91 voxels
## - 2 x 2 x 2 mm per voxel
colin_affine<-buildAffine(source=ica1000, target=target)
#apply transformation from above
#assume 1000M_ica.nii and source are in the same space
colin_like<-applyTransform(colin_affine,ica1000)
neurobase::ortho2(colin_like)

Resampling an image to different dimensions. This example is different from above in which rescaling is performed as part of registration to higher resolution image. Here the rescale function from RNiftyReg library is used to change the dimensions from 1x1x1 mm to 2x2x2 mm.

ica1000.rescale<-RNiftyReg::rescale(ica1000,c(.5,.5,.5))

ica1000
## Image array of mode "double" (6.9 Mb)
## - 91 x 109 x 91 voxels
## - 2 x 2 x 2 mm per voxel
#compare with rescale

ica1000.rescale
## Image array of mode "double" (855.6 Kb)
## - 45 x 54 x 45 voxels
## - 4 x 4 x 4 mm per voxel

3.7.10 MNI template

There are several different MRI templates. The well known one is the MNI 152 template (Mazziotta J 2001). This was developed from male right-handed medical students. The MNI 152 is also known as International Consortium for Brain Mapping (ICBM) 152.

3.7.10.1 Atlases

A list of available atlases for human and animals is available at https://loni.usc.edu/research/atlases.

3.7.10.2 AAL atlas

The automated anatomical labeling (AAL) atlas of activation contains 45 volume of interest in each hemisphere (Tzourio-Mazoyer N 2002). The atlas is aligned to MNI 152 coordinates. The updated AAL has additional parcellation of orbitofrontal cortex. AAL3 update includes further parcellation of thalamus.

library(rgl)
library(misc3d)
library(MNITemplate)
library(aal)
library(neurobase)
library(magick)
## Linking to ImageMagick 6.9.11.34
## Enabled features: cairo, freetype, fftw, ghostscript, lcms, pango, rsvg, webp
## Disabled features: fontconfig, x11
library(oro.nifti)

img = aal_image()
template = readMNI(res = "2mm")
cut <- 4500
dtemp <- dim(template)

# All of the sections you can label
labs = aal_get_labels()

# highlight - in this case the Cingulate_Post_L
cingulate = labs$index[grep("Cingulate_Post_R", labs$name)]

#mask of object for rendering
mask = remake_img(vec = img %in% cingulate, img = img)

#contour for MNI template
contour3d(template, x=1:dtemp[1], y=1:dtemp[2], z=1:dtemp[3], level = cut, alpha = 0.1, draw = TRUE)

#contour for mask
contour3d(mask, level = c(0.5), alpha = c(0.5), add = TRUE, color=c("red") )

### add text
text3d(x=dtemp[1]/2, y=dtemp[2]/2, z = dtemp[3]*0.98, text="Top")
text3d(x=-0.98, y=dtemp[2]/2, z = dtemp[3]/2, text="Right")

#create movie
#movie file is saved to temporary folder
#movie3d(spin3d(),duration=30)


#add digital map of mca territory
MCA<-readNIfTI("./Data-Use/MCA_average28_MAP_100.nii")
## Malformed NIfTI - not reading NIfTI extension, use at own risk!
#check range 
range(MCA)
## [1] -2.531219e-05  2.320454e+01
#mask2 = remake_img(vec = img %in% source, img = img)
contour3d(template, x=1:dtemp[1], y=1:dtemp[2], z=1:dtemp[3], level = cut, alpha = 0.1, draw = TRUE)

#contour for mask
contour3d(MCA, level = c(0.5), alpha = c(0.5), add = TRUE, color=c("Yellow"))
#add frontal
angular = labs$index[grep("Angular_R", labs$name)]
#mask of object for rendering
mask2 = remake_img(vec = img %in% angular, img = img)
contour3d(mask2, level = c(0.5), alpha = c(0.5), add = TRUE, color=c("red"))
#add cingulate
contour3d(mask, level = c(0.5), alpha = c(0.5), add = TRUE, color=c("blue"))
### add text
text3d(x=dtemp[1]/2, y=dtemp[2]/2, z = dtemp[3]*0.98, text="Top")
text3d(x=-0.98, y=dtemp[2]/2, z = dtemp[3]/2, text="Right")

#create movie
#movie file is saved to temporary folder
#movie3d(spin3d(),duration=5)
#rglwidget()

3.7.10.3 Eve template

The Eve template is from John Hopkins (“Atlas-Based Whole Brain White Matter Analysis Using Large Deformation Diffeomorphic Metric Mapping: Application to Normal Elderly and Alzheimer’s Disease Participants.” 2009). It is a single subject high resolution white matter atlas that has been morphed into MNI 152 coordinates. The atlas is parcellated into 176 regions based on ICBM-DTI-81 atlas.

library(EveTemplate)

eve_labels = readEveMap(type = "II")
eve_labels
## NIfTI-1 format
##   Type            : nifti
##   Data Type       : 4 (INT16)
##   Bits per Pixel  : 16
##   Slice Code      : 0 (Unknown)
##   Intent Code     : 0 (None)
##   Qform Code      : 2 (Aligned_Anat)
##   Sform Code      : 1 (Scanner_Anat)
##   Dimension       : 181 x 217 x 181
##   Pixel Dimension : 1 x 1 x 1
##   Voxel Units     : mm
##   Time Units      : Unknown
neurobase::ortho2(eve_labels)

library(RColorBrewer)
library(tidyverse)
unique_labs = eve_labels %>% 
  c %>% 
  unique %>% 
  sort 
breaks = unique_labs
rf <- colorRampPalette(rev(brewer.pal(11,'Spectral')))
cols <- rf(length(unique_labs))

neurobase::ortho2(eve_labels, col = cols, breaks = c(-1, breaks))

3.7.10.4 Sensorimotor tract atlas

Template for corticofugal tracts from primary motor cortex, dorsal premotor cortex, ventral premotor cortex, supplementary motor area (SMA), pre-supplementary motor area (preSMA), and primary somatosensory cortex is available from (Derek B Archer 2018). Below is an illustration of the M1 tract obtained from LRNLAB.

RightM1<-readNifti("./Ext-Data/Right-M1-S-MATT.nii")

neurobase::ortho2(RightM1)

#to down weight the voxel from 1 mm to 2 mm
RightM1.rescale<-RNiftyReg::rescale(RightM1,c(.5,.5,.5))

RightM1o<-nii2oro(RightM1)

#dimensions of RightM1o (182 218 182) and eve_labels (181 217 181) are dissimilar
#overlay(eve_labels, y=RightM1o, z=90,plot.type="single")

Reference a figure by its code chunk label with the fig: prefix, e.g., see Figure ??. Similarly, you can reference tables generated from knitr::kable(), e.g., see Table 3.1.

knitr::kable(
  df, caption = 'Table containing uncleaned data above',
  booktabs = TRUE
)
Table 3.1: Table containing uncleaned data above
drug infection drug1 infection1 drug2 infection2
valium 1mg pneumonia valium pneumonia valium pneumonia
verapamil sr aspiration pneumonia verapamil pneumonia verapamil pneumonia
betaloc zoc tracheobronchitis betaloc betaloc
tramadol respiratory tract infection tramadol tramadol
valium (diazepam) respiratory.tract.infection valium (diazepam) respiratory.tract.infection valium
knitr::kable(
  Table2, caption = 'Table containing cleaned data above',
  booktabs = TRUE
)
Table 3.2: Table containing cleaned data above
Jurisdiction NumberCases31.03.20 CumulativeIncidence31.03.20 NumberCases07.04.20 NumberDeath07.04.20 CumulativeIncidence07.04.20 Absolute.change.in.cumulative.incidence.
Alabama 999 20.4 2197 39 1. 44.9 24.5
Alaska 133 18.0 213 6 2. 28.9 10.8
Arizona 1289 18.0 2575 73 2. 35.9 17.9
Arkansas 560 18.6 993 18 1. 32.9 14.4
California 8131 20.6 15865 374 2. 40.1 19.6
Colorado 2966 52.1 5429 179 3. 95.3 43.2
Connecticut 3128 87.6 7781 277 3. 217.8 130.2
Delaware 319 33.0 928 16 1. 95.9 63.0
District of Columbia 495 70.5 1211 24 2. 172.4 101.9
Florida 6490 30.5 14302 296 2. 67.1 36.7
Georgia 4585 43.6 9713 351 3. 92.3 48.7
Hawaii 185 13.0 362 5 1. 25.5 12.5
Idaho 525 29.9 1210 15 1. 69.0 39.0
Illinois 5994 47.0 13549 380 2. 106.3 59.3
Indiana 2159 32.3 5507 173 3. 82.3 50.0
Iowa 497 15.7 1048 26 2. 33.2 17.5
Kansas 428 14.7 900 27 3. 30.9 16.2
Kentucky 591 13.2 1149 65 5. 25.7 12.5
Louisiana 5237 112.4 16284 582 3. 349.4 237.1
Maine 303 22.6 519 12 2. 38.8 16.1
Maryland 1660 27.5 5529 124 2. 91.5 64.0
Massachusetts 6620 95.9 15202 356 2. 220.3 124.3
Michigan 7615 76.2 18970 845 4. 189.8 113.6
Minnesota 689 12.3 1154 39 3. 20.6 8.3
Mississippi 1073 35.9 2003 67 3. 67.1 31.1
Missouri 1327 21.7 3037 53 1. 49.6 27.9
Montana 203 19.1 332 6 1. 31.3 12.1
Nebraska 177 9.2 478 10 2. 24.8 15.6
Nevada 1113 36.7 2087 71 3. 68.8 32.1
New Hampshire 367 27.1 747 13 1. 55.1 28.0
New Jersey 18696 209.9 44416 1232 2. 498.6 288.7
New Mexico 315 15.0 794 13 1. 37.9 22.9
New York† 32656 293.1 61897 1378 2. 555.5 262.4
New York City 41771 497.3 76876 4111 5. 915.3 418.0
North Carolina 1584 15.3 3221 46 1. 31.0 15.8
North Dakota 126 16.6 237 4 1. 31.2 14.6
Ohio 2199 18.8 4782 167 3. 40.9 22.1
Oklahoma 565 14.3 1472 67 4. 37.3 23.0
Oregon 690 16.5 1181 33 2. 28.2 11.7
Pennsylvania 4843 37.8 14559 240 1. 113.7 75.9
Rhode Island 520 49.2 1414 30 2. 133.7 84.6
South Carolina 1083 21.3 2417 51 2. 47.5 26.2
South Dakota 108 12.2 320 6 1. 36.3 24.0
Tennessee 2239 33.1 4139 72 1. 61.1 28.1
Texas 3266 11.4 8262 154 1. 28.8 17.4
Utah 934 29.5 1804 13 0. 57.1 27.5
Vermont 293 46.8 575 23 4. 91.8 45.0
Virginia 1484 17.4 3645 75 2. 42.8 25.4
Washington 4896 65.0 8682 394 4. 115.2 50.2
West Virginia 162 9.0 412 4 1. 22.8 13.8
Wisconsin 1351 23.2 2578 92 3. 44.3 21.1
Wyoming 120 20.8 221 0 —) 38.3 17.5
American Samoa 0 0.0 0 0 —) 0.0 0.0
Federated States of Micronesia 0 0.0 0 0 —) 0.0 0.0
Guam 71 42.8 122 4 3. 73.6 30.8
Marshall Islands 0 0.0 0 0 —) 0.0 0.0
Northern Mariana Islands 2 3.5 8 2 25. 14.1 10.5
Palau 0 0.0 0 0 —) 0.0 0.0
Puerto Rico 239 7.5 573 23 4. 17.9 10.5
U.S. Virgin Islands 30 28.0 45 1 2. 42.1 14.0
U.S. Total 186101 56.2 395926 12757 3. 119.6 63.4

References

Aboumatar, H., and R. A. Wise. 2019. “Notice of Retraction. Aboumatar et al. Effect of a Program Combining Transitional Care and Long-term Self-management Support on Outcomes of Hospitalized Patients With Chronic Obstructive Pulmonary Disease: A Randomized Clinical Trial. JAMA. 2018;320(22):2335-2343.” JAMA 322 (14): 1417–8.

“Atlas-Based Whole Brain White Matter Analysis Using Large Deformation Diffeomorphic Metric Mapping: Application to Normal Elderly and Alzheimer’s Disease Participants.” 2009. Neuroimage 46 (2): 486–99. https://doi.org/10.1016/j.neuroimage.2009.01.002.

Derek B Archer, Stephen A Coombes, David E Vaillancourt. 2018. “A Template and Probabilistic Atlas of the Human Sensorimotor Tracts Using Diffusion Mri.” Cerebral Cortex 28 (5): 1685–99. https://doi.org/10.1093/cercor/bhx066.

Mazziotta J, Evans A, Toga A. 2001. “A Probabilistic Atlas and Reference System for the Human Brain: International Consortium for Brain Mapping (Icbm).” Philos Trans R Soc Lond B Biol Sci 356 (1412): 1293–1322. https://doi.org/10.1098/rstb.2001.0915.

Tzourio-Mazoyer N, Papathanassiou D, Landeau B. 2002. “Automated Anatomical Labeling of Activations in Spm Using a Macroscopic Anatomical Parcellation of the Mni Mri Single-Subject Brain.” Neuroimage 15 (1): 273–89. https://doi.org/10.1006/nimg.2001.0978.

Wickham, Hadley. 2019. 2nd ed. Chapman & Hall/Crc the R Series. Chapman; Hall/CRC; 2 edition (May 30, 2019).