Chapter 1 Preamble

The book is written in the hope that clinicians and junior doctors will take a new approach towards understanding medicine and data. One of the key aspect in the journey to becoming a doctor is leaning about medical diagnosis. This is a complex process and should appropriately take into account the history and corroborating history, examination and investigations (Centor, Geha, and Manesh 2019). However, early on in medical school, students are taught list of associations and frequency of signs and symptoms related to a disease, rather than recognition of patterns. Medical students later become junior doctors and would pride themselves on the ability to generate these lists. An analogy to these lists is performing univariable regression to explore their occurrence in the condition compare to an alternative diagnosis. However, the findings from univariable regression do not convey a meaning about relationship of the variables. An example is the teaching that vasculitis is associated with stroke. This has led many junior doctors to search for rare association such vasculitis among young patients with stroke. In an audit over 10 years at Monash Health, there was only one case of vasculitis presenting as stroke (Kempster, McLean, and Phan 2016). Another example would be searching for temporal arteritis among elderly patients presenting with stroke. This action is performing without realising that few patients with arteritis have stroke and few patients with stroke gave arteritis. Often patients developed arteritis first and their associated symptoms. Stroke may present later in the course of arteritis and sometimes even after treatment for arteritis. By contrast, a senior clinician would discuss why certain diagnoses and not others were considered in the differential diagnoses given the occurrence of selected symptoms and signs. In statistical analysis, patterns can be found using a variety of multivariate methods (Hastie, Tibshirani, and Friedman 2009). These methods are not taught except in advance Statistics or in machine learning courses. Consequently, there is a lack of appreciations of these methods and how they can be applied in Health care. In this book a wide variety of multivariate methods will be briefly demonstrated to give the clinicians a glimpse of the possibilities.

Statistics is taught in at rudimentary level in school and during specialty training. Given this lack of emphasis of an important subject, it is not surprising that students and junior doctors do not embrace statistics. When it is done it is through the use of commercial statistical software such as SPSS which has a graphical user interface (GUI) and encourage the user to perform analysis by clicking. R is a statistical program which takes a completely opposite direction. It requires the user to be able to code and understand why a task should be done. This drawback means that many students, junior doctors and clinicians do not appreciate the advantage of R: free open source software, large online community who are willing to share their codes and direct access to statisticians and bioinformaticians who write the softwares. A lot of the new ideas in statistics have libraries available in R. By contrast, it would take several years for commercial software to catch-up. R can be used to scrape data from the internet or interface with data platform such as Google Maps application programming interface (API), Youtube, Twitter. Rstudio, the integrated development environment (IDE) of R, provides platform for Shiny app development, creation of web document and writing of book (such as this one).

This book takes on a non-traditional approach to teaching R. It emphasises learning R by examples. Often data science course spend time explaining how R treat data as vector and manipulate data symbolically. Data manipulation is the foundation of data science but can bore those new to R. That aspect is left to the next chapter on data wrangling. This chapter is an introduction to ggplot2. Another aspect of learning R is that the libraries come with many free dataset. Clinicians do not find the diamond or car or gapminder datasets useful as they are not related to medicine. On the other hand, the actual Titanic data, with passenger list and their fate, may be of interest. In this book we will try to use dataset which are directly related to medicine or topics of high interest such as the COVID-19 data. Some of the data provided here comes from publications by the Neurology Department, some are simulated and some have come from the internet (COVID-19) and some dataset are provided by R (eg fertility, cancer (lung, breast), leukemia, lymphoma, coronary artery disease, diabetes, hepatitis and microbiome). The dataset available in R in the following packages datasets, Stat2Data and mlbench. Additional datasets are available from external website such as Kaggle and UCI Machine learning Databases. For example, the heart disease data are available from https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/. It is encouraged that the reader visit these websites to obtain data for learning R. Dataset from these websites are labelled in this book as coming from the ExtData folder. Unless indicated, the data use in this book can be found in the Data-Use folder. Researchers working on animal dataset may find the principles of analysis described here useful for animal research.

This is a book adapted from Yihui Xie’s package bookdown. The files are written in Markdown. Markdown files enable embedding of R script, the main aim of writing this book. The repository for this book is https://github.com/GNtem2/HealthcareRbook. Once completed the book will be available on Netlify. R can be installed from The Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/. It is free. The user can choose between three different platforms such as Windows, Mac (OS X) and Linux. Following installation of R, you should install Rstudio, available at https://rstudio.com/. Some of the libraries are available at Bioconductor website and not CRAN.

References

Centor, Robert M., Rabih Geha, and Reza Manesh. 2019. “The Pursuit of Diagnostic Excellence.” JAMA Network Open 2 (12): e1918040–e1918040. https://doi.org/10.1001/jamanetworkopen.2019.18040.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer. http://www-stat.stanford.edu/~tibs/ElemStatLearn/.

Kempster, P. A., C. A. McLean, and T. G. Phan. 2016. “Ten year clinical experience with stroke and cerebral vasculitis.” J Clin Neurosci 27 (May): 119–25.