Preface
Think of a Biologist. Who do you see? Take a minute to write down some characteristics in your mind. Try to be specific: gender, skin, age, height, hair, clothes, personality. Who do you see?
Now think of a computer programmer or data scientist. Write down their characteristics. How do these people differ in your mind? Can you imagine them being the same person? Can you picture yourself in both roles?
The goal of this book is to bridge these two worlds. In writing this book, I assume you are a practising biologist or a student of biology, or you are just motivated by biological phenomena. It doesn’t matter if you are a recent high school graduate entering into a biology undergraduate program, a graduate student embarking on an independent research dissertation, or a senior scientist with specialized expertise in the science of life. As long as you are interested in learning how to code, this book is written for you.
The goal of this book is to provide a ‘how-to’ guide to connect you to the world of data science. We focus on the fundamentals of the R programming language and its applications in biology. In writing this book, I assume you do not have much coding experience. Whether you are a new biology student or a seasoned professional, this book was written for you.
There are many great introductions to the R coding language available in print and online. But these tend to be general and abstract, sometimes going on tangents that are not so relevant to what you want to do as a biologist. What makes this book different, is that it is written with the biologist in mind. Specifically, my goal was to write the book that I wish I had had as an undergraduate student learning how to collect and analyze data. With the benefit of hindsight, I’ve tried to cut out all the programming details that haven’t been of much use to me as a data scientist, and focus on the most common methods. I’ve tried to connect to biological questions and examples as much as possible, without getting too side-tracked with biological details. This decision-making progress is based on my research and teaching experience in a range of topics in Biology and Health Sciences at Queen’s University – Environmental Science, Epidemiology, Genomics, Ecology, and Evolution.
A comprehensive coding volume would require thousands of printed pages and take decades to master. In choosing the content for this book, I have focused on everything that I wish I knew when I first started learning to program in R. Many of the functions and packages included here were not available when I started, but have some exceptional functionality. I will continue to add new tricks and techniques that I find useful.
Why this book?
Maybe you are curious about coding for data analysis but you aren’t sure if you want to invest the time and energy you will need to become competent in these methods. Many students in biology programs do not receive strong quantitative skills training in math, statistics, or computer science. In fact, many of us choose to go into biology programs because we are scared of the quantitative focus of the ‘hard’ sciences like physics and chemistry. Only much later do we realize how valuable these skills can be for investigating biological phenomena. Modern biology is defined by ‘big data’ sources including high-throughput sequencing, real-time environmental measurements, satellite imaging, animal tracking, and monitoring human health. Along with more traditional data types, these data are increasingly made available in online databases that are too big to navigate manually. Coding is not simply helpful to biologists – it’s becoming essential.
To help demonstrate the tremendous value of coding, I focus on examples drawn from real biological studies. I try to provide real-world examples of how one can apply programming tools and techniques to curate, analyze, and visualize biological data. These tend to be areas in which I have researched and published papers – opportunities that were presented to me because of my ability to analyze data in a reproducible and open framework. However, a key theme of this book is that these skills are highly transferable, not only across the biological sciences but to other disciplines.
Here are a few examples of the diversity of data, analyses, and visualizations in my own collaborations, which all use data analysis and visualizations in R:
A paper examining rapid evolution of flowering: https://doi.org/10.1126/science.1242121
A de novo genome assembly: https://doi.org/10.1093/g3journal/jkab339
A meta-analysis of evolution of invasive species: https://doi.org/10.1111/mec.13162
Tracking COVID-19 outbreaks using whole-genome sequencing: https://doi.org/10.1038/s41598-021-83355-1
A study of metabolites in nasal swabs that can differentiate COVID-19 from other viral infections in human patients: https://www.nature.com/articles/s41598-022-14050-y
An analysis of 3,429 herbarium images and >1 million weather records to reconstruct evolution of an invasive plant: https://www.pnas.org/doi/full/10.1073/pnas.2107584119
A model of species range limits: https://royalsocietypublishing.org/doi/full/10.1098/rstb.2021.0020
Acknowledgements
This book was written at Queen’s University in Kingston, Ontario, Canada, originally known as Katarowki, part of the traditional lands of the Anishinaabe and Haudenosaunee. I am very grateful that fate has brought me to this land to learn the Teaching of the Seven Grandfathers. When you need a break from coding, I encourage you to look up the Anishinaabe tradition of the Teaching of the Seven Grandfathers. Remember that coding is a superpower, and as your coding skills improve, you will have the responsibility to use your powers for the good of others.
This book was written at Queen’s University, but it began in 2009, when I first learned to code in R and began to collect resources and make notes to help teach these tools to others. In 2015, I converted these personal notes into a course at the University of Tübingen in southern Germany. I’m grateful to my friend and colleague, Dr. Oliver Bossdorf for encouraging me to develop and deliver that course. In 2017 I added new modules and developed a website of self-tutorials for a fourth year at Queen’s University called Introduction to Computation and Big Data in Biology. Over the next four years this content was revised and refined for a third year course on biostatistics and three graduate-level courses. In 2022 I separated these notes into four books, the first of which became the R Crash Course for Biologists. Feedback from dozens of graduate and undergraduate students helped me to understand which concepts were most difficult to new learners. The following graduate provided especially detailed and helpful feedback: Mia Akbar, María José Gómez Quijano, Charlotte Ngo, Claire Smith, Mike Vermeulen, and Sherise Vialva. The courses I’ve taught require a lot of work from students in the form of weekly quizzes and assignments. As such, the Dr. Brian Cumming deserves much credit for supporting these courses with Teaching Assistants to help me develop and deliver the content effectively. A special thanks to my partner in life and academia, Dr. Sarah Yakimowski, who provided support and feedback on a wide range of aspects from basic teaching philosophy to the cover design and layout. And a final thanks to you, the reader, for your interest in this book. I hope you find it useful, and I hope you will let me know what you think via email robert.colautti@queensu.ca or social media.