-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathBest_practice_R.qmd
248 lines (172 loc) · 18 KB
/
Best_practice_R.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
---
title: "R Best Practices for ADF&G Biometricans"
author: "ADF&G Standards Group"
date: :"June 30, 2023"
format: html
editor: visual
---
This document outlines coding best practices in R for biometricians and quantitative biologists within ADF&G, with the hope that a common framework will facilitate collaboration and code review. The main objective is to make code easily readable, understandable, and reproducible. A second objective is to give all ADF&G staff members a common starting point while learning to use R to achieve their job responsibilities. Coding best practices work in tandem with the Department's operational planning and reporting polices to facilitate reproducible research.
## Conventions used in this document
When point and click sequences are referenced in this text, the sequence will be enclosed in "quotations". Button names will be shown in *italic*. Button names in the point and click sequence will be separated by the greater than symbol (\>). Any actions required will be described using italics enclosed in parentheses, such as (*italics enclosed in parentheses*). For example, reference the following point and click sequence: "*Button name* \> *Button name* \> (*Required action described*)". Keystrokes will be described with regular font enclosed in quotations, e.g., "Control+Shift+R".
Example R scripts will be shown in code blocks:
```{r}
#example
age <- 48
```
R commands in the text below will be shown using inline code blocks (e.g. `<-`).
## R and Rstudio Best Practices
1. **Use RStudio Projects:**
- To create a new RStudio project: "*File* \> *New Project...* \> *New Directory* \> *New Project* \> (*Provide a Directory name and location*) \> *Create Project*".
- Checking the box next to "*Create a git repository"* will initialize a git repository in the project directory.
- Checking the box next to "*Use renv with this project"* will create a package environment for your project, which will protect your analysis from package updates and retirements.
- RStudio projects make file paths within the project relative to the location of the RStudio project file (.Rproj). This provides stability because relative file locations (e.g., `"./raw data.xlsx"`) will not break if the project folder is moved to a new location.
- This approach is more stable than absolute paths created using `setwd()`. Use of `setwd()` should be reserved for sharing small pieces of code in situations where the code will not be further developed.
- Refer to the RStudio tips and tricks section below to learn more about file paths.
2. **Don't save your work environment:**
- To ensure your work environment is never saved when exiting RStudio, go to "*Tools* \> *Global Options* \> *Save workspace to .RData on exit*, (*Change selection to 'Never'*)".
- This practice ensures programming errors are not masked by orphaned objects in your workspace.
- If the analyses are lengthy, complex, or time consuming, the deliberate practice of saving specific output files, which can be sourced later, is more stable.
3. **Use a consistent folder structure within each RStudio project:**
- Project/Analysis name
- scripts -- This folder contains the R scripts used to perform fisheries analysis.
- data -- This folder contains any data sets used in the project or analysis and any code required to prepare the data for analysis.
- The raw data file for the project should be a read-only file (i.e. never adjust the contents of a raw data file). If changes are required, they should be done via code prior to the analysis. If errors are located, they should be communicated back to the database manager for correction.
- If data are confidential, write a comment in the script explaining how to call the data, where it is stored on an ADF&G server, or whom to contact.
- reports -- This folder contains FDS reports, operational plans, and any other documents associated with the analysis.
- README file -- This file stored in the main project directory can be very helpful for orienting a new user.
- You can add to this basic structure with additional folders such as: functions (R scripts containing functions, which can be sourced when needed in other scripts), output (summary tables, .csv files, Excel files, JAGS posteriors, .RData files, simulation objects, figures, etc.), R Markdown (R Markdown files and outputs), Quarto (Quarto files and outputs), literature (references that are pertinent to the project), and models (statistical model scripts).
- Multiple year analyses are generally set up such that there is a separate folder for each year, within which resides the above folder structure (e.g., 2019 folder contains scripts, data, and reports folders).
- This practice helps to create a common framework to facilitate collaboration, code review, and the reproducibility of projects.
::: {#fig-folders layout-nrow="2"}
![Folder structure when project data is analyzed at once.](folder_general.png){#fig-folder_general}
![Folder structure for separate, annual analyses of each year's data.](folder_annual.png){#fig-folder_annual}
Typical folder structures for ADF&G fisheries analyses.
:::
4. **Use informative headers:**
- For better understanding among multiple staff (current and future) and long-term maintenance of analyses, each script should begin with a header that includes at least the following information:
- Synopsis of what the code does and which project uses the code pertains
- Author
- Date of creation
- Other information a new analyst might find helpful: required packages, parameter names/descriptions, major modification history, dependencies, etc.
- Refer to the RStudio tips and tricks section below to learn about using snippets to create informative headers.
```{r}
# Find the most common words and word pairs in the DSF Gotham Culture survey results.
# Author: Adam Reimer
# Version: 2023-05-05
# Packages
packs <- c("tidyverse", "stringr", "tidytext", "textdata", "ggraph")
lapply(packs, require, character.only = TRUE)
# Note: The data associated with the project is stored outside of this repository to protect employee privacy.
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
```
5. **Use a standard template for your script:**
- Following a general coding template will help us all read eachother's code:
- Header
- Required packages
- Read data/Clean data
- Analysis
- Write/Save output figures and tables
- Refer to the RStudio tips and tricks section below to learn about snippets to create a standard template and about using outlines to organize your R script.
6. **Have a naming convention for local variables, global variables, constants, and functions:**
- Keep names short but descriptive. Meaningful and understandable variable and function names help collaborators differentiate items in the coding environment. In general, variable names should be nouns and function names should be verbs.
- The best naming convention will vary depending on the application. The most important thing is to be consistent within each script and/or project. On large projects, a glossary of parameter names can help keep the naming convention understandable.
- Rename/reformat columns, as needed, immediately after importing data.
- Do not use the same name for multiple purposes. For example, variables with the same name in global and local environments can lead to confusing results.
- Avoid numbers in variable names unless they are informative.
- E.g., length500 for a variable that contains the number of fish greater than 500 mm in length is informative.
- Incremental names are not informative (e.g., using S1, S2, and S3 as variables containing the number of fish greater than 300 mm, 400 mm, and 500 mm). If incremental names are used, their meaning should be described in comments.
- Use consistent parameter names in the code and the methods of the associated report.
- Refer to the RStudio tips and tricks section below to learn about the `clean_names()` function and a tip about capital letters in object names.
7. **Indentation and whitespace:**
- Proper indentation and judicious use of whitespace (e.g., spaces between sections) are very important in enhancing the readability of the code.
- Be consistent throughout the entire script.
- Refer to the RStudio tips and tricks section below to learn tips for including whitespace, see an example of having proper indentation and a judicious use of whitespace, and to learn a tip about reindenting or reformatting code.
8. **Code should be well documented:**
- Commenting makes your code more understandable.
- Comments should be brief, relevant, and used conservatively within the script.
- Examples of places to comment code:
- Adding links to the origin of copied code and/or to credit the original author
- Explaining a complex piece of code
- Stating obscure packages used in the R script and explaining the functionality they provide
- Explaining what a block of code or what each step of a pipe does
- Adding details about units, precision, or the treatment of null values
- For example: Instead of naming a variable catch_kg, make a comment at the beginning of the script that catch is in kg unless otherwise stated, and then name the variable catch.
- To document changes from previous years
- Document major updates and issues
- The simplest way to document major updates is to include a modification history in the header (e.g., May 2022: changed the statistical analysis to consider year).
- Git/GitHub is a great tool to track updates, note future code improvements, and report issues. See [ADFG-DSF/Git_book: A Quarto Book describing best practices for Git use at ADF&G (github.com)](https://github.com/ADFG-DSF/Git_book) for more guidance on using GitHub.
9. **Write functions for repetitive tasks:**
- Whenever possible, write universal functions for small discrete tasks.
- If you have many functions, place them all into one script and source that script in your main code.
- An extension of this practice is to write utility scripts for larger discrete tasks and then source those scripts in your main code.
- This practice reduces the clutter of the script containing the main analysis so that the main code is easier to follow.
10. **Format dates using the international date/time standard format (ISO 8601):**
- Dates regularly cause errors. Format all date columns into the international standard format (ISO 8601), which is yyyy-mm-dd, at the beginning of a script to avoid problems later in the analysis.
11. **Use Quarto or R Markdown:**
- When appropriate, use Quarto or R Markdown to generate output documents in Word, PDF, and html formats. Using Quarto or R Markdown allows you to produce high-quality, self-contained output documents directly from code, reducing the likelihood of copy and paste or version errors. Using Quarto or R Markdown is also efficient since you can easily regenerate the entire output following a data update or modification of the code.
- To create a new Quarto document: "*File* \> *New File* \> *Quarto Document...* \> *Document* or *Presentation* or *Interactive* \> (*Pick the type of document/presentation to create and type in a Title for the document/presentation; Pick the Engine and Editor as well if need be*) \> *Create*".
- To create a new R Markdown document: "*File* \> *New File* \> *R Markdown...* \> *Document* or *Presentation* or *Shiny* or *From Template* \> (*Pick the type of document/presentation to create, type in a Title for the document/presentation, and check or uncheck box about using current date when rendering document; Or pick the template to use*) \> *OK*".
## RStudio tips and tricks
- RStudio can provide a script outline with navigation
- Insert section names such as `Read data` or `Format data` using the keyboard shortcut "Control+Shift+R". These names will appear in your script as a comment with a separator following the name.
- Section names can also be typed in using at least one hash sign, then the section name, and then at least 4 dashes or equal signs. For example, `# Load data ----` or `## Format data ====` would both create a section name.
- You can create subsections by preceding your section name with `*` or sub-subsections by preceding your section name with `**`.
- Section names are visible in an outline. This outline can be accessed by clicking the button with 5 horizontal lines at the top right corner of the source pane in RStudio. Section names in the outline can be clicked on to navigate through a lengthy piece of code.
- RStudio snippets can be used to insert a standard header and code template in a new R script.
- To create a header/code template snippet, go to "*Tools* \> *Edit Code Snippets...* \> (*Paste the following rscript snippet at the bottom of the pop-up*) \> *Save*".
```{r}
snippet rscript
# ${1:Description}
# Author: ${2:Name}
# Version: `r Sys.Date()`
# Packages
packs <- c("${3:Packages}")
lapply(packs, require, character.only = TRUE)
# Parameters
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# Read data ---------------------------------------------------------------
${4:code}
# * Format data ----------------------------------------------------------
# Analysis ----------------------------------------------------------------
# Results -----------------------------------------------------------------
```
- To use the rscript snippet, type `rscript` slowly in a blank R script and hit "Enter" when the command auto-completes. A standard header and code template will be added to your script with the cursor placed where the code description should be entered (see @fig-snippet_output). After you have entered an appropriate description, pressing the "Tab" key will take the cursor to the area where the author's name should be entered. A third "Tab" keystroke will take the cursor to the area where packages can be specified, and a forth "Tab" keystroke will take the cursor to the section where you can begin writing your code. Notice the snippet creates section names and the associated preliminary outline.
![The output of the rscript snippet in a blank R script.](snippet_output.png){#fig-snippet_output}
- You can specify package names (e.g. `dplyr::rename` vs. `rename`) when using uncommon functions to make the dependency clear and/or avoid conflicts between packages.
- Use pipes (`%>%` or `|>`) to chain together multiple, related operations where the intermediate steps are of little interest. Note that the piping operator `%>%` requires the `dplyr` package to be loaded, whereas the native piping operator `|>` requires no packages to be loaded.
- The `clean_names()` function in the `janitor` package is an easy way to rename all columns consistently.
- Object names in R are case specific. Use capital letters sparingly if at all. It can be difficult to remember arbitrary capitalization in variable and function names.
- Use `<-` (not `=`) for object assignments. This helps distinguish object assignments from passing values/objects to function arguments. Make sure to use `=` (not `<-`) in function arguments to avoid making "accidental" objects.
- Use `TRUE` and `FALSE`, not `T` or `F` (the latter can be reassigned, the former cannot).
- Place spaces around all binary operators (`=`, `+`, `-`, `<-`, etc.). Do not place a space before a comma, but always place one after a comma. Place a space before a left parenthesis or left curly bracket, except in a function call. There should be a hard return after pipe operators.
```{r}
# An example of R script spacing
human_age <- function(dog_age) {
# args: dog_age = the age of a medium-sized dog in years
# return: the equivalent human age (in years) for the dog
# from https://www.akc.org/expert-advice/health/how-to-calculate-dog-years-to-human-years/
stopifnot(dog_age >= 0, dog_age <= 20)
output <-
if(dog_age <= 2) {
dog_age * 12
} else {
24 + (dog_age - 2) * 5
}
output
}
# An example of hard returns in a pipe sequence
data %>%
mutate(...) %>%
ggplot(...)
```
- RStudio can automatically reformat your code. Go to "*Code* \> *Reformat Code*". The keystroke shortcut is "Ctrl+Shift+A".
- The autocomplete functionality in RStudio is a great way to avoid misspelling functions, data objects, and file paths.
- R abides by the Unix standard of using forward slashes (/) to navigate directories, while Windows systems use backward slashes (\\). R has trouble parsing character strings with a single backward slash. So if you are using a Windows pathway in an R object, one option is to convert backward slashes to forward slashes (e.g., change `read.csv(".\data.csv")` to `read.csv("./data.csv")`.
- When typing file paths, `"/"` refers to your root directory (e.g., C drive if working locally on a Windows machine), `"./"` refers to your current working directory (location of your .Rproj file), `"../"` refers to the parent directory, `"../../"` to the grandparent, and so on. The `here` package is an option to simplify relative file paths.
- You can comment out multiple lines of code in RStudio by highlighting and pressing "Ctrl+Shift+C".
## Resources
Although this document was based on recommendations from ADF&G employees, many of the recommendations were summarized from outside resources including:
- <https://style.tidyverse.org/>
- <https://www.geeksforgeeks.org/coding-standards-and-guidelines/>
- <https://github.com/Kristories/awesome-guidelines>
- <https://google.github.io/styleguide/Rguide.xml>