-
Notifications
You must be signed in to change notification settings - Fork 1
/
02_practice_with_r.qmd
507 lines (372 loc) · 19.8 KB
/
02_practice_with_r.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
---
title: "Introduction to R and Basic Data Analysis"
subtitle: "ACTEX Learning - AFDP: R Session 1.2"
author: "Federica Gazzelloni and Farid Flici"
date: "2024/10/03"
format:
html:
toc: true
editor: visual
execute:
warning: false
message: false
---
```{r}
#| echo: false
library(ggplot2)
book_theme <- theme_minimal() +
theme(plot.title=element_text(face="bold"))
ggplot2::theme_set(book_theme)
```
# Getting into Practice with R
`Data manipulation` is a fundamental skill in the field of `data science`. It involves the process of `transforming`, `organizing`, and `cleaning` raw data to make it suitable for analysis and visualization. `R` can be used in tasks such as determining `premium rates` for mortality and morbidity products, `evaluating the probability` of financial loss or return, providing business `risk consulting`, and `planning` for pensions and retirement. In essence, `R` is a powerful tool for actuaries and data scientists to perform complex data analysis and modeling.
# R Packages
R packages are collections of functions, data, and compiled code in a well-defined format. They are stored in a directory called `library`. A fresh R installation includes a set of packages that are loaded automatically when R starts. These packages are referred to as the base packages. The base packages are always available, and they do not need to be loaded explicitly.
Others packages are available for download from [CRAN (Comprehensive R Archive Network)](https://cran.r-project.org/web/packages/available_packages_by_name.html) or other repositories, such as [GitHub](https://github.com/) usually for package development versions. To install a package which is stored on CRAN, we can use the `install.packages("")` function. To load a package, use the `library()` function.
# Data Types
R has several data types, including numeric, character, logical, integer, and complex. The most common data types are:
- `Numeric` - real numbers
- `Character` - text
- `Logical` - TRUE or FALSE
- `Integer` - whole numbers
It allows data to be on the form of:
- `Vectors` of numbers, characters, or logical values
- `Matrices` which are arrays with two dimensions
- `Arrays` which are multi-dimensional generalizations of matrices
- `Data Frames` which are matrices with columns of different types
- `Lists` which are collections of objects
# Basic R Syntax
R is a case-sensitive language, meaning that it distinguishes between uppercase and lowercase letters. It uses the `#` symbol to add comments to the code. Comments are ignored by the R interpreter and are used to explain the code.
```{r}
# Basic arithmetic and variable assignment
x <- 10
y <- 5
sum_xy <- x + y
sum_xy
```
```{r}
# Simple interest formula
P <- 1000 # Principal
r <- 0.05 # Interest rate
t <- 2 # Time in years
A <- P * (1 + r * t)
A
```
# R Functions
R is a functional programming language, which means that it is based on functions. Functions are blocks of code that perform a specific task. They take input, process it, and return output.
There are two types of functions in R: `base functions` and `user-defined functions`. Base functions are built into R, while user-defined functions are created by the user.
## Base Functions
For example, the `mean()` function calculates the average of a set of numbers. The `mean()` function takes a vector of numbers as input and returns the average of those numbers. To access a general function information in R, use the `help()` function, or the `?` operator.
```{r}
#| eval: false
?mean()
help(mean)
```
The `c()` function is used to combine values into a vector or list.
```{r}
mean(c(1, 2, 3, 4, 5))
```
Useful base R function:
- `mean()` - calculates the average of a set of numbers
- `sum()` - calculates the sum of a set of numbers
- `sd()` - calculates the standard deviation of a set of numbers
- `var()` - calculates the variance of a set of numbers
- `min()` - returns the minimum value in a set of numbers
- `max()` - returns the maximum value in a set of numbers
- `length()` - returns the length of a vector
- `str()` - displays the structure of an R object
- `class()` - returns the class of an object
- `typeof()` - returns the type of an object
- `summary()` - provides a summary of an object
- `plot()` - creates a plot of data ...
## User-Defined Functions
```{r}
#| eval: false
name <- function(variables) {
# Code block
}
```
```{r}
x <- c(1, 2, 3, 4, 5)
avg <- function(x) {
sum(x)/length(x)
}
```
```{r}
avg(x);
mean(x)
```
# Essential Data Preparation Techniques: Wrangling, Manipulation, and Transformation
Data wrangling, manipulation, and transformation are essential techniques for preparing and refining data for analysis.
- **Data wrangling** involves cleaning raw data—handling missing values, correcting errors, and merging datasets—to ensure accuracy and consistency.
- **Data manipulation** builds on this by adjusting, filtering, and transforming specific aspects of the data, such as selecting relevant columns, filtering rows based on conditions, and creating new variables.
- **Data transformation** goes deeper by reshaping the structure of the data (e.g., converting between wide and long formats), normalizing values, or applying mathematical operations like log transformations to ensure that the data is suitable for complex analysis or modeling.
These processes, often performed using R packages like dplyr, tidyr, and data.table, are crucial steps in converting raw, disorganized data into meaningful insights. Together, they form the foundation for effective data analytics, which uses cleaned and well-structured data to perform descriptive, predictive, and inferential analyses.
As an example, consider a dataset with columns for `ID`, `Age`, and `Salary`. We can remove rows with missing values in the `Age` column and create a new column called `IncomeGroup` based on the `Salary` column.
```{r}
# Create a data frame
data <- data.frame(
ID = 1:5,
Age = c(25, 30, NA, 45, 35),
Salary = c(50000, 60000, 55000, NA, 70000)
)
data
```
Handling missing values and creating a new variable can be done using the `tidyverse` set of packages. The `dplyr` package provides functions (or verbs) for data manipulation, such as `filter()`, `mutate`, `select()`, and more. The `magrittr` package provides the pipe operator (`%>%`) used to chain functions together, making the code more readable. Furthermore, the pipe operator has recently been included in base R as well, so called the "tee" operator (`|>`) or native pipe operator.
Load the `tidyverse` package:
```{r}
library(tidyverse)
```
```{r}
cleaned_data <- data %>%
# Remove rows with missing Age values
filter(!is.na(Age)) %>%
# Create a new column based on Salary
mutate(IncomeGroup = ifelse(Salary > 60000, "High", "Low"))
cleaned_data
```
An interesting function to use is from the `janitor` package. The `clean_names()` function cleans the column names of a data frame by converting them to lowercase, replacing spaces with underscores, and removing special characters.
```{r}
cleaned_data %>%
# Clean column names (here the janitor package is called with ::)
janitor::clean_names()
```
# Example 1: Simulate Claims Data
In this example we will simulate a dataset of `claims data`. We will assume that the claims follow an exponential distribution with a mean of 5000. We will simulate 1000 claims and plot the data.
R provides functions to generate random numbers from different distributions, such as the `rexp()` function for creating a random exponential distribution, the `rnorm()` function for a random normal distribution, the `runif()` function for the uniform distribution, and others.
We can create a variable by the name of `avg_claim_size` to store the average claim size and a variable `num_claims` to store the number of claims to simulate.
```{r}
avg_claim_size <- 5000
num_claims <- 1000
```
For reproducibility, we can use the `set.seed()` function which assures that the random numbers generated are the same each time the code is run. Use the `rexp()` function to generate random numbers from an exponential distribution. And finally, plot the data using the `plot()` function.
```{r}
set.seed(123)
claims <- rexp(n = num_claims,
rate = 1/avg_claim_size)
plot(claims)
```
To access the help page for the `rexp()` function, use the following code: `?rexp`
## Inspecting Data
To inspect data we can use functions such as `str()`, `class()`, or `typeof()`. The `str()` function provides a compact display of the internal structure of an R object. The `class()` function returns the class of the object. The `typeof()` function returns the type of the object. Another function that can be used to inspect data is `glimpse()` from the `dplyr` package.
```{r}
str(claims)
```
```{r}
class(claims)
```
```{r}
typeof(claims)
```
The `claims` data is a numeric vector.
To have a basic summary of the data, we can use the `summary()` function.
```{r}
summary(claims)
```
## Data Visualization
There are a set of functions in base R for data visualization, such as `plot()`, `hist()`, and `boxplot()`, to create scatter plots, line plots, bar plots, histograms, and box plots.
For example, to make an histogram of the claims data, use the `hist()` function. The `breaks` argument specifies the number of intervals to divide the data into.
```{r}
hist(claims, breaks = 30)
```
In addition, for a more customized plot, we can use the `ggplot2` package, part of the `tidyverse` set of packages. The `ggplot()` function initializes a plot, and the `geom_<>()` functions add layers to the plot, the layers are chained using the `+` operator. More information on `ggplot2` can be found in the **ggplot2: Elegant Graphics for Data Analysis** book by Hadley Wickham.
We can produce the same histogram as before using `ggplot2`, but first we need to convert the claims vector to be a data frame.
```{r}
set.seed(1123)
claims_df <- data.frame(claim_id = sample(x = 1:num_claims,
size = num_claims,
replace = FALSE),
claim_amount = claims)
head(claims_df)
```
To create the histogram using `ggplot2`, we add the aesthetics of the plot using the `aes()` function, specify the x-axis and y-axis variables, and use the `geom_histogram()` function to create the histogram. The `labs()` function is used to add titles to the plot. There are other customizations that can be added to the plot, such as color, fill, and more.
```{r}
ggplot(data = claims_df,
aes(x = claim_amount)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
labs(title = "Simulated Claims Data",
x = "Claim Amount",
y = "Frequency")
```
# Example 2: Motor Insurance Claims Data
For this example, we will use the French Motor Third-Part Liability datasets: `freMTPL2freq` and `freMTPL2sev`. The datasets contain the frequency and severity of claims for a motor insurance portfolio.
The datasets are from the `CASdatasets` R-package, which provides a collection of datasets for actuarial science used in the book **"Computational Actuarial Science with R"** by Arthur Charpentier and Rob Kaas. We do not need to install the package as the data we need is already available in the `data` folder of this project. For more information on how to install the `CASdatasets` package please check the `helper00.R` file in the project folder.
Data can be accessed by loading it into the R environment with the `load()` function.
```{r}
# freMTPLfreq has been reduced to 1000 rows for demonstration purposes
load("data/freMTPLfreq_1000.rda")
load("data/freMTPLsev.rda")
```
R allows the user to manage different types of data, such as .Rdata, .rda, .rds, .csv, .txt, .xls, .xlsx, among others. The different types of data can be loaded into R using functions like `load()`, `read.csv()`, `read.table()`, `read_excel()`, and others. Each of these types of files occupies a specific format in memory.
The R Data Format Family, usually with extension . rdata or . rda, is a format designed for use with R for storing a complete R workspace or selected "objects" from a workspace in a form that can be loaded back by R. Once the file is loaded into R, the data is stored in the `Environment` as an object. The object can be accessed by its name.
Here we look at the first few rows of each dataset:
```{r}
head(freMTPLfreq)
```
```{r}
head(freMTPLsev)
```
## Inspecting Data
The `freMTPLfreq` dataset contains the risk features and the claim number, while the `freMTPLsev` dataset contains the claim amount and the corresponding policy ID.
We can use the `glimpse()` function from the `dplyr` package to get a quick overview of the datasets.
```{r}
glimpse(head(freMTPLfreq))
```
```{r}
glimpse(head(freMTPLsev))
```
Check if there are any missing values in the datasets using the `any()` function. The `is.na()` function checks for missing values in the dataset.
```{r}
any(is.na(freMTPLfreq));
any(is.na(freMTPLsev))
```
## Data Preparation
The two datasets have a common variable, `PolicyID`, which can be used as a key to merge the datasets to create a single dataset that contains both the frequency and severity of claims.
Something to notice is that `PolicyID` is stored as a character variable in the `freMTPLfreq` dataset and as an integer variable in the `freMTPLsev` dataset. So, in order to proceed with merging the two sets, we need to transform the `PolicyID` variable in the `freMTPLfreq` dataset to an integer type.
## Mutate and Merge Data
The `mutate()` function is used to create a new variable, or to make a modification to an existing one, as in this case, to transform `PolicyID` as an integer type.
```{r}
freMTPLfreq <- freMTPLfreq %>%
mutate(PolicyID = as.integer(PolicyID))
```
We could have done this with base R as well, using the `as.integer()` function.
```{r}
freMTPLfreq$PolicyID <- as.integer(freMTPLfreq$PolicyID)
```
The way to use the `$` operator is to access a variable in a data frame. The `class()` function is used to check the class of the modified object.
```{r}
freMTPLfreq$PolicyID %>% class()
```
We can now combine the two datasets using the `merge()` function from base R which merges two datasets based on a common variable, in this case, the `PolicyID` variable.
There are other functions that can be used to merge datasets, such as `inner_join()`, `left_join()`, `right_join()`, and `full_join()` from the `dplyr` package, depending on the desired output.
```{r}
claims_data_raw <- freMTPLfreq %>%
merge(freMTPLsev, by = "PolicyID")
claims_data_raw %>% dim()
```
```{r}
claims_data_raw %>% names()
```
```{r}
claims_data_raw %>%
summary()
```
## Filter and Summarize Data
We can filter the data to include only claims greater than 1000 and summarize the total claims by year. The `group_by()` function is used to group the data by a specific variable, and the `reframe()` function is used to calculate summary statistics for each group.
```{r}
# Filter data for claims greater than 1000
high_claims <- claims_data_raw %>% filter(ClaimAmount > 1000)
```
```{r}
ggplot(data = high_claims,
aes(x = factor(DriverAge), y = ClaimAmount)) +
geom_col() +
labs(title = "High Claims by Driver Age",
x = "Driver Age",
y = "Claim Amount")
```
## Grouping and Summarize total claims by DriverAge
```{r}
claims_summary <- claims_data_raw %>%
group_by(DriverAge) %>%
reframe(TotalClaims = sum(ClaimAmount))
claims_summary
```
```{r}
ggplot(data = claims_summary,
aes(x = factor(DriverAge), y = TotalClaims)) +
# fix the bar plot
geom_col() +
labs(title = "Total Claims by Driver Age",
x = "Driver Age",
y = "Claim Amount")
```
## Premium Calculation
Let's calculate the `Premium formula` for the claims data. The premium formula is given by:
$$Premium = \frac{(\text{projected claims} + \text{fixed expenses)}}{(1-\text{expenses as % of premium)}}$$ We calculate the projected claims as the product of the number of claims (`ClaimNb`) and the amount of claims (`ClaimAmount`). Then, we simulate fixed expenses as a random number between 100 and 500, and the expenses as a percentage of premium between 10% and 30%, with `runif()` functions.
```{r}
# Setting seed for reproducibility
set.seed(42)
claims_data <- claims_data_raw %>%
mutate(
# Simulate projected claims between 1000 and 5000
projected_claims = ClaimNb * ClaimAmount,
# Simulate fixed expenses between 100 and 500
fixed_expenses = runif(n(), 100, 500),
# Simulate expenses percentage between 10% and 30%
expenses_as_percent_of_premium = runif(n(), 0.1, 0.3)
)
```
With the `mutate()` function we create the `Premium` variable using the formula provided. The `dim()` function is used to check the dimensions of the dataset.
```{r}
# Calculate Premium using the given formula
claims_data <- claims_data %>%
mutate(
Premium = (projected_claims + fixed_expenses) / (1 - expenses_as_percent_of_premium)
)
claims_data %>% dim()
```
To select specific columns from a dataset we can use the `select()` function from the `dplyr` package. Then, the `head()` function is used to display the first few rows of the dataset.
```{r}
# View the results
claims_data %>%
select(PolicyID,
projected_claims,
fixed_expenses,
expenses_as_percent_of_premium,
Premium) %>%
head()
```
## Visualize Projected Claims vs Premium
```{r}
ggplot(data = claims_data %>% filter(Premium < 10000),
aes(x = projected_claims, y = Premium)) +
# Add points
geom_point() +
# Add a smooth line
geom_smooth()+
labs(title = "Projected Claims vs Premium",
x = "Projected Claims",
y = "Premium")
```
## Group Data by Age and Calculate Average Premium
Let's check the range of the `DriverAge` variable and group the data by age to calculate the average premium for each age group.
```{r}
range(claims_data$DriverAge)
```
Finally, we group the data by 5 years age group and calculate the average premium for each age group using the `cut()` function to create age groups.
```{r}
claims_data %>%
mutate(DriverAge_group = cut_interval(DriverAge,
n = 5)) %>%
group_by(DriverAge_group) %>%
reframe(avg_Premium = mean(Premium))
```
```{r}
claims_data_ag <- claims_data %>%
mutate(DriverAge_group = cut(DriverAge,
breaks = c(18, 25, 35,
45, 55, 65,
75, 85, 95, 99),
include.lowest = TRUE)) %>%
arrange(DriverAge_group) %>%
group_by(DriverAge_group) %>%
reframe(avg_Premium = mean(Premium))
claims_data_ag
```
```{r}
claims_data_ag %>%
ggplot(aes(x = DriverAge_group, y = avg_Premium)) +
geom_col(fill = "blue") +
labs(title = "Average Premium by Age Group",
x = "Driver Age Group",
y = "Average Premium")
```
# Summary
In this lesson, we have introduced the basics of R programming, including functions, data types, and syntax. We have explored data manipulation techniques using base R functions and the `tidyverse` set of packages. We have also worked with simulated and real-world claims data to perform data wrangling, manipulation, and visualization.
R is a powerful tool for actuaries and data scientists, offering a wide range of functions and packages for data analysis, modeling, and visualization. By mastering R programming, actuaries can enhance their analytical skills, improve decision-making, and drive innovation in the insurance industry.
In the next lesson, we will delve deeper into data analysis and modeling techniques, exploring predictive modeling using R.
# Resources
- Charpentier, A., & Kaas, R. (2014). [Computational Actuarial Science with R. CRC Press](https://www.amazon.it/Computational-Actuarial-Science-Arthur-Charpentier/dp/1466592591)
- Wickham, H. (2016). [ggplot2: Elegant Graphics for Data Analysis](https://ggplot2-book.org/)
- Tidyverse (2021). [Tidyverse: Easily Install and Load the 'Tidyverse'](https://www.tidyverse.org/)