Skip to content

Commit

Permalink
new cran release
Browse files Browse the repository at this point in the history
  • Loading branch information
MHaringa committed Oct 9, 2024
1 parent 8f6cb60 commit 363137b
Show file tree
Hide file tree
Showing 308 changed files with 6,321 additions and 57,191 deletions.
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: insurancerating
Type: Package
Title: Analytic Insurance Rating Techniques
Version: 0.7.4.9000
Version: 0.7.5
Authors@R: c(person("Martin", "Haringa", role = c("aut", "cre"), email = "mtharinga@gmail.com"))
Maintainer: Martin Haringa <mtharinga@gmail.com>
BugReports: https://github.com/MHaringa/insurancerating/issues
Expand All @@ -11,7 +11,7 @@ Description: Functions to build, evaluate, and visualize insurance rating
data-driven strategy for constructing insurance tariff classes, drawing on
the work of Antonio and Valdez (2012) <doi:10.1007/s10182-011-0152-7>.
License: GPL (>= 2)
URL: https://github.com/mharinga/insurancerating, https://mharinga.github.io/insurancerating/, https://github.com/MHaringa/insurancerating
URL: https://mharinga.github.io/insurancerating/, https://github.com/MHaringa/insurancerating
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.3.2
Expand Down
2 changes: 1 addition & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# insurancerating (development version)
# insurancerating 0.7.5

* `rating_factors()` now always returns correct output when column with exposure in data is not named `exposure`
* `intercept_only` in `update_glm()` is added to apply the manual changes and refit the intercept, ensuring that the changes have no impact on the other variables.
Expand Down
4 changes: 2 additions & 2 deletions R/model_add_prediction.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@
#' @examples
#' mod1 <- glm(nclaims ~ age_policyholder, data = MTPL,
#' offset = log(exposure), family = poisson())
#' add_prediction(MTPL, mod1)
#' mtpl_pred <- add_prediction(MTPL, mod1)
#'
#' # Include confidence bounds
#' add_prediction(MTPL, mod1, conf_int = TRUE)
#' mtpl_pred_ci <- add_prediction(MTPL, mod1, conf_int = TRUE)
#'
#' @export
add_prediction <- function(data, ..., var = NULL, conf_int = FALSE,
Expand Down
77 changes: 69 additions & 8 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ knitr::opts_chunk$set(

<!-- badges: start -->

[![CRAN Status](https://www.r-pkg.org/badges/version/insurancerating)](https://cran.r-project.org/package=insurancerating) [![Downloads](https://cranlogs.r-pkg.org/badges/insurancerating?color=blue)](https://cran.rstudio.com/package=insurancerating)
[![CRAN Status](https://www.r-pkg.org/badges/version/insurancerating)](https://cran.r-project.org/package=insurancerating) [![Downloads](https://cranlogs.r-pkg.org/badges/insurancerating?color=blue)](https://cran.r-project.org/package=insurancerating)


<!-- badges: end -->

Expand Down Expand Up @@ -67,10 +68,12 @@ The following indicators are calculated:

**Note on Exposure and Risk Premium**

In the context of insurance:
In insurance, *exposure* refers to the level of risk an insurer takes on when providing coverage for a certain asset, like a vehicle, over a period of time. For example, in vehicle insurance, exposure is often measured in *vehicle-years*, indicating how long the vehicle is covered and the likelihood of a claim being made.

For example, in vehicle insurance:

- The term *exposure* refers to the subject or asset that is being insured. For example, an insured vehicle is considered an exposure.
- If a vehicle is insured as of July 1st for a particular year, the insurance company would record this as an exposure of 0.5 for that year. This means that the vehicle was insured for half the year.
- If a car is insured for a full year, its exposure is counted as 1.
- If a vehicle is insured for six months, its exposure would be 0.5.

Additionally, the term *risk premium* is used interchangeably with *pure premium* or *burning cost*. These terms represent the amount of premium that is required to cover the expected loss, without including any additional expenses or profit margins.

Expand All @@ -85,6 +88,8 @@ univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure)
However, the above table is small and easy to understand, the same information might be presented more effectively with a graph, as shown below.

```{r uni3, eval = TRUE, message = FALSE}
#| fig.alt: >
#| Show all available univariate plots
univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) |>
autoplot()
Expand All @@ -106,6 +111,8 @@ In `autoplot.univariate()`, `show_plots` specifies both which plots to display a
For instance, to display the exposure and claim frequency plots:

```{r uni4}
#| fig.alt: >
#| Show claim frequency and exposure
univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) |>
autoplot(show_plots = c(6,1))
Expand All @@ -115,6 +122,8 @@ univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) |>
To check whether claim frequency remains consistent over the years is important for identifying trends or irregularities:

```{r uni5}
#| fig.alt: >
#| Show claim frequency over the years
set.seed(1)
sample_years <- sample(2016:2019, nrow(MTPL2), replace = TRUE)
Expand All @@ -129,6 +138,8 @@ MTPL2 |>
To remove the bars from the plot and display only the line graph, use `background = FALSE`:

```{r uni6}
#| fig.alt: >
#| Show claim frequency and exposure without histogram
univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) |>
autoplot(show_plots = c(6,1), background = FALSE)
Expand All @@ -138,6 +149,8 @@ univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) |>
`sort` arranges the levels of the risk factor in descending order based on exposure:

```{r uni7}
#| fig.alt: >
#| Show claim frequency and arrange levels in descending order
univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) |>
autoplot(show_plots = 1, background = FALSE, sort = TRUE)
Expand All @@ -147,6 +160,8 @@ univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) |>
`sort_manual` allows you to arrange the levels of a discrete risk factor according to your preferred order. This is useful when the levels have a natural sequence or when you want to exclude certain levels from the output.

```{r uni8}
#| fig.alt: >
#| Show claim frequency and arrange levels according to your preferred order
univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) |>
autoplot(show_plots = c(6,1), background = FALSE,
Expand All @@ -157,6 +172,8 @@ univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) |>
The graph below illustrates additional options:

```{r uni9, fig.width = 10, fig.height = 5}
#| fig.alt: >
#| Show graph with additional options
univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) |>
autoplot(show_plots = c(6,1), background = FALSE, sort = TRUE, ncol = 2,
Expand All @@ -168,6 +185,8 @@ univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) |>
Alternatively, you can create a bar graph to display the number of claims; this is the last `univariate()` plot with options presented here:

```{r uni10, eval = TRUE, message = FALSE, warning = FALSE}
#| fig.alt: >
#| Show number of claims
univariate(MTPL2, x = area, nclaims = nclaims) |>
autoplot(show_plots = 8, coord_flip = TRUE, sort = TRUE)
Expand All @@ -177,6 +196,8 @@ univariate(MTPL2, x = area, nclaims = nclaims) |>
In addition to `univariate()`, another option for one-way analysis is `histbin()`. This function allows you to create a histogram for continuous variables:

```{r hist1, eval = TRUE, message = FALSE, warning = FALSE}
#| fig.alt: >
#| Histogram for the distribution of the premium
histbin(MTPL2, premium, bins = 20)
Expand All @@ -185,6 +206,8 @@ histbin(MTPL2, premium, bins = 20)
In the context of insurance, it is common to encounter outliers in the data, and one way to address this issue is by grouping the outliers into a single bin:

```{r hist2, eval = TRUE, message = FALSE, warning = FALSE}
#| fig.alt: >
#| Histogram for the distribution of the premium with grouped outliers
histbin(MTPL2, premium, bins = 10, right = 110)
Expand All @@ -201,6 +224,8 @@ To do this, we fit a Generalized Additive Model (GAM) for *age_policyholder*. A
`fit_gam()` below displays the claim frequency (i.e. number of claims / exposure) for different age groups:

```{r cont1, eval = TRUE, message = FALSE, warning = FALSE}
#| fig.alt: >
#| Claim frequency for different age groups
age_policyholder_frequency <- fit_gam(data = MTPL,
nclaims = nclaims,
Expand All @@ -223,6 +248,8 @@ The first method is to bin the GAM output using evolutionary trees, which group
`construct_tariff_classes()` generates bins using evolutionary trees:

```{r cont2, eval = TRUE}
#| fig.alt: >
#| Claim frequency for different age groups with bins
clusters_freq <- construct_tariff_classes(age_policyholder_frequency)
Expand Down Expand Up @@ -254,7 +281,7 @@ glimpse(dat)
```

The last line above sets the base level of the factors (specifically `age_policyholder_freq_evt` and `age_policyholder_freq_man`) to the one with the highest exposure. For example, for `age_policyholder_freq_evt`, the age group (39, 84] is chosen as the base level because it has the most exposure.
`biggest_reference()` in the last line above establishes the baseline for the factors, specifically `age_policyholder_freq_evt` and `age_policyholder_freq_man`, using the one with the highest exposure. For instance, for `age_policyholder_freq_evt`, the age group `(39,84]` is selected as the baseline since it has the highest exposure.

## Risk premium models

Expand All @@ -264,7 +291,7 @@ GLMs are favored because they allow for the modeling of complex relationships be

### Example 1

The following code generates two different models for claim frequency. `rating_factors()` displays the fitted coefficients:
The following code generates two different models for claim frequency.

```{r rp11, eval = TRUE}
Expand All @@ -278,13 +305,21 @@ mod_freq2 <- glm(nclaims ~ age_policyholder_freq_evt + age_policyholder,
family = "poisson",
data = dat)
```

A fitted linear model has coefficients for the different categories of the factor terms, usually one less than the total number of categories. `rating_factors()` includes the baseline for the factors with a coefficient of 1:

```{r rp11a}
rating_factors(mod_freq1, mod_freq2)
```

`autoplot.riskfactor()` generates a figure of the coefficients. The base level for the factor `age_policyholder_freq_cat` is the group with the highest exposure, which is displayed first.

```{r rp12, eval = TRUE}
#| fig.alt: >
#| Show rating factors
rating_factors(mod_freq1, mod_freq2) |>
autoplot()
Expand All @@ -294,6 +329,8 @@ rating_factors(mod_freq1, mod_freq2) |>
The figure above displays the age groups in a non-natural order, with the group aged 39 to 84 appearing before the group aged 18 to 25. To arrange the ages in their natural order, include `model_data` in `rating_factors()` to sort the clustering in the original sequence. Please note that ordering the factor `age_policyholder_freq_evt` will only work if `biggest_reference()` is used to set the base level of the factor to the level with the highest exposure.

```{r rp13, eval = TRUE}
#| fig.alt: >
#| Show rating factors in natural order
rating_factors(mod_freq1, mod_freq2, model_data = dat) |>
autoplot()
Expand All @@ -303,6 +340,9 @@ rating_factors(mod_freq1, mod_freq2, model_data = dat) |>
The following graph presents additional options, for example, including the exposure displayed as a bar graph:

```{r rp14, eval = TRUE}
#| fig.alt: >
#| Show rating factors in natural order, including the exposure displayed
#| as a bar graph
rating_factors(mod_freq1, mod_freq2, model_data = dat, exposure = exposure) |>
autoplot(linetype = TRUE)
Expand Down Expand Up @@ -367,6 +407,8 @@ rating_factors(burn_unrestricted)
While the table above is concise and easy to interpret, the same information can be presented more effectively through a graph, as shown below. This visualization makes it easier to assess whether the coefficients follow the desired trend:

```{r rp33, fig.width = 10}
#| fig.alt: >
#| Show coefficients for the age of the policyholder
rating_factors(burn_unrestricted, model_data = MTPL_premium, exposure = exposure) |>
autoplot(risk_factor = "age_policyholder_freq_man")
Expand Down Expand Up @@ -394,6 +436,9 @@ In `smooth_coef()`, `x_cut` refers to the risk factor with clusters, in this cas
`autoplot()` generates a figure for the smoothed estimates. The blue segments represent the estimates from the unrestricted model, while the black line displays the smoothed coefficients. The red segments indicate the newly estimated coefficients based on the polynomial and the selected age groups. These age groups can be chosen to align with commercial objectives:

```{r rp35, eval = TRUE, message = FALSE, warning = FALSE}
#| fig.alt: >
#| Show smoothed coefficients by means of a polynomial
#| for the age of the policyholder
burn_unrestricted |>
smooth_coef(x_cut = "age_policyholder_freq_man",
Expand All @@ -408,6 +453,8 @@ As illustrated above, the fitted polynomial yields excessively high coefficients
The degree can be adjusted to a lower-order polynomial (in this case, set to 1), resulting in a straight line, which is not ideal:

```{r rp36, eval = TRUE, message = FALSE, warning = FALSE}
#| fig.alt: >
#| Show smoothed coefficients by means of a lower-order polynomial
burn_unrestricted |>
smooth_coef(x_cut = "age_policyholder_freq_man",
Expand All @@ -421,6 +468,8 @@ burn_unrestricted |>
In most cases, and particularly in this situation, a better alternative is to use a GAM rather than a polynomial:

```{r rp37, eval = TRUE, message = FALSE, warning = FALSE}
#| fig.alt: >
#| Show smoothed coefficients by means of a GAM
burn_unrestricted |>
smooth_coef(x_cut = "age_policyholder_freq_man",
Expand All @@ -434,6 +483,8 @@ burn_unrestricted |>
It is observed that for ages above 80, the fitted line decreases rapidly, despite having very little exposure in this age group. Therefore, the GAM should be weighted by the exposure, resulting in a weighted GAM:

```{r rp38, eval = TRUE, message = FALSE, warning = FALSE, fig.width = 10, fig.height = 8}
#| fig.alt: >
#| Show smoothed coefficients by means of a weighted GAM
burn_unrestricted |>
smooth_coef(x_cut = "age_policyholder_freq_man",
Expand All @@ -450,7 +501,9 @@ We now observe a pattern that looks quite desirable (especially when compared to

To achieve this, `smooth_coef()` offers options for monotonic increasing ("mpi") or monotonic decreasing ("mpd") trends. These are modeled using shape-constrained additive models (SCAMs).

```{r rp39, eval = TRUE, message = FALSE, warning = FALSE, fig.width = 10, fig.height=8}
```{r rp39, eval = TRUE, message = FALSE, warning = FALSE, fig.width = 10, fig.height = 8}
#| fig.alt: >
#| Show smoothed coefficients by means of a gam vs mpd
gam <- burn_unrestricted |>
smooth_coef(x_cut = "age_policyholder_freq_man",
Expand Down Expand Up @@ -506,6 +559,8 @@ rating_factors(burn_restricted3)
And visualize them:

```{r rp312, fig.width = 10}
#| fig.alt: >
#| Show rating factors according to the burning model
# Show rating factors
rating_factors(burn_restricted3) |> autoplot()
Expand Down Expand Up @@ -554,6 +609,8 @@ The RMSE (Root Mean Square Error) is the square root of the average squared diff
`bootstrap_rmse()` computes the RMSE for bootstrap replicates, conducting this process \code{n} times. Specifically, in each iteration, a sample is drawn with replacement from the dataset, and the model is refitted using this sample. The root mean squared error is then calculated. The following visualizes this:

```{r rp42, eval = TRUE}
#| fig.alt: >
#| Show bootstrapped rmse
bootstrap_rmse(mod_freq1, dat, n = 100, show_progress = FALSE) |>
autoplot()
Expand All @@ -575,6 +632,8 @@ check_overdispersion(mod_freq1)
`check_residuals()` calculates standardized residuals from GLMs, scaling them between 0 and 1, making them easier to interpret, similar to residuals from linear models:

```{r rp44, message = FALSE, eval = TRUE}
#| fig.alt: >
#| Show uniform QQ plot for calculated standardized residuals
check_residuals(mod_freq1, n_simulations = 600) |>
autoplot()
Expand All @@ -583,4 +642,6 @@ check_residuals(mod_freq1, n_simulations = 600) |>

`check_residuals()` helps identify deviations from the expected distribution and generates a uniform quantile-quantile (QQ) plot. The simulated residuals in the QQ plot above show no significant deviation from a Poisson distribution. Keep in mind that formal tests for residual distribution usually yield significant results, so visual inspections like QQ plots are preferred.

Diagnosing issues in GLMs is challenging because standard residual plots often don't work well. This is due to the expected data distribution changing with fitted values, which can make it seem like there are issues like non-normality or heteroscedasticity, even if the model is correct. To address this, `check_residuals()` uses a simulation-based approach to create standardized residuals that can be intuitively understood. This explanation is adopted from the [vignette for DHARMa](https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html).
Diagnosing issues in GLMs is challenging because standard residual plots often don't work well. This is due to the expected data distribution changing with fitted values, which can make it seem like there are issues like non-normality or heteroscedasticity, even if the model is correct. To address this, `check_residuals()` uses a simulation-based approach to create standardized residuals that can be intuitively understood. This explanation is adopted from the [vignette for DHARMa](https://cran.r-project.org/package=DHARMa/vignettes/DHARMa.html).


Loading

0 comments on commit 363137b

Please sign in to comment.