Imagine a brilliant technical prodigy, fluent in several programming languages, adept at statistical inference, capable with all manner of machine- and deep-learning procedures. They don’t work for google, but they are still pretty good. You ask them for help figuring something out in excel, or pulling together a report from a SQL database, which ends up being…beneath their abilities…and they let you know this. If they help out, they do it begrudgingly. There is a bit of an edge; they want to make it known that they think you should know how to do this yourself. They will stoop down to your level for the time being, but while implying you should be able to handle this trivial task. If you can’t handle this simple thing, what can you do?

Is this familiar at all? Maybe you’ve witnessed it, or maybe you’ve been the prodigy. In either case, here’s a few quick thoughts to meditate on – a set of workplace virtues of sorts which may help ground and orient your interactions.

*Take a stance of solidarity with your coworkers – you’re all swimming in the same fishbowl. See others’ challenges or goals as challenges or goals you share. Solidarity is effective at creating strong and harmonious human relationships based on long-term commitment to common objectives; it elevates you above the rational self-interested individualism of calculating and acting according to your own personal advantage.*

*Cultivate a forward-moving energy and focus it on solving those shared problems, not on undermining or sassing others. Properly focused energy can animate your work; we should exhort others to cultivate it as well and hold up exemplars of it to ourselves and each other.*

*Help out your colleagues because it is virtuous and “good for the soul.” You will feel good doing it, because helping others is the essence of your species-being and so a good in itself.*

*Embody the disposition to properly direct your own decision making and to be virtuous in self-activity. Autonomy elevates and empowers you; heteronomy lowers and demeans you. If you cannot direct yourself in a simple coworker request for help even in a technically uninteresting task, then you are not ready for autonomy and self-direction*.

*People don’t bring the same things you bring to the table, but it does not follow that they don’t bring anything to the table. Assume everyone brings something which you do not – can you get a meeting with a decision maker at a prospective client company? Are you aware of new IFRS-17 standards and what is needed to implement them? If everyone had the same abilities as you, then there wouldn’t be a compelling reason to have you there.*

]]>

It looks likely now that Congress is stepping in with hearings in both chambers, where they will give the Robinhood and hedge fund CEOs a good dressing down for the TV viewing public.

My contrarian take on the whole $GME (Gamestock) episode can be summed up in meme form:

Some info and reasoning in support of my meme position:

- Volume on $GME over the past 3 days (1/27 – 1/29) has ranged from 50 to 100 million shares. Looking at the price charts and figuring around $300 per share as a rough average price, that’s about $15-30 billion in trading each day. Retail investors on r/wallstreetbets are obviously not mustering $15-30 billion of buying power. I think it more likely that the denizens of reddit are an entertaining sideshow in what is otherwise a hedge fund vs hedge fund fight. The news media is focusing on the sideshow, while some crafty institutions go on pillorying the shorts.
- Now, an interesting (to me) story is how not one but
*multiple*hedge funds (Melvin Capital, Citron Research being the two big names) failed so incredibly at the most basic of risk-management and fiduciary responsibilities, in Melvin’s case allowing this*one*position to drive an over 50% loss in the fund for the month. - But you also can’t entirely blame the HFs who shorted a stock to 230% of float. They have to and do disclose their positions, so apparently fund participants were ok with them risking the squeeze by backing into that corner.
- Short funds have now assumed the evil mantle for Wall Street? Maybe Netflix needs to add “The Big Short” back for 2021. Short funds play an important role in markets by performing and publishing research exposing frauds, scams, pyramid schemes, and various other kinds of trickery, as well as plain old poor management. Think of short funds as non-governmental financial regulators who back up their research with their own capital. They are probably the least contemptible of actors in the financial markets. Of course they don’t do it out of a sense of volunteerism – they are looking to get paid off their research – but it does end up being one of those synergistic arrangements where the economics and business strategy align with the public interest to produce a net positive outcome for society.
- Let’s keep in mind who has the role of David in this saga – small time day-traders, primarily those who rely on a free trading app which temporarily halted buying in $GME and some other equities. We’re not talking about homeowners losing their homes, workers being laid off, retirees’ pension funds evaporating, or anything of that sort. I personally do not know many everyday working people who spend their days day-trading meme stocks.
- In the context of 4 and 5, Congress-persons tripping over each other to get to a camera and decry the Wall St. “sore losers” who regularly do all sorts of evils (remember, this is the short funds we are talking about now, who represent all the evil of Wall St.) to the rest of us but cry for mom and dad (SEC and Congress I think, here) to step in when the tables get turned on them by your salt-of-the-earth day-traders on reddit. There are going to be hearings now. This is a surprisingly pressing matter in the midst of a global pandemic, with evictions happening, stimulus negotiations seemingly not moving, and all other sorts of things that are…less important than a free trading platform startup not having the margin needed to fulfill a surge of orders.
- I’m really skeptical of the otherwise-baseless assertion that Robinhood (and some others) halted buying on $GME solely because of real or supposed business relationships they had with companies on the short side. Sorry but the shorts have no such leverage over a brokerage, and Robinhood has no incentive to completely undermine their business and their customer base. I’d place my own bet that this was plain old incompetence on the part of Robinhood – probably their not satisfying capital and margin requirements from the clearing houses. Keep in mind they have a track record of screwing up in different ways.
- The redditors are really exuberant now (and their ranks are swelling…), but as the shorts unwind their positions who’s going to ultimately be left holding that really shitty bag? They may all “like the stock,” hold fast with diamond hands etc., etc., but as discussed above, we know they are not the ones controlling the shares which the shorts need.

Update 2/4/2021:

A couple updates since I posted this on Sunday:

- Robinhood released this statement on Friday evening (so before I had written this, I wasn’t aware at the time) explaining that indeed the buying limitations were imposed due to the suddenly increased capital requirements coming from the clearinghouses, which in turn were driven by significantly increased volatility in their customers’ holdings (which probably include a disproportionate amount of $GME stock which was seeing a 1700% increase in volatility).
- This article from WSJ Wednesday evening about one hedge fund (Senvest) which bought up 25% of Gamestock in the run up (for under $10 a share) and turned a quick $700M profit in the squeeze…
- $GME quickly descending from the high of $468 to ~$92 at close on 2/3. WSB still talking about “holding the line” and “buying the dip.”

I remember when I started using machine learning methods how time consuming and – even worse – *manual* it could be to perform a hyperparameter search. The whole benefit of machine learning is that the algorithm should optimize the model-learning task for us, right? The problem of course becomes one of compute resources. Suppose we only have simple brute-force grid search at our disposal. With just one hyperparameter to tune, this approach is practical – we may only need to test 5 candidate values. But as the number of hyperparameters (“dimension”) increments, the number of candidate hyperparameterizations increases according to a power function. Suppose instead we have 5 hyperparameters to tune – again using just points five points for each dimension would now result in \(5^5 = 3,125\) model evaluations to test all the possible combinations. Sometimes 5 points is realistic – for example, with some discrete parameters like the maximum tree depth in a random forest. But for something continuous it is usually not, so I am really understating how quickly a grid will blow up, making brute-force approaches impractical.

Pretty quickly one goes from the brute-force appraoch to more involved strategies for grid search. That could mean starting with coarse grids and zooming into specific promising areas with higher resolution grids in subsequent searches; it could mean iterating between two or three different subsets of the hyperparameters which tend to “move together” – like the learning rate and number of rounds in a GBM. These strategies become highly manual, and frankly it becomes a real effort to keep track of the different runs and results. We don't want to have to think this much and risk making a mistake when tuning algorithms!

MBO differs from grid search in a couple of ways. First, we search the entire continuous range of a hyperparameter, not a discretized set of points within that range. Second, and more importantly, it is a probabilistic method which uses information from early evaluations to improve the selection of subsequent tests that will be run. In this regard it is similar to the low-res/high-res search strategy, but with automation. As good Bayesians, we like methods that incorporate prior information to improve later decisions, a principle which is intuitive and appealing to our naturally bayesian brains.

As mentioned above, the method for selecting later test points based on the information from the early tests is gaussian process smoothing or *kriging*. One popular application for Gaussian processes is in geo-spatial smoothing and regression. We are basically doing the same thing here, except instead of geographic (lat-long) space, our space is defined by the ranges of a set of hyperparameters. We refer to this as the hyperparameter space, and MBO is going to help us search it for the point which provides the optimal result of a machine learning algorithm.

So let's take a look at how Bayes helps us tune machine learning algorithms with some code.

The main package we need is `mlrMBO`

, which provides the `mbo()`

method for optimizing an arbitrary function sequentially. We also need several others for various helpers – `smoof`

to define the objective function which will be optimized; `ParamHelpers`

to define a parameter space in which we will perform the bayesian search for a global optimum; and `DiceKriging`

provides the gaussian process interpolation (in the machine learning world it is called “kriging”) capability.

We will use the `xgboost`

flavor of GBM as our machine learning methodology to be tuned, but you could adapt what I'm demonstrating here to any algorithm with multiple hyperparameters (or even a single one, if run-time for a single iteration was so high as to warrant it). `mlrMBO`

is completely agnostic to your choice of methodology, but the flip side is this means a bit of coding setup required on the data scientist's part (good thing we like coding, and don't like manual work).

library(CASdatasets) library(dplyr) library(tibble) library(magrittr) library(ggplot2) library(scatterplot3d) library(kableExtra) library(tidyr) library(mlrMBO) library(ParamHelpers) library(DiceKriging) library(smoof) library(xgboost)

I'll use my go-to insurance ratemaking dataset for demonstration purposes – the french motor dataset from `CASdatasets`

.

data("freMPL1") data("freMPL2") data("freMPL3") fre_df <- rbind(freMPL1, freMPL2, freMPL3 %>% select(-DeducType)) rm(freMPL1, freMPL2, freMPL3)

Let's take a look at our target variable `ClaimAmount`

gridExtra::grid.arrange( fre_df %>% filter(ClaimAmount > 0) %>% ggplot(aes(x = ClaimAmount)) + geom_density() + ggtitle("Observed Loss Distribution"), fre_df %>% filter(ClaimAmount > 0, ClaimAmount < 1.5e4) %>% ggplot(aes(x = ClaimAmount)) + geom_density() + ggtitle("Observed Severity Distribution"), nrow = 1 )

We have something like a compound distribution – a probability mass at 0, and some long-tailed distribution of loss dollars for observations with incurred claims. But let's also look beyond the smoothed graphical view.

min(fre_df$ClaimAmount)

## [1] -3407.7

sum(fre_df$ClaimAmount < 0)

## [1] 690

We also appear to have some claims < 0 - perhaps recoveries (vehicle salvage) exceeded payments. For the sake of focusing on the MBO, we will adjust these records by flooring values at 0. I'll also convert some factor columns to numeric types which make more sense for modeling.

fre_df %<>% mutate(ClaimAmount = case_when(ClaimAmount < 0 ~ 0, TRUE ~ ClaimAmount)) %>% mutate(VehMaxSpeed_num = sub(".*-", "", VehMaxSpeed) %>% substr(., 1, 3)%>% as.numeric, VehAge_num = sub("*.-", "", VehAge) %>% sub('\\+', '', .) %>% as.numeric, VehPrice_num = as.integer(VehPrice)) %>% # The factor levels appear to be ordered so I will use this group_by(SocioCateg) %>% # high cardinality, will encode as a proportion of total mutate(SocioCateg_prop = (sum(n()) / 4) / nrow(.) * 1e5) %>% ungroup() ## matrices, no intercept needed and don't forget to exclude post-dictors fre_mat <- model.matrix(ClaimAmount ~ . -1 -ClaimInd -Exposure -RecordBeg -RecordEnd - VehMaxSpeed -VehPrice -VehAge -SocioCateg, data = fre_df) ## xgb.DMatrix, faster sparse matrix fre_dm <- xgb.DMatrix(data = fre_mat, label = fre_df$ClaimAmount, base_margin = log(fre_df$Exposure)) ## base-margin == offset ## we use log earned exposure because the xgboost Tweedie ## implementation includes a log-link for the variance power

To avoid confusion, there are two objective functions we could refer to. Statistically, our objective function aka our loss funciton is negative log-likelihood for am assumed tweeedie-distributed random variable. The `xgboost`

algorithm will minimize this objective (equivalent to maximizing likelihood) for a given set of hyper-parameters for each run. Our other objective function is the R function defined below - it calls `xgb.cv()`

, runs the learning procedure with cross-validation, stops when the out-of-fold likelihood does not improve, and returns the best objective evaluation (log-loss metric) based on the out-of-fold samples.

Note that the function below also includes a defined hyperparameter space - a set of tuning parameters with possible ranges for values. There are 6 traditional tuning parameters for xgboost, but I've also added the tweedie variance “power” parameter as a seventh. This parameter would take a value between (1,2) for a poisson-gamma compound distribution, but I first narrowed this down to a smaller range based on a quick profile of the loss distribution (using `tweedie::tweedie.profile()`

, omitted here).

# Adapted for Tweedie likelihood from this very good post at https://www.simoncoulombe.com/2019/01/bayesian/ # objective function: we want to minimize the neg log-likelihood by tuning hyperparameters obj.fun <- makeSingleObjectiveFunction( name = "xgb_cv_bayes", fn = function(x){ set.seed(42) cv <- xgb.cv(params = list( booster = "gbtree", eta = x["eta"], max_depth = x["max_depth"], min_child_weight = x["min_child_weight"], gamma = x["gamma"], subsample = x["subsample"], colsample_bytree = x["colsample_bytree"], max_delta_step = x["max_delta_step"], tweedie_variance_power = x["tweedie_variance_power"], objective = 'reg:tweedie', eval_metric = paste0("tweedie-nloglik@", x["tweedie_variance_power"])), data = dm, ## must set in global.Env() nround = 7000, ## Set this large and use early stopping nthread = 26, ## Adjust based on your machine nfold = 5, prediction = FALSE, showsd = TRUE, early_stopping_rounds = 25, ## If evaluation metric does not improve on out-of-fold sample for 25 rounds, stop verbose = 1, print_every_n = 500) cv$evaluation_log %>% pull(4) %>% min ## column 4 is the eval metric here, tweedie negative log-likelihood }, par.set = makeParamSet( makeNumericParam("eta", lower = 0.005, upper = 0.01), makeNumericParam("gamma", lower = 1, upper = 5), makeIntegerParam("max_depth", lower= 2, upper = 10), makeIntegerParam("min_child_weight", lower= 300, upper = 2000), makeNumericParam("subsample", lower = 0.20, upper = .8), makeNumericParam("colsample_bytree", lower = 0.20, upper = .8), makeNumericParam("max_delta_step", lower = 0, upper = 5), makeNumericParam("tweedie_variance_power", lower = 1.75, upper = 1.85) ), minimize = TRUE ## negative log likelihood )

The core piece here is the call to `mbo()`

. This accepts an initial design - i.e. a set of locations which are chosen to be “space-filling” within our hyperparameter space (we do not want randomn generation which could result in areas of the space having no points nearby) - created using `ParamHelpers::generateDesign()`

. The `makeMBOControl()`

method is used to create an object which will simply tell `mbo()`

how many optimization steps to run after the intial design is tested - these are the runs which are determined probabilistically through gaussian process smoothing, aka kriging. Finally, I create a plot of the optimization path and return the objects in a list for later use.

The covariance structure used in the gaussian process is what makes GPs “bayesian” - they define the prior information as a function of nearby observed values and the covariance structure which defines the level of smoothness expected. We use a Matern 3/2 kernel - this is a moderately smooth covariance often used in geospatial applications and which is well-suited to our own spatial task. It is equivalent to the product of an exponential and a polynomial of degree 1. This is the `mbo`

default for a numerical hyperparameter space - if your hyperparameters include some which are non-numeric (for example, you may have a hyperparameter for “method” and a set of methods to choose from), then instead of kriging a random forest is used to estimate the value of the objective function between points, and from this the optimizing proposals are chosen. This would no longer be a strictly “bayesian” approach, though I think it would still be bayesian in spirit.

The gaussian process models the result of our objective function's output as a function of hyperparameter values, using the initial design samples. For this reason, it is referred to (especially in the deep learning community) as a surrogate model - it serves as a cheap surrogate for running another evaluation of our objective function at some new point. For any point not evaluated directly, the estimated/interpolated surface provides an expectation. This benefits us because points that are likely to perform poorly (based on the surrogate model estimate) will be discarded, and we will only move on with directly evaluating points in promising regions of the hyperparameter space.

Creating a wrapper function is optional - but to perform multiple runs in an analysis, most of the code here would need to be repeated. To be concise, I write it once so it can be called for subsequent runs (perhaps on other datasets, or if you get back a boundary solution you did not anticipate).

do_bayes <- function(n_design = NULL, opt_steps = NULL, of = obj.fun, seed = 42) { set.seed(seed) des <- generateDesign(n=n_design, par.set = getParamSet(of), fun = lhs::randomLHS) control <- makeMBOControl() %>% setMBOControlTermination(., iters = opt_steps) ## kriging with a matern(3,2) covariance function is the default surrogate model for numerical domains ## but if you wanted to override this you could modify the makeLearner() call below to define your own ## GP surrogate model with more or lesss smoothness, or use an entirely different method run <- mbo(fun = of, design = des, learner = makeLearner("regr.km", predict.type = "se", covtype = "matern3_2", control = list(trace = FALSE)), control = control, show.info = TRUE) opt_plot <- run$opt.path$env$path %>% mutate(Round = row_number()) %>% mutate(type = case_when(Round <= n_design ~ "Design", TRUE ~ "mlrMBO optimization")) %>% ggplot(aes(x= Round, y= y, color= type)) + geom_point() + labs(title = "mlrMBO optimization") + ylab("-log(likelihood)") print(run$x) return(list(run = run, plot = opt_plot)) }

Normally for this problem I would perform more evaluations, in both the intial and optimizing phases. Something around 5 -7 times the number of parameters being tuned for the initial design and half of that for the number of optimization steps could be a rule of thumb. You need some points in the space to have something to interpolate between!

Here's my intial design of 15 points.

des <- generateDesign(n=15, par.set = getParamSet(obj.fun), fun = lhs::randomLHS) kable(des, format = "html", digits = 4) %>% kable_styling(font_size = 10) %>% kable_material_dark()

eta | gamma | max_depth | min_child_weight | subsample | colsample_bytree | max_delta_step | tweedie_variance_power |
---|---|---|---|---|---|---|---|

0.0064 | 1.7337 | 5 | 1259 | 0.3795 | 0.3166 | 4.0727 | 1.8245 |

0.0074 | 3.2956 | 6 | 1073 | 0.4897 | 0.2119 | 1.0432 | 1.7833 |

0.0062 | 1.4628 | 9 | 1451 | 0.7799 | 0.5837 | 3.0967 | 1.8219 |

0.0059 | 4.5578 | 3 | 703 | 0.2135 | 0.3206 | 1.6347 | 1.8346 |

0.0081 | 4.0202 | 8 | 445 | 0.4407 | 0.5495 | 2.2957 | 1.7879 |

0.0067 | 3.5629 | 3 | 330 | 0.4207 | 0.4391 | 0.9399 | 1.7998 |

0.0092 | 4.4374 | 7 | 1937 | 0.6411 | 0.7922 | 3.9315 | 1.7704 |

0.0099 | 1.8265 | 7 | 1595 | 0.2960 | 0.3961 | 2.5070 | 1.8399 |

0.0055 | 2.4464 | 9 | 805 | 0.5314 | 0.2457 | 3.6427 | 1.7624 |

0.0096 | 2.2831 | 10 | 1662 | 0.6873 | 0.6075 | 0.2518 | 1.8457 |

0.0079 | 2.8737 | 4 | 894 | 0.2790 | 0.4954 | 0.4517 | 1.7965 |

0.0087 | 2.8600 | 6 | 599 | 0.5717 | 0.6537 | 4.9145 | 1.7557 |

0.0053 | 1.0082 | 2 | 1863 | 0.6021 | 0.7436 | 2.7048 | 1.7686 |

0.0071 | 4.9817 | 2 | 1380 | 0.3599 | 0.4507 | 4.3572 | 1.8156 |

0.0086 | 3.6847 | 9 | 1124 | 0.7373 | 0.6935 | 1.9759 | 1.8043 |

And here is a view to see how well those points fill out 3 of the 7 dimensions

scatterplot3d(des$eta, des$gamma, des$min_child_weight, type = "h", color = "blue", pch = 16)

We can see large areas with no nearby points - if the global optimum lies here, we *may* still end up with proposals in this area that lead us to find it, but it sure would be helpful to gather some info there and guarantee it. Here's a better design with 6 points per hyperparameter.

des <- generateDesign(n=42, par.set = getParamSet(obj.fun), fun = lhs::randomLHS) scatterplot3d(des$eta, des$gamma, des$min_child_weight, type = "h", color = "blue", pch = 16)

This would take longer to run, but we will rely less heavily on interpolation over long distances during the optimizing phase because we have more information observed through experiments. Choosing your design is about the trade-off between desired accuracy and computational expense. So use as many points in the initial design as you can afford time for (aiming for at least 5-7 per parameter), and maybe half as many for the number of subsequent optimization steps.

Now that we are all set up, let's run the procedure using our `do_bayes()`

function above and evaluate the result. As discussed above, I recommend sizing your random design and optimization steps according to the size of your hyperparameter space, using 5-7 points per hyperparameter as a rule of thumb. You can also figure out roughly how much time a single evaluation take (which will depend on the hyperparameter values, so this should be an estimate of the mean time), as well as how much time you can budget, and then choose the values that work for you. Here I use 25 total runs - 15 initial evaluations, and 10 optimization steps.

(Note: The verbose output for each evaluation is shown below for your interest)

dm <- fre_dm runs <- do_bayes(n_design = 15, of = obj.fun, opt_steps = 10, seed = 42)

## Computing y column(s) for design. Not provided.

## [1] train-tweedie-nloglik@1.82975:342.947351+9.068011 test-tweedie-nloglik@1.82975:342.950897+36.281204 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.82975 for early stopping. ## Will train until test_tweedie_nloglik@1.82975 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.82975:54.102369+1.254521 test-tweedie-nloglik@1.82975:54.103011+5.019251 ## [1001] train-tweedie-nloglik@1.82975:17.471831+0.173606 test-tweedie-nloglik@1.82975:17.486700+0.699454 ## [1501] train-tweedie-nloglik@1.82975:15.143893+0.056641 test-tweedie-nloglik@1.82975:15.371196+0.251584 ## [2001] train-tweedie-nloglik@1.82975:14.794773+0.047399 test-tweedie-nloglik@1.82975:15.161293+0.229903 ## [2501] train-tweedie-nloglik@1.82975:14.583392+0.051122 test-tweedie-nloglik@1.82975:15.069248+0.245231 ## [3001] train-tweedie-nloglik@1.82975:14.419826+0.051263 test-tweedie-nloglik@1.82975:14.996046+0.258229 ## [3501] train-tweedie-nloglik@1.82975:14.281899+0.049542 test-tweedie-nloglik@1.82975:14.944954+0.278042 ## [4001] train-tweedie-nloglik@1.82975:14.162422+0.046876 test-tweedie-nloglik@1.82975:14.902480+0.299742 ## [4501] train-tweedie-nloglik@1.82975:14.056329+0.045482 test-tweedie-nloglik@1.82975:14.861813+0.318562 ## Stopping. Best iteration: ## [4597] train-tweedie-nloglik@1.82975:14.037272+0.045602 test-tweedie-nloglik@1.82975:14.852832+0.320272 ## ## [1] train-tweedie-nloglik@1.78053:341.751507+9.223267 test-tweedie-nloglik@1.78053:341.757257+36.900142 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.78053 for early stopping. ## Will train until test_tweedie_nloglik@1.78053 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.78053:20.067214+0.271193 test-tweedie-nloglik@1.78053:20.130995+1.160850 ## [1001] train-tweedie-nloglik@1.78053:15.076158+0.069745 test-tweedie-nloglik@1.78053:15.599540+0.313749 ## [1501] train-tweedie-nloglik@1.78053:14.517137+0.066595 test-tweedie-nloglik@1.78053:15.309150+0.339692 ## [2001] train-tweedie-nloglik@1.78053:14.163563+0.061668 test-tweedie-nloglik@1.78053:15.147976+0.382249 ## [2501] train-tweedie-nloglik@1.78053:13.890958+0.068184 test-tweedie-nloglik@1.78053:15.034968+0.404933 ## [3001] train-tweedie-nloglik@1.78053:13.663806+0.063876 test-tweedie-nloglik@1.78053:14.950204+0.428143 ## [3501] train-tweedie-nloglik@1.78053:13.467250+0.063427 test-tweedie-nloglik@1.78053:14.885284+0.460394 ## [4001] train-tweedie-nloglik@1.78053:13.293906+0.060834 test-tweedie-nloglik@1.78053:14.837956+0.493384 ## Stopping. Best iteration: ## [4190] train-tweedie-nloglik@1.78053:13.231073+0.059948 test-tweedie-nloglik@1.78053:14.818203+0.503775 ## ## [1] train-tweedie-nloglik@1.8352:342.471167+9.032795 test-tweedie-nloglik@1.8352:342.480517+36.144871 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.8352 for early stopping. ## Will train until test_tweedie_nloglik@1.8352 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.8352:28.357122+0.477149 test-tweedie-nloglik@1.8352:28.895080+2.147141 ## [1001] train-tweedie-nloglik@1.8352:14.938225+0.058305 test-tweedie-nloglik@1.8352:15.482690+0.376248 ## [1501] train-tweedie-nloglik@1.8352:14.236916+0.048579 test-tweedie-nloglik@1.8352:14.920337+0.307540 ## [2001] train-tweedie-nloglik@1.8352:13.941143+0.047951 test-tweedie-nloglik@1.8352:14.796813+0.353247 ## [2501] train-tweedie-nloglik@1.8352:13.719723+0.047280 test-tweedie-nloglik@1.8352:14.722402+0.396149 ## [3001] train-tweedie-nloglik@1.8352:13.536447+0.045626 test-tweedie-nloglik@1.8352:14.666037+0.429260 ## Stopping. Best iteration: ## [3171] train-tweedie-nloglik@1.8352:13.480152+0.046018 test-tweedie-nloglik@1.8352:14.652766+0.442854 ## ## [1] train-tweedie-nloglik@1.80549:340.990906+9.109456 test-tweedie-nloglik@1.80549:340.997974+36.447143 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80549 for early stopping. ## Will train until test_tweedie_nloglik@1.80549 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.80549:17.171271+0.149287 test-tweedie-nloglik@1.80549:17.612757+0.749392 ## [1001] train-tweedie-nloglik@1.80549:14.128140+0.059918 test-tweedie-nloglik@1.80549:14.910039+0.330671 ## [1501] train-tweedie-nloglik@1.80549:13.571417+0.056644 test-tweedie-nloglik@1.80549:14.694386+0.417752 ## Stopping. Best iteration: ## [1847] train-tweedie-nloglik@1.80549:13.292618+0.053193 test-tweedie-nloglik@1.80549:14.617792+0.481956 ## ## [1] train-tweedie-nloglik@1.77453:340.924103+9.220978 test-tweedie-nloglik@1.77453:340.936652+36.904872 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.77453 for early stopping. ## Will train until test_tweedie_nloglik@1.77453 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.77453:15.810031+0.105204 test-tweedie-nloglik@1.77453:16.556760+0.587635 ## [1001] train-tweedie-nloglik@1.77453:13.707083+0.061060 test-tweedie-nloglik@1.77453:14.973095+0.458299 ## [1501] train-tweedie-nloglik@1.77453:13.042243+0.062065 test-tweedie-nloglik@1.77453:14.767002+0.587105 ## Stopping. Best iteration: ## [1888] train-tweedie-nloglik@1.77453:12.664030+0.058594 test-tweedie-nloglik@1.77453:14.693107+0.666969 ## ## [1] train-tweedie-nloglik@1.81211:341.573523+9.099887 test-tweedie-nloglik@1.81211:341.579096+36.409106 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.81211 for early stopping. ## Will train until test_tweedie_nloglik@1.81211 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.81211:21.698555+0.287551 test-tweedie-nloglik@1.81211:21.802563+1.318389 ## [1001] train-tweedie-nloglik@1.81211:15.316608+0.057473 test-tweedie-nloglik@1.81211:15.507609+0.267001 ## [1501] train-tweedie-nloglik@1.81211:15.002261+0.055740 test-tweedie-nloglik@1.81211:15.310712+0.234783 ## [2001] train-tweedie-nloglik@1.81211:14.813123+0.056066 test-tweedie-nloglik@1.81211:15.225331+0.251187 ## [2501] train-tweedie-nloglik@1.81211:14.666221+0.055383 test-tweedie-nloglik@1.81211:15.168456+0.269639 ## [3001] train-tweedie-nloglik@1.81211:14.541463+0.053780 test-tweedie-nloglik@1.81211:15.124595+0.291749 ## [3501] train-tweedie-nloglik@1.81211:14.435767+0.053397 test-tweedie-nloglik@1.81211:15.090134+0.312649 ## [4001] train-tweedie-nloglik@1.81211:14.342162+0.053199 test-tweedie-nloglik@1.81211:15.060593+0.328036 ## [4501] train-tweedie-nloglik@1.81211:14.255980+0.053683 test-tweedie-nloglik@1.81211:15.030732+0.343309 ## Stopping. Best iteration: ## [4585] train-tweedie-nloglik@1.81211:14.242503+0.053874 test-tweedie-nloglik@1.81211:15.024390+0.345340 ## ## [1] train-tweedie-nloglik@1.81758:341.854852+9.086251 test-tweedie-nloglik@1.81758:341.863275+36.356180 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.81758 for early stopping. ## Will train until test_tweedie_nloglik@1.81758 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.81758:25.013295+0.397345 test-tweedie-nloglik@1.81758:25.676643+1.808151 ## [1001] train-tweedie-nloglik@1.81758:14.504071+0.055626 test-tweedie-nloglik@1.81758:15.187462+0.380966 ## [1501] train-tweedie-nloglik@1.81758:13.875887+0.049737 test-tweedie-nloglik@1.81758:14.765052+0.371576 ## [2001] train-tweedie-nloglik@1.81758:13.542809+0.049598 test-tweedie-nloglik@1.81758:14.638334+0.423159 ## Stopping. Best iteration: ## [2270] train-tweedie-nloglik@1.81758:13.398557+0.049952 test-tweedie-nloglik@1.81758:14.601097+0.448479 ## ## [1] train-tweedie-nloglik@1.78506:341.348248+9.195115 test-tweedie-nloglik@1.78506:341.357587+36.789846 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.78506 for early stopping. ## Will train until test_tweedie_nloglik@1.78506 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.78506:18.174807+0.195105 test-tweedie-nloglik@1.78506:19.060744+1.022248 ## [1001] train-tweedie-nloglik@1.78506:13.397231+0.063749 test-tweedie-nloglik@1.78506:14.732965+0.451939 ## Stopping. Best iteration: ## [1460] train-tweedie-nloglik@1.78506:12.673491+0.062628 test-tweedie-nloglik@1.78506:14.487809+0.572331 ## ## [1] train-tweedie-nloglik@1.84189:341.976398+8.988454 test-tweedie-nloglik@1.84189:341.984406+35.983697 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.84189 for early stopping. ## Will train until test_tweedie_nloglik@1.84189 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.84189:18.400557+0.165181 test-tweedie-nloglik@1.84189:18.735286+0.850735 ## [1001] train-tweedie-nloglik@1.84189:14.588454+0.047146 test-tweedie-nloglik@1.84189:15.120865+0.271691 ## [1501] train-tweedie-nloglik@1.84189:14.106823+0.046601 test-tweedie-nloglik@1.84189:14.891420+0.316660 ## Stopping. Best iteration: ## [1936] train-tweedie-nloglik@1.84189:13.822906+0.048588 test-tweedie-nloglik@1.84189:14.787369+0.361366 ## ## [1] train-tweedie-nloglik@1.76821:343.904297+9.327030 test-tweedie-nloglik@1.76821:343.909997+37.314570 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.76821 for early stopping. ## Will train until test_tweedie_nloglik@1.76821 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.76821:153.005731+4.071955 test-tweedie-nloglik@1.76821:153.008331+16.291001 ## [1001] train-tweedie-nloglik@1.76821:70.474121+1.777711 test-tweedie-nloglik@1.76821:70.475147+7.112157 ## [1501] train-tweedie-nloglik@1.76821:35.485048+0.776015 test-tweedie-nloglik@1.76821:35.485494+3.104626 ## [2001] train-tweedie-nloglik@1.76821:21.549418+0.338670 test-tweedie-nloglik@1.76821:21.549704+1.354912 ## [2501] train-tweedie-nloglik@1.76821:17.189798+0.147854 test-tweedie-nloglik@1.76821:17.210399+0.612728 ## [3001] train-tweedie-nloglik@1.76821:16.418288+0.094189 test-tweedie-nloglik@1.76821:16.555264+0.433128 ## [3501] train-tweedie-nloglik@1.76821:16.109250+0.081212 test-tweedie-nloglik@1.76821:16.344176+0.387444 ## [4001] train-tweedie-nloglik@1.76821:15.925080+0.078790 test-tweedie-nloglik@1.76821:16.233377+0.369088 ## [4501] train-tweedie-nloglik@1.76821:15.793141+0.077723 test-tweedie-nloglik@1.76821:16.158620+0.370770 ## [5001] train-tweedie-nloglik@1.76821:15.686279+0.075960 test-tweedie-nloglik@1.76821:16.110775+0.376137 ## [5501] train-tweedie-nloglik@1.76821:15.597672+0.076042 test-tweedie-nloglik@1.76821:16.076505+0.388222 ## Stopping. Best iteration: ## [5707] train-tweedie-nloglik@1.76821:15.563175+0.074569 test-tweedie-nloglik@1.76821:16.059571+0.391720 ## ## [1] train-tweedie-nloglik@1.7599:342.328162+9.312946 test-tweedie-nloglik@1.7599:342.334070+37.259847 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.7599 for early stopping. ## Will train until test_tweedie_nloglik@1.7599 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.7599:21.034394+0.261867 test-tweedie-nloglik@1.7599:21.110715+1.265011 ## [1001] train-tweedie-nloglik@1.7599:16.515162+0.077981 test-tweedie-nloglik@1.7599:16.665819+0.360601 ## [1501] train-tweedie-nloglik@1.7599:16.270615+0.075134 test-tweedie-nloglik@1.7599:16.505939+0.319116 ## [2001] train-tweedie-nloglik@1.7599:16.125761+0.075163 test-tweedie-nloglik@1.7599:16.432678+0.320781 ## [2501] train-tweedie-nloglik@1.7599:16.014767+0.073620 test-tweedie-nloglik@1.7599:16.385424+0.331610 ## [3001] train-tweedie-nloglik@1.7599:15.920065+0.071297 test-tweedie-nloglik@1.7599:16.346298+0.346168 ## Stopping. Best iteration: ## [3228] train-tweedie-nloglik@1.7599:15.881553+0.070426 test-tweedie-nloglik@1.7599:16.329813+0.351423 ## ## [1] train-tweedie-nloglik@1.80226:341.820270+9.143934 test-tweedie-nloglik@1.80226:341.825208+36.586124 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80226 for early stopping. ## Will train until test_tweedie_nloglik@1.80226 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.80226:26.608221+0.443646 test-tweedie-nloglik@1.80226:26.664685+1.977213 ## [1001] train-tweedie-nloglik@1.80226:15.824172+0.068891 test-tweedie-nloglik@1.80226:15.918101+0.371526 ## [1501] train-tweedie-nloglik@1.80226:15.499650+0.061893 test-tweedie-nloglik@1.80226:15.647616+0.269262 ## [2001] train-tweedie-nloglik@1.80226:15.374731+0.061671 test-tweedie-nloglik@1.80226:15.573182+0.258539 ## [2501] train-tweedie-nloglik@1.80226:15.289750+0.061845 test-tweedie-nloglik@1.80226:15.534944+0.261417 ## Stopping. Best iteration: ## [2579] train-tweedie-nloglik@1.80226:15.278433+0.061600 test-tweedie-nloglik@1.80226:15.529791+0.262305 ## ## [1] train-tweedie-nloglik@1.84974:343.394391+8.997226 test-tweedie-nloglik@1.84974:343.399872+36.001197 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.84974 for early stopping. ## Will train until test_tweedie_nloglik@1.84974 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.84974:35.638326+0.676807 test-tweedie-nloglik@1.84974:35.855040+2.870666 ## [1001] train-tweedie-nloglik@1.84974:15.927161+0.074464 test-tweedie-nloglik@1.84974:16.216283+0.425658 ## [1501] train-tweedie-nloglik@1.84974:14.851506+0.047610 test-tweedie-nloglik@1.84974:15.276635+0.237773 ## [2001] train-tweedie-nloglik@1.84974:14.524470+0.041982 test-tweedie-nloglik@1.84974:15.112846+0.258227 ## [2501] train-tweedie-nloglik@1.84974:14.292380+0.045034 test-tweedie-nloglik@1.84974:15.027395+0.291822 ## [3001] train-tweedie-nloglik@1.84974:14.103252+0.044065 test-tweedie-nloglik@1.84974:14.970938+0.334465 ## Stopping. Best iteration: ## [3384] train-tweedie-nloglik@1.84974:13.977265+0.046385 test-tweedie-nloglik@1.84974:14.943196+0.365468 ## ## [1] train-tweedie-nloglik@1.79106:341.213971+9.169863 test-tweedie-nloglik@1.79106:341.220727+36.687101 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79106 for early stopping. ## Will train until test_tweedie_nloglik@1.79106 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.79106:17.835510+0.165372 test-tweedie-nloglik@1.79106:18.048213+0.847644 ## [1001] train-tweedie-nloglik@1.79106:14.868611+0.068062 test-tweedie-nloglik@1.79106:15.391133+0.315593 ## [1501] train-tweedie-nloglik@1.79106:14.301947+0.058922 test-tweedie-nloglik@1.79106:15.099987+0.329528 ## [2001] train-tweedie-nloglik@1.79106:13.923899+0.058205 test-tweedie-nloglik@1.79106:14.931884+0.371135 ## [2501] train-tweedie-nloglik@1.79106:13.617700+0.062158 test-tweedie-nloglik@1.79106:14.828744+0.421567 ## [3001] train-tweedie-nloglik@1.79106:13.364167+0.063944 test-tweedie-nloglik@1.79106:14.745604+0.455195 ## Stopping. Best iteration: ## [3159] train-tweedie-nloglik@1.79106:13.292696+0.064719 test-tweedie-nloglik@1.79106:14.719766+0.463486 ## ## [1] train-tweedie-nloglik@1.75281:343.292597+9.363249 test-tweedie-nloglik@1.75281:343.301770+37.470216 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.75281 for early stopping. ## Will train until test_tweedie_nloglik@1.75281 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.75281:33.808357+0.652361 test-tweedie-nloglik@1.75281:34.678754+3.054606 ## [1001] train-tweedie-nloglik@1.75281:16.164603+0.097345 test-tweedie-nloglik@1.75281:17.212687+0.750414 ## [1501] train-tweedie-nloglik@1.75281:14.603756+0.074138 test-tweedie-nloglik@1.75281:15.953423+0.574503 ## [2001] train-tweedie-nloglik@1.75281:13.968725+0.071299 test-tweedie-nloglik@1.75281:15.654203+0.618075 ## [2501] train-tweedie-nloglik@1.75281:13.520048+0.068583 test-tweedie-nloglik@1.75281:15.482204+0.654811 ## Stopping. Best iteration: ## [2566] train-tweedie-nloglik@1.75281:13.469965+0.068246 test-tweedie-nloglik@1.75281:15.476034+0.661855

## [mbo] 0: eta=0.00745; gamma=3.51; max_depth=4; min_child_weight=1387; subsample=0.629; colsample_bytree=0.376; max_delta_step=0.64; tweedie_variance_power=1.83 : y = 14.9 : 365.7 secs : initdesign

## [mbo] 0: eta=0.00941; gamma=1.73; max_depth=6; min_child_weight=1729; subsample=0.665; colsample_bytree=0.277; max_delta_step=0.95; tweedie_variance_power=1.78 : y = 14.8 : 401.0 secs : initdesign

## [mbo] 0: eta=0.00594; gamma=2.63; max_depth=9; min_child_weight=1844; subsample=0.406; colsample_bytree=0.723; max_delta_step=4.15; tweedie_variance_power=1.84 : y = 14.7 : 504.6 secs : initdesign

## [mbo] 0: eta=0.00858; gamma=4.15; max_depth=6; min_child_weight=974; subsample=0.772; colsample_bytree=0.686; max_delta_step=3.91; tweedie_variance_power=1.81 : y = 14.6 : 199.2 secs : initdesign

## [mbo] 0: eta=0.00993; gamma=4.57; max_depth=8; min_child_weight=1068; subsample=0.688; colsample_bytree=0.441; max_delta_step=2.05; tweedie_variance_power=1.77 : y = 14.7 : 239.7 secs : initdesign

## [mbo] 0: eta=0.00698; gamma=1.95; max_depth=3; min_child_weight=1116; subsample=0.581; colsample_bytree=0.647; max_delta_step=4.48; tweedie_variance_power=1.81 : y = 15 : 366.1 secs : initdesign

## [mbo] 0: eta=0.00637; gamma=3.72; max_depth=10; min_child_weight=1934; subsample=0.457; colsample_bytree=0.779; max_delta_step=2.48; tweedie_variance_power=1.82 : y = 14.6 : 415.2 secs : initdesign

## [mbo] 0: eta=0.00794; gamma=2.22; max_depth=9; min_child_weight=840; subsample=0.757; colsample_bytree=0.515; max_delta_step=3.11; tweedie_variance_power=1.79 : y = 14.5 : 206.0 secs : initdesign

## [mbo] 0: eta=0.00825; gamma=4.93; max_depth=7; min_child_weight=616; subsample=0.325; colsample_bytree=0.292; max_delta_step=3.4; tweedie_variance_power=1.84 : y = 14.8 : 213.4 secs : initdesign

## [mbo] 0: eta=0.00871; gamma=2.37; max_depth=3; min_child_weight=1553; subsample=0.221; colsample_bytree=0.324; max_delta_step=0.248; tweedie_variance_power=1.77 : y = 16.1 : 359.8 secs : initdesign

## [mbo] 0: eta=0.0073; gamma=4.34; max_depth=2; min_child_weight=1484; subsample=0.526; colsample_bytree=0.618; max_delta_step=1.99; tweedie_variance_power=1.76 : y = 16.3 : 233.8 secs : initdesign

## [mbo] 0: eta=0.00608; gamma=1.07; max_depth=2; min_child_weight=1280; subsample=0.293; colsample_bytree=0.402; max_delta_step=2.82; tweedie_variance_power=1.8 : y = 15.5 : 196.2 secs : initdesign

## [mbo] 0: eta=0.00519; gamma=3.23; max_depth=5; min_child_weight=514; subsample=0.5; colsample_bytree=0.527; max_delta_step=1.51; tweedie_variance_power=1.85 : y = 14.9 : 380.0 secs : initdesign

## [mbo] 0: eta=0.0091; gamma=1.49; max_depth=6; min_child_weight=321; subsample=0.362; colsample_bytree=0.21; max_delta_step=1.12; tweedie_variance_power=1.79 : y = 14.7 : 300.6 secs : initdesign

## [mbo] 0: eta=0.00543; gamma=2.87; max_depth=10; min_child_weight=731; subsample=0.243; colsample_bytree=0.595; max_delta_step=4.85; tweedie_variance_power=1.75 : y = 15.5 : 350.0 secs : initdesign

## [1] train-tweedie-nloglik@1.81205:340.715326+9.076589 test-tweedie-nloglik@1.81205:340.723718+36.318898 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.81205 for early stopping. ## Will train until test_tweedie_nloglik@1.81205 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.81205:15.918866+0.102574 test-tweedie-nloglik@1.81205:16.461088+0.571469 ## [1001] train-tweedie-nloglik@1.81205:13.785034+0.049437 test-tweedie-nloglik@1.81205:14.672956+0.330847 ## [1501] train-tweedie-nloglik@1.81205:13.252387+0.053019 test-tweedie-nloglik@1.81205:14.476279+0.413990 ## [2001] train-tweedie-nloglik@1.81205:12.879603+0.050442 test-tweedie-nloglik@1.81205:14.391401+0.501251 ## Stopping. Best iteration: ## [2023] train-tweedie-nloglik@1.81205:12.864722+0.050294 test-tweedie-nloglik@1.81205:14.387596+0.505142

## [mbo] 1: eta=0.00954; gamma=1.9; max_depth=8; min_child_weight=1225; subsample=0.731; colsample_bytree=0.349; max_delta_step=2.2; tweedie_variance_power=1.81 : y = 14.4 : 249.3 secs : infill_cb

## [1] train-tweedie-nloglik@1.80087:343.092389+9.184359 test-tweedie-nloglik@1.80087:343.097986+36.745005 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80087 for early stopping. ## Will train until test_tweedie_nloglik@1.80087 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.80087:113.222479+2.921018 test-tweedie-nloglik@1.80087:113.224371+11.686362 ## [1001] train-tweedie-nloglik@1.80087:41.464624+0.928879 test-tweedie-nloglik@1.80087:41.465188+3.716334 ## [1501] train-tweedie-nloglik@1.80087:20.439788+0.295234 test-tweedie-nloglik@1.80087:20.440313+1.181873 ## [2001] train-tweedie-nloglik@1.80087:15.711758+0.113962 test-tweedie-nloglik@1.80087:15.898547+0.431068 ## [2501] train-tweedie-nloglik@1.80087:14.325745+0.070814 test-tweedie-nloglik@1.80087:15.013139+0.345294 ## [3001] train-tweedie-nloglik@1.80087:13.676998+0.056270 test-tweedie-nloglik@1.80087:14.686207+0.365084 ## [3501] train-tweedie-nloglik@1.80087:13.235437+0.057295 test-tweedie-nloglik@1.80087:14.506423+0.408045 ## [4001] train-tweedie-nloglik@1.80087:12.878366+0.056469 test-tweedie-nloglik@1.80087:14.409738+0.476704 ## Stopping. Best iteration: ## [4439] train-tweedie-nloglik@1.80087:12.602206+0.057227 test-tweedie-nloglik@1.80087:14.353508+0.526717

## [mbo] 2: eta=0.00911; gamma=1.77; max_depth=8; min_child_weight=868; subsample=0.791; colsample_bytree=0.549; max_delta_step=0.314; tweedie_variance_power=1.8 : y = 14.4 : 422.2 secs : infill_cb

## [1] train-tweedie-nloglik@1.79891:341.616339+9.151591 test-tweedie-nloglik@1.79891:341.621362+36.614048 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79891 for early stopping. ## Will train until test_tweedie_nloglik@1.79891 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.79891:21.260748+0.323534 test-tweedie-nloglik@1.79891:21.276269+1.295378 ## [1001] train-tweedie-nloglik@1.79891:14.082018+0.051754 test-tweedie-nloglik@1.79891:14.875722+0.333678 ## [1501] train-tweedie-nloglik@1.79891:13.233551+0.051804 test-tweedie-nloglik@1.79891:14.484654+0.410128 ## [2001] train-tweedie-nloglik@1.79891:12.712292+0.049849 test-tweedie-nloglik@1.79891:14.331855+0.503634 ## Stopping. Best iteration: ## [2290] train-tweedie-nloglik@1.79891:12.472694+0.053111 test-tweedie-nloglik@1.79891:14.298177+0.575406

## [mbo] 3: eta=0.00942; gamma=3.15; max_depth=9; min_child_weight=843; subsample=0.783; colsample_bytree=0.271; max_delta_step=0.888; tweedie_variance_power=1.8 : y = 14.3 : 252.1 secs : infill_cb

## [1] train-tweedie-nloglik@1.8023:340.514740+9.107514 test-tweedie-nloglik@1.8023:340.523969+36.446491 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.8023 for early stopping. ## Will train until test_tweedie_nloglik@1.8023 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.8023:14.909694+0.100825 test-tweedie-nloglik@1.8023:16.029717+0.614204 ## [1001] train-tweedie-nloglik@1.8023:12.276371+0.058075 test-tweedie-nloglik@1.8023:14.299033+0.636373 ## Stopping. Best iteration: ## [1016] train-tweedie-nloglik@1.8023:12.243908+0.059515 test-tweedie-nloglik@1.8023:14.297354+0.646123

## [mbo] 4: eta=0.00999; gamma=2.8; max_depth=9; min_child_weight=342; subsample=0.683; colsample_bytree=0.49; max_delta_step=1.58; tweedie_variance_power=1.8 : y = 14.3 : 155.0 secs : infill_cb

## [1] train-tweedie-nloglik@1.79965:340.693542+9.123636 test-tweedie-nloglik@1.79965:340.698126+36.502615 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79965 for early stopping. ## Will train until test_tweedie_nloglik@1.79965 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.79965:15.513652+0.107246 test-tweedie-nloglik@1.79965:16.291323+0.599968 ## [1001] train-tweedie-nloglik@1.79965:13.127623+0.057521 test-tweedie-nloglik@1.79965:14.478948+0.468251 ## [1501] train-tweedie-nloglik@1.79965:12.450628+0.057462 test-tweedie-nloglik@1.79965:14.320771+0.610130 ## Stopping. Best iteration: ## [1521] train-tweedie-nloglik@1.79965:12.427706+0.057687 test-tweedie-nloglik@1.79965:14.317007+0.616377

## [mbo] 5: eta=0.01; gamma=2.63; max_depth=10; min_child_weight=1081; subsample=0.698; colsample_bytree=0.493; max_delta_step=1.18; tweedie_variance_power=1.8 : y = 14.3 : 205.2 secs : infill_cb

## [1] train-tweedie-nloglik@1.79543:340.537445+9.134006 test-tweedie-nloglik@1.79543:340.552142+36.558473 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79543 for early stopping. ## Will train until test_tweedie_nloglik@1.79543 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.79543:15.423422+0.099615 test-tweedie-nloglik@1.79543:16.142575+0.549508 ## [1001] train-tweedie-nloglik@1.79543:13.201343+0.056114 test-tweedie-nloglik@1.79543:14.503582+0.419961 ## [1501] train-tweedie-nloglik@1.79543:12.487963+0.057690 test-tweedie-nloglik@1.79543:14.305727+0.556182 ## Stopping. Best iteration: ## [1697] train-tweedie-nloglik@1.79543:12.274374+0.058514 test-tweedie-nloglik@1.79543:14.272447+0.606904

## [mbo] 6: eta=0.00999; gamma=1.42; max_depth=9; min_child_weight=584; subsample=0.769; colsample_bytree=0.272; max_delta_step=2.29; tweedie_variance_power=1.8 : y = 14.3 : 217.2 secs : infill_cb

## [1] train-tweedie-nloglik@1.79417:341.503680+9.166254 test-tweedie-nloglik@1.79417:341.508606+36.672458 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79417 for early stopping. ## Will train until test_tweedie_nloglik@1.79417 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.79417:19.509591+0.261787 test-tweedie-nloglik@1.79417:19.639546+1.096469 ## [1001] train-tweedie-nloglik@1.79417:13.144418+0.052379 test-tweedie-nloglik@1.79417:14.496626+0.423258 ## [1501] train-tweedie-nloglik@1.79417:12.073966+0.057633 test-tweedie-nloglik@1.79417:14.226620+0.590800 ## Stopping. Best iteration: ## [1514] train-tweedie-nloglik@1.79417:12.054614+0.057727 test-tweedie-nloglik@1.79417:14.224718+0.595632

## [mbo] 7: eta=0.00991; gamma=1.58; max_depth=10; min_child_weight=453; subsample=0.77; colsample_bytree=0.352; max_delta_step=0.902; tweedie_variance_power=1.79 : y = 14.2 : 182.2 secs : infill_cb

## [1] train-tweedie-nloglik@1.80093:342.887091+9.178540 test-tweedie-nloglik@1.80093:342.892096+36.721835 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80093 for early stopping. ## Will train until test_tweedie_nloglik@1.80093 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.80093:85.321825+2.153553 test-tweedie-nloglik@1.80093:85.322955+8.616076 ## [1001] train-tweedie-nloglik@1.80093:26.996887+0.505121 test-tweedie-nloglik@1.80093:26.997190+2.021150 ## [1501] train-tweedie-nloglik@1.80093:16.226591+0.120683 test-tweedie-nloglik@1.80093:16.298045+0.500625 ## [2001] train-tweedie-nloglik@1.80093:14.522975+0.055452 test-tweedie-nloglik@1.80093:15.077755+0.326529 ## [2501] train-tweedie-nloglik@1.80093:13.705078+0.062120 test-tweedie-nloglik@1.80093:14.643362+0.337972 ## [3001] train-tweedie-nloglik@1.80093:13.174421+0.062708 test-tweedie-nloglik@1.80093:14.423376+0.381127 ## [3501] train-tweedie-nloglik@1.80093:12.755801+0.063702 test-tweedie-nloglik@1.80093:14.289940+0.439437 ## Stopping. Best iteration: ## [3829] train-tweedie-nloglik@1.80093:12.516629+0.066593 test-tweedie-nloglik@1.80093:14.231650+0.485118

## [mbo] 8: eta=0.00908; gamma=1.13; max_depth=10; min_child_weight=301; subsample=0.709; colsample_bytree=0.2; max_delta_step=0.399; tweedie_variance_power=1.8 : y = 14.2 : 391.0 secs : infill_cb

## [1] train-tweedie-nloglik@1.80353:340.954138+9.115118 test-tweedie-nloglik@1.80353:340.965655+36.477642 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80353 for early stopping. ## Will train until test_tweedie_nloglik@1.80353 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.80353:16.280642+0.145262 test-tweedie-nloglik@1.80353:17.426506+0.818395 ## [1001] train-tweedie-nloglik@1.80353:12.257854+0.055505 test-tweedie-nloglik@1.80353:14.196385+0.579480 ## Stopping. Best iteration: ## [1075] train-tweedie-nloglik@1.80353:12.083582+0.060371 test-tweedie-nloglik@1.80353:14.174686+0.630563

## [mbo] 9: eta=0.00868; gamma=1.42; max_depth=10; min_child_weight=302; subsample=0.799; colsample_bytree=0.376; max_delta_step=1.39; tweedie_variance_power=1.8 : y = 14.2 : 165.5 secs : infill_cb

## [1] train-tweedie-nloglik@1.8168:341.102509+9.068973 test-tweedie-nloglik@1.8168:341.113751+36.288728 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.8168 for early stopping. ## Will train until test_tweedie_nloglik@1.8168 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.8168:16.397836+0.150262 test-tweedie-nloglik@1.8168:17.661101+0.874430 ## [1001] train-tweedie-nloglik@1.8168:12.316689+0.061103 test-tweedie-nloglik@1.8168:14.303786+0.625352 ## Stopping. Best iteration: ## [1019] train-tweedie-nloglik@1.8168:12.277460+0.060180 test-tweedie-nloglik@1.8168:14.299007+0.638777

## [mbo] 10: eta=0.00858; gamma=1.26; max_depth=10; min_child_weight=467; subsample=0.798; colsample_bytree=0.641; max_delta_step=1.38; tweedie_variance_power=1.82 : y = 14.3 : 170.7 secs : infill_cb

## $eta ## [1] 0.008678109 ## ## $gamma ## [1] 1.423488 ## ## $max_depth ## [1] 10 ## ## $min_child_weight ## [1] 302 ## ## $subsample ## [1] 0.7986374 ## ## $colsample_bytree ## [1] 0.3759013 ## ## $max_delta_step ## [1] 1.391834 ## ## $tweedie_variance_power ## [1] 1.803532

Results in hand, we want to check some diagnostics starting with the objective function evaluation for all of the runs.

runs$plot

The plot above (see our `do_bayes()`

function for how we extracted this info) shows the best test evaluation result for each run - the initial design is colored red and the optimization runs are in blue. The hyperparameter values which produced those evaluations were the ones chosen through kriging. We can see from this that none of the random evaluations gave a top result, but together they did provide solid information on *where* in the hyperparameter space we should focus our search in order to optimize the algorithm. Every subsequent proposal was better than all of the random ones.

The “default” viz below comes from the `plot()`

S3 class method for `MBOSingleObjResult`

and shows a few useful things, although the formatting could be improved. Most importantly, the top left plot shows the “scaled” values for each set of hyperparameters, for each run. Use this to confirm your recommended solution does not include any hyperparameter at the boundary of the values tested - if it does, then expand the range of that parameter in your objective function and re-run. In my example below, you can see that the optimal solution (the green line) includes a value for `max_depth`

at the maximum of 10, and a `min_child_weight`

at or near the minimum (300) of the range I had allowed. Unless I were intentionally using these bounds to limit model complexity and improve generalization, I should try expanding the ranges of these hyperparameters and running again.

class(runs$run) %>% print

## [1] "MBOSingleObjResult" "MBOResult"

plot(runs$run)

If you print the result object you can confirm the recommended solution included these boundary values:

print(runs$run)

## Recommended parameters: ## eta=0.00868; gamma=1.42; max_depth=10; min_child_weight=302; subsample=0.799; colsample_bytree=0.376; max_delta_step=1.39; tweedie_variance_power=1.8 ## Objective: y = 14.175 ## ## Optimization path ## 15 + 10 entries in total, displaying last 10 (or less): ## eta gamma max_depth min_child_weight subsample colsample_bytree max_delta_step tweedie_variance_power ## 16 0.009543908 1.904103 8 1225 0.7305367 0.3488211 2.1958369 1.812051 ## 17 0.009114593 1.773073 8 868 0.7911231 0.5494522 0.3138641 1.800874 ## 18 0.009421640 3.150876 9 843 0.7830561 0.2705329 0.8878380 1.798907 ## 19 0.009990520 2.801814 9 342 0.6832292 0.4898784 1.5775174 1.802298 ## 20 0.009996788 2.626826 10 1081 0.6979587 0.4934878 1.1786900 1.799649 ## 21 0.009986297 1.419945 9 584 0.7690373 0.2716777 2.2940200 1.795434 ## 22 0.009913078 1.579795 10 453 0.7699498 0.3517838 0.9017585 1.794166 ## 23 0.009082225 1.130765 10 301 0.7085142 0.2004467 0.3985858 1.800927 ## 24 0.008678109 1.423488 10 302 0.7986374 0.3759013 1.3918335 1.803532 ## 25 0.008579770 1.258053 10 467 0.7982686 0.6407762 1.3812183 1.816803 ## y dob eol error.message exec.time cb error.model train.time prop.type propose.time se mean ## 16 14.38760 1 NA249.294 14.08443 0.417 infill_cb 1.041 0.26275330 14.34718 ## 17 14.35351 2 NA 422.219 14.20226 0.150 infill_cb 1.771 0.21492201 14.41718 ## 18 14.29818 3 NA 252.086 14.20628 0.123 infill_cb 1.537 0.19145391 14.39773 ## 19 14.29735 4 NA 155.010 14.19001 0.174 infill_cb 1.474 0.15554575 14.34556 ## 20 14.31701 5 NA 205.205 14.19583 0.333 infill_cb 1.299 0.15011713 14.34595 ## 21 14.27245 6 NA 217.224 14.20543 0.122 infill_cb 1.356 0.12818859 14.33362 ## 22 14.22472 7 NA 182.158 14.19185 0.101 infill_cb 1.510 0.10787970 14.29973 ## 23 14.23165 8 NA 390.982 14.16640 0.180 infill_cb 1.438 0.12182032 14.28822 ## 24 14.17469 9 NA 165.500 14.16323 0.113 infill_cb 1.142 0.08824657 14.25147 ## 25 14.29901 10 NA 170.738 14.12875 0.097 infill_cb 1.179 0.14723428 14.27598 ## lambda ## 16 1 ## 17 1 ## 18 1 ## 19 1 ## 20 1 ## 21 1 ## 22 1 ## 23 1 ## 24 1 ## 25 1

Assuming we are happy with the result, we should then have what we need to proceed with model training. *However*, since `xgb.cv()`

now uses early stopping and `nrounds`

is not a tuning parameter, we did not capture this needed information in our MBO result. So we need to run one more evaluation the old-fashioned way, calling `xgb.cv()`

directly using the best hyperparameters we found.

best.params <- runs$run$x print(best.params)

## $eta ## [1] 0.008678109 ## ## $gamma ## [1] 1.423488 ## ## $max_depth ## [1] 10 ## ## $min_child_weight ## [1] 302 ## ## $subsample ## [1] 0.7986374 ## ## $colsample_bytree ## [1] 0.3759013 ## ## $max_delta_step ## [1] 1.391834 ## ## $tweedie_variance_power ## [1] 1.803532

We add the model parameters which were fixed during optimization to this list:

best.params$booster <- "gbtree" best.params$objective <- "reg:tweedie"

Now we cross-validate the number of rounds to use, fixing our best hyperparameters:

optimal.cv <- xgb.cv(params = best.params, data = fre_dm, nrounds = 6000, nthread = 26, nfold = 5, prediction = FALSE, showsd = TRUE, early_stopping_rounds = 25, verbose = 1, print_every_n = 500)

## [1] train-tweedie-nloglik@1.80353:340.950989+7.692406 test-tweedie-nloglik@1.80353:340.929736+30.743659 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80353 for early stopping. ## Will train until test_tweedie_nloglik@1.80353 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.80353:16.281491+0.131605 test-tweedie-nloglik@1.80353:17.436350+0.607008 ## [1001] train-tweedie-nloglik@1.80353:12.254035+0.041286 test-tweedie-nloglik@1.80353:14.250821+0.548260 ## Stopping. Best iteration: ## [1060] train-tweedie-nloglik@1.80353:12.114213+0.035136 test-tweedie-nloglik@1.80353:14.242658+0.593849

Obtain the best number of rounds…

best.params$nrounds <- optimal.cv$best_ntreelimit best.params[[11]] %>% print

## [1] 1060

…and finally, train the final learner:

final.model <- xgboost(params = best.params[-11], ## do not include nrounds here data = fre_dm, nrounds = best.params$nrounds, verbose = 1, print_every_n = 500)

## [1] train-tweedie-nloglik@1.80353:340.952393 ## [501] train-tweedie-nloglik@1.80353:16.252325 ## [1001] train-tweedie-nloglik@1.80353:12.189503 ## [1060] train-tweedie-nloglik@1.80353:12.041411

xgb.importance(model = final.model) %>% xgb.plot.importance()

Bayesian optimization is a smart approach for tuning more complex learning algorithms with many hyperparameters when compute resources are slowing down the analysis. It is commonly used in deep learning, but can also be useful to when working with machine learning algorithms like GBMs (shown here), random forests, support vector machines - really anything that is going to take you too much time to run a naive grid search. Even if you are working with a relatively simple algorithm - say a lasso regression, which involves a single hyperparameter \(\lambda\) to control the shrinkage/penalty - you may have just a small amount of compute available to you. If so, then it could still make sense to use MBO and cut down the number of evaluations needed to find the optimum.

]]>

- What is in the market currently?
- What levels of rate disruption to the existing book are acceptable?
- Would indicated rating relativities produce unaffordable rates for any customer segments?
- Do rating differences between customer segments rise to the level of being “unfairly discriminatory”?
- What will a regulator approve, or conversely, object to?
- What are the IT/systems implementation constraints?
- Deploying models as containerized prediction APIs would eliminate such a question… but instead the prediction formulas are usually written in a stored procedure with SQL tables to hold parameter values

- Are the proposed rating factors “intuitive”, or is there a believable causal relationship to the predicted losses?

Many or all of these questions will always need consideration post-modeling since they are not driven by shortcomings in the modeling approach or data quality. Often other questions also motivate adjustments to the modeling, such as:

- Will the proposed factors impact the customer experience negatively?
- For example, will there be “reversals” such that rates increase, then decrease, then increase again as a policy renews?

- Do we have sufficient experience to give high confidence in our parameter estimates?
- Where is our experience thin?
- Are we extrapolating wildly beyond the range of our experience, producing rates that are too high or low in the tails?

These considerations are of a different sort – they could be avoided in whole or in part with better modeling techniques.

Enter Generalized Additive Models (GAMs) as a more flexible approach than the vanilla GLM. I'm a big fan of using GAMs for most things where a GLM may be the go-to – in both pricing and non-pricing (operational, underwriting, marketing, etc.) applications. GAMs are just an extension of the GLM and as such are accesible to actuaries and others who are already familiar with that framework.

From the prediction perspective, the output of a GLM is an estimate of the expected value (mean) for an observation of the form

\[

\mu_i = g^{-1}(\eta) = g^{-1}(\beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + … + \beta_p x_{i,p})

\]

\(g^-1\) is the inverse of a monotonic link function, and the linear predictor (\(\eta\)) is a linear combination of covariates and estimated coefficients (aka weights, parameters). This is a pretty inflexible model. In fact, on the link scale it's a linear model – the linear predictor surface is a hyperplane slicing through a feature space of dimension \(p\). Compare to the relationship modeled by a GAM:

\[

\mu_i = g^{-1}(\eta) = g^{-1}(\beta_0 + f_1(x_{i,1}) + f_2(x_{i,2}) + … + f_p(x_{i,p}))

\]

Here we only assume \(x\) enters the linear predictor via **some function** \(f_x\), rather than directly in proportion to some \(\beta\). This looser assumption allows for more flexible models if we can reliably estimate the \(p\) non-linear functions \(f\) somehow.

It turns out that we can produce useful estimates for these functions – called a basis expansion of \(x\). We will take a practical look at how that is done using the approach of thin plate splines, but first let's establish our baseline for comparison using the GLMs we are familiar with.

First we load some standard libraries for manipulating and visualizing data. The `mgcv`

package was an important development in our computational ability to estimate GAMs and remains probably the most popular package in use today.

library(CASdatasets) library(dplyr) library(tibble) library(magrittr) library(ggplot2) library(tidyr) library(mgcv)

We will use the french motor dataset from the `CASdatasets`

library for fooling around here.

data("freMPL6")

Let's have a glimpse at the data

glimpse(freMPL6)

## Rows: 42,400 ## Columns: 20 ## $ Exposure <dbl> 0.333, 0.666, 0.207, 0.163, 0.077, 0.347, 0.652, 0.355, 0.644, 0.379, 0.333, 0.666, 0.407, 0.333, 0.666, â€¦ ## $ LicAge <int> 468, 472, 169, 171, 180, 165, 169, 429, 433, 194, 318, 322, 522, 444, 448, 403, 401, 406, 530, 534, 267, â€¦ ## $ RecordBeg <date> 2004-01-01, 2004-05-01, 2004-01-01, 2004-03-16, 2004-12-03, 2004-01-01, 2004-05-06, 2004-01-01, 2004-05-â€¦ ## $ RecordEnd <date> 2004-05-01, NA, 2004-03-16, 2004-05-15, NA, 2004-05-06, NA, 2004-05-09, NA, 2004-05-18, 2004-05-01, NA, â€¦ ## $ Gender <fct> Male, Male, Male, Male, Male, Female, Female, Female, Female, Male, Male, Male, Female, Male, Male, Male,â€¦ ## $ MariStat <fct> Other, Other, Other, Other, Other, Alone, Alone, Other, Other, Other, Other, Other, Other, Other, Other, â€¦ ## $ SocioCateg <fct> CSP50, CSP50, CSP50, CSP50, CSP50, CSP50, CSP50, CSP50, CSP50, CSP50, CSP22, CSP22, CSP50, CSP50, CSP50, â€¦ ## $ VehUsage <fct> Private, Private, Private+trip to office, Private+trip to office, Private+trip to office, Private+trip toâ€¦ ## $ DrivAge <int> 67, 68, 32, 32, 33, 32, 32, 59, 59, 51, 44, 45, 61, 55, 55, 55, 54, 54, 62, 63, 40, 40, 55, 55, 38, 38, 7â€¦ ## $ HasKmLimit <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, â€¦ ## $ ClaimAmount <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 75.94985, 0.00000, 0.00000, 59.24230, 0.00000, 0.00000, 0.00â€¦ ## $ ClaimNbResp <int> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 0, 0, 0, 3, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, â€¦ ## $ ClaimNbNonResp <int> 0, 0, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, â€¦ ## $ ClaimNbParking <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, â€¦ ## $ ClaimNbFireTheft <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, â€¦ ## $ ClaimNbWindscreen <int> 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, â€¦ ## $ OutUseNb <int> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, â€¦ ## $ RiskArea <int> 9, 9, 7, 7, 7, 6, 6, 9, 9, 11, 9, 9, 7, 5, 5, 9, 9, 9, 6, 6, 10, 10, 11, 11, 11, 11, 10, 10, 7, 7, 10, 10â€¦ ## $ BonusMalus <int> 50, 50, 72, 68, 68, 63, 59, 51, 50, 50, 50, 50, 50, 53, 50, 95, 118, 100, 50, 50, 50, 50, 58, 72, 50, 50,â€¦ ## $ ClaimInd <int> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, â€¦

We have some typical policy elements here, including a pre-calculated exposure column, as well as some claim elements joined to the exposures. I'm curious how `Exposure`

was calculated given the NAs that exist in the `RecordEnd`

column, but we will just trust it for our purposes.

I notice also that we have a column named `ClaimInd`

rather than `ClaimCount`

– let's confirm this is indeed a binary indicator of 0 or 1 (or more?) claims in the exposure period.

table(freMPL6$ClaimInd)

## ## 0 1 ## 38186 4214

One thing I want to change is to express `LicAge`

in integral years rather than months. We would not throw out information like this normally, but for our purposes it will simplify some plotting.

freMPL6 %<>% as_tibble() %>% mutate(LicAge = floor(LicAge/12))

Let's start by looking at annualized claim frequency as a function of `LicAge`

in years.

dat_summ <- freMPL6 %>% group_by(LicAge) %>% summarise(claim_cnt = sum(ClaimInd), Exposure = sum(Exposure), obs_freq = claim_cnt / Exposure)

## `summarise()` ungrouping output (override with `.groups` argument)

dat_summ %>% ggplot(aes(x = LicAge, y = obs_freq)) + geom_line() + ggtitle("Observed Claim Frequency by License Age")

This looks close to but not quite a linear effect. So now let's move into some modeling.

The GLM is in a sense just a special case of a GAM, where we are limited to estimates of the form \(f(x) = \beta x\). Insurance practitioners will often approach a modeling analysis by fitting a GLM as follows…

Fit a model to the raw feature(s)

lame_glm <- glm(ClaimInd ~ LicAge, offset = log(Exposure), family = poisson(link = "log"), ## skip tests of poisson assumption, save for a future blog post data = freMPL6)

And then check for “significant” predictors

summary(lame_glm)

## ## Call: ## glm(formula = ClaimInd ~ LicAge, family = poisson(link = "log"), ## data = freMPL6, offset = log(Exposure)) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.7390 -0.5252 -0.3913 -0.2396 3.2793 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.278794 0.034191 -37.401 < 2e-16 *** ## LicAge -0.009577 0.001177 -8.139 3.98e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 18267 on 42399 degrees of freedom ## Residual deviance: 18200 on 42398 degrees of freedom ## AIC: 26632 ## ## Number of Fisher Scoring iterations: 6

Finally, visually inspect the fit against the mean observed response (we use the summarised data for this)

dat_summ %>% mutate(lame_fit = predict(lame_glm, ., type = "response") / Exposure ) %>% gather(key = "metric", value = "freq", obs_freq, lame_fit) %>% ggplot(aes(x = LicAge, y = freq, color = metric)) + geom_line() + ggtitle("Fit vs Observed")

Good enough, back to studying for an exam.

Well maybe not…what's going on at the tails of our domain? Are we just seeing the process variance due to low exposure? How good is this linear curve?

gridExtra::grid.arrange( dat_summ %>% mutate(lame_fit = predict(lame_glm, ., type = "response") / Exposure ) %>% gather(key = "metric", value = "freq", obs_freq, lame_fit) %>% ggplot(aes(x = LicAge, y = freq, color = metric)) + geom_line() + ggtitle("Fit vs Observed") + theme(axis.title.x = element_blank(), axis.ticks.x = element_blank(), axis.text.x = element_blank(), legend.position = c(.15,.2) ), dat_summ %>% ggplot(aes(x = LicAge, y = Exposure)) + geom_bar(stat = 'identity'), heights = c(2,1) )

The exposure is certainly lower in the each tail, but too low? Is pattern in the observed mean at the tails purely “noise” or is there a credible “signal” here? Of course we can't tell from a graph – we need some statistics. At this point, depending on the practitioner, the linear fit might be accepted and some post-modeling selections proposed to better fit the observed if it is believed to be signal and not noise.

Given the data above an attentive modeler would probably introduce some “feature engineering” into the analysis by creating a new column, a log-transform of `LicAge`

, to test in a second (or perhaps the first) candidate model. For many, some standard transformations of raw features are created at the outset before any models are fit.

freMPL6 %<>% ## mutate and update data mutate(log_LicAge = log(LicAge)) better_glm <- glm(ClaimInd ~ log_LicAge, ## new term offset = log(Exposure), family = poisson(link = "log"), data = freMPL6)

We recreate the summarised data for plotting

dat_summ <- freMPL6 %>% group_by(LicAge) %>% summarise(claim_cnt = sum(ClaimInd), Exposure = sum(Exposure), obs_freq = claim_cnt / Exposure, log_LicAge = first(log_LicAge))

## `summarise()` ungrouping output (override with `.groups` argument)

and compare the two curves graphically

gridExtra::grid.arrange( dat_summ %>% mutate(lame_fit = predict(lame_glm, ., type = "response") / Exposure, better_fit = predict(better_glm, ., type = "response") / Exposure) %>% gather(key = "metric", value = "freq", obs_freq, lame_fit, better_fit) %>% ggplot(aes(x = LicAge, y = freq, color = metric)) + geom_line() + ggtitle("Fit vs Observed") + theme(axis.title.x = element_blank(), axis.ticks.x = element_blank(), axis.text.x = element_blank(), legend.position = c(.15,.2) ), dat_summ %>% ggplot(aes(x = LicAge, y = Exposure)) + geom_bar(stat = 'identity'), heights = c(2,1) )

The model with `log(LicAge)`

looks like an improvement. These models are not nested and we are using 1 degree of freedom for each curve, so model comparison is straightforward – we would simply choose the model with lower residual deviance, or equivalently when model complexity is the same, the lowest AIC.

AIC(lame_glm, better_glm)

## df AIC ## lame_glm 2 26632.23 ## better_glm 2 26615.96

Now we are running into the natural limitations of our GLM. We can model using the raw features or create simple functional transforms – log, square root, and so on are some common choices. Another option is to create a polynomial basis expansion of \(x\) – an orthogonolized basis of \({x^1, x^2,…,x^k}\) – and test each of these polynomials in the model. Actually this is a popular approach; it is an option built into some widely used modeling software and can be done in R with the `poly()`

method. If you're already familiar with this then you are familiar with GAMs since it is what I would call a poor-man's GAM. Some of the shortcomings of this approach are:

- A polynomial basis is still a limited set of shapes for the basis functions
- Including multiple polynomials often produces wild behaviors in the tails
- Each basis function is either fully in or fully omitted from the model – that is, assuming we stick with the
`glm()`

method (we could move into regularization with lasso or ridge regressions) - Terms are typically tested manually for inclusion starting from the lowest order and increasing
- Just my experience, but we often begin to over-fit the data once we go beyond an order of 2 or maybe 3

Here is how a polynomial basis expansion can be implemented

poly_glm <- glm(ClaimInd ~ poly(LicAge, degree = 3), offset = log(Exposure), family = poisson(link = "log"), data = freMPL6) dat_summ %>% mutate(poly_fit = predict(poly_glm, ., type = "response") / Exposure) %>% gather(key = "metric", value = "freq", obs_freq, poly_fit) %>% ggplot(aes(x = LicAge, y = freq, color = metric)) + geom_line() + ggtitle("GLM w/ 3rd order Polynomial Basis Expansion vs Observed") + theme(axis.title.x = element_blank(), axis.ticks.x = element_blank(), axis.text.x = element_blank(), legend.position = c(.15,.2) )

As you can see, the fitted curve shows a characteristic cubic shape which could become dangerous if this is the type of variable where some large outliers in \(x\) are possible.

Take a look at the model summary and make note of the terms in the model

summary(poly_glm)

## ## Call: ## glm(formula = ClaimInd ~ poly(LicAge, degree = 3), family = poisson(link = "log"), ## data = freMPL6, offset = log(Exposure)) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.8152 -0.5262 -0.3917 -0.2411 3.2924 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.53972 0.01548 -99.459 < 2e-16 *** ## poly(LicAge, degree = 3)1 -25.16797 3.16467 -7.953 1.82e-15 *** ## poly(LicAge, degree = 3)2 7.58463 3.25564 2.330 0.0198 * ## poly(LicAge, degree = 3)3 -6.66467 3.28154 -2.031 0.0423 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 18267 on 42399 degrees of freedom ## Residual deviance: 18189 on 42396 degrees of freedom ## AIC: 26625 ## ## Number of Fisher Scoring iterations: 6

Our leading model so far seems to be the GLM with a log-transform – but it looks like we under-estimate the risk at the low end of LicAge. Suppose our carrier plans to begin writing more non-standard risks in the lower age group in future years. This is a compelling reason to obtain the most accurate predictions we can for this segment where our historical experience is thin. Is this the best we can do?

Let us turn to 'proper' GAMs, specifically using thin plate spline bases. When estimating GAMs, you are really just fitting a GLM with some basis expansions of your covariates, similar to our example using `poly()`

above. The differences between that approach and proper GAMs (with TPS) are:

- Basis expansions will be created for us starting from a much larger function space (about the same dimension as there are observations in the dataset, if I understand correctly), and then truncating it down to the \(k\) most important functions in this space via eigen decomposition
- The basis functions will not have a simple closed form expression like \(b(x) = x^3\), but rather a more complicated representation in terms of a radial basis kernel transformation applied to \(x\). We don't deal with the math behind this.
- The coefficient estimates will be regularized or “shrunk” via inclusion of a complexity penalty in the optimization

From an estimation perspective, the quantity being minimized to “fit” the GAM is (for a gaussian error)

\[\min_{f}\sum_{i=1}^{N}[y_i-f(x_i)]^2 + \lambda J(f)\]

We've seen pieces of this before – the first term is squared error loss. The \(J(f)\) in the second term is a measure of “wiggliness” of fitted function \(f\). Thinking about how one measures wiggliness, it is intuitive that we would want to consider second derivatives of \(f\) evaluated all along \(x\), squaring them so that concave and convex areas do not cancel each other out, and sum them all up. Well that is effectively how it is done:

\[J(f) =\int f’‘(x)^2 dx\]

\(\lambda\) is the smoothing parameter which controls how much penalty enters the function being optimized. It is a hyperparameter, so it must be tuned/estimated for us – this is done either through GCV (default) or through (RE)ML which is recommended by the author since it is less prone to undersmoothing, though also a bit slower. Those familiar with penalized regressions such as the Lasso and Ridge estimators (as implemented in the `glmnet`

package for example) will recognize \(\lambda\) is also used there to denote the complexity parameter. Instead of \(L_1\) or \(L_2\) norms as measures of complexity (Lasso and Ridge, respectively), here we have the quantity \(J(f)\) measuring wiggliness.

A GAM is fit using the `gam()`

and `s()`

methods as follows

gam_tps_1 <- freMPL6 %>% gam(ClaimInd ~ s(LicAge, bs = "tp"), #TPS basis offset = log(Exposure), family = poisson(link = "log"), data = .)

Let's see what the `summary()`

tells us about our model

summary(gam_tps_1)

## ## Family: poisson ## Link function: log ## ## Formula: ## ClaimInd ~ s(LicAge, bs = "tp") ## ## Parametric coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.54069 0.01549 -99.46 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df Chi.sq p-value ## s(LicAge) 6.065 7.077 97.18 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.0313 Deviance explained = 0.517% ## UBRE = -0.57107 Scale est. = 1 n = 42400

We notice that we spend about 6 degrees of freedom estimating the smooth term `s(LicAge)`

.

Compare the GAM fit to the GLM with logged feature

gridExtra::grid.arrange( dat_summ %>% mutate(better_fit = predict(better_glm, ., type = "response") / Exposure, gam_fit_1 = predict(gam_tps_1, ., type = "response")) %>% gather(key = "metric", value = "freq", obs_freq, better_fit, gam_fit_1) %>% ggplot(aes(x = LicAge, y = freq, color = metric)) + geom_line() + ggtitle("Fit vs Observed") + theme(axis.title.x = element_blank(), axis.ticks.x = element_blank(), axis.text.x = element_blank(), legend.position = c(.15,.2) ), dat_summ %>% ggplot(aes(x = LicAge, y = Exposure)) + geom_bar(stat = 'identity'), heights = c(2,1) )

Recall a GAM models the response as a linear combination of some smooth functions \(f(x)\), for each \(x\) in the set of covariates \(\{x_1…x_p\}\).

Here we only have one \(x\), and the spline basis \(f_1(x_1)\) is estimated as

\[f_1(x_1) = \sum^{k}_{j=1}b_j(x_1)\beta_j\]

\(k\) represents the basis dimension for the smooth. What is the dimension of our selected spline basis? Well since we know we should be estimating one \(\beta\) for each basis function, let's check

coef(gam_tps_1)

## (Intercept) s(LicAge).1 s(LicAge).2 s(LicAge).3 s(LicAge).4 s(LicAge).5 s(LicAge).6 s(LicAge).7 s(LicAge).8 s(LicAge).9 ## -1.5406883 0.3491719 0.6261047 -0.2770389 0.3752651 0.2409351 -0.3306367 0.1200613 -0.9979342 -0.4636694

So we have a spline basis of dimension 9, producing a smooth

\[f_{LicAge}(LicAge) = \sum^{9}_{j=1}b_j(LicAge)\beta_j\]

The spline basis is itself a linear combination of estimated weights (\(\beta_j\)) each scaling a basis function \(b(x)\). Add up all the \(b_j(x_1)\beta_j\) and you get the smooth function \(f_1(x_1)\).

Let's plot the 9 basis functions \(b_j(x_1)\) which we used for the basis expansion of `LicAge`

dat_summ %>% cbind(., predict(gam_tps_1, newdata = ., type = "lpmatrix") ) %>% as_tibble() %>% gather(key = "b_j", value = "b_j_x", 7:15) %>% ggplot(aes(x = LicAge, y = b_j_x, color = b_j)) + geom_line(linetype = "dashed") + scale_y_continuous(name = "b_j(x)") + geom_line(data = . %>% mutate(gam_fit_1 = predict(gam_tps_1, ., type = "response")), aes(LicAge, gam_fit_1), color = "black")

Notice that the first basis function is linear.

Now let's confirm our understanding by getting predictions a few different ways, we should get the same result from each.

cbind( # default output on linear predictor scale and apply inverse link function predict(gam_tps_1, newdata = tibble(LicAge = 1:10)) %>% as.vector() %>% exp(), # ask for output already on response scale predict(gam_tps_1, newdata = tibble(LicAge = 1:10), type = "response"), # calculate dot product of Intercept and spline basis for x with corresponding estimated weights, apply inverse link (predict(gam_tps_1, newdata = tibble(LicAge = 1:10), type = "lpmatrix") %*% coef(gam_tps_1)) %>% as.vector() %>% exp() )

## [,1] [,2] [,3] ## 1 0.4168438 0.4168438 0.4168438 ## 2 0.3889351 0.3889351 0.3889351 ## 3 0.3629308 0.3629308 0.3629308 ## 4 0.3388892 0.3388892 0.3388892 ## 5 0.3169581 0.3169581 0.3169581 ## 6 0.2972915 0.2972915 0.2972915 ## 7 0.2800022 0.2800022 0.2800022 ## 8 0.2651402 0.2651402 0.2651402 ## 9 0.2526882 0.2526882 0.2526882 ## 10 0.2425674 0.2425674 0.2425674

How did we end up with 9 basis functions? By default this is set for us by the library and is also essentially arbitrary – for a univariate smooth like we have the argument `k`

controls the basis dimension and is defaulted to a value of 10, producing a basis of dimension 9 (`k-1`

). These are selected in order of the amount of variance explained in the radial basis transformation of \(x\) mentioned earlier. This is a vector space, and of very high dimension, so for computational efficiency we want to select a small subset of the most important functions in that space.

Since smooth functions are estimated with penalized regression (by default), `k`

is really setting an upper limit to the degrees of freedom spent in the estimation. The d.f. upper limit \(k\) = `k-1`

since 1 d.f. is recovered due to an “identifiability constraint” – a constant basis function is removed from the basis expansion to maintain model identifiability since an intercept is already included in the model specification outside of the smooth term. In practice, we just want to check that the effective degrees of freedom of the smooth is not too close to the upper limit of `k-1`

. If it is, we would increase `k`

to select a few more basis functions from that big vector space and include them in the basis expansion of x, producing more wiggliness. This approach of creating a large vector space and eigen decomposing it to select a subset of important funtions is the computational innovation of Simon Wood and the `mgcv`

package.

Let's look at the summary again to inspect how complex our model is.

summary(gam_tps_1)

## ## Family: poisson ## Link function: log ## ## Formula: ## ClaimInd ~ s(LicAge, bs = "tp") ## ## Parametric coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.54069 0.01549 -99.46 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df Chi.sq p-value ## s(LicAge) 6.065 7.077 97.18 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.0313 Deviance explained = 0.517% ## UBRE = -0.57107 Scale est. = 1 n = 42400

The value “edf” refers to the estimated or effective degrees of freedom and is a measure of curve complexity – it is sort of like how many parameters are being estimated from the data. In a normal MLE [generalized] linear regression, 1 parameter estimate burns 1 d.f., but in a penalized regression context the complexity penalty has the effect of “shrinking” parameters out of the model. These shrunken estimates do not use a full degree of freedom because they do not fully key-in to the data – they are a compromise between the data and the penalty being imposed against complexity (similar to a prior in the bayesian framework). This is the case with our smooth term – despite a basis dimension of 9, it is more like we are estimating 5 full parameters from the data, due to shrsinkage.

Let's look at where we are spending these degrees of freedom

gam_tps_1$edf

## (Intercept) s(LicAge).1 s(LicAge).2 s(LicAge).3 s(LicAge).4 s(LicAge).5 s(LicAge).6 s(LicAge).7 s(LicAge).8 s(LicAge).9 ## 1.00000000 0.99555993 0.69450590 1.00662557 0.16930169 0.80850259 -0.01463086 0.38449257 1.02098007 1.00000000

gam_tps_1$edf %>% sum

## [1] 7.065337

Because the reported edf are approximate, terms may be slightly below 0 or above 1. We can interpret this as indicating the corresponding basis function was shrunken nearly out of the smooth term, or not shrunken at all, respectively. Here we have one basis function shrunken completely

Our `edf`

sum balances with what is reported in the model summary (6.065, plus 1 for the Intercept). But we should keep in mind that we are also estimating a smoothing parameter which controls the amount of penalty applied. If we want edf's that accounts for this parameter as well, then you must look at `edf2`

.

gam_tps_1$edf2

## NULL

gam_tps_1$edf2 %>% sum()

## [1] 0

So far we have only considered the univariate case. The real power of GAMs + thin plate spline smooths comes through when we move into smooths of 2 (or more) covariates and the plots become much cooler.

gam_tps_2X <- freMPL6 %>% gam(ClaimInd ~ s(LicAge, DrivAge, bs = "tp", k = 15), # a smooth surface on (LicAge, DrivAge) offset = log(Exposure), family = poisson(link = "log"), method = "REML", ## The recommended method which is less prone to over-smoothing compared to GCV data = .) ## plotting with persp() steps <- 30 LicAge <- with(freMPL6, seq(min(LicAge), max(LicAge), length = steps)) DrivAge <- with(freMPL6, seq(min(DrivAge), max(DrivAge), length = steps)) newdat <- expand.grid(LicAge = LicAge, DrivAge = DrivAge) fit_2X <- matrix(predict(gam_tps_2X, newdat, type = "response"), steps, steps) persp(LicAge, DrivAge, fit_2X, theta = 120, col = "yellow", ticktype = "detailed")

Here we have a “thin plate” spline prediction surface estimated jointly on `DrivAge * LicAge`

. The model specification is equivalent to including an interaction between two covariates in the GLM context. We can see that, generally speaking, risk decreases with `DrivAge`

and with `LicAge`

. But the story is a little more nuanced than either a tilted plane (2 linear terms) or a “saddle” (2 linear terms plus an interaction between them), with a ridge and some depressions appearing on the surface, and some tugging upward of the corners of the surface.

We can increase or decrease the tension of the thin plate, leading to more or less pressure required to flex it. This is the physical interpretation of the penalty-controlling parameter \(\lambda\) we saw earlier. Normally the optimal lambda is estimated for us; let's manually override that and make it fairly large to induce more penalty on wiggliness.

gam_tps_2X <- freMPL6 %>% gam(ClaimInd ~ s(LicAge, DrivAge, bs = "tp", k = 15, sp = 100), #large lambda = more tension offset = log(Exposure), family = poisson(link = "log"), method = "REML", data = .) fit_2X <- matrix(predict(gam_tps_2X, newdat, type = "response"), steps, steps) persp(LicAge, DrivAge, fit_2X, theta = 120, col = "yellow", ticktype = "detailed")

Here we can see that the “pressure” of the signal in our data is only strong enough to bend the plate at the corner where both `DrivAge`

and `LicAge`

are low, and elsewhere we have essentially a flat plate, i.e. a plane. We can infer from this that we did not shrink out the linear parts of the smooths, only the wiggly bits. That is the default behavior – shrinkage is applied in the “range space” containing the non-linear basis functions, not to the “null space” with the linear components of each term. But there are a couple ways to also apply shrinkage here if we like.

There are two ways to induce shrinkage on the linear terms – shrinkage smoothers and the double penalty approach.

The shinkrage smoother approach is used by setting `bs="ts"`

in a call to `s()`

. We can do this for some or all terms – one advantage of this approach. This approach assumes *a priori* that non-linear bits should be shrunk more than the linear terms.

gam_tps_shrink <- freMPL6 %>% gam(ClaimInd ~ s(LicAge, DrivAge, bs = "ts", k = 15), #shrinkage version of TPS basis offset = log(Exposure), family = poisson(link = "log"), method = "REML", data = .) fit_shrink <- matrix(predict(gam_tps_shrink, newdat, type = "response"), steps, steps) persp(LicAge, DrivAge, fit_shrink, theta = 120, col = "yellow", ticktype = "detailed")

The double penalty approach is used by setting `select=TRUE`

in the `gam()`

call. The disadvantages of this approach are that it is either on or off for all smooth terms, and now instead of one smoothing parameter we must estimate two – one each for the null space (linear terms) and the range space (non-linear terms). The advantages are that it treats both linear and non-linear terms the same from the point of view of shrinkage, and that this tends to produce more robust estimates according to the package author.

gam_tps_dblpen <- freMPL6 %>% gam(ClaimInd ~ s(LicAge, DrivAge, bs = "tp", k = 15), offset = log(Exposure), family = poisson(link = "log"), method = "REML", select = TRUE, ## an extra penalty placed on linear terms of smooths data = .) fit_dblpen <- matrix(predict(gam_tps_dblpen, newdat, type = "response"), steps, steps) persp(LicAge, DrivAge, fit_dblpen, theta = 120, col = "yellow", ticktype = "detailed")

The double penalty approach produces more regularlization on the linear components than the shrinkage splines, resulting in a more level surface. These options become powerful when you have more than a handful of features and want to perform automatic feature selection, as in a Lasso regression.

So far we've seen how we can improve our estimator with more flexible shapes to capture non-linear patterns and with regularization to reduce estimator variance by controlling over-fitting of the data. So let's turn back to the motivation of better reflecting business constraints.

Looking at both our univariate and bivariate models, we see that better data fitting can mean curves/surfaces that are wavy. This can be totally fine depending on the application – for example if we are modeling the result of some operational process where customers and regulators are not seeing or feeling the effects of modeling decisions so directly. On the other hand, this may be undesireably in a pricing application from the customer experience perspective – in particular when we have rating factors which increment over time (driver age, license age) then the result may be rates that swing up and down with each policy renewal/year.

Can we achieve more flexible curves while also maintaining monotonicity as we had with our log-transform GLM?

library(scam)

To do this we need to turn to the `scam`

package. `scam`

is very similar to `mgcv`

in that it allows us to fit GAMs with a variety of spline bases, but `scam`

is special because it allows us to also place shape constraints on these models (“scam” stands for shape constrained additive models).

Let's take a look at a model for `LicAge`

where we impose a monotonic decreasing condition on the spline basis

scam_mono_d <- freMPL6 %>% scam(ClaimInd ~ s(LicAge, bs = "mpd"), ## monotonic decreasing constraint on basis expansion offset = log(Exposure), family = poisson(link = "log"), data = .) dat_summ %>% mutate(gam_fit_1 = predict(gam_tps_1, ., type = "response"), scam_fit_1 = predict(scam_mono_d, ., type = "response")) %>% gather(key = "metric", value = "freq", obs_freq, gam_fit_1, scam_fit_1) %>% ggplot(aes(x = LicAge, y = freq, color = metric)) + geom_line()

We've expressed our desire for a non-wavy curve as a shape constraint within the spline basis itself, and estimated the optimal curve from the data under that condition. This is also possible in the bivariate case by using the argument `bs="tedmd"`

, though I will not do so here because it generates warnings due to not being an appropriate model for this data.

Finally let's look at the in-sample estimate of predictive performance for three of our models – the GLM w/ log-transform, the GAM without shape constraints, and our model incorporating business considerations via a monotone decreasing spline basis.

AIC(better_glm, gam_tps_1, scam_mono_d)

## df AIC ## better_glm 2.000000 26615.96 ## gam_tps_1 7.065337 26614.70 ## scam_mono_d 5.958760 26616.27

Based on AIC (we wouldn't entirely base these decisions on AIC) which penalizes prediction improvement for added model complexity, we would select the unconstrained GAM as our best model. But based on the business considerations and the desire to write more risks in the low age range prospectively, we would choose the GAM with monotonic constraint.

We started with a GLM which fit the observed experience pretty well with a simple log feature. The unconstrained GAM is able to better match the experience at the low end because it is not constrained by any functional form. But it also finds a wavy pattern throughout the experience and maybe this is not something we would choose to implement (or even a true effect we would believe). By imposing a constraint on the spline basis we can achieve a model which captures the relative risk better than the GLM, but maintains the desired shape, and still performs comparably with the GLM based on AIC (a penalized estimate of out-of-sample prediction performance).

In my mind, one of the greatest advantages in using the more flexible GAMs/non-linear methods over a GLM is the perception by our stakeholders and business partners around the quality of our analysis. With less flexible linear methods we must attach more qualifications when sharing visuals of the model output – statements like “this curve is the the best fit according to the model but some adjustments are certainly warranted here and there.” It doesn't sound like your model does a very good job then! I prefer sharing results where I can confidently say that the curves are a best estimate of the underlying signal under reasonable assumptions/constraints – allowing non-linearity but assuming monotonicity; “credibility-weighting” observed experience in areas with thin data. We may even save our own and our colleagues' time from meetings spent fiddling with curves in excel trying to find the best compromise fit between higher observed loss and the thin experience which generated it.

In a future Part 2 post I will plan to look at additional modeling capabilities, focusing on estimating random effects/mixed models within `mgcv`

and from there moving into fully Bayesian GAMs with `brms`

and stan.

There is a fair amount of open data available from the city and other sources which could be leveraged by residents to improve their neighborhoods, provide transparency and oversight of institutions both public and private, or just to have fun and inform themselves with. So toward at least one and hopefully more of these purposes, I write to announce an early (beta) release of haRtisan.

haRtisan is a web app which lets users access, query, and visualize data from and related to the city. As of this version, this includes just one city dataset providing building permit information. Users can filter by neighborhood, as well as by permit characteristics such as the date of application, class of work (e.g. residential, commerical), permit status (e.g. submitted, issued, approved), and others. Click on the permit marker and a popup shows important info about that permit.

Several data elements or filters have choices which are not easily interpreted – for example, *Work Class* can be “Residential” or “Commercial”, but it can also be “Alteration” (which one would think could itself be residential or commercial), “New”, “Tent,” and so on. These options simply reflect all the values that have been entered by the city in the past, and is not any design by myself. However, I will be working to get a “data dictionary” added which might clarify how these assignments are decided.

Permit data from the city is updated nightly on data.hartford.gov, and haRtisan automatically pulls it down nightly as well before processing, cleaning, and storing. Importantly, this includes a *geo-tagging *step – adding latitude and longitude information based on an address search using the OpenStreetMaps (OSM) service. Sometimes OSM cannot find an address provided and does not return a hit – in these cases, I instead use coordinates which are sometimes provided (though infrequently) in the original dataset by the city. When neither option is available, I do not include the records at all in the map nor in the table below. Testing over 5 years, I find this approach results in omission of less than 0.1% of permits. Nevertheless, in a future update I will add a way to see the records that could not be mapped in order that no permits will ever “slip through the cracks.”

Relying on the city’s data feed constitutes a dependency – if the city does not update their data for a week, then data within haRtisan is not updated either. If addresses or other elements are inaccurate as provided by the city, then they are inaccurate here.

Think of this version as the most minimal functionality possible while still being useful – the term often used is MVP (Minimum Viable Product). I wanted to put this out while I continue to work on adding functionality both to this Building Permits piece as well as by adding other datasets. **If you have thoughts on what data you would like to see included or feedback on what you see so far, please leave a comment here.**

The app can be accessed from the navigation bar at the top of my site (click on “haRtisan”) or by clicking on the image at the top of this post.

]]>The pandemic quarantine was really wearing on me back in September and my PTO balance was stuck at January levels, so a week of forecasted perfect bike riding weather gave me the impulse to go do a tour for the first time in years. I pulled my old 2006 Fuji Touring out of the attic and started to get some gear together. New panniers were *en route* but I was missing a front rack and every shop nearby was out of stock of these and so many other things – apparently I wasn’t the only one who was rediscovering cycling. The velo gods must have been smiling on me because just as I was starting to get desperate and considering postponing, I went to dig in the bins at my local bike co-op and came upon a very sought after “Jim Blackburn” front rack. Tour on.

If you’re ever thinking about doing an extended bike tour – or even a long day trip – I’ll share something that might save you some discomfort. You need to be conditioned before you embark, and I don’t mean aerobically. You need to condition your posterior with some time in the saddle. Not doing so will mean saddle sores and general soreness regardless of whatever padded bike shorts (chamois) or saddle you use. A week or two of regular riding is what is called for – but being the type of person who always goes for it, I prepared with just two rides a few days before leaving.

I embarked Hartford around noon on a Wednesday and at about 10 miles in I made the mistake of riding too much in the gutter, for which I was rewarded with a punctured rear tire. Not more than 10 miles later around Marlborough CT I get my second flat – not a great start. Second word of advice here – bring lots of patch kits and spare tubes, not just an optimistic few. By the evening I’m sitting outside eating an Italian dinner in Norwich with some other cyclists who were eating there as well and struck up a conversation when they saw me. Another 25 miles or so and I stop for sleep somewhere off Woodville Road in Woodville, RI.

There is a right way and many wrong ways to stealth camp. The right way is to scout a location out of sight around dusk – in the woods, the back of an old cemetery, somewhere out of sight and preferably at a vantage point. A wrong way is to trudge into the woods nearest you when your headlight starts to die at about midnight on a rural road. My spot could have been worse but after waking up and making some coffee, I exited the woods to find I had chosen a spot almost directly across from a home where the owner’s dog was allowed to roam freely near the road. He must not have been accustomed to people emerging from the woods in the morning – could have been worse but he was definitely letting his owner know I was there as I rode away. Just down the road I stop to visit the Collins family at a historic cemetery.

Pedal 25 miles that morning and I reward my efforts with brunch in Narragansett while recharging the power bank and devices. Crossing the two bridges here is tricky since bikes are not allowed, but detouring north through Providence is too many pointless miles. Heading east, the strategy is to make a run for it on the Jamestown Bridge from Plum Beach, carefully negotiating the narrow protected walkway and apologizing to the city workers who stop to let me know bikes are not allowed on the other side. This tact isn’t an option for the Newport bridge though, so after admiring the Jamestown windmill I hitch a ride at the entrance ramp from a worker with a pickup truck and some landscaping equipment in the bed who declines my offer of a few bucks for the trouble. I’ve taken more than a few adventures in my life, and it often seems to be the working class people who are willing to help out when needed. By the way, there is a bus service here which I think runs hourly, but only in-season and during certain hours…worth looking into in advance but on the fly I preferred a streamlined solution.

Now in Newport I notice I’ve popped a couple spokes so it is down to a bike shop (Ten Speed Spokes) to get them replaced. I was rolling the dice traveling without a cassette removal tool which I definitely would not have done on a very rural tour, but with towns usually not more than 10-20 miles apart I figured I could make it to a shop on a bumpy wheel if and when this happened. Luckily there were no more broken spokes for the rest of the ride. After the fix I grab an early dinner at “Pour Judgement” where I could again sit outside and avoid offending anyone as I’ve now been riding for two days. Forty miles after dinner gets me to Shoolman Nature Preserve where I again set up camp in the dark. This was a great nature spot where I could get way off the trail and not worry about early morning hikers.

Second morning waking up in my tent and I’m just outside the Cape at this point – don’t want to be denied entry, so I’m thinking I need a shower. I stop in the very blue collar, old school New England town of Wareham and get beach recommendations from the owner of a diner where I eat. I like the energy in Wareham – nobody is in a rush, the historic buildings remain standing, the waitress is dealing with relationship drama in the service area between waiting tables – it retains its charm from a century earlier when the major industries were a nail factory and cranberries. Apparently it is also in the top 5% of MA towns for crime rates. The owner and his wife want to tell me about every beach in the area but I extract from him that he thinks they have showers at Onset beach. Onset is a separate incorporated tax district/village within Wareham, but geographically and economically separated. They do end up having showers which are turned off due to Covid, so the shower turns into a bath in the ocean – just as good.

Ready to make my appearance on the Cape, I pick up the Canal Service road near Buzzards Bay which was an awesome, heavily used recreational path. This leads to a bridge across the canal and then you continue on the other side’s service road for a few miles before arriving in Sandwich. I honestly didn’t have time to stop and look but there seemed to be some outdoor music event happening on the beach right where I exit the path. As I push on I find myself on the single road (aside from a highway) running east, Rt. 6A which is two lanes and about two inches of shoulder, for 10 miles in what must have been weekender traffic. Definitely the most dangerous leg of the trip so far – I think the preferred bike route heads south before swinging east whereas I wanted to go straight east toward my campground in Brewster for the night. Maybe there is a lesser known trail google maps could not tell me about…anyway, do whatever is necessary to avoid 6A between Sandwich and about Barnstable, although somewhere around Barnstable you get more residences and nice sidewalks to ride on if needed.

In Brewster there are two campgrounds and I forget which one I chose, but I was happy to have a hot shower and an outlet all to myself for the night. I had a beer with a couple guys who work in IT and another with myself before going to bed.

The next day was by far the best riding of the trip and will be the reason I return to the Cape. From Brewster I head east and pick up the Cape Code Rail Trail (CCRT) which connects villages from South of Yarmouth right up to the National Seashore in Wellfleet. With the nice weather it is actually a little bit crowded with people on bikes. Orleans in particular stands out as a great village with bike shops right near the trail and people all around on foot and I stop here for second breakfast before shooting up to P-town. My biggest mistake on the trip was not leaving enough time to enjoy the stops and explore towns and villages along the route.

As you enter Wellfleet you are entering National Seashore protected land. I take Ocean View Dr. north and start to notice little dirt “ways” with funky signs listing what looks like family names. My route says to turn left on Thoreau Way which at first I think is a private driveway – little more than a dirt trail one vehicle wide. Turns out as one of the locals driving through explains to me when I ask if I’m lost, the land I’m riding through is *sort of* private, but with public access rights being in a National Park. Another local, an older lady in a car, rolls down the window to let me know “bikes rock!” as we pass each other carefully. Excellent. I’m having a lot of fun riding through a maze of rocky, hilly dirt trails and getting lost so I forget to take all but one picture. Next visit I will be taking a lot more time to explore the seashore area over a couple days.

Entrance to a network of dirt ‘ways’ in the National Seashore The only photo I thought to take here – on Thoreau Way

Leaving the forested area of the Seashore, I shoot up to Provincetown on a more typical road to make the ferry to Boston by 4:00. If liberal-minded people in September were taking a principled stance behind a social responsibility to sacrifice pleasures in the name of social distancing, then you wouldn’t know it visiting P-town which was an all-out party in the street and in the establishments. Only not-OK when a cruder class of person violate these expectations, I suppose? Anyway I am out as fast as I arrive, enjoying a beer on the ferry.

Entering P-town Immediately exiting P-town

There’s still a lot of miles to cover – from Boston up to Salem, camping off a wooded trail at a golf course (Highland Park), then taking a hard left turn for the two-day return leg heading southeast through Mass and entering CT through the “Quiet Corner.” I’m also leaving out the best camping location – a marked “rest area” located on a bike trail where I build my one and only camp fire on the trip. But this post is turning out longer than intended, so I’m going to leave the rest to the image gallery below. All in all this was a 6-day tour and somewhere around 350 miles. I’ll be planning some more trips in 2021 with my soon-to-be rebuilt bike – the stripped frame is currently at the powdercoater so I’ll post soon on the rebuild.

]]>I’ve set this blog up as an outlet for recording my thoughts on a few topics which are interesting to me and take up enough of my time to warrant writing on. If you couldn’t guess, they are:

**Data science**– mostly statistical inference, probability, and predictive modeling…but I expect also to write on topics such as R, scripting, DevOps, linux, and application development;**Insurance**– specifically on industry current topics as they intersect with politics and society;**Bikes**– two wheels and two pedals (or four, if we include tandems). I like to ride them, wrench on them, “collect” (aka hoard) them, and sometimes take an extended bikepacking trip with one;**The Meaning of Life**– the big one, but I felt it was time for someone to take it on. Anything that might somehow relate either directly or indirectly to the meaning of life is fair game here.

I enjoy learning new (and old but new to me) technologies and applying them at work and at home. Here are the technologies I’m using so far for this blog itself:

**Virtual Hardware**: I spun up an Ubuntu 18.04 VM hosted in Azure to host the blog/database. I wanted a server instead of a WordPress service so that I can have more control of the backend and build+deploy small applications to incorporate into blog posts for demonstrations. It’s super easy to provision resources in the cloud – knowing what resources to provision and how to make them work together is the hard part (and why ‘Cloud Architect’ is a well-paid profession). I chose the two-core D2s for $70 per month for this first month but I’ll be scaling down to maybe the single-core B1ms for ~$30 per month.**LAMP stack**: Linux, Apache, MySQL, and php. From a little browsing I’ve gathered that this is the go-to open source stack for dynamic websites. I didn’t need to be told to use Linux because I’ve been promoting it at work for the past 5 years, though RHEL is the preferred distro there and I’m running Ubuntu here. Software setup was simple following this guide.**WordPress**: self-evident, installed with 2 commands from the shell and configured in under 15 minutes. Here’s another guide – I cheat as much as I can.**Writee**: a free theme for WP. I’m not sure that I will stick with this one but it seemed good to get started and geared towards blogs.

That’s it for now, will catch up soon with some recent content I’ve been saving up for once I got this going.

]]>