# Tune R tidymodels with the Python optuna package

By Niels van der Velden in reticulate R Python Machine Learning

February 16, 2022

## Introduction

I really like `tidymodels`

to build Machine Learning models in R because of its `tidy`

way of implementing all the data processing steps. However, when it comes to hyperparameter tuning I found the options very limited and not as easy to use as for instance the Python `optuna`

package (see my
previous post). Luckily, there is the `reticulate`

package which allows you to run R code in Python which makes it possible to tune R models using any Python package.

In this post I will show how to use `tidymodels`

to set op a `xgboost`

model to predict flower species. I will then tune the hyperparameters directly in R using a grid search and in Python using `optuna`

.

## Build a tidymodel

Required packages

```
#Load all libraries
library(reticulate)
library(tidyverse)
library(tidymodels)
library(xgboost)
library(ggplot2)
library(knitr)
```

To build a simple Machine Learning model I will use the Iris dataset. This is a built-in dataset in R that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different species. We can use the different attributes to fit a `xgboost`

model that will predict the flower species.

```
set.seed(123)
#Load data
df <- as_tibble(iris)
head(df)
```

```
## # A tibble: 6 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
```

To train and test the model we split the data into train, test and 10-Fold Cross-Validation set.

```
# Split data
split <- initial_split(df, strata = Species)
train <- training(split)
test <- testing(split)
folds <- vfold_cv(train)
```

An `xgboost`

model is fitted and the average accuracy is calculated for the K-folds. Using the standard parameters an accuracy of 0.911 is achieved.

```
#Model
xg_model <-
boost_tree(
trees = 1000
) %>%
set_engine("xgboost") %>%
set_mode("classification")
#Recipe
xg_recipe <-
recipe(Species ~., data=split)
#Workflow
xg_workflow <-
workflow() %>%
add_recipe(xg_recipe) %>%
add_model(xg_model)
#Fit
xg_fit <-
xg_workflow %>%
last_fit(split)
#Accuracy on test data
#accuracy <- xg_fit %>% collect_metrics()
xg_fit_rs <-
xg_workflow %>%
fit_resamples(folds)
accuracy_xg <- xg_fit_rs %>% collect_metrics()
accuracy_xg
```

```
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.911 10 0.0255 Preprocessor1_Model1
## 2 roc_auc hand_till 0.960 10 0.0216 Preprocessor1_Model1
```

Next we can tune the parameters using a grid search with a size of `30`

```
set.seed(123)
#Set tuning grid
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), train),
learn_rate(),
size = grid_size
)
#Tune the model using a latin hypercube.
#Model
xg_model <-
boost_tree(
trees = 1000,
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
learn_rate = tune(),
) %>%
set_engine("xgboost") %>%
set_mode("classification")
#Workflow
xg_workflow <-
workflow() %>%
add_recipe(xg_recipe) %>%
add_model(xg_model)
xgb_res <- tune_grid(
xg_workflow ,
resamples = folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
best_auc <- select_best(xgb_res, "roc_auc")
best_auc
```

```
## # A tibble: 1 x 7
## mtry min_n tree_depth learn_rate loss_reduction sample_size .config
## <int> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 2 6 3 0.00000689 2.14e-10 0.875 Preprocessor1_Mo~
```

The model is fit with the best found parameters and accuracy is calculated on the K-folds.

```
final_xgb <- finalize_workflow(
xg_workflow,
best_auc
)
set.seed(123)
xg_fit_rs <-
final_xgb %>%
fit_resamples(folds)
accuracy_xg <- xg_fit_rs %>% collect_metrics()
accuracy_xg
```

```
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.928 10 0.0181 Preprocessor1_Model1
## 2 roc_auc hand_till 0.987 10 0.00581 Preprocessor1_Model1
```

After tuning an accuracy of `0.928`

was achieved which is a slight improvement over the 0.911 using the standard parameters.

## Tune R model using Python

The grid search only lead to a slight improvement of the accuracy. Can we achieve better results using the `optuna`

package? Because we need to run `optuna`

in Python we need to wrap the `tidymodel`

that we created above into a function. Notice that I used the `rlang`

!! injection operator to define the variables. This is because `NULL`

values result in an error due to delayed evaluation of the variables by the `boost_tree()`

function.

```
tune_r_model <- function(trees = 1000, tree_depth = NULL, min_n = NULL, loss_reduction = NULL, sample_size = NULL, mtry = NULL, learn_rate = NULL){
set.seed(123)
xg_model <-
boost_tree(
trees = !!trees,
tree_depth = !!tree_depth,
min_n = !!min_n,
loss_reduction = !!loss_reduction,
sample_size = !!sample_size,
mtry = !!mtry,
learn_rate = !!learn_rate
) %>%
set_engine("xgboost") %>%
set_mode("classification")
#Recipe
xg_recipe <-
recipe(Species ~., data=split)
#Workflow
xg_workflow <-
workflow() %>%
add_recipe(xg_recipe) %>%
add_model(xg_model)
xg_fit_rs <-
xg_workflow %>%
fit_resamples(folds)
accuracy_xg <- xg_fit_rs %>% collect_metrics()
return(accuracy_xg$mean[1])
}
```

Now that we wrapped the model in an R function we can use `optuna`

to suggest values for the different parameters. Using the `reticulate`

package (see:
link) we can run Python code directly in Rstudio and call the `tune_R_model`

function that we have defined above simply by adding an `r.`

in front of the functions name. In order to compare the results of `optuna`

with the grid search that was performed using `tidymodels`

I use a search space for each suggested value with upper and lower bounds equal to the min and max values that were used in the grid. I then run `30`

optimization trials which is equal to the size of the grid.

```
import optuna
from optuna.samplers import TPESampler
import math
def objective(trial):
tree_depth = trial.suggest_int('tree_depth',1, 15)
min_n = trial.suggest_int('min_n',2, 40)
loss_reduction = trial.suggest_float("loss_reduction", 1e-8, 1.0, log=True)
sample_size = trial.suggest_float("sample_size", 0.2, 1.0)
mtry = trial.suggest_int("mtry", 1, 5)
learn_rate = trial.suggest_float("learn_rate", 1e-8, 1.0, log=True)
out = r.tune_r_model(
trees = 1000,
tree_depth = tree_depth,
min_n = min_n,
loss_reduction = loss_reduction,
sample_size = sample_size,
mtry = mtry,
learn_rate = learn_rate
)
return out
study = optuna.create_study(
direction="maximize",
sampler=TPESampler(seed=123) #For reproducible results
)
study.optimize(objective, n_trials=r.grid_size)
print("The best found Accuracy = {}".format(round(study.best_value,3)))
```

```
## The best found Accuracy = 0.937
```

```
print("With Parameters: {}".format(study.best_params))
```

```
## With Parameters: {'tree_depth': 9, 'min_n': 5, 'loss_reduction': 1.7181734243476516e-07, 'sample_size': 0.7143436449949954, 'mtry': 2, 'learn_rate': 1.0998448330505116e-05}
```

The best accuracy is `0.937`

which is quite a bit better then an accuracy of `0.928`

which was found using the grid search.

## Conclusion

The `reticulate`

package makes it very easy to combine R and Python code. To demonstrate this I build a Machine Learning model to predict Iris flower species in R using the `tidymodels`

package. I then tuned the model using the python `optuna`

package. This resulted in a higher accuracy model compared to performing a random grid search directly in R using `tidymodels`

.