CatBoost
is a very popular and high-quality gradient boosting library.
With the latest releases of
bonsai
and
parsnip
,
you can now train CatBoost models from R using the same tidymodels interface you already use for xgboost, LightGBM, and the rest of the boost_tree() family.
Installing CatBoost#
The one wrinkle is installation.
The CatBoost R package is not on CRAN,
so you can’t reach for install.packages("catboost") directly.
Grab the URL for your platform from the CatBoost R installation guide and install it with the remotes package. For example, on an Apple Silicon or Intel Mac:
install.packages("remotes")
remotes::install_url(
"https://github.com/catboost/catboost/releases/download/v1.2.10/catboost-R-darwin-universal2-1.2.10.tgz",
INSTALL_opts = c("--no-multiarch", "--no-test-load", "--no-staged-install")
)Swap in the release version, operating system, and architecture that match your setup. The guide lists the full URL pattern and the binaries available for each release.
Once CatBoost itself is installed, the tidymodels packages is just the usual packages:
# install.packages("pak")
pak::pak(c("tidymodels", "bonsai"))You’ll need bonsai 0.4.1 (or later) and parsnip 1.4.0 (or later), which is where the CatBoost engine landed and got polished.
Fitting a CatBoost model#
CatBoost is supported as an engine for boost_tree().
Loading bonsai registers the engine,
and from there it behaves like any other parsnip model spec:
library(tidymodels)── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
✔ broom 1.0.12.9000 ✔ recipes 1.3.2
✔ dials 1.4.3 ✔ rsample 1.3.2
✔ dplyr 1.2.1 ✔ tailor 0.1.0
✔ ggplot2 4.0.3 ✔ tidyr 1.3.2
✔ infer 1.1.0 ✔ tune 2.1.0.9000
✔ modeldata 1.5.1 ✔ workflows 1.3.0
✔ parsnip 1.6.0 ✔ workflowsets 1.1.1
✔ purrr 1.2.2 ✔ yardstick 1.4.0
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ recipes::step() masks stats::step()
library(bonsai)
cat_spec <-
boost_tree(trees = 500, learn_rate = 0.05) |>
set_engine("catboost") |>
set_mode("regression")
cat_fit <- fit(cat_spec, mpg ~ ., data = mtcars)
cat_fitparsnip model object
CatBoost model (500 trees)
Loss function: RMSE
Fit to 10 feature(s)
Tuning#
The CatBoost engine supports the standard boost_tree() tuning parameters,
and the recent releases made tuning both faster and more correct.
A typical tuning setup looks like this:
cat_spec <-
boost_tree(trees = tune(), learn_rate = tune()) |>
set_engine("catboost") |>
set_mode("regression")
cat_wf <- workflow(mpg ~ ., cat_spec)
set.seed(123)
folds <- vfold_cv(mtcars, v = 5)
tune_res <- tune_grid(cat_wf, resamples = folds, grid = 20)
show_best(tune_res, metric = "rmse")# A tibble: 5 × 8
trees learn_rate .metric .estimator mean n std_err .config
<int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1473 0.0379 rmse standard 2.76 5 0.419 pre0_mod15_post0
2 527 0.0513 rmse standard 2.81 5 0.443 pre0_mod06_post0
3 1053 0.0941 rmse standard 2.83 5 0.447 pre0_mod11_post0
4 211 0.127 rmse standard 2.84 5 0.400 pre0_mod03_post0
5 632 0.234 rmse standard 2.86 5 0.480 pre0_mod07_post0
Thanks to the
submodel trick
tuning trees doesn’t require refitting the model from scratch at every candidate value.
Wrapping up#
CatBoost is a great addition to the gradient boosting options available in tidymodels, especially if you work with categorical features or want a strong out-of-the-box model.
For the full details, see the bonsai changelog and the parsnip 1.4.0 release notes .
