[ECC DS 9์ฃผ์ฐจ] Tutorial: CatBoost Overview
0. Introduction
-
Catboost๋ decision tree๋ฅผ ๊ธฐ๋ฐ ๋ชจ๋ธ๋ก ํ์ฉํ๋
gradient boosting
๋ชจ๋ธ์ ๋ฐ์ ์ํจ ์๊ณ ๋ฆฌ์ฆ -
Catboost๋ ๋ค์๊ณผ ๊ฐ์ ๋ฌธ์ ๋ค์ ํด๊ฒฐํ๋ ๋ฐ ํ์ฉ๋ ์ ์์
-
๋ถ๋ฅ(์ด์ง, ๋ค์ค)
-
ํ๊ท
-
ranking
-
-
์ด๋ฌํ ์์ ๋ค์ ๊ทธ๋ค์
๋ชฉ์ ํจ์(objective function)
์ ๋ฐ๋ผ ๋ฌ๋ผ์ง ์ ์์- ๊ฒฝ์ฌ ํ๊ฐ ์ค ์ต์ํํ๋ ค๋ ๊ฒ
-
Catboost๋ ๋ชจ๋ธ์ ์ ํ๋๋ฅผ ์ธก์ ํ๊ธฐ ์ํด
์ฌ์ ์ ๊ตฌ์ถ๋
ํ๊ฐ ์งํ(metric)๊ฐ ์กด์ฌ
๐ Catboost์ ๊ฐ์
1. ๋ฒ์ฃผํ ๋ณ์์ ๋ํ ์ง์
-
๋ฒ์ฃผํ
๋ณ์๊ฐ ์๋ ๋ฐ์ดํฐ์ ๊ฒฝ์ฐ CatBoost์ ์ ํ๋๊ฐ ๋ค๋ฅธ ์๊ณ ๋ฆฌ์ฆ์ ๋นํด ๋ ์ข์ -
๋ฒ์ฃผํ ๋ณ์์ ๋ํ ์ ์ฒ๋ฆฌ(ex> one-hot encoding)๊ฐ ํ์ x
- ๋ช๋ช
ํ์ดํผ ํ๋ผ๋ฏธํฐ
๋ง ์ง์ ํ๋ฉด ok
- ๋ช๋ช
2. ๊ณผ์ ํฉ(overfitting) ํธ๋ค๋ง์ด ์ฉ์ด
-
CatBoost๋ ๊ธฐ์กด ๋ถ์คํ ์๊ณ ๋ฆฌ์ฆ์ ๋์์ธ
์์ํ๋
๋ถ์คํ ๊ตฌํ์ ํ์ฉ -
์๋ฅผ ๋ค์ด, gradient boosting์ ์์ ๋ฐ์ดํฐ์ ์ ๋ํด ์ฝ๊ฒ ๊ณผ์ ํฉ๋จ
- Catboost์์๋ ์ด๋ฌํ ๊ฒฝ์ฐ์ ๋ํ ํน๋ณํ ๋ณํ์ด ์กด์ฌ -> ๊ณผ์ ํฉ ๋ฐฉ์ง
3. ๋น ๋ฅด๊ณ GPU ํ์ต์ ํ์ฉํ๊ธฐ์ ์ฉ์ด
- GPU ํ์ต์ ์ง์
4. ๋ค๋ฅธ ์ ์ฉํ ํผ์ณ๋ค
-
๊ฒฐ์ธก์น์ ๋ํ ์ฒ๋ฆฌ
-
ํ๋ฅญํ ์๊ฐํ
1. Catboost ์ค์น
# ํด๋น ์์ ๋ฅผ ์ํํ๋ ค๋ฉด 0.14.2 ์ด์์ด์ฌ์ผ ํจ
!pip install catboost
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting catboost Downloading catboost-1.2-cp310-cp310-manylinux2014_x86_64.whl (98.6 MB) [2K [90mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ[0m [32m98.6/98.6 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m [?25hRequirement already satisfied: graphviz in /usr/local/lib/python3.10/dist-packages (from catboost) (0.20.1) Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from catboost) (3.7.1) Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.10/dist-packages (from catboost) (1.22.4) Requirement already satisfied: pandas>=0.24 in /usr/local/lib/python3.10/dist-packages (from catboost) (1.5.3) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from catboost) (1.10.1) Requirement already satisfied: plotly in /usr/local/lib/python3.10/dist-packages (from catboost) (5.13.1) Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from catboost) (1.16.0) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24->catboost) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24->catboost) (2022.7.1) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (1.0.7) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (4.39.3) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (1.4.4) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (23.1) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (8.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (3.0.9) Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly->catboost) (8.2.2) Installing collected packages: catboost Successfully installed catboost-1.2
import catboost
print(catboost.__version__)
1.2
2. ๋ถ๋ฅ ์์ ์ํ
2-1. ๋ผ์ด๋ธ๋ฌ๋ฆฌ & ๋ฐ์ดํฐ import
### ๋ถ๋ฅ๋ฅผ ์ํ ๋ชจ๋ import
from catboost import CatBoostClassifier
### ๋ฐ์ดํฐ ์ค๋น
from catboost import datasets
train_df, test_df = datasets.amazon() # ์ค์ง ๋ฒ์ฃผํ ๋ณ์๋ง ์๋ ๊ทผ์ฌํ ๋ฐ์ดํฐ์
train_df.shape, test_df.shape
((32769, 10), (58921, 10))
### ๋ฐ์ดํฐ ํ์ธ
train_df.head()
ACTION RESOURCE MGR_ID ROLE_ROLLUP_1 ROLE_ROLLUP_2 ROLE_DEPTNAME \ 0 1 39353 85475 117961 118300 123472 1 1 17183 1540 117961 118343 123125 2 1 36724 14457 118219 118220 117884 3 1 36135 5396 117961 118343 119993 4 1 42680 5905 117929 117930 119569 ROLE_TITLE ROLE_FAMILY_DESC ROLE_FAMILY ROLE_CODE 0 117905 117906 290919 117908 1 118536 118536 308574 118539 2 117879 267952 19721 117880 3 118321 240983 290919 118322 4 119323 123932 19793 119325
-
train_df์๋ label(target) ์นผ๋ผ์ด ํฌํจ๋์ด ์์ง๋ง, test_df์ ์ปฌ๋ผ์ ๊ฐ์๊ฐ ๋์ผํจ
- test_df ๋ฐ์ดํฐ ์ธํธ์๋ target์ด ํฌํจ๋์ด ์๋?
test_df.head()
id RESOURCE MGR_ID ROLE_ROLLUP_1 ROLE_ROLLUP_2 ROLE_DEPTNAME \ 0 1 78766 72734 118079 118080 117878 1 2 40644 4378 117961 118327 118507 2 3 75443 2395 117961 118300 119488 3 4 43219 19986 117961 118225 118403 4 5 42093 50015 117961 118343 119598 ROLE_TITLE ROLE_FAMILY_DESC ROLE_FAMILY ROLE_CODE 0 117879 118177 19721 117880 1 118863 122008 118398 118865 2 118172 301534 249618 118175 3 120773 136187 118960 120774 4 118422 300136 118424 118425
-
target ๊ฐ์ ํฌํจํ๊ณ ์๋ ๊ฒ์ ์๋
- id ์ปฌ๋ผ์ด ์ถ๊ฐ์ ์ผ๋ก ์๋ ๊ฒ์ด๋ค.
-
๋ฐ์ดํฐ ์ธํธ์ ๋ฐ์ดํฐ๋ค์ด ์ซ์๋ก ํฌํจ๋์ด ์์ง๋ง, ์ด๋ฌํ feature๋ค์ ์ค์ ๋ก ๊ด๋ฆฌ์ ID, ํ์ฌ ์ญํ ์ฝ๋ ๋ฑ ์ง์์ ๋ค์ํ
์์ฑ
์ ๋ํ ์ฝ๋์- ๋ฐ๋ผ์,
๋ฒ์ฃผํ
๋ณ์๋ผ๊ณ ํด์ํด์ผ ํจ
- ๋ฐ๋ผ์,
2-2. ๋ฐ์ดํฐ ์ค๋น
y = train_df['ACTION']
X = train_df.drop(columns = 'ACTION')
### ๋๋ X๋ฅผ ๋ค์๊ณผ ๊ฐ์ด ์ค๋นํด๋ OK
# X = train_df.drop('ACTION', axis = 1)
X_test = test_df.drop(columns = 'id') # ๋ถํ์ํ id ์ปฌ๋ผ ์ ๊ฑฐ
### ์ดํ ๋์ผํ ๋ฐ์ดํฐ๋ฅผ ๋ค์ ์์ฑํ ์ ์๋๋ก seed ๊ฐ ์ค์
SEED = 1
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.25, random_state = SEED)
2-3. ๋ชจ๋ธ๋ง
a) 1st - ๊ธฐ๋ณธ ๋ชจ๋ธ
%%time
# ์ํ ์๊ฐ์ ์ธก์ ํ์!
### ํ๋ผ๋ฏธํฐ ๋ชฉ๋ก
params = {'loss_function': 'Logloss', # ์์คํจ์(๋ชฉ์ ํจ์, objective fuction)
'eval_metric':'AUC', # ํ๊ฐ ์งํ(metric)
'verbose': 200, # 200ํ ๋ฐ๋ณตํ ๋๋ง๋ค ๊ต์ก ๊ณผ์ ์ ๋ํ ์ ๋ณด๋ฅผ stdout์ผ๋ก ์ถ๋ ฅ
'random_seed': SEED # seed ์ค์
}
cbc_1 = CatBoostClassifier(**params) # ๋ชจ๋ธ ๊ฐ์ฒด ์ ์ธ
cbc_1.fit(X_train, y_train, # ํ์ตํ ๋ฐ์ดํฐ
eval_set = (X_valid, y_valid), # ๊ฒ์ฆ์ฉ ๋ฐ์ดํฐ
use_best_model = True, # ๋ชจ๋ธ์ด ํญ์ ์ต์ ํ๋ผ๋ฏธํฐ๋ก ํ๋๋ ์ํ๋ฅผ ์ ์งํ๋๋ก
plot = True # ์๊ฐํ
);
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.069882 0: test: 0.5400959 best: 0.5400959 (0) total: 57.5ms remaining: 57.4s 200: test: 0.8020842 best: 0.8020842 (200) total: 2.04s remaining: 8.13s 400: test: 0.8237941 best: 0.8237941 (400) total: 5.24s remaining: 7.83s 600: test: 0.8328464 best: 0.8330283 (598) total: 7.02s remaining: 4.66s 800: test: 0.8366271 best: 0.8370599 (785) total: 8.88s remaining: 2.21s 999: test: 0.8417832 best: 0.8417832 (999) total: 10.6s remaining: 0us bestTest = 0.8417831567 bestIteration = 999 CPU times: user 15.5 s, sys: 942 ms, total: 16.5 s Wall time: 10.7 s
<catboost.core.CatBoostClassifier at 0x7f05f08678b0>
-
๋ชจ๋ธ์ด ๋ ๋ง์ ๋ฐ๋ณต์ ํตํด ํ๋ จ๋๋ค๋ฉด ๋ ๋์ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ฌ์ค ์ ์์(iteration ์ฆ๊ฐ ์, default = 1000)
-
๋ฌด์๋ณด๋ค๋, ์ฐ๋ฆฌ๋ ์ด๋ค feature๋ค์ด
๋ฒ์ฃผํ
๋ณ์์ธ์ง ๋ช ์ํด์ผ ํจ-
์ ๋ชจ๋ธ์์ Catboost๋ ๋ฒ์ฃผํ ๋ณ์๋ฅผ ๋ช ์ํ์ง ์์ ์ด๋ค์ ๋ชจ๋ ์์นํ ๋ณ์๋ก ์ฒ๋ฆฌํจ
-
๋ฒ์ฃผ๋ค ์ฌ์ด์
์์
๊ฐ ๋งค๊ฒจ์ง(class 2 > class 1) -
cat_features = [i1,i2,...,in]
๋ก ๋ฒ์ฃผํ ๋ณ์์์ ๋ช ์
-
### Catboost๊ฐ ๋ฒ์ฃผํ ๋ณ์๋ค๋ก ์ทจ๊ธํ ์นผ๋ผ์ index
# ๋ฐ์ดํฐ ์ธํธ์ ๋ชจ๋ feature๋ค์ ๋ฒ์ฃผํ ๋ณ์์
cat_features = list(range(X.shape[1]))
print(cat_features)
[0, 1, 2, 3, 4, 5, 6, 7, 8]
### ๋ฐฉ๋ฒ 2)
cat_features_names = X.columns # categorical features์ ์ด๋ฆ์ ๊ตฌ์ฒด์ ์ผ๋ก ๋ช
์
cat_features = [X.columns.get_loc(col) for col in cat_features_names]
print(cat_features)
[0, 1, 2, 3, 4, 5, 6, 7, 8]
### ๋ฐฉ๋ฒ 3)
condition = True # ๋ฒ์ฃผํ ํน์ง์ ์ด๋ฆ์ผ๋ก๋ง ์ถฉ์กฑ๋์ด์ผ ํ๋ ์กฐ๊ฑด์ ์ง์
cat_features_names = [col for col in X.columns if condition]
cat_features = [X.columns.get_loc(col) for col in cat_features_names]
print(cat_features)
[0, 1, 2, 3, 4, 5, 6, 7, 8]
b) 2nd - ๋ฒ์ฃผํ ๋ณ์ ๋ช ์
### ๋ฒ์ฃผํ ๋ณ์์ ๋ชฉ๋ก์ ๋ช
์ํ์ฌ ์ฌํ์ต
%%time
params = {'loss_function': 'Logloss',
'eval_metric': 'AUC',
'cat_features': cat_features, # ๋ฒ์ฃผํ ๋ณ์ ๋ช
์
'verbose': 200,
'random_seed': SEED
}
cbc_2 = CatBoostClassifier(**params)
cbc_2.fit(X_train, y_train,
eval_set = (X_valid, y_valid),
use_best_model = True,
plot = True
);
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.069882 0: test: 0.5637606 best: 0.5637606 (0) total: 69.5ms remaining: 1m 9s 200: test: 0.8955617 best: 0.8955872 (198) total: 15.2s remaining: 1m 400: test: 0.8985912 best: 0.8987220 (386) total: 32s remaining: 47.9s 600: test: 0.9004468 best: 0.9005457 (595) total: 43s remaining: 28.6s 800: test: 0.8997008 best: 0.9007469 (631) total: 55.5s remaining: 13.8s 999: test: 0.8985767 best: 0.9007469 (631) total: 1m 9s remaining: 0us bestTest = 0.9007468588 bestIteration = 631 Shrink model to first 632 iterations. CPU times: user 1min 35s, sys: 2.02 s, total: 1min 37s Wall time: 1min 10s
<catboost.core.CatBoostClassifier at 0x7f05f08dc6a0>
-
์ด์ ์ ๋นํด ์ฑ๋ฅ์ด ํฅ์๋จ
-
์ ์ฒด์ ์ธ ํ๋ จ ์๊ฐ์ ์ฆ๊ฐํ์์ง๋ง, ์ต๊ณ ์ฑ๋ฅ์ ๋ ์ ์ ๋ฐ๋ณต์๋ก ์ป์ด๋(631ํ ๋ฐ๋ณต)
-
early_stopping_rounds = N
์ ์ง์ ํ์ฌ ํ์ต์ ์กฐ๊ธฐ ์ค๋จ์ํฌ ์ ์์- N-round์ ๋ํ metric ๊ฒฐ๊ณผ๊ฐ ๊ฐ์ ๋์ง ์์ผ๋ฉด ๋ชจ๋ธ์ ํ์ต์ ์ค๋จ์ํด
### early stopping ์ ์ฉ
%%time
params = {'loss_function': 'Logloss',
'eval_metric': 'AUC',
'cat_features': cat_features,
'early_stopping_rounds': 200,
'verbose': 200,
'random_seed': SEED
}
cbc_2 = CatBoostClassifier(**params)
cbc_2.fit(X_train, y_train,
eval_set = (X_valid, y_valid),
use_best_model = True,
plot = True
);
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.069882 0: test: 0.5637606 best: 0.5637606 (0) total: 46ms remaining: 45.9s 200: test: 0.8955617 best: 0.8955872 (198) total: 10.6s remaining: 42s 400: test: 0.8985912 best: 0.8987220 (386) total: 24.3s remaining: 36.3s 600: test: 0.9004468 best: 0.9005457 (595) total: 35.2s remaining: 23.4s 800: test: 0.8997008 best: 0.9007469 (631) total: 50.2s remaining: 12.5s Stopped by overfitting detector (200 iterations wait) bestTest = 0.9007468588 bestIteration = 631 Shrink model to first 632 iterations. CPU times: user 1min 18s, sys: 1.6 s, total: 1min 20s Wall time: 53.3 s
<catboost.core.CatBoostClassifier at 0x7f05f08dfa30>
c) 3rd - GPU ์ฐ์ฐ ํ์ฉ
-
๊ธฐ๋ณธ์ ์ผ๋ก CatBoost๋
CPU
๋ฅผ ์ฌ์ฉํ์ฌ ๊ณ์ฐ -
GPU์์ ๊ณ์ฐ์ ๊ฐ๋ฅํ๊ฒ ํ๋ ค๋ฉด
task_type = 'GPU'
๋ฅผ ์ง์ ํ๋ฉด ๋จ
โป ์ฝ๋ฉ์์ ์คํ ์ ๋ฐํ์ ์ ํ ๋ณ๊ฒฝ -> GPU๋ก!
### GPU ํ์ฉ
%%time
params = {'loss_function': 'Logloss',
'eval_metric': 'AUC',
'cat_features': cat_features,
'task_type': 'GPU',
'verbose': 200,
'random_seed': SEED
}
cbc_3 = CatBoostClassifier(**params)
cbc_3.fit(X_train, y_train,
eval_set = (X_valid, y_valid),
use_best_model = True,
plot = True
);
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.054241
Default metric period is 5 because AUC is/are not implemented for GPU
0: test: 0.6184174 best: 0.6184174 (0) total: 78.7ms remaining: 1m 18s 200: test: 0.8792616 best: 0.8792616 (200) total: 10.7s remaining: 42.6s 400: test: 0.8832826 best: 0.8833058 (390) total: 20.7s remaining: 30.9s 600: test: 0.8845304 best: 0.8845304 (600) total: 33.5s remaining: 22.3s 800: test: 0.8854544 best: 0.8854544 (800) total: 42.4s remaining: 10.5s 999: test: 0.8866701 best: 0.8867319 (995) total: 58.1s remaining: 0us bestTest = 0.886731863 bestIteration = 995 Shrink model to first 996 iterations. CPU times: user 52.6 s, sys: 14.3 s, total: 1min 6s Wall time: 59.7 s
<catboost.core.CatBoostClassifier at 0x7f05f08dfbb0>
-
์ฑ๋ฅ์ด ํฌ๊ฒ ๊ฐ์ ๋์ง๋ ์์์
-
๋ช๋ช ํ์ดํผ ํ๋ผ๋ฏธํฐ๋ค์ ์ค์ง GPU ๋ชจ๋์์๋ง ์ค์ ๊ฐ๋ฅํจ
-
grow_policy
: ํธ๋ฆฌ ์์ฑ ๊ท์น -
min_data_in_leaf
: ๋ฆฌํ ๋ ธ๋์ ์ต์ ํ๋ จ ์ํ ์ -
max_leaves
: ๊ฒฐ๊ณผ ํธ๋ฆฌ์์์ ์ต๋ ๋ฆฌํ ์
-
-
์ผ๋ถ ๋ฐ์ดํฐ ์ธํธ์์๋ GPU ํ์ต์ด ํจ์ฌ ์ ์ ์๊ฐ์ด ์์๋จ
-
border_count = N
์ ์ง์ ํ์ฌ GPU ํ์ต์ ๋์ธ ๊ฐ์ํ ํ ์ ์์-
N: ๊ฐ feature์ ๋ํด ๊ณ ๋ ค๋๋ ๋ถํ ์ ์
-
๊ณต์ ๋ฌธ์์์๋ GPU ํ์ต ์ ์ด๋ฅผ 32๋ก ์ค์ ํ ๊ฒ์ ์ ์ํจ
-
๋ง์ ๊ฒฝ์ฐ ์ด๋ ๋ชจ๋ธ์ ์ฑ๋ฅ์๋ ์ํฅ์ ๋ฏธ์น์ง ์์ง๋ง ํ๋ จ ์๋๋ฅผ ํฌ๊ฒ ํฅ์์ํด
-
d) 4th - GPU ํ๋ผ๋ฏธํฐ ํ๋
%%time
params = {'loss_function': 'Logloss',
'eval_metric': 'AUC',
'cat_features': cat_features,
'task_type': 'GPU',
'border_count': 32,
'verbose': 200,
'random_seed': SEED
}
cbc_4 = CatBoostClassifier(**params)
cbc_4.fit(X_train, y_train,
eval_set = (X_valid, y_valid),
use_best_model = True,
plot = True
);
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.054241 0: test: 0.6184174 best: 0.6184174 (0) total: 34.2ms remaining: 34.1s
Default metric period is 5 because AUC is/are not implemented for GPU
200: test: 0.8792616 best: 0.8792616 (200) total: 8.22s remaining: 32.7s 400: test: 0.8832826 best: 0.8833058 (390) total: 17.8s remaining: 26.6s 600: test: 0.8845304 best: 0.8845304 (600) total: 27.9s remaining: 18.5s 800: test: 0.8854544 best: 0.8854544 (800) total: 38.6s remaining: 9.59s 999: test: 0.8866701 best: 0.8867319 (995) total: 53.8s remaining: 0us bestTest = 0.886731863 bestIteration = 995 Shrink model to first 996 iterations. CPU times: user 50.7 s, sys: 11.6 s, total: 1min 2s Wall time: 54.6 s
<catboost.core.CatBoostClassifier at 0x7f05f08df4f0>
- ์ฑ๋ฅ์ ํฐ ์ฐจ์ด๋ ์์ง๋ง ์๋๋ ํจ์ฌ ๋นจ๋ผ์ก์
e) 5th, 6th - ๋ณ์ ์ ํ(feature selection)
-
๊ฒฝ์ฐ์ ๋ฐ๋ผ ์ผ๋ถ feature๋ค์ด
์๋ชป๋
์ ๋ณด๋ฅผ ์ ๊ณตํ๋ค๊ณ ์์ฌํ ์ ์์ -
์ด๋ฅผ ์คํํด๋ณด๊ธฐ ์ํด ์๋ง์ ๋ฐ์ดํฐ ์กฐ๊ฐ์ ๋ง๋ค๊ฑฐ๋
ignored_slice = [i1,i2,...,in]
์ผ๋ก ๋ชจ๋ธ์์ ๋ฌด์ํ ์ด ๋ฒํธ ๋ชฉ๋ก์ ์ง์ ํ ์ ์์
### ์คํ 1) ๋ฐ์ดํฐ ์กฐ๊ฐ ๋ง๋ค๊ธฐ
import numpy as np
import warnings
warnings.filterwarnings("ignore")
np.random.seed(SEED)
noise_cols = [f'noise_{i}' for i in range(5)]
for col in noise_cols:
X_train[col] = y_train * np.random.rand(X_train.shape[0])
X_valid[col] = np.random.rand(X_valid.shape[0])
X_train.head()
RESOURCE MGR_ID ROLE_ROLLUP_1 ROLE_ROLLUP_2 ROLE_DEPTNAME \ 16773 27798 1350 117961 118052 122938 23491 80701 4571 117961 118225 119924 32731 34039 5113 117961 118300 119890 7855 42085 4733 118290 118291 120126 16475 16358 6046 117961 118446 120317 ROLE_TITLE ROLE_FAMILY_DESC ROLE_FAMILY ROLE_CODE noise_0 \ 16773 117905 117906 290919 117908 0.417022 23491 118685 279443 308574 118687 0.720324 32731 119433 133686 118424 119435 0.000114 7855 118980 166203 118295 118982 0.302333 16475 307024 306404 118331 118332 0.146756 noise_1 noise_2 noise_3 noise_4 16773 0.097850 0.665600 0.979025 0.491624 23491 0.855900 0.311763 0.929346 0.391708 32731 0.287838 0.896624 0.704050 0.606467 7855 0.264320 0.482195 0.028493 0.182570 16475 0.022876 0.009307 0.726750 0.623357
%%time
params = {'loss_function': 'Logloss',
'eval_metric': 'AUC',
'cat_features': cat_features,
'verbose': 200,
'random_seed': SEED
}
cbc_5 = CatBoostClassifier(**params)
cbc_5.fit(X_train, y_train,
eval_set = (X_valid, y_valid),
use_best_model = True,
plot = True
);
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.069882 0: test: 0.4990944 best: 0.4990944 (0) total: 78.3ms remaining: 1m 18s 200: test: 0.5831370 best: 0.5894476 (7) total: 9.93s remaining: 39.5s 400: test: 0.5831376 best: 0.5894476 (7) total: 17s remaining: 25.4s 600: test: 0.5831376 best: 0.5894476 (7) total: 25.4s remaining: 16.9s 800: test: 0.5831378 best: 0.5894476 (7) total: 32.9s remaining: 8.16s 999: test: 0.5831381 best: 0.5894476 (7) total: 38.7s remaining: 0us bestTest = 0.5894475816 bestIteration = 7 Shrink model to first 8 iterations. CPU times: user 1min 4s, sys: 1.25 s, total: 1min 5s Wall time: 39.2 s
<catboost.core.CatBoostClassifier at 0x7f05eeb78ac0>
- ์ฑ๋ฅ์ด ํฌ๊ฒ ํ๋ฝํจ
### ์คํ 2) ๋ฌด์ํ ์ปฌ๋ผ ๋ชฉ๋ก ์ง์
ignored_features = list(range(X_train.shape[1] - 5, X_train.shape[1]))
print(ignored_features)
[9, 10, 11, 12, 13]
%%time
params = {'loss_function': 'Logloss',
'eval_metric': 'AUC',
'cat_features': cat_features,
'ignored_features': ignored_features, # ๋ฌด์ํ ๋ณ์๋ค
'early_stopping_rounds': 200,
'verbose': 200,
'random_seed': SEED
}
cbc_6 = CatBoostClassifier(**params)
cbc_6.fit(X_train, y_train,
eval_set = (X_valid, y_valid),
use_best_model = True,
plot = True
);
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.069882 0: test: 0.5637606 best: 0.5637606 (0) total: 42.2ms remaining: 42.1s 200: test: 0.8955617 best: 0.8955872 (198) total: 10.4s remaining: 41.3s 400: test: 0.8985912 best: 0.8987220 (386) total: 21.7s remaining: 32.4s 600: test: 0.9004468 best: 0.9005457 (595) total: 33.4s remaining: 22.2s 800: test: 0.8997008 best: 0.9007469 (631) total: 45s remaining: 11.2s Stopped by overfitting detector (200 iterations wait) bestTest = 0.9007468588 bestIteration = 631 Shrink model to first 632 iterations. CPU times: user 1min 18s, sys: 1.44 s, total: 1min 20s Wall time: 47 s
<catboost.core.CatBoostClassifier at 0x7f05eeb7b8b0>
### ์คํ 1๋ก ๋ง๋ noise๋ฅผ ์ ๊ฑฐํ๊ณ ๋ฐ์ดํฐ๋ฅผ ์๋ ์ํ๋ก ๋๋ฆฌ๊ธฐ
X_train = X_train.drop(columns = noise_cols)
X_valid = X_valid.drop(columns = noise_cols)
X_train.head()
RESOURCE MGR_ID ROLE_ROLLUP_1 ROLE_ROLLUP_2 ROLE_DEPTNAME \ 16773 27798 1350 117961 118052 122938 23491 80701 4571 117961 118225 119924 32731 34039 5113 117961 118300 119890 7855 42085 4733 118290 118291 120126 16475 16358 6046 117961 118446 120317 ROLE_TITLE ROLE_FAMILY_DESC ROLE_FAMILY ROLE_CODE 16773 117905 117906 290919 117908 23491 118685 279443 308574 118687 32731 119433 133686 118424 119435 7855 118980 166203 118295 118982 16475 307024 306404 118331 118332
f) 7th - Pool ๊ฐ์ฒด ํ์ฉ
-
Catboost๋ ํ์ต ๋ฐ์ดํฐ๋ก Pool ๊ฐ์ฒด๋ ๋ฐ์ ์ ์์
-
Pool ๊ฐ์ฒด
- ๋ฒ์ฃผํ ์ปฌ๋ผ ์ธ๋ฑ์ค(์ ์๋ก ์ง์ ) ๋๋ ์ด๋ฆ(๋ฌธ์์ด๋ก ์ง์ )์ 1์ฐจ์ ๋ฐฐ์ด
from catboost import Pool
train_data = Pool(data = X_train,
label = y_train,
cat_features = cat_features
)
valid_data = Pool(data = X_valid,
label = y_valid,
cat_features = cat_features
)
%%time
params = {'loss_function':'Logloss',
'eval_metric':'AUC',
# 'cat_features': cat_features, # ์ด๋ฏธ Pool ๊ฐ์ฒด์์ ์ง์ ํจ
'early_stopping_rounds': 200,
'verbose': 200,
'random_seed': SEED
}
cbc_7 = CatBoostClassifier(**params)
cbc_7.fit(train_data, # instead of X_train, y_train
eval_set = valid_data, # instead of (X_valid, y_valid)
use_best_model = True,
plot = True
);
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.069882 0: test: 0.5637606 best: 0.5637606 (0) total: 44.6ms remaining: 44.6s 200: test: 0.8955617 best: 0.8955872 (198) total: 13.6s remaining: 54.2s 400: test: 0.8985912 best: 0.8987220 (386) total: 27.7s remaining: 41.4s 600: test: 0.9004468 best: 0.9005457 (595) total: 49.6s remaining: 32.9s 800: test: 0.8997008 best: 0.9007469 (631) total: 1m 2s remaining: 15.5s Stopped by overfitting detector (200 iterations wait) bestTest = 0.9007468588 bestIteration = 631 Shrink model to first 632 iterations. CPU times: user 1min 19s, sys: 1.65 s, total: 1min 20s Wall time: 1min 4s
<catboost.core.CatBoostClassifier at 0x7f05f08df190>
๐ Pool ๊ฐ์ฒด๋ฅผ ํ์ฉํ๋ ์ด์
-
์๋ฅผ ๋ค์ด, ๋ฐ์ดํฐ์ ์ผ๋ถ๊ฐ ๊ตฌ์์ด๊ฑฐ๋ ๋ถ์ ํํ ์ ์์
Pool.set_weight()
๋ฅผ ํตํด ๋ฐ์ดํฐ์ ์ธ์คํด์ค(=> ํ)์ ๊ฐ์ค์น๋ฅผ ๋ถ์ฌํ ์ ์์
-
๋๋ Pool์ ์ฌ์ฉํ์ฌ ๋ฐ์ดํฐ ๊ทธ๋ฃน์ ๋๋ ์ ์์
set_group_id()
๋ฅผ ์ง์ ํ๊ณPool.set_group_weight()
์ ํ์ฉํ์ฌ ์๋ก ๋ค๋ฅธ ๊ทธ๋ฃน์ ๋ํด ์๋ก ๋ค๋ฅธ ๊ฐ์ค์น๋ฅผ ์ ์ฉํ ์ ์์
-
๊ธฐ์ค์ ์ ๊ณ์ฐํ ์ ์์
-
Pool.set_baseline()
์ ์ฌ์ฉํ์ฌ ๋ชจ๋ ์ ๋ ฅ ๊ฐ์ฒด์ ๋ํ ์ด๊ธฐ ์์ ๊ฐ์ ์ ๊ณตํ ์ ์์ -
ํ๋ จ์ด 0์์ ์์ํ๋ ๊ฒ์ด ์๋ ์ง์ ๋ ๊ฐ์์ ์์ํจ
-
-
Pool ๊ฐ์ฒด๋ ๋ฐ์ดํฐ์ ๊ฒฝ๊ณ ๋ถ๋ถ์ ํฌํจํ ์ ์๋ ์ข์ ๋ฐฉ๋ฒ์
g) 8th - ๊ต์ฐจ ๊ฒ์ฆ(Cross Validation)
from catboost import cv
%%time
params = {'loss_function': 'Logloss',
'eval_metric': 'AUC',
'verbose': 200,
'random_seed': SEED
}
all_train_data = Pool(data = X,
label = y,
cat_features = cat_features
)
scores = cv(pool = all_train_data,
params = params,
fold_count = 4,
seed = SEED,
shuffle = True,
stratified = True, # True๋ฉด ๊ฐ ํด๋์ค์ ๋ํ ํ๋ณธ ๋น์จ์ ๋ณด์กดํ๋ฉฐ fold ์์ฑ
plot = True
)
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Training on fold [0/4] 0: test: 0.5000000 best: 0.5000000 (0) total: 15.7ms remaining: 15.7s 200: test: 0.8948050 best: 0.8948050 (200) total: 11.1s remaining: 44s 400: test: 0.8993043 best: 0.8993043 (400) total: 22.7s remaining: 33.9s 600: test: 0.9019037 best: 0.9019037 (600) total: 36.6s remaining: 24.3s 800: test: 0.9027905 best: 0.9031492 (781) total: 48.8s remaining: 12.1s 999: test: 0.9036792 best: 0.9036792 (999) total: 1m 1s remaining: 0us bestTest = 0.9036791642 bestIteration = 999 Training on fold [1/4] 0: test: 0.5000000 best: 0.5000000 (0) total: 16.8ms remaining: 16.7s 200: test: 0.8835559 best: 0.8840146 (166) total: 10.4s remaining: 41.3s 400: test: 0.8852191 best: 0.8853875 (382) total: 23.1s remaining: 34.5s 600: test: 0.8859059 best: 0.8859447 (591) total: 36.8s remaining: 24.4s 800: test: 0.8860087 best: 0.8865844 (746) total: 49.7s remaining: 12.3s 999: test: 0.8841890 best: 0.8865844 (746) total: 1m 2s remaining: 0us bestTest = 0.8865843778 bestIteration = 746 Training on fold [2/4] 0: test: 0.5000000 best: 0.5000000 (0) total: 60.1ms remaining: 1m 200: test: 0.8762431 best: 0.8762994 (198) total: 11.4s remaining: 45.5s 400: test: 0.8825299 best: 0.8825365 (399) total: 24.9s remaining: 37.1s 600: test: 0.8859397 best: 0.8859462 (593) total: 38.5s remaining: 25.6s 800: test: 0.8876071 best: 0.8876071 (800) total: 52.4s remaining: 13s 999: test: 0.8890818 best: 0.8890895 (998) total: 1m 7s remaining: 0us bestTest = 0.8890894812 bestIteration = 998 Training on fold [3/4] 0: test: 0.5000000 best: 0.5000000 (0) total: 15.7ms remaining: 15.7s 200: test: 0.8848750 best: 0.8848750 (200) total: 11.9s remaining: 47.2s 400: test: 0.8886395 best: 0.8886395 (400) total: 33.3s remaining: 49.7s 600: test: 0.8917459 best: 0.8917475 (599) total: 49.4s remaining: 32.8s 800: test: 0.8926586 best: 0.8928882 (763) total: 1m 3s remaining: 15.8s 999: test: 0.8919993 best: 0.8928882 (763) total: 1m 18s remaining: 0us bestTest = 0.892888207 bestIteration = 763 CPU times: user 6min 21s, sys: 33.8 s, total: 6min 55s Wall time: 4min 30s
### ํผ์ฒ ์ค์๋ ์๊ฐํ
import pandas as pd
feature_importance_df = cbc_7.get_feature_importance(prettified = True)
feature_importance_df
Feature Id Importances 0 RESOURCE 19.191502 1 ROLE_DEPTNAME 15.756340 2 MGR_ID 15.621862 3 ROLE_ROLLUP_2 13.129965 4 ROLE_FAMILY_DESC 10.059007 5 ROLE_TITLE 7.790703 6 ROLE_FAMILY 6.412647 7 ROLE_ROLLUP_1 6.224750 8 ROLE_CODE 5.813223
### ์๊ฐํ
from matplotlib import pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 6));
sns.barplot(x= 'Importances', y="Feature Id", data=feature_importance_df);
plt.title('CatBoost features importance:')
Text(0.5, 1.0, 'CatBoost features importance:')
<Figure size 1200x600 with 1 Axes>
- ๋ ์์ธํ ์ดํด๋ณด์.
!pip install shap
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting shap Downloading shap-0.41.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (572 kB) [2K [90mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ[0m [32m572.6/572.6 kB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m [?25hRequirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from shap) (1.22.4) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from shap) (1.10.1) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from shap) (1.2.2) Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from shap) (1.5.3) Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.10/dist-packages (from shap) (4.65.0) Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.10/dist-packages (from shap) (23.1) Collecting slicer==0.0.7 (from shap) Downloading slicer-0.0.7-py3-none-any.whl (14 kB) Requirement already satisfied: numba in /usr/local/lib/python3.10/dist-packages (from shap) (0.56.4) Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from shap) (2.2.1) Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba->shap) (0.39.1) Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from numba->shap) (67.7.2) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->shap) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->shap) (2022.7.1) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->shap) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->shap) (3.1.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->shap) (1.16.0) Installing collected packages: slicer, shap Successfully installed shap-0.41.0 slicer-0.0.7
import shap
explainer = shap.TreeExplainer(cbc_7) # ๋ชจ๋ธ ๊ฐ์ฒด
shap_values = explainer.shap_values(train_data) # ํ์ต์ฉ Pool ๊ฐ์ฒด
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[:100,:], X_train.iloc[:100,:])
-
๋์ ์ธ ์๊ฐํ plot์(interactive plot)
- ๊ฐ๋ก์ถ๊ณผ ์ธ๋ก์ถ ๋ชจ๋์ ๋ํ ๋งค๊ฐ๋ณ์๋ฅผ ์ ํํ์ฌ ๋ชจํ์ ๋ถ์ํ ์ ์์
-
summary plot ํ์ธ
shap.summary_plot(shap_values, X_train)
<Figure size 800x510 with 2 Axes>
Look like it matters who is your manager (MGR_ID) :D
-
์ ๋ค์ด์ด๊ทธ๋จ์์ ๋ชจ๋ ์ง์(๋ฐ์ดํฐ ์ธํธ์ ์ธ์คํด์ค/ํ)์ ๊ฐ ํ์ ์ ํ๋์ฉ์ผ๋ก ํ์๋จ
-
์ ์
x ์ขํ
๋ ํด๋น ํ์์ด ๋ชจํ์ ์์ธก์ ๋ฏธ์น๋ ์ํฅ์ ์๋ฏธ -
์ ์
์์
์ ํด๋น ์ง์์ ๋ํ ํด๋น feature์ ๊ฐ์ ์๋ฏธ -
ํ์ ๋ง์ง ์๋ ์ ๋ค์ ์์ฌ์ ๋ฐ๋๋ฅผ ํํ
-
-
์ฌ๊ธฐ์
ROLE_ROLLUP_1
๋ฐROLE_CODE
ํผ์ณ๋ ๋ชจ๋ธ ์์ธก์ ๋ฏธ์น๋ ์ํฅ์ด ์ ์ผ๋ฉฐ, ๋๋ถ๋ถ์ ์ง์์ ๊ฒฝ์ฐ ์ํฅ์ด ๊ฑฐ์ ์์์ ์ ์ ์์
2-5. ์ต์ข ์์ธก(Prediction)
%%time
from sklearn.model_selection import StratifiedKFold
n_fold = 4 # amount of data folds
folds = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=SEED)
params = {'loss_function':'Logloss',
'eval_metric':'AUC',
'verbose': 200,
'random_seed': SEED
}
test_data = Pool(data=X_test,
cat_features=cat_features)
scores = []
prediction = np.zeros(X_test.shape[0])
for fold_n, (train_index, valid_index) in enumerate(folds.split(X, y)):
X_train, X_valid = X.iloc[train_index], X.iloc[valid_index] # train and validation data splits
y_train, y_valid = y[train_index], y[valid_index]
train_data = Pool(data=X_train,
label=y_train,
cat_features=cat_features)
valid_data = Pool(data=X_valid,
label=y_valid,
cat_features=cat_features)
model = CatBoostClassifier(**params)
model.fit(train_data,
eval_set=valid_data,
use_best_model=True
)
score = model.get_best_score()['validation']['AUC']
scores.append(score)
y_pred = model.predict_proba(test_data)[:, 1]
prediction += y_pred
prediction /= n_fold
print('CV mean: {:.4f}, CV std: {:.4f}'.format(np.mean(scores), np.std(scores)))
Learning rate set to 0.069882 0: test: 0.5797111 best: 0.5797111 (0) total: 89.7ms remaining: 1m 29s 200: test: 0.8638646 best: 0.8638646 (200) total: 11.6s remaining: 46.1s 400: test: 0.8690869 best: 0.8691354 (377) total: 22.8s remaining: 34.1s 600: test: 0.8730484 best: 0.8730484 (600) total: 34.1s remaining: 22.6s 800: test: 0.8758933 best: 0.8761898 (791) total: 45.1s remaining: 11.2s 999: test: 0.8758019 best: 0.8767012 (846) total: 56.1s remaining: 0us bestTest = 0.8767012179 bestIteration = 846 Shrink model to first 847 iterations. Learning rate set to 0.069883 0: test: 0.5000000 best: 0.5000000 (0) total: 31.2ms remaining: 31.2s 200: test: 0.8946113 best: 0.8947113 (198) total: 10.6s remaining: 42s 400: test: 0.9000046 best: 0.9000046 (400) total: 22s remaining: 32.8s 600: test: 0.9016458 best: 0.9017362 (584) total: 33.5s remaining: 22.2s 800: test: 0.9017343 best: 0.9020394 (723) total: 45.1s remaining: 11.2s 999: test: 0.8998936 best: 0.9021903 (817) total: 56.4s remaining: 0us bestTest = 0.9021902605 bestIteration = 817 Shrink model to first 818 iterations. Learning rate set to 0.069883 0: test: 0.5000000 best: 0.5000000 (0) total: 14.2ms remaining: 14.2s 200: test: 0.9042458 best: 0.9043000 (199) total: 10.6s remaining: 42.3s 400: test: 0.9046762 best: 0.9059032 (260) total: 22s remaining: 32.8s 600: test: 0.9027506 best: 0.9059032 (260) total: 33s remaining: 21.9s 800: test: 0.9008662 best: 0.9059032 (260) total: 44.1s remaining: 10.9s 999: test: 0.8987709 best: 0.9059032 (260) total: 54.5s remaining: 0us bestTest = 0.9059031548 bestIteration = 260 Shrink model to first 261 iterations. Learning rate set to 0.069883 0: test: 0.5000000 best: 0.5000000 (0) total: 28.8ms remaining: 28.8s 200: test: 0.8951500 best: 0.8951673 (199) total: 10.4s remaining: 41.5s 400: test: 0.8960953 best: 0.8963737 (320) total: 21.9s remaining: 32.8s 600: test: 0.8974762 best: 0.8976249 (598) total: 33.2s remaining: 22s 800: test: 0.8980579 best: 0.8981457 (794) total: 44.5s remaining: 11.1s 999: test: 0.8972463 best: 0.8983501 (946) total: 55.9s remaining: 0us bestTest = 0.8983501224 bestIteration = 946 Shrink model to first 947 iterations. CV mean: 0.8958, CV std: 0.0113 CPU times: user 6min 27s, sys: 7.5 s, total: 6min 34s Wall time: 3min 45s