0. Introduction

  • Catboost๋Š” decision tree๋ฅผ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋กœ ํ™œ์šฉํ•˜๋Š” gradient boosting ๋ชจ๋ธ์„ ๋ฐœ์ „์‹œํ‚จ ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • Catboost๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Œ

    • ๋ถ„๋ฅ˜(์ด์ง„, ๋‹ค์ค‘)

    • ํšŒ๊ท€

    • ranking

  • ์ด๋Ÿฌํ•œ ์ž‘์—…๋“ค์€ ๊ทธ๋“ค์˜ ๋ชฉ์  ํ•จ์ˆ˜(objective function)์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ

    • ๊ฒฝ์‚ฌ ํ•˜๊ฐ• ์ค‘ ์ตœ์†Œํ™”ํ•˜๋ ค๋Š” ๊ฒƒ
  • Catboost๋Š” ๋ชจ๋ธ์˜ ์ •ํ™•๋„๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์ „์— ๊ตฌ์ถ•๋œ ํ‰๊ฐ€ ์ง€ํ‘œ(metric)๊ฐ€ ์กด์žฌ

  • ๊ณต์‹ ๋„ํ๋จผํŠธ

๐Ÿ“Œ Catboost์˜ ๊ฐ•์ 

1. ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ์ง€์›

  • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๊ฐ€ ์žˆ๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ CatBoost์˜ ์ •ํ™•๋„๊ฐ€ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋น„ํ•ด ๋” ์ข‹์Œ

  • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ์ „์ฒ˜๋ฆฌ(ex> one-hot encoding)๊ฐ€ ํ•„์š” x

    • ๋ช‡๋ช‡ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ์ง€์ •ํ•˜๋ฉด ok

2. ๊ณผ์ ํ•ฉ(overfitting) ํ•ธ๋“ค๋ง์ด ์šฉ์ด

  • CatBoost๋Š” ๊ธฐ์กด ๋ถ€์ŠคํŒ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋Œ€์•ˆ์ธ ์ˆœ์„œํ™”๋œ ๋ถ€์ŠคํŒ… ๊ตฌํ˜„์„ ํ™œ์šฉ

  • ์˜ˆ๋ฅผ ๋“ค์–ด, gradient boosting์€ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ์‰ฝ๊ฒŒ ๊ณผ์ ํ•ฉ๋จ

    • Catboost์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ์— ๋Œ€ํ•œ ํŠน๋ณ„ํ•œ ๋ณ€ํ˜•์ด ์กด์žฌ -> ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€

3. ๋น ๋ฅด๊ณ  GPU ํ•™์Šต์„ ํ™œ์šฉํ•˜๊ธฐ์— ์šฉ์ด

  • GPU ํ•™์Šต์„ ์ง€์›

4. ๋‹ค๋ฅธ ์œ ์šฉํ•œ ํ”ผ์ณ๋“ค

  • ๊ฒฐ์ธก์น˜์— ๋Œ€ํ•œ ์ฒ˜๋ฆฌ

  • ํ›Œ๋ฅญํ•œ ์‹œ๊ฐํ™”

1. Catboost ์„ค์น˜

# ํ•ด๋‹น ์˜ˆ์ œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ ค๋ฉด 0.14.2 ์ด์ƒ์ด์—ฌ์•ผ ํ•จ

!pip install catboost 
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.2-cp310-cp310-manylinux2014_x86_64.whl (98.6 MB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 98.6/98.6 MB 10.0 MB/s eta 0:00:00
[?25hRequirement already satisfied: graphviz in /usr/local/lib/python3.10/dist-packages (from catboost) (0.20.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from catboost) (3.7.1)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.10/dist-packages (from catboost) (1.22.4)
Requirement already satisfied: pandas>=0.24 in /usr/local/lib/python3.10/dist-packages (from catboost) (1.5.3)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from catboost) (1.10.1)
Requirement already satisfied: plotly in /usr/local/lib/python3.10/dist-packages (from catboost) (5.13.1)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from catboost) (1.16.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24->catboost) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24->catboost) (2022.7.1)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (1.0.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (4.39.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (8.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (3.0.9)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly->catboost) (8.2.2)
Installing collected packages: catboost
Successfully installed catboost-1.2
import catboost
print(catboost.__version__)
1.2

2. ๋ถ„๋ฅ˜ ์ž‘์—… ์ˆ˜ํ–‰

2-1. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ import

### ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋ชจ๋“ˆ import

from catboost import CatBoostClassifier
### ๋ฐ์ดํ„ฐ ์ค€๋น„

from catboost import datasets

train_df, test_df = datasets.amazon() # ์˜ค์ง ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋งŒ ์žˆ๋Š” ๊ทผ์‚ฌํ•œ ๋ฐ์ดํ„ฐ์…‹

train_df.shape, test_df.shape
((32769, 10), (58921, 10))
### ๋ฐ์ดํ„ฐ ํ™•์ธ

train_df.head()
   ACTION  RESOURCE  MGR_ID  ROLE_ROLLUP_1  ROLE_ROLLUP_2  ROLE_DEPTNAME  \
0       1     39353   85475         117961         118300         123472   
1       1     17183    1540         117961         118343         123125   
2       1     36724   14457         118219         118220         117884   
3       1     36135    5396         117961         118343         119993   
4       1     42680    5905         117929         117930         119569   

   ROLE_TITLE  ROLE_FAMILY_DESC  ROLE_FAMILY  ROLE_CODE  
0      117905            117906       290919     117908  
1      118536            118536       308574     118539  
2      117879            267952        19721     117880  
3      118321            240983       290919     118322  
4      119323            123932        19793     119325  
  • train_df์—๋Š” label(target) ์นผ๋Ÿผ์ด ํฌํ•จ๋˜์–ด ์žˆ์ง€๋งŒ, test_df์™€ ์ปฌ๋Ÿผ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋™์ผํ•จ

    • test_df ๋ฐ์ดํ„ฐ ์„ธํŠธ์—๋„ target์ด ํฌํ•จ๋˜์–ด ์žˆ๋‚˜?
test_df.head()
   id  RESOURCE  MGR_ID  ROLE_ROLLUP_1  ROLE_ROLLUP_2  ROLE_DEPTNAME  \
0   1     78766   72734         118079         118080         117878   
1   2     40644    4378         117961         118327         118507   
2   3     75443    2395         117961         118300         119488   
3   4     43219   19986         117961         118225         118403   
4   5     42093   50015         117961         118343         119598   

   ROLE_TITLE  ROLE_FAMILY_DESC  ROLE_FAMILY  ROLE_CODE  
0      117879            118177        19721     117880  
1      118863            122008       118398     118865  
2      118172            301534       249618     118175  
3      120773            136187       118960     120774  
4      118422            300136       118424     118425  
  • target ๊ฐ’์„ ํฌํ•จํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ์€ ์•„๋‹˜

    • id ์ปฌ๋Ÿผ์ด ์ถ”๊ฐ€์ ์œผ๋กœ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.
  • ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋ฐ์ดํ„ฐ๋“ค์ด ์ˆซ์ž๋กœ ํฌํ•จ๋˜์–ด ์žˆ์ง€๋งŒ, ์ด๋Ÿฌํ•œ feature๋“ค์€ ์‹ค์ œ๋กœ ๊ด€๋ฆฌ์ž ID, ํšŒ์‚ฌ ์—ญํ•  ์ฝ”๋“œ ๋“ฑ ์ง์›์˜ ๋‹ค์–‘ํ•œ ์†์„ฑ์— ๋Œ€ํ•œ ์ฝ”๋“œ์ž„

    • ๋”ฐ๋ผ์„œ, ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ผ๊ณ  ํ•ด์„ํ•ด์•ผ ํ•จ

2-2. ๋ฐ์ดํ„ฐ ์ค€๋น„

y = train_df['ACTION']
X = train_df.drop(columns = 'ACTION') 

### ๋˜๋Š” X๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ค€๋น„ํ•ด๋„ OK
# X = train_df.drop('ACTION', axis = 1)
X_test = test_df.drop(columns = 'id') # ๋ถˆํ•„์š”ํ•œ id ์ปฌ๋Ÿผ ์ œ๊ฑฐ
### ์ดํ›„ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์‹œ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก seed ๊ฐ’ ์„ค์ •

SEED = 1
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.25, random_state = SEED)

2-3. ๋ชจ๋ธ๋ง

a) 1st - ๊ธฐ๋ณธ ๋ชจ๋ธ

%%time 
# ์ˆ˜ํ–‰ ์‹œ๊ฐ„์„ ์ธก์ •ํ•˜์ž!

### ํŒŒ๋ผ๋ฏธํ„ฐ ๋ชฉ๋ก
params = {'loss_function': 'Logloss', # ์†์‹คํ•จ์ˆ˜(๋ชฉ์ ํ•จ์ˆ˜, objective fuction)
          'eval_metric':'AUC', # ํ‰๊ฐ€ ์ง€ํ‘œ(metric)
          'verbose':  200, # 200ํšŒ ๋ฐ˜๋ณตํ•  ๋•Œ๋งˆ๋‹ค ๊ต์œก ๊ณผ์ •์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ stdout์œผ๋กœ ์ถœ๋ ฅ
          'random_seed': SEED # seed ์„ค์ •
         }

cbc_1 = CatBoostClassifier(**params) # ๋ชจ๋ธ ๊ฐ์ฒด ์„ ์–ธ
cbc_1.fit(X_train, y_train, # ํ•™์Šตํ•  ๋ฐ์ดํ„ฐ
          eval_set = (X_valid, y_valid), # ๊ฒ€์ฆ์šฉ ๋ฐ์ดํ„ฐ
          use_best_model = True, # ๋ชจ๋ธ์ด ํ•ญ์ƒ ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํŠœ๋‹๋œ ์ƒํƒœ๋ฅผ ์œ ์ง€ํ•˜๋„๋ก
          plot = True # ์‹œ๊ฐํ™”
         );
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.069882
0:	test: 0.5400959	best: 0.5400959 (0)	total: 57.5ms	remaining: 57.4s
200:	test: 0.8020842	best: 0.8020842 (200)	total: 2.04s	remaining: 8.13s
400:	test: 0.8237941	best: 0.8237941 (400)	total: 5.24s	remaining: 7.83s
600:	test: 0.8328464	best: 0.8330283 (598)	total: 7.02s	remaining: 4.66s
800:	test: 0.8366271	best: 0.8370599 (785)	total: 8.88s	remaining: 2.21s
999:	test: 0.8417832	best: 0.8417832 (999)	total: 10.6s	remaining: 0us

bestTest = 0.8417831567
bestIteration = 999

CPU times: user 15.5 s, sys: 942 ms, total: 16.5 s
Wall time: 10.7 s
<catboost.core.CatBoostClassifier at 0x7f05f08678b0>
  • ๋ชจ๋ธ์ด ๋” ๋งŽ์€ ๋ฐ˜๋ณต์„ ํ†ตํ•ด ํ›ˆ๋ จ๋œ๋‹ค๋ฉด ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ์Œ(iteration ์ฆ๊ฐ€ ์‹œ, default = 1000)

  • ๋ฌด์—‡๋ณด๋‹ค๋„, ์šฐ๋ฆฌ๋Š” ์–ด๋–ค feature๋“ค์ด ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์ธ์ง€ ๋ช…์‹œํ•ด์•ผ ํ•จ

    • ์œ„ ๋ชจ๋ธ์—์„œ Catboost๋Š” ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ๋ช…์‹œํ•˜์ง€ ์•Š์•„ ์ด๋“ค์„ ๋ชจ๋‘ ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜๋กœ ์ฒ˜๋ฆฌํ•จ

    • ๋ฒ”์ฃผ๋“ค ์‚ฌ์ด์— ์ˆœ์œ„๊ฐ€ ๋งค๊ฒจ์ง(class 2 > class 1)

    • cat_features = [i1,i2,...,in]๋กœ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์ž„์„ ๋ช…์‹œ

### Catboost๊ฐ€ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋“ค๋กœ ์ทจ๊ธ‰ํ•  ์นผ๋Ÿผ์˜ index
# ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๋ชจ๋“  feature๋“ค์€ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์ž„

cat_features = list(range(X.shape[1]))
print(cat_features)
[0, 1, 2, 3, 4, 5, 6, 7, 8]
### ๋ฐฉ๋ฒ• 2)

cat_features_names = X.columns # categorical features์˜ ์ด๋ฆ„์„ ๊ตฌ์ฒด์ ์œผ๋กœ ๋ช…์‹œ
cat_features = [X.columns.get_loc(col) for col in cat_features_names]
print(cat_features)
[0, 1, 2, 3, 4, 5, 6, 7, 8]
### ๋ฐฉ๋ฒ• 3)

condition = True # ๋ฒ”์ฃผํ˜• ํŠน์ง•์˜ ์ด๋ฆ„์œผ๋กœ๋งŒ ์ถฉ์กฑ๋˜์–ด์•ผ ํ•˜๋Š” ์กฐ๊ฑด์„ ์ง€์ •

cat_features_names = [col for col in X.columns if condition]
cat_features = [X.columns.get_loc(col) for col in cat_features_names]
print(cat_features)
[0, 1, 2, 3, 4, 5, 6, 7, 8]

b) 2nd - ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ๋ช…์‹œ

### ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์˜ ๋ชฉ๋ก์„ ๋ช…์‹œํ•˜์—ฌ ์žฌํ•™์Šต

%%time

params = {'loss_function': 'Logloss',
          'eval_metric': 'AUC',
          'cat_features':  cat_features, # ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ๋ช…์‹œ
          'verbose':  200,
          'random_seed':  SEED
         }

cbc_2 = CatBoostClassifier(**params)
cbc_2.fit(X_train, y_train,
          eval_set = (X_valid, y_valid),
          use_best_model = True,
          plot = True
         );
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.069882
0:	test: 0.5637606	best: 0.5637606 (0)	total: 69.5ms	remaining: 1m 9s
200:	test: 0.8955617	best: 0.8955872 (198)	total: 15.2s	remaining: 1m
400:	test: 0.8985912	best: 0.8987220 (386)	total: 32s	remaining: 47.9s
600:	test: 0.9004468	best: 0.9005457 (595)	total: 43s	remaining: 28.6s
800:	test: 0.8997008	best: 0.9007469 (631)	total: 55.5s	remaining: 13.8s
999:	test: 0.8985767	best: 0.9007469 (631)	total: 1m 9s	remaining: 0us

bestTest = 0.9007468588
bestIteration = 631

Shrink model to first 632 iterations.
CPU times: user 1min 35s, sys: 2.02 s, total: 1min 37s
Wall time: 1min 10s
<catboost.core.CatBoostClassifier at 0x7f05f08dc6a0>
  • ์ด์ „์— ๋น„ํ•ด ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋จ

  • ์ „์ฒด์ ์ธ ํ›ˆ๋ จ ์‹œ๊ฐ„์€ ์ฆ๊ฐ€ํ•˜์˜€์ง€๋งŒ, ์ตœ๊ณ  ์„ฑ๋Šฅ์€ ๋” ์ ์€ ๋ฐ˜๋ณต์ˆ˜๋กœ ์–ป์–ด๋ƒ„(631ํšŒ ๋ฐ˜๋ณต)

  • early_stopping_rounds = N์„ ์ง€์ •ํ•˜์—ฌ ํ•™์Šต์„ ์กฐ๊ธฐ ์ค‘๋‹จ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ

    • N-round์— ๋Œ€ํ•œ metric ๊ฒฐ๊ณผ๊ฐ€ ๊ฐœ์„ ๋˜์ง€ ์•Š์œผ๋ฉด ๋ชจ๋ธ์€ ํ•™์Šต์„ ์ค‘๋‹จ์‹œํ‚ด
### early stopping ์ ์šฉ

%%time

params = {'loss_function': 'Logloss',
          'eval_metric': 'AUC',
          'cat_features': cat_features,
          'early_stopping_rounds': 200,
          'verbose': 200,
          'random_seed': SEED
         }

cbc_2 = CatBoostClassifier(**params)
cbc_2.fit(X_train, y_train, 
          eval_set = (X_valid, y_valid), 
          use_best_model = True, 
          plot = True
         );
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.069882
0:	test: 0.5637606	best: 0.5637606 (0)	total: 46ms	remaining: 45.9s
200:	test: 0.8955617	best: 0.8955872 (198)	total: 10.6s	remaining: 42s
400:	test: 0.8985912	best: 0.8987220 (386)	total: 24.3s	remaining: 36.3s
600:	test: 0.9004468	best: 0.9005457 (595)	total: 35.2s	remaining: 23.4s
800:	test: 0.8997008	best: 0.9007469 (631)	total: 50.2s	remaining: 12.5s
Stopped by overfitting detector  (200 iterations wait)

bestTest = 0.9007468588
bestIteration = 631

Shrink model to first 632 iterations.
CPU times: user 1min 18s, sys: 1.6 s, total: 1min 20s
Wall time: 53.3 s
<catboost.core.CatBoostClassifier at 0x7f05f08dfa30>

c) 3rd - GPU ์—ฐ์‚ฐ ํ™œ์šฉ

  • ๊ธฐ๋ณธ์ ์œผ๋กœ CatBoost๋Š” CPU๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„์‚ฐ

  • GPU์—์„œ ๊ณ„์‚ฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ ค๋ฉด task_type = 'GPU'๋ฅผ ์ง€์ •ํ•˜๋ฉด ๋จ

โ€ป ์ฝ”๋žฉ์—์„œ ์‹คํ–‰ ์‹œ ๋Ÿฐํƒ€์ž„ ์œ ํ˜• ๋ณ€๊ฒฝ -> GPU๋กœ!

### GPU ํ™œ์šฉ

%%time

params = {'loss_function': 'Logloss',
          'eval_metric': 'AUC',
          'cat_features': cat_features,
          'task_type':  'GPU',
          'verbose':  200,
          'random_seed': SEED
         }

cbc_3 = CatBoostClassifier(**params)
cbc_3.fit(X_train, y_train,
          eval_set = (X_valid, y_valid), 
          use_best_model = True,
          plot = True
         );
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.054241
Default metric period is 5 because AUC is/are not implemented for GPU
0:	test: 0.6184174	best: 0.6184174 (0)	total: 78.7ms	remaining: 1m 18s
200:	test: 0.8792616	best: 0.8792616 (200)	total: 10.7s	remaining: 42.6s
400:	test: 0.8832826	best: 0.8833058 (390)	total: 20.7s	remaining: 30.9s
600:	test: 0.8845304	best: 0.8845304 (600)	total: 33.5s	remaining: 22.3s
800:	test: 0.8854544	best: 0.8854544 (800)	total: 42.4s	remaining: 10.5s
999:	test: 0.8866701	best: 0.8867319 (995)	total: 58.1s	remaining: 0us
bestTest = 0.886731863
bestIteration = 995
Shrink model to first 996 iterations.
CPU times: user 52.6 s, sys: 14.3 s, total: 1min 6s
Wall time: 59.7 s
<catboost.core.CatBoostClassifier at 0x7f05f08dfbb0>
  • ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๊ฐœ์„ ๋˜์ง€๋Š” ์•Š์•˜์Œ

  • ๋ช‡๋ช‡ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์€ ์˜ค์ง GPU ๋ชจ๋“œ์—์„œ๋งŒ ์„ค์ • ๊ฐ€๋Šฅํ•จ

    • grow_policy: ํŠธ๋ฆฌ ์ƒ์„ฑ ๊ทœ์น™

    • min_data_in_leaf: ๋ฆฌํ”„ ๋…ธ๋“œ์˜ ์ตœ์†Œ ํ›ˆ๋ จ ์ƒ˜ํ”Œ ์ˆ˜

    • max_leaves: ๊ฒฐ๊ณผ ํŠธ๋ฆฌ์—์„œ์˜ ์ตœ๋Œ€ ๋ฆฌํ”„ ์ˆ˜

  • ์ผ๋ถ€ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ๋Š” GPU ํ•™์Šต์ด ํ›จ์”ฌ ์ ์€ ์‹œ๊ฐ„์ด ์†Œ์š”๋จ

  • border_count = N์„ ์ง€์ •ํ•˜์—ฌ GPU ํ•™์Šต์„ ๋”์šธ ๊ฐ€์†ํ™” ํ•  ์ˆ˜ ์žˆ์Œ

    • N: ๊ฐ feature์— ๋Œ€ํ•ด ๊ณ ๋ ค๋˜๋Š” ๋ถ„ํ• ์˜ ์ˆ˜

    • ๊ณต์‹ ๋ฌธ์„œ์—์„œ๋Š” GPU ํ•™์Šต ์‹œ ์ด๋ฅผ 32๋กœ ์„ค์ •ํ•  ๊ฒƒ์„ ์ œ์•ˆํ•จ

    • ๋งŽ์€ ๊ฒฝ์šฐ ์ด๋Š” ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์—๋Š” ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์ง€๋งŒ ํ›ˆ๋ จ ์†๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ด

d) 4th - GPU ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

%%time

params = {'loss_function': 'Logloss',
          'eval_metric': 'AUC',
          'cat_features': cat_features,
          'task_type':  'GPU',
          'border_count': 32,
          'verbose': 200,
          'random_seed': SEED
         }

cbc_4 = CatBoostClassifier(**params)
cbc_4.fit(X_train, y_train, 
          eval_set = (X_valid, y_valid), 
          use_best_model = True, 
          plot = True
         );
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.054241
0:	test: 0.6184174	best: 0.6184174 (0)	total: 34.2ms	remaining: 34.1s
Default metric period is 5 because AUC is/are not implemented for GPU
200:	test: 0.8792616	best: 0.8792616 (200)	total: 8.22s	remaining: 32.7s
400:	test: 0.8832826	best: 0.8833058 (390)	total: 17.8s	remaining: 26.6s
600:	test: 0.8845304	best: 0.8845304 (600)	total: 27.9s	remaining: 18.5s
800:	test: 0.8854544	best: 0.8854544 (800)	total: 38.6s	remaining: 9.59s
999:	test: 0.8866701	best: 0.8867319 (995)	total: 53.8s	remaining: 0us
bestTest = 0.886731863
bestIteration = 995
Shrink model to first 996 iterations.
CPU times: user 50.7 s, sys: 11.6 s, total: 1min 2s
Wall time: 54.6 s
<catboost.core.CatBoostClassifier at 0x7f05f08df4f0>
  • ์„ฑ๋Šฅ์— ํฐ ์ฐจ์ด๋Š” ์—†์ง€๋งŒ ์†๋„๋Š” ํ›จ์”ฌ ๋นจ๋ผ์กŒ์Œ

e) 5th, 6th - ๋ณ€์ˆ˜ ์„ ํƒ(feature selection)

  • ๊ฒฝ์šฐ์— ๋”ฐ๋ผ ์ผ๋ถ€ feature๋“ค์ด ์ž˜๋ชป๋œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•œ๋‹ค๊ณ  ์˜์‹ฌํ•  ์ˆ˜ ์žˆ์Œ

  • ์ด๋ฅผ ์‹คํ—˜ํ•ด๋ณด๊ธฐ ์œ„ํ•ด ์ˆ˜๋งŽ์€ ๋ฐ์ดํ„ฐ ์กฐ๊ฐ์„ ๋งŒ๋“ค๊ฑฐ๋‚˜ ignored_slice = [i1,i2,...,in]์œผ๋กœ ๋ชจ๋ธ์—์„œ ๋ฌด์‹œํ•  ์—ด ๋ฒˆํ˜ธ ๋ชฉ๋ก์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Œ

### ์‹คํ—˜ 1) ๋ฐ์ดํ„ฐ ์กฐ๊ฐ ๋งŒ๋“ค๊ธฐ

import numpy as np
import warnings
warnings.filterwarnings("ignore")
np.random.seed(SEED)

noise_cols = [f'noise_{i}' for i in range(5)]
for col in noise_cols:
    X_train[col] = y_train * np.random.rand(X_train.shape[0])
    X_valid[col] = np.random.rand(X_valid.shape[0])
X_train.head()
       RESOURCE  MGR_ID  ROLE_ROLLUP_1  ROLE_ROLLUP_2  ROLE_DEPTNAME  \
16773     27798    1350         117961         118052         122938   
23491     80701    4571         117961         118225         119924   
32731     34039    5113         117961         118300         119890   
7855      42085    4733         118290         118291         120126   
16475     16358    6046         117961         118446         120317   

       ROLE_TITLE  ROLE_FAMILY_DESC  ROLE_FAMILY  ROLE_CODE   noise_0  \
16773      117905            117906       290919     117908  0.417022   
23491      118685            279443       308574     118687  0.720324   
32731      119433            133686       118424     119435  0.000114   
7855       118980            166203       118295     118982  0.302333   
16475      307024            306404       118331     118332  0.146756   

        noise_1   noise_2   noise_3   noise_4  
16773  0.097850  0.665600  0.979025  0.491624  
23491  0.855900  0.311763  0.929346  0.391708  
32731  0.287838  0.896624  0.704050  0.606467  
7855   0.264320  0.482195  0.028493  0.182570  
16475  0.022876  0.009307  0.726750  0.623357  
%%time

params = {'loss_function': 'Logloss',
          'eval_metric': 'AUC',
          'cat_features': cat_features,
          'verbose': 200,
          'random_seed': SEED
         }

cbc_5 = CatBoostClassifier(**params)
cbc_5.fit(X_train, y_train, 
          eval_set = (X_valid, y_valid), 
          use_best_model = True, 
          plot = True
         );
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.069882
0:	test: 0.4990944	best: 0.4990944 (0)	total: 78.3ms	remaining: 1m 18s
200:	test: 0.5831370	best: 0.5894476 (7)	total: 9.93s	remaining: 39.5s
400:	test: 0.5831376	best: 0.5894476 (7)	total: 17s	remaining: 25.4s
600:	test: 0.5831376	best: 0.5894476 (7)	total: 25.4s	remaining: 16.9s
800:	test: 0.5831378	best: 0.5894476 (7)	total: 32.9s	remaining: 8.16s
999:	test: 0.5831381	best: 0.5894476 (7)	total: 38.7s	remaining: 0us

bestTest = 0.5894475816
bestIteration = 7

Shrink model to first 8 iterations.
CPU times: user 1min 4s, sys: 1.25 s, total: 1min 5s
Wall time: 39.2 s
<catboost.core.CatBoostClassifier at 0x7f05eeb78ac0>
  • ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ•˜๋ฝํ•จ
### ์‹คํ—˜ 2) ๋ฌด์‹œํ•  ์ปฌ๋Ÿผ ๋ชฉ๋ก ์ง€์ •

ignored_features = list(range(X_train.shape[1] - 5, X_train.shape[1]))
print(ignored_features)
[9, 10, 11, 12, 13]
%%time

params = {'loss_function': 'Logloss',
          'eval_metric': 'AUC',
          'cat_features': cat_features,
          'ignored_features': ignored_features, # ๋ฌด์‹œํ•  ๋ณ€์ˆ˜๋“ค
          'early_stopping_rounds': 200,
          'verbose': 200,
          'random_seed': SEED
         }

cbc_6 = CatBoostClassifier(**params)
cbc_6.fit(X_train, y_train, 
          eval_set = (X_valid, y_valid), 
          use_best_model = True, 
          plot = True
         );
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.069882
0:	test: 0.5637606	best: 0.5637606 (0)	total: 42.2ms	remaining: 42.1s
200:	test: 0.8955617	best: 0.8955872 (198)	total: 10.4s	remaining: 41.3s
400:	test: 0.8985912	best: 0.8987220 (386)	total: 21.7s	remaining: 32.4s
600:	test: 0.9004468	best: 0.9005457 (595)	total: 33.4s	remaining: 22.2s
800:	test: 0.8997008	best: 0.9007469 (631)	total: 45s	remaining: 11.2s
Stopped by overfitting detector  (200 iterations wait)

bestTest = 0.9007468588
bestIteration = 631

Shrink model to first 632 iterations.
CPU times: user 1min 18s, sys: 1.44 s, total: 1min 20s
Wall time: 47 s
<catboost.core.CatBoostClassifier at 0x7f05eeb7b8b0>
### ์‹คํ—˜ 1๋กœ ๋งŒ๋“  noise๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ์›๋ž˜ ์ƒํƒœ๋กœ ๋Œ๋ฆฌ๊ธฐ

X_train = X_train.drop(columns = noise_cols)
X_valid = X_valid.drop(columns = noise_cols)
X_train.head()
       RESOURCE  MGR_ID  ROLE_ROLLUP_1  ROLE_ROLLUP_2  ROLE_DEPTNAME  \
16773     27798    1350         117961         118052         122938   
23491     80701    4571         117961         118225         119924   
32731     34039    5113         117961         118300         119890   
7855      42085    4733         118290         118291         120126   
16475     16358    6046         117961         118446         120317   

       ROLE_TITLE  ROLE_FAMILY_DESC  ROLE_FAMILY  ROLE_CODE  
16773      117905            117906       290919     117908  
23491      118685            279443       308574     118687  
32731      119433            133686       118424     119435  
7855       118980            166203       118295     118982  
16475      307024            306404       118331     118332  

f) 7th - Pool ๊ฐ์ฒด ํ™œ์šฉ

  • Catboost๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ Pool ๊ฐ์ฒด๋„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Œ

  • Pool ๊ฐ์ฒด

    • ๋ฒ”์ฃผํ˜• ์ปฌ๋Ÿผ ์ธ๋ฑ์Šค(์ •์ˆ˜๋กœ ์ง€์ •) ๋˜๋Š” ์ด๋ฆ„(๋ฌธ์ž์—ด๋กœ ์ง€์ •)์˜ 1์ฐจ์› ๋ฐฐ์—ด
from catboost import Pool

train_data = Pool(data = X_train,
                  label = y_train,
                  cat_features = cat_features
                 )

valid_data = Pool(data = X_valid,
                  label = y_valid,
                  cat_features = cat_features
                 )
%%time

params = {'loss_function':'Logloss',
          'eval_metric':'AUC',
#         'cat_features': cat_features, # ์ด๋ฏธ Pool ๊ฐ์ฒด์—์„œ ์ง€์ •ํ•จ
          'early_stopping_rounds': 200,
          'verbose': 200,
          'random_seed': SEED
         }

cbc_7 = CatBoostClassifier(**params)
cbc_7.fit(train_data, # instead of X_train, y_train
          eval_set = valid_data, # instead of (X_valid, y_valid)
          use_best_model = True, 
          plot = True
         );
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Learning rate set to 0.069882
0:	test: 0.5637606	best: 0.5637606 (0)	total: 44.6ms	remaining: 44.6s
200:	test: 0.8955617	best: 0.8955872 (198)	total: 13.6s	remaining: 54.2s
400:	test: 0.8985912	best: 0.8987220 (386)	total: 27.7s	remaining: 41.4s
600:	test: 0.9004468	best: 0.9005457 (595)	total: 49.6s	remaining: 32.9s
800:	test: 0.8997008	best: 0.9007469 (631)	total: 1m 2s	remaining: 15.5s
Stopped by overfitting detector  (200 iterations wait)

bestTest = 0.9007468588
bestIteration = 631

Shrink model to first 632 iterations.
CPU times: user 1min 19s, sys: 1.65 s, total: 1min 20s
Wall time: 1min 4s
<catboost.core.CatBoostClassifier at 0x7f05f08df190>

๐Ÿ“Œ Pool ๊ฐ์ฒด๋ฅผ ํ™œ์šฉํ•˜๋Š” ์ด์œ 

  • ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฐ์ดํ„ฐ์˜ ์ผ๋ถ€๊ฐ€ ๊ตฌ์‹์ด๊ฑฐ๋‚˜ ๋ถ€์ •ํ™•ํ•  ์ˆ˜ ์žˆ์Œ

    • Pool.set_weight()๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ์˜ ์ธ์Šคํ„ด์Šค(=> ํ–‰)์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•  ์ˆ˜ ์žˆ์Œ
  • ๋˜๋Š” Pool์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๊ทธ๋ฃน์„ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Œ

    • set_group_id()๋ฅผ ์ง€์ •ํ•˜๊ณ  Pool.set_group_weight()์„ ํ™œ์šฉํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ ๊ทธ๋ฃน์— ๋Œ€ํ•ด ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Œ
  • ๊ธฐ์ค€์„ ์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Œ

    • Pool.set_baseline()์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋“  ์ž…๋ ฅ ๊ฐœ์ฒด์— ๋Œ€ํ•œ ์ดˆ๊ธฐ ์ˆ˜์‹ ๊ฐ’์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Œ

    • ํ›ˆ๋ จ์ด 0์—์„œ ์‹œ์ž‘ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ ์ง€์ •๋œ ๊ฐ’์—์„œ ์‹œ์ž‘ํ•จ

  • Pool ๊ฐ์ฒด๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ๊ณ„ ๋ถ€๋ถ„์„ ํฌํ•จํ•  ์ˆ˜ ์žˆ๋Š” ์ข‹์€ ๋ฐฉ๋ฒ•์ž„

g) 8th - ๊ต์ฐจ ๊ฒ€์ฆ(Cross Validation)

from catboost import cv
%%time

params = {'loss_function': 'Logloss',
          'eval_metric': 'AUC',
          'verbose': 200,
          'random_seed': SEED
         }

all_train_data = Pool(data = X,
                      label = y,
                      cat_features = cat_features
                     )

scores = cv(pool = all_train_data,
            params = params, 
            fold_count = 4,
            seed = SEED, 
            shuffle = True,
            stratified = True, # True๋ฉด ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ‘œ๋ณธ ๋น„์œจ์„ ๋ณด์กดํ•˜๋ฉฐ fold ์ƒ์„ฑ
            plot = True
           )
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Training on fold [0/4]
0:	test: 0.5000000	best: 0.5000000 (0)	total: 15.7ms	remaining: 15.7s
200:	test: 0.8948050	best: 0.8948050 (200)	total: 11.1s	remaining: 44s
400:	test: 0.8993043	best: 0.8993043 (400)	total: 22.7s	remaining: 33.9s
600:	test: 0.9019037	best: 0.9019037 (600)	total: 36.6s	remaining: 24.3s
800:	test: 0.9027905	best: 0.9031492 (781)	total: 48.8s	remaining: 12.1s
999:	test: 0.9036792	best: 0.9036792 (999)	total: 1m 1s	remaining: 0us

bestTest = 0.9036791642
bestIteration = 999

Training on fold [1/4]
0:	test: 0.5000000	best: 0.5000000 (0)	total: 16.8ms	remaining: 16.7s
200:	test: 0.8835559	best: 0.8840146 (166)	total: 10.4s	remaining: 41.3s
400:	test: 0.8852191	best: 0.8853875 (382)	total: 23.1s	remaining: 34.5s
600:	test: 0.8859059	best: 0.8859447 (591)	total: 36.8s	remaining: 24.4s
800:	test: 0.8860087	best: 0.8865844 (746)	total: 49.7s	remaining: 12.3s
999:	test: 0.8841890	best: 0.8865844 (746)	total: 1m 2s	remaining: 0us

bestTest = 0.8865843778
bestIteration = 746

Training on fold [2/4]
0:	test: 0.5000000	best: 0.5000000 (0)	total: 60.1ms	remaining: 1m
200:	test: 0.8762431	best: 0.8762994 (198)	total: 11.4s	remaining: 45.5s
400:	test: 0.8825299	best: 0.8825365 (399)	total: 24.9s	remaining: 37.1s
600:	test: 0.8859397	best: 0.8859462 (593)	total: 38.5s	remaining: 25.6s
800:	test: 0.8876071	best: 0.8876071 (800)	total: 52.4s	remaining: 13s
999:	test: 0.8890818	best: 0.8890895 (998)	total: 1m 7s	remaining: 0us

bestTest = 0.8890894812
bestIteration = 998

Training on fold [3/4]
0:	test: 0.5000000	best: 0.5000000 (0)	total: 15.7ms	remaining: 15.7s
200:	test: 0.8848750	best: 0.8848750 (200)	total: 11.9s	remaining: 47.2s
400:	test: 0.8886395	best: 0.8886395 (400)	total: 33.3s	remaining: 49.7s
600:	test: 0.8917459	best: 0.8917475 (599)	total: 49.4s	remaining: 32.8s
800:	test: 0.8926586	best: 0.8928882 (763)	total: 1m 3s	remaining: 15.8s
999:	test: 0.8919993	best: 0.8928882 (763)	total: 1m 18s	remaining: 0us

bestTest = 0.892888207
bestIteration = 763

CPU times: user 6min 21s, sys: 33.8 s, total: 6min 55s
Wall time: 4min 30s
### ํ”ผ์ฒ˜ ์ค‘์š”๋„ ์‹œ๊ฐํ™”

import pandas as pd

feature_importance_df = cbc_7.get_feature_importance(prettified = True)
feature_importance_df
         Feature Id  Importances
0          RESOURCE    19.191502
1     ROLE_DEPTNAME    15.756340
2            MGR_ID    15.621862
3     ROLE_ROLLUP_2    13.129965
4  ROLE_FAMILY_DESC    10.059007
5        ROLE_TITLE     7.790703
6       ROLE_FAMILY     6.412647
7     ROLE_ROLLUP_1     6.224750
8         ROLE_CODE     5.813223
### ์‹œ๊ฐํ™”

from matplotlib import pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6));
sns.barplot(x= 'Importances', y="Feature Id", data=feature_importance_df);
plt.title('CatBoost features importance:')
Text(0.5, 1.0, 'CatBoost features importance:')
<Figure size 1200x600 with 1 Axes>
  • ๋” ์ž์„ธํžˆ ์‚ดํŽด๋ณด์ž.
!pip install shap
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting shap
  Downloading shap-0.41.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (572 kB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 572.6/572.6 kB 36.1 MB/s eta 0:00:00
[?25hRequirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from shap) (1.22.4)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from shap) (1.10.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from shap) (1.2.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from shap) (1.5.3)
Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.10/dist-packages (from shap) (4.65.0)
Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.10/dist-packages (from shap) (23.1)
Collecting slicer==0.0.7 (from shap)
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Requirement already satisfied: numba in /usr/local/lib/python3.10/dist-packages (from shap) (0.56.4)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from shap) (2.2.1)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba->shap) (0.39.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from numba->shap) (67.7.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->shap) (2022.7.1)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->shap) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->shap) (3.1.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->shap) (1.16.0)
Installing collected packages: slicer, shap
Successfully installed shap-0.41.0 slicer-0.0.7
import shap
explainer = shap.TreeExplainer(cbc_7) # ๋ชจ๋ธ ๊ฐ์ฒด
shap_values = explainer.shap_values(train_data) # ํ•™์Šต์šฉ Pool ๊ฐ์ฒด

shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[:100,:], X_train.iloc[:100,:])
  • ๋™์ ์ธ ์‹œ๊ฐํ™” plot์ž„(interactive plot)

    • ๊ฐ€๋กœ์ถ•๊ณผ ์„ธ๋กœ์ถ• ๋ชจ๋‘์— ๋Œ€ํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ „ํ™˜ํ•˜์—ฌ ๋ชจํ˜•์„ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ์Œ
  • summary plot ํ™•์ธ

shap.summary_plot(shap_values, X_train)
<Figure size 800x510 with 2 Axes>

Look like it matters who is your manager (MGR_ID) :D

  • ์œ„ ๋‹ค์ด์–ด๊ทธ๋žจ์—์„œ ๋ชจ๋“  ์ง์›(๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ธ์Šคํ„ด์Šค/ํ–‰)์€ ๊ฐ ํ–‰์— ์  ํ•˜๋‚˜์”ฉ์œผ๋กœ ํ‘œ์‹œ๋จ

    • ์ ์˜ x ์ขŒํ‘œ๋Š” ํ•ด๋‹น ํ˜•์ƒ์ด ๋ชจํ˜•์˜ ์˜ˆ์ธก์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์˜๋ฏธ

    • ์ ์˜ ์ƒ‰์ƒ์€ ํ•ด๋‹น ์ง์›์— ๋Œ€ํ•œ ํ•ด๋‹น feature์˜ ๊ฐ’์„ ์˜๋ฏธ

    • ํ–‰์— ๋งž์ง€ ์•Š๋Š” ์ ๋“ค์€ ์Œ“์—ฌ์„œ ๋ฐ€๋„๋ฅผ ํ‘œํ˜„

  • ์—ฌ๊ธฐ์„œ ROLE_ROLLUP_1 ๋ฐ ROLE_CODE ํ”ผ์ณ๋Š” ๋ชจ๋ธ ์˜ˆ์ธก์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์ด ์ ์œผ๋ฉฐ, ๋Œ€๋ถ€๋ถ„์˜ ์ง์›์˜ ๊ฒฝ์šฐ ์˜ํ–ฅ์ด ๊ฑฐ์˜ ์—†์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ

2-5. ์ตœ์ข… ์˜ˆ์ธก(Prediction)

%%time

from sklearn.model_selection import StratifiedKFold

n_fold = 4 # amount of data folds
folds = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=SEED)

params = {'loss_function':'Logloss',
          'eval_metric':'AUC',
          'verbose': 200,
          'random_seed': SEED
         }

test_data = Pool(data=X_test,
                 cat_features=cat_features)

scores = []
prediction = np.zeros(X_test.shape[0])
for fold_n, (train_index, valid_index) in enumerate(folds.split(X, y)):
    
    X_train, X_valid = X.iloc[train_index], X.iloc[valid_index] # train and validation data splits
    y_train, y_valid = y[train_index], y[valid_index]
    
    train_data = Pool(data=X_train, 
                      label=y_train,
                      cat_features=cat_features)
    valid_data = Pool(data=X_valid, 
                      label=y_valid,
                      cat_features=cat_features)
    
    model = CatBoostClassifier(**params)
    model.fit(train_data,
              eval_set=valid_data, 
              use_best_model=True
             )
    
    score = model.get_best_score()['validation']['AUC']
    scores.append(score)

    y_pred = model.predict_proba(test_data)[:, 1]
    prediction += y_pred

prediction /= n_fold
print('CV mean: {:.4f}, CV std: {:.4f}'.format(np.mean(scores), np.std(scores)))
Learning rate set to 0.069882
0:	test: 0.5797111	best: 0.5797111 (0)	total: 89.7ms	remaining: 1m 29s
200:	test: 0.8638646	best: 0.8638646 (200)	total: 11.6s	remaining: 46.1s
400:	test: 0.8690869	best: 0.8691354 (377)	total: 22.8s	remaining: 34.1s
600:	test: 0.8730484	best: 0.8730484 (600)	total: 34.1s	remaining: 22.6s
800:	test: 0.8758933	best: 0.8761898 (791)	total: 45.1s	remaining: 11.2s
999:	test: 0.8758019	best: 0.8767012 (846)	total: 56.1s	remaining: 0us

bestTest = 0.8767012179
bestIteration = 846

Shrink model to first 847 iterations.
Learning rate set to 0.069883
0:	test: 0.5000000	best: 0.5000000 (0)	total: 31.2ms	remaining: 31.2s
200:	test: 0.8946113	best: 0.8947113 (198)	total: 10.6s	remaining: 42s
400:	test: 0.9000046	best: 0.9000046 (400)	total: 22s	remaining: 32.8s
600:	test: 0.9016458	best: 0.9017362 (584)	total: 33.5s	remaining: 22.2s
800:	test: 0.9017343	best: 0.9020394 (723)	total: 45.1s	remaining: 11.2s
999:	test: 0.8998936	best: 0.9021903 (817)	total: 56.4s	remaining: 0us

bestTest = 0.9021902605
bestIteration = 817

Shrink model to first 818 iterations.
Learning rate set to 0.069883
0:	test: 0.5000000	best: 0.5000000 (0)	total: 14.2ms	remaining: 14.2s
200:	test: 0.9042458	best: 0.9043000 (199)	total: 10.6s	remaining: 42.3s
400:	test: 0.9046762	best: 0.9059032 (260)	total: 22s	remaining: 32.8s
600:	test: 0.9027506	best: 0.9059032 (260)	total: 33s	remaining: 21.9s
800:	test: 0.9008662	best: 0.9059032 (260)	total: 44.1s	remaining: 10.9s
999:	test: 0.8987709	best: 0.9059032 (260)	total: 54.5s	remaining: 0us

bestTest = 0.9059031548
bestIteration = 260

Shrink model to first 261 iterations.
Learning rate set to 0.069883
0:	test: 0.5000000	best: 0.5000000 (0)	total: 28.8ms	remaining: 28.8s
200:	test: 0.8951500	best: 0.8951673 (199)	total: 10.4s	remaining: 41.5s
400:	test: 0.8960953	best: 0.8963737 (320)	total: 21.9s	remaining: 32.8s
600:	test: 0.8974762	best: 0.8976249 (598)	total: 33.2s	remaining: 22s
800:	test: 0.8980579	best: 0.8981457 (794)	total: 44.5s	remaining: 11.1s
999:	test: 0.8972463	best: 0.8983501 (946)	total: 55.9s	remaining: 0us

bestTest = 0.8983501224
bestIteration = 946

Shrink model to first 947 iterations.
CV mean: 0.8958, CV std: 0.0113
CPU times: user 6min 27s, sys: 7.5 s, total: 6min 34s
Wall time: 3min 45s

๐Ÿ“š Resources

  1. CatBoost documentation

  2. CatBoost tutorials repository

  3. Introduction to gradient boosting on decision trees with Catboost

  4. Working with categorical data: Catboost

  5. Interpretable Machine Learning with XGBoost

ํƒœ๊ทธ: , ,

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ: