0. Introduction: Home Credit Default Risk Competition

  • ๊ณ ๊ฐ์ด ๋Œ€์ถœ๊ธˆ์„ ์ƒํ™˜ํ•  ๊ฒƒ์ธ์ง€ ๋˜๋Š” ์–ด๋ ค์›€์„ ๊ฒช์„ ๊ฒƒ์ธ์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ์ค‘์š”ํ•œ ๋น„์ฆˆ๋‹ˆ์Šค ์š”๊ตฌ์‚ฌํ•ญ์ž„

  • ๋Œ€ํšŒ ๋ชฉํ‘œ: ๊ณผ๊ฑฐ ๋Œ€์ถœ ์‹ ์ฒญ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹ ์ฒญ์ž๊ฐ€ ๋Œ€์ถœ๊ธˆ์„ ์ƒํ™˜ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธก

    • ์ „ํ˜•์ ์ธ ๋ถ„๋ฅ˜(Classification) ๋ฌธ์ œ์ž„

  • Supervised

    • label์€ train data์— ํฌํ•จ๋˜์–ด ์žˆ์Œ

    • ๋ชจ๋ธ์ด feature๋ฅผ ํ†ตํ•ด label์„ ์˜ˆ์ธกํ•˜๋„๋ก ํ•˜๋Š” ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ

  • Classification

    • label์€ binary ๋ณ€์ˆ˜์ž„

    • 0(๋Œ€์ถœ์„ ์ œ๋•Œ ์ƒํ™˜) vs 1(๋Œ€์ถœ์„ ์ƒํ™˜ํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์ด ์žˆ์Œ)

0-1. Data ๊ตฌ์กฐ ํŒŒ์•…ํ•˜๊ธฐ

  • ๋ฐ์ดํ„ฐ๋Š” Home Credit์— ์˜ํ•ด ์ œ๊ณต๋จ

    • ์€ํ–‰์— ๊ฐ€์ž…ํ•˜์ง€ ์•Š์€ ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ์‹ ์šฉ๋Œ€์ถœ(์‹ ์šฉ๋Œ€์ถœ)์„ ์ œ๊ณตํ•˜๋Š” ์„œ๋น„์Šค

a) ๋ฐ์ดํ„ฐ ์†Œ์Šค

1. application_train/application_test

  • Home Credit์˜ ๊ฐ ๋Œ€์ถœ ์‹ ์ฒญ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ํฌํ•จ๋œ ์ฃผ์š” train/test ๋ฐ์ดํ„ฐ

  • ๋ชจ๋“  ๋Œ€์ถœ์—๋Š” ๊ณ ์œ ํ•œ ํ–‰์ด ์žˆ์œผ๋ฉฐ SK_ID_CURR ํ”ผ์ฒ˜๋กœ ์‹๋ณ„๋จ

  • train ๋ฐ์ดํ„ฐ์—๋Š” target์ด ํ•จ๊ป˜ ์ œ๊ณต๋จ

    • 0: ๋Œ€์ถœ๊ธˆ ์ƒํ™˜

    • 1: ๋Œ€์ถœ๊ธˆ ๋ฏธ์ƒํ™˜

2. bureau

  • ๋‹ค๋ฅธ ๊ธˆ์œต ๊ธฐ๊ด€์œผ๋กœ๋ถ€ํ„ฐ์˜ ๊ณ ๊ฐ์˜ ์ด์ „ ์‹ ์šฉ์— ๊ด€ํ•œ ๋ฐ์ดํ„ฐ

  • ๊ฐ ์ด์ „ ํฌ๋ ˆ๋”ง์—๋Š” ๊ณ ์œ ํ•œ ํ–‰์ด ์ง€์ •๋˜์–ด ์žˆ์ง€๋งŒ, ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ๋ฐ์ดํ„ฐ์—์„œ ํ•˜๋‚˜์˜ ๋Œ€์ถœ์ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ด์ „ ํฌ๋ ˆ๋”ง์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ

3. bureau_balance

  • ์‚ฌ๋ฌด๊ตญ์˜ ์ด์ „ ํ•™์ ์— ๋Œ€ํ•œ ์›”๋ณ„ ๋ฐ์ดํ„ฐ

  • ๊ฐ ํ–‰์€ ์ด์ „ ํฌ๋ ˆ๋”ง์˜ ํ•œ ๋‹ฌ์ด๋ฉฐ, ์ด์ „ ํฌ๋ ˆ๋”ง ํ•˜๋‚˜์—๋Š” ํฌ๋ ˆ๋”ง ๊ธธ์ด์˜ ๊ฐ ๋‹ฌ์— ํ•˜๋‚˜์”ฉ ์—ฌ๋Ÿฌ ํ–‰์ด ์žˆ์„ ์ˆ˜ ์žˆ์Œ

4. previous_application

  • ์‹ ์ฒญ ๋ฐ์ดํ„ฐ์— ๋Œ€์ถœ์ด ์žˆ๋Š” ๊ณ ๊ฐ์ด Home Credit์—์„œ ๋Œ€์ถœ์„ ์œ„ํ•ด ์ด์ „์— ํ•œ ์‹ ์ฒญ

  • ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ๋ฐ์ดํ„ฐ์˜ ๊ฐ ํ˜„์žฌ ๋Œ€์ถœ์—๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ด์ „ ๋Œ€์ถœ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Œ

  • ๊ฐ ์ด์ „ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์—๋Š” ํ•˜๋‚˜์˜ ํ–‰์ด ์žˆ์œผ๋ฉฐ SK_ID_PREV feature๋กœ ์‹๋ณ„๋จ

5. POS_CASH_BALANCE

  • ๊ณ ๊ฐ์ด Home Credit์—์„œ ๋ณด์œ ํ•œ ์ด์ „ ํŒ๋งค ์‹œ์  ๋˜๋Š” ํ˜„๊ธˆ ๋Œ€์ถœ์— ๋Œ€ํ•œ ์›”๋ณ„ ๋ฐ์ดํ„ฐ

  • ๊ฐ ํ–‰์€ ์ด์ „ ํŒ๋งค ์‹œ์  ๋˜๋Š” ํ˜„๊ธˆ ๋Œ€์ถœ์˜ ํ•œ ๋‹ฌ์ด๋ฉฐ, ํ•˜๋‚˜์˜ ์ด์ „ ๋Œ€์ถœ์—๋Š” ์—ฌ๋Ÿฌ ํ–‰์ด ์žˆ์„ ์ˆ˜ ์žˆ์Œ

6. credit_card_balance

  • ๊ณ ๊ฐ๋“ค์ด ํ™ˆ ํฌ๋ ˆ๋”ง์— ๊ฐ€์ง€๊ณ  ์žˆ๋˜ ์ด์ „ ์‹ ์šฉ์นด๋“œ์— ๋Œ€ํ•œ ์›”๋ณ„ ๋ฐ์ดํ„ฐ

  • ๊ฐ ํ–‰์€ ์‹ ์šฉ์นด๋“œ ์ž”์•ก์˜ ํ•œ ๋‹ฌ์ด๋ฉฐ, ํ•˜๋‚˜์˜ ์‹ ์šฉ์นด๋“œ๋Š” ์—ฌ๋Ÿฌ ํ–‰์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ

7. installments_payment

  • Home Credit์˜ ์ด์ „ ๋Œ€์ถœ์— ๋Œ€ํ•œ ์ง€๋ถˆ ๋‚ด์—ญ

  • ๊ฒฐ์ œ ์Šน์ธ ๋•Œ๋งˆ๋‹ค ํ•˜๋‚˜์˜ ํ–‰์ด ์žˆ๊ณ  ๊ฒฐ์ œ ๋ฏธ์Šน์ธ ๋•Œ๋งˆ๋‹ค ํ•˜๋‚˜์˜ ํ–‰์ด ์žˆ์Œ

(+)

  • ๋ชจ๋“  ์—ด์— ๋Œ€ํ•œ ์ •์˜๊ฐ€ ํฌํ•จ๋œ HomeCredit_columns_description.csv ํŒŒ์ผ

  • ์ œ์ถœ ํŒŒ์ผ ์˜ˆ์‹œ

b) ์ด๋ฏธ์ง€ ๊ตฌ์กฐ๋„

0-2. ํ‰๊ฐ€ ์ง€ํ‘œ: ROC AUC


image

  • ๊ทธ๋ž˜ํ”„์˜ ๊ฐ ์„ ์€ ๋‹จ์ผ ๋ชจํ˜•์— ๋Œ€ํ•œ ๊ณก์„ ์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๊ฐ ์„ ์„ ๋”ฐ๋ผ ์ด๋™ํ•˜๋ฉด ์–‘์˜ ์ธ์Šคํ„ด์Šค(instance)๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” ์ž„๊ณ„๊ฐ’์ด ๋ณ€๊ฒฝ๋จ์„ ๋‚˜ํƒ€๋ƒ„

  • ์ž„๊ณ„๊ฐ’(threshold)์€ ์˜ค๋ฅธ์ชฝ ์ƒ๋‹จ์—์„œ 0์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ์™ผ์ชฝ ํ•˜๋‹จ์˜ 1๋กœ ์ด๋™

  • ์™ผ์ชฝ์— ์žˆ๊ณ  ๋‹ค๋ฅธ ๊ณก์„  ์œ„์— ์žˆ๋Š” ๊ณก์„ ์€ ๋” ๋‚˜์€ ๋ชจํ˜•์„ ๋‚˜ํƒ€๋ƒ„

    • ํŒŒ๋ž€์ƒ‰ ๋ชจ๋ธ์ด ์ดˆ๋ก์ƒ‰ ๋ชจ๋ธ๋ณด๋‹ค ๋‚ซ๊ณ , ์ด๋Š” ๋‹จ์ˆœํ•œ ๋ฌด์ž‘์œ„ ์ถ”์ธก ๋ชจ๋ธ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋นจ๊ฐ„์ƒ‰ ๋Œ€๊ฐ์„ ๋ณด๋‹ค ๋‚ซ๋‹ค.

  • The Area Under the Curve (AUC)

  • ๋‹จ์ˆœํžˆ ROC ๊ณก์„  ์•„๋ž˜์˜ ๋ฉด์ ์„ ์˜๋ฏธ

    • ๊ณก์„ ์˜ ์ ๋ถ„๊ฐ’์„ ์˜๋ฏธ
  • 0 ~ 1์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฉฐ ๋†’์„์ˆ˜๋ก ๋” ์ข‹์€ ๋ชจ๋ธ์ด๋ผ๊ณ  ํ‰๊ฐ€

  • ๋žœ๋ค์œผ๋กœ ๋‹จ์ˆœํžˆ ์ถ”์ธกํ•˜๋Š” ๋ชจํ˜•์˜ ROC-AUC๋Š” 0.5


  • ROC- AUC์— ๋”ฐ๋ผ classifier(๋ถ„๋ฅ˜๊ธฐ)๋ฅผ ์ธก์ •ํ•  ๋•Œ ์˜ˆ์ธก์„ 0 ๋˜๋Š” 1๋กœ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ(=> ์ด๋ถ„๋ฒ•์ ์œผ๋กœ ๋‹ต์„ ์ •ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ) 0๊ณผ 1 ์‚ฌ์ด์˜ ํ™•๋ฅ ์„ ์ƒ์„ฑ

    • ํด๋ž˜์Šค๊ฐ€ ๋ถˆ๊ท ํ˜•ํ•œ(imbalanced) ๋ถ„ํฌ๋ฅผ ๋„๋Š” ๊ฒฝ์šฐ ์ •ํ™•๋„(accuracy)๊ฐ€ ์ตœ์ƒ์˜ ์ธก์ • ๊ธฐ์ค€์ด ์•„๋‹˜
  • EX>

    99.9999%์˜ ์ •ํ™•๋„๋กœ ํ…Œ๋Ÿฌ๋ฆฌ์ŠคํŠธ๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ์‹ถ๋‹ค๋ฉด, ๋ชจ๋“  ์‚ฌ๋žŒ์ด ํ…Œ๋Ÿฌ๋ฆฌ์ŠคํŠธ๊ฐ€ ์•„๋‹ˆ๋ผ๊ณ  ์˜ˆ์ธกํ•ด ๋ฒ„๋ฆฌ๋ฉด ์ผ๋‹จ ์ •ํ™•๋„๋Š” ๋†’๊ฒŒ ๋‚˜์˜จ๋‹ค.

  • ์ด๋Ÿฌํ•œ ์ƒํ™ฉ์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด ROC- AUC ๋˜๋Š” F1 ์ ์ˆ˜์™€ ๊ฐ™์€ ๊ณ ๊ธ‰ metric์„ ํ™œ์šฉ


0-3. ์ฐธ๊ณ ํ•  ๋งŒํ•œ ๋…ธํŠธ๋ถ

1. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ์ค€๋น„ํ•˜๊ธฐ

1-1. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & module

### for ๋ฐ์ดํ„ฐ ๊ฐ€๊ณต
import numpy as np
import pandas as pd 

### ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ encoder
from sklearn.preprocessing import LabelEncoder

### ํŒŒ์ผ ์‹œ์Šคํ…œ ์ œ์–ด
import os

import warnings
warnings.filterwarnings('ignore')

### ์‹œ๊ฐํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
import matplotlib.pyplot as plt
import seaborn as sns

1-2. ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ

  • ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ํŒŒ์ผ

    • train์„ ์œ„ํ•œ main ํŒŒ์ผ(target โญ•)

    • test๋ฅผ ์œ„ํ•œ main ํŒŒ์ผ(target โŒ)

    • ์ œ์ถœ์šฉ ์˜ˆ์‹œ ํŒŒ์ผ

    • ๊ฐ ๋Œ€์ถœ์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ์ •๋ณด๊ฐ€ ํฌํ•จ๋œ 6๊ฐœ์˜ ๊ธฐํƒ€ ํŒŒ์ผ

from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ์˜ ๋ชฉ๋ก ํ™•์ธ

print(os.listdir('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/3แ„Œแ…ฎแ„Žแ…ก/data'))
['HomeCredit_columns_description.csv', 'POS_CASH_balance.csv', 'application_test.csv', 'application_train.csv', 'bureau.csv', 'bureau_balance.csv', 'credit_card_balance.csv', 'installments_payments.csv', 'previous_application.csv', 'sample_submission.csv']
# ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ

app_train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/3แ„Œแ…ฎแ„Žแ…ก/data/application_train.csv')
print('Training data shape: ', app_train.shape)
Training data shape:  (307511, 122)
app_train.head()
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows ร— 122 columns

  • train ๋ฐ์ดํ„ฐ์—๋Š” 307511๊ฐœ์˜ ๊ด€์ธก์น˜(๊ฐ๊ฐ ๋ณ„๋„์˜ ๋Œ€์ถœ)์™€ target(์˜ˆ์ธกํ•˜๊ณ ์ž ํ•˜๋Š” ๋ ˆ์ด๋ธ”)```์„ ํฌํ•จํ•œ 122๊ฐœ์˜ feature(๋ณ€์ˆ˜)๊ฐ€ ์กด์žฌ
# ํ‰๊ฐ€์šฉ ๋ฐ์ดํ„ฐ

app_test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/3แ„Œแ…ฎแ„Žแ…ก/data/application_test.csv')
print('Testing data shape: ', app_test.shape)
Testing data shape:  (48744, 121)
app_test.head()
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100001 Cash loans F N Y 0 135000.0 568800.0 20560.5 450000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
1 100005 Cash loans M N Y 0 99000.0 222768.0 17370.0 180000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
2 100013 Cash loans M Y Y 0 202500.0 663264.0 69777.0 630000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 1.0 4.0
3 100028 Cash loans F N Y 2 315000.0 1575000.0 49018.5 1575000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
4 100038 Cash loans M Y N 1 180000.0 625500.0 32067.0 625500.0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN

5 rows ร— 121 columns

  • test data๊ฐ€ ๋” ์ž‘๊ณ , target ๊ฐ’์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค.

2. ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„(EDA/ Exploratory Data Analysis)

  • ํ†ต๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๋ฐ์ดํ„ฐ ๋‚ด์˜ ์ถ”์„ธ, ์ด์ƒ ์ง•ํ›„, ํŒจํ„ด ๋˜๋Š” ๊ด€๊ณ„๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ์ˆ˜์น˜๋ฅผ ๋งŒ๋“œ๋Š” ๊ฐœ๋ฐฉํ˜• ํ”„๋กœ์„ธ์Šค

  • ๋ชฉํ‘œ: ๋ฐ์ดํ„ฐ๊ฐ€ ์ œ๊ณตํ•˜๋Š” ์ •๋ณด๋“ค์„ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ

  • ์ผ๋ฐ˜์ ์œผ๋กœ ํฌ๊ด„์ ์ธ ์ˆ˜์ค€์˜ ๊ฐœ์š”๋กœ ์‹œ์ž‘ํ•œ ๋‹ค์Œ ๋ฐ์ดํ„ฐ์˜ ํฅ๋ฏธ๋กœ์šด ๋ถ€๋ถ„์„ ์ฐพ์Œ์— ๋”ฐ๋ผ ํŠน์ • ์˜์—ญ์œผ๋กœ ์ขํ˜€๋‚˜๊ฐ€๋Š” ๋ฐฉ์‹

  • ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋Š” ์ž์ฒด์ ์œผ๋กœ ํฅ๋ฏธ๋กœ์šธ ์ˆ˜๋„ ์žˆ๊ณ , ์‚ฌ์šฉํ•  feature๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐ ๋„์›€์„ ์ฃผ๋Š” ๋“ฑ ๋ชจ๋ธ๋ง ์‹œ์˜ ์„ ํƒ ์‚ฌํ•ญ์„ ์•Œ๋ฆฌ๋Š” ๋ฐ ์‚ฌ์šฉ๋  ์ˆ˜๋„ ์žˆ์Œ

2-1. Target ๊ฐ’์˜ ๋ถ„ํฌ ํ™•์ธํ•˜๊ธฐ

  • ๋ชฉํ‘œ: 0(๋Œ€์ถœ๊ธˆ์„ ์ œ ๋•Œ ์ƒํ™˜) vs 1(๊ณ ๊ฐ์ด ์ง€๋ถˆ์— ์–ด๋ ค์›€์„ ๊ฒช์—ˆ์Œ) ์ค‘ ํ•˜๋‚˜๋ฅผ ์˜ˆ์ธก

  • ๊ฐ ๋ฒ”์ฃผ์— ์†ํ•˜๋Š” ๋Œ€์ถœ์˜ ์ˆ˜๋ฅผ ์กฐ์‚ฌํ•  ์ˆ˜ ์žˆ์Œ

app_train['TARGET'].value_counts()
0    282686
1     24825
Name: TARGET, dtype: int64
app_train['TARGET'].astype(int).plot.hist()
<Axes: ylabel='Frequency'>

  • ๋งค์šฐ ๋ถˆ๊ท ํ˜•ํ•œ(imbalanced) ํด๋ž˜์Šค์ž„์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Œ

    • ์ œ๋•Œ ์ƒํ™˜๋œ ๋Œ€์ถœ(0)์ด ์ƒํ™˜๋˜์ง€ ์•Š์€ ๋Œ€์ถœ(1)๋ณด๋‹ค ํ›จ์”ฌ ๋งŽ์Œ
  • ๋” ์ •๊ตํ•œ ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ๋ง์„ ์ ์šฉํ•œ๋‹ค๋ฉด, ์ด ๋ถˆ๊ท ํ˜•์„ ๋ฐ˜์˜ํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ์—์„œ์˜ ํ‘œํ˜„์œผ๋กœ ํด๋ž˜์Šค์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•  ์ˆ˜ ์žˆ์Œ

2-2. ๊ฒฐ์ธก์น˜(Missing Value) ํ™•์ธ

  • ๊ฐ ์—ด์˜ ๊ฒฐ์ธก๊ฐ’์˜ ์ˆ˜์™€ ๋น„์œจ ํŒŒ์•…
### column# ํ•จ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ฒฐ์ธก๊ฐ’์„ ๊ณ„์‚ฐํ•˜๋Š” ํ•จ์ˆ˜

def missing_values_table(df):
        # ๋ฐ์ดํ„ฐ ๋‚ด์—์„œ ๊ฒฐ์ธก์น˜์˜ ์ด ๊ฐœ์ˆ˜
        mis_val = df.isnull().sum()
        
        # ์ „์ฒด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๊ฒฐ์ธก์น˜์˜ ๋น„์œจ
        mis_val_percent = 100 * df.isnull().sum() / len(df) 
        
        # ๊ฒฐ๊ณผ ํ…Œ์ด๋ธ” ๋งŒ๋“ค๊ธฐ
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis = 1) # ์ปฌ๋Ÿผ ๋ฐฉํ–ฅ์œผ๋กœ ๊ฒฐํ•ฉ
        
        # ์ปฌ๋Ÿผ๋ช… ์žฌ์ •์˜
        mis_val_table_ren_columns = mis_val_table.rename(
            columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # ๊ฒฐ์ธก์น˜์˜ ๋น„์œจ์„ ๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
                '% of Total Values', ascending = False).round(1) # ์†Œ์ˆ˜ ์ฒซ์งธ์ž๋ฆฌ์—์„œ ๋ฐ˜์˜ฌ๋ฆผ
        
        # ์š”์•ฝ ์ •๋ณด ์ถœ๋ ฅ
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns
# ๊ฒฐ์ธก์น˜์— ๋Œ€ํ•œ ์š”์•ฝ ์ •๋ณด ํ™•์ธ
missing_values = missing_values_table(app_train)
missing_values.head(20)
Your selected dataframe has 122 columns.
There are 67 columns that have missing values.
Missing Values % of Total Values
COMMONAREA_MEDI 214865 69.9
COMMONAREA_AVG 214865 69.9
COMMONAREA_MODE 214865 69.9
NONLIVINGAPARTMENTS_MEDI 213514 69.4
NONLIVINGAPARTMENTS_MODE 213514 69.4
NONLIVINGAPARTMENTS_AVG 213514 69.4
FONDKAPREMONT_MODE 210295 68.4
LIVINGAPARTMENTS_MODE 210199 68.4
LIVINGAPARTMENTS_MEDI 210199 68.4
LIVINGAPARTMENTS_AVG 210199 68.4
FLOORSMIN_MODE 208642 67.8
FLOORSMIN_MEDI 208642 67.8
FLOORSMIN_AVG 208642 67.8
YEARS_BUILD_MODE 204488 66.5
YEARS_BUILD_MEDI 204488 66.5
YEARS_BUILD_AVG 204488 66.5
OWN_CAR_AGE 202929 66.0
LANDAREA_AVG 182590 59.4
LANDAREA_MEDI 182590 59.4
LANDAREA_MODE 182590 59.4
  • ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•  ๋•Œ๊ฐ€ ๋˜๋ฉด ์ด๋Ÿฌํ•œ ๋ˆ„๋ฝ๋œ ๊ฐ’์„ ์ฑ„์›Œ์•ผ ํ•จ(๊ฒฐ์ธก์ด๋ผ๊ณ  ํ•จ)

  • ์ดํ›„ ์ž‘์—…์—์„œ๋Š” XGBoost์™€ ๊ฐ™์€ imputation ํ•  ํ•„์š” ์—†์ด ๊ฒฐ์ธก๊ฐ’์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ํ™œ์šฉํ•  ์˜ˆ์ •

  • ๊ฒฐ์ธก๊ฐ’์˜ ๋น„์œจ์ด ๋†’์€ ์—ด์„ ์‚ญ์ œํ•˜๋Š” ๊ฒƒ๋„ ๋ฐฉ๋ฒ•์ž„

    • ํ•˜์ง€๋งŒ ํ•ด๋‹น ์—ด์ด ๋ชจํ˜•์— ๋„์›€์ด ๋˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ๋ฏธ๋ฆฌ ์•Œ ์ˆ˜ X

    • ํ˜„์žฌ ๋ชจ๋“  column์„ ์œ ์ง€

2-3. Column์˜ ํ˜•ํƒœ

  • int64์™€ float64๋Š” ์ˆซ์žํ˜• ๋ณ€์ˆ˜

    • ์ด์‚ฐํ˜• ๋˜๋Š” ์—ฐ์†ํ˜•

    • reference

  • object ์—ด์€ ๋ฌธ์ž์—ด์„ ํฌํ•จํ•˜๋ฉฐ ํ†ต๊ณ„์  feature์ž„

# ์ปฌ๋Ÿผ๋“ค์˜ ํƒ€์ž…๋ณ„ ๋ถ„ํฌ
app_train.dtypes.value_counts()
float64    65
int64      41
object     16
dtype: int64
  • ๊ฐ objectํ˜•(๋ฒ”์ฃผํ˜•) ์ปฌ๋Ÿผ๋“ค์˜ ๊ณ ์œ  ํ•ญ๋ชฉ ์ˆ˜
# ๊ฐ object ์ปฌ๋Ÿผ์˜ ๊ณ ์œ  ํด๋ž˜์Šค ์ˆ˜
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64
  • ๋Œ€๋ถ€๋ถ„์˜ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์—๋Š” ๋น„๊ต์  ์ ์€ ์ˆ˜์˜ ๊ณ ์œ  ํ•ญ๋ชฉ์ด ์กด์žฌ

2-4. ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ Encoding

  • ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ์€ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†์Œ

    • LightGBM ๊ณผ ๊ฐ™์€ ์ผ๋ถ€ ๋ชจ๋ธ ์ œ์™ธ

    • ๋ชจ๋ธ์— ์ ์šฉํ•˜๊ธฐ ์ „์— ์ด๋Ÿฌํ•œ ๋ณ€์ˆ˜๋ฅผ ์ˆซ์ž๋กœ ์ธ์ฝ”๋”ฉ(ํ‘œํ˜„) ํ•ด์ฃผ์–ด์•ผ ํ•จ


1. ๋ ˆ์ด๋ธ” ์ธ์ฝ”๋”ฉ

  • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์˜ ๊ฐ ๊ณ ์œ  ๋ฒ”์ฃผ์— ์ •์ˆ˜๋ฅผ ํ• ๋‹น

  • ์ƒˆ๋กœ์šด column์ด ์ƒ์„ฑ๋˜์ง€ ์•Š์Œ

์ด๋ฏธ์ง€

2. One-hot ์ธ์ฝ”๋”ฉ

  • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์˜ ๊ฐ ๊ณ ์œ  ๋ฒ”์ฃผ์— ๋Œ€ํ•ด ์ƒˆ๋กœ์šด column์„ ์ƒ์„ฑ

  • ๊ฐ ๊ด€์ธก์น˜๋Š” ํ•ด๋‹น ๋ฒ”์ฃผ์— ๋Œ€ํ•œ ์—ด์˜ ๊ฐ’์œผ๋กœ 1์„ ์ €์žฅํ•˜๊ณ  ๋‹ค๋ฅธ ๋ชจ๋“  ์ƒˆ๋กœ์šด ์—ด์—๋Š” 0์„ ์ €์žฅ

์ด๋ฏธ์ง€

a) Label Encoding๊ณผ One-Hot Encoding

  • ๊ณ ์œ ํ•œ ๋ฒ”์ฃผ๊ฐ€ 2๊ฐœ์ธ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜(dtype == object)์˜ ๊ฒฝ์šฐ ๋ ˆ์ด๋ธ” ์ธ์ฝ”๋”ฉ์„ ์‚ฌ์šฉํ•˜๊ณ , ๊ณ ์œ ํ•œ ๋ฒ”์ฃผ๊ฐ€ 2๊ฐœ ์ด์ƒ์ธ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์˜ ๊ฒฝ์šฐ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ์„ ์‚ฌ์šฉ

  • ๋ ˆ์ด๋ธ” ์ธ์ฝ”๋”ฉ์˜ ๊ฒฝ์šฐ Scikit-Learn์˜ LabelEncoder๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ์›-ํ•ซ ์ธ์ฝ”๋”ฉ์˜ ๊ฒฝ์šฐ pandas์˜ get_dummies(df) ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉ

### LabelEncoding ๊ฐ์ฒด ์ƒ์„ฑ
le = LabelEncoder()
le_count = 0

# ์—ด๋“ค์„ ํ•˜๋‚˜์”ฉ ํƒ์ƒ‰ํ•˜๋ฉด์„œ..
for col in app_train:
    if app_train[col].dtype == 'object': # ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์ด๋ฉด
        if len(list(app_train[col].unique())) <= 2:
            ### LabelEncoding
            le.fit(app_train[col]) # Encoder ํ•™์Šต
            # ๋ฐ์ดํ„ฐ ๋ณ€ํ˜•
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            
            le_count += 1
            
print('%d columns were label encoded.' % le_count)
3 columns were label encoded.
# ์ธ์ฝ”๋”ฉ๋˜์ง€ ์•Š์€ ๋‚˜๋จธ์ง€ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋“ค -> One-hot Encoding
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)
Training Features shape:  (307511, 243)
Testing Features shape:  (48744, 239)

b) Train/ Test ๋ฐ์ดํ„ฐ ์กฐ์ •

  • train ๋ฐ test ๋ฐ์ดํ„ฐ ๋ชจ๋‘์— ๋™์ผํ•œ ๋ณ€์ˆ˜(feature)๊ฐ€ ์žˆ์–ด์•ผ ํ•จ

  • test ๋ฐ์ดํ„ฐ์— ๋ฒ”์ฃผ๊ฐ€ ํ‘œ์‹œ๋˜์ง€ ์•Š์€ ์ผ๋ถ€ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๊ฐ€ ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์— one - hot ์ธ์ฝ”๋”ฉ์œผ๋กœ ์ธํ•ด train ๋ฐ์ดํ„ฐ์— ๋” ๋งŽ์€ ์—ด์ด ์ƒ์„ฑ๋˜์—ˆ์Œ

  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ์—†๋Š” train ๋ฐ์ดํ„ฐ์˜ ์—ด์„ ์ œ๊ฑฐํ•˜๋ ค๋ฉด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์ •๋ ฌํ•ด์•ผ ํ•จ

    • ๋จผ์ € train ๋ฐ์ดํ„ฐ์—์„œ ๋Œ€์ƒ ์—ด์„ ์ถ”์ถœ

      • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์—๋Š” ์—†์ง€๋งŒ ํ•ด๋‹น ์ •๋ณด๋ฅผ ์œ ์ง€ํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—
    • ์ •๋ ฌ์„ ์ˆ˜ํ–‰ํ•  ๋•Œ ํ–‰์ด ์•„๋‹Œ ์—ด์„ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์ •๋ ฌํ•˜๋„๋ก axis = 1์„ ์„ค์ •ํ•ด์•ผ ํ•จ

train_labels = app_train['TARGET'] # target ๊ฐ’

### train/test ๋ฐ์ดํ„ฐ ์ •๋ ฌ
# ์ผ๋‹จ train/test ๋‚ด์˜ ์—ด์„ ๋งž์ถ”๊ธฐ ์œ„ํ•ด test ๋ฐ์ดํ„ฐ์— ์žˆ๋Š” ์—ด๋“ค ๊ธฐ์ค€์œผ๋กœ๋งŒ ๋งž์ถ”๊ธฐ
app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)

# train ๋ฐ์ดํ„ฐ ๋’ค์—๋Š” target๊ฐ’์„ ๋‹ค์‹œ ๊ฒฐํ•ฉ
app_train['TARGET'] = train_labels

print('Training Features shape: ', app_train.shape) 
print('Testing Features shape: ', app_test.shape)
Training Features shape:  (307511, 240)
Testing Features shape:  (48744, 239)
  • train๊ณผ test ๋ฐ์ดํ„ฐ๊ฐ€ ๋™์ผํ•œ feature๋“ค์„ ๊ฐ€์ง€๊ฒŒ ๋จ(target ๋ณ€์ˆ˜๋Š” ์ œ์™ธ)

  • one - hot ์ธ์ฝ”๋”ฉ์œผ๋กœ ์ธํ•ด feature์˜ ์ˆ˜๊ฐ€ ํฌ๊ฒŒ ์ฆ๊ฐ€

2-5. ๋ถ€๊ฐ€์ ์ธ EDA ์ง„ํ–‰ํ•˜๊ธฐ

a) ์ด์ƒ์น˜(anomaly)

  • EDA๋ฅผ ์ˆ˜ํ–‰ํ•  ๋•Œ ๋ฐ์ดํ„ฐ ๋‚ด์˜ ์ด์ƒ์น˜๋ฅผ ์ฃผ์˜ํ•ด์•ผ ํ•จ

    • ์ˆซ์ž๋ฅผ ์ž˜๋ชป ์ž…๋ ฅํ–ˆ๊ฑฐ๋‚˜ ์ธก์ • ์žฅ๋น„์— ์˜ค๋ฅ˜๊ฐ€ ์žˆ๊ฑฐ๋‚˜ ์œ ํšจํ•˜์ง€๋งŒ ๊ทน๋‹จ์ ์ธ ์ธก์ •์ผ ์ˆ˜ ์žˆ์Œ
  • describe() ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ํ•ด๋‹น ์—ด์˜ ํ†ต๊ณ„๋ฅผ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์œผ๋กœ ์ด์ƒ์น˜๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์Œ

  • DAYS_BIRTH ์ปฌ๋Ÿผ์— ์žˆ๋Š” ์ˆซ์ž๋Š” ํ˜„์žฌ ๋Œ€์ถœ ์‹ ์ฒญ๊ณผ ๊ด€๋ จํ•˜์—ฌ ๊ธฐ๋ก๋˜๊ธฐ ๋•Œ๋ฌธ์— ์Œ์ˆ˜์ž„

    • ํ†ต๊ณ„๋ฅผ ๋…„ ๋‹จ์œ„๋กœ ๋ณด๋ ค๋ฉด -1์„ ๊ณฑํ•˜๊ณ  1๋…„์˜ ์ผ ์ˆ˜(365์ผ)๋กœ ๋‚˜๋ˆ ์„œ ๊ตฌํ•  ์ˆ˜ ์žˆ์Œ
(app_train['DAYS_BIRTH'] / -365).describe()
count    307511.000000
mean         43.936973
std          11.956133
min          20.517808
25%          34.008219
50%          43.150685
75%          53.923288
max          69.120548
Name: DAYS_BIRTH, dtype: float64
  • ์ƒํ•œ ๋˜๋Š” ํ•˜ํ•œ์— ์—ฐ๋ น์— ๋Œ€ํ•œ ์ด์ƒ์น˜๊ฐ€ ์—†๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.
app_train['DAYS_EMPLOYED'].describe()
count    307511.000000
mean      63815.045904
std      141275.766519
min      -17912.000000
25%       -2760.000000
50%       -1213.000000
75%        -289.000000
max      365243.000000
Name: DAYS_EMPLOYED, dtype: float64
  • ์ตœ๋Œ€๊ฐ’์€ ์•ฝ 1000๋…„..?!

    • ์ด์ƒ์น˜๊ฐ€ ์กด์žฌํ•œ๋‹ค.
### ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ํ™•์ธํ•˜๊ธฐ

app_train['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram');
plt.xlabel('Days Employment');

  • ๋น„์ •์ƒ์ ์ธ ํด๋ผ์ด์–ธํŠธ์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์„ ์„ค์ •ํ•˜์—ฌ ๋‹ค๋ฅธ ํด๋ผ์ด์–ธํŠธ๋ณด๋‹ค ๊ธฐ๋ณธ๊ฐ’ ๋น„์œจ์ด ๋†’๊ฑฐ๋‚˜ ๋‚ฎ์€ ๊ฒฝํ–ฅ์ด ์žˆ๋Š”์ง€ ํ™•์ธํ•ด๋ณด์ž.
anom = app_train[app_train['DAYS_EMPLOYED'] == 365243]
non_anom = app_train[app_train['DAYS_EMPLOYED'] != 365243]
print('The non-anomalies default on %0.2f%% of loans' % (100 * non_anom['TARGET'].mean()))
print('The anomalies default on %0.2f%% of loans' % (100 * anom['TARGET'].mean()))
print('There are %d anomalous days of employment' % len(anom))
The non-anomalies default on 8.66% of loans
The anomalies default on 5.40% of loans
There are 55374 anomalous days of employment
  • ์ด์ƒ์น˜๋กœ ์—ฌ๊ฒจ์ง€๋Š” ๊ณ ๊ฐ๋“ค์˜ ์ฑ„๋ฌด ๋ถˆ์ดํ–‰ ๋น„์œจ์ด ๋” ๋‚ฎ์€ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚จ

โœ” ์ด์ƒ์น˜ ์ฒ˜๋ฆฌ

  • ๊ฐ€์žฅ ์•ˆ์ „ํ•œ ์ ‘๊ทผ๋ฒ• ์ค‘ ํ•˜๋‚˜๋Š” ๊ธฐ๊ณ„ ํ•™์Šต ์ „์— ์ด์ƒ์น˜๋ฅผ ๊ฒฐ์ธก๊ฐ’์œผ๋กœ ์„ค์ •ํ•œ ๋‹ค์Œ ๊ฐ’์„ ์ฑ„์›Œ๋„ฃ๋Š” ๋ฐฉ๋ฒ•์ž„

    • ๋ชจ๋‘ ๊ณตํ†ต์ ์ธ ์–ด๋–ค ๊ฐ’์œผ๋กœ ๊ฐ€์ •ํ•ด ๋™์ผํ•œ ๊ฐ’์œผ๋กœ ํ•ด๋‹น ๊ฐ’์„ ์ฑ„์›€
  • ์—ฌ๊ธฐ์„œ๋Š” ๋น„์ •์ƒ์ ์ธ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด np.nan์œผ๋กœ ๊ฒฐ์ธก์น˜๋กœ ๊ฐ’์„ ๋ณ€๊ฒฝ ํ›„, ๊ฐ’์˜ ๋น„์ •์ƒ ์œ ๋ฎค๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” bool ์นผ๋Ÿผ์„ ์ƒˆ๋กœ ์ƒ์„ฑ

# flag ์ปฌ๋Ÿผ ์ƒ์„ฑ -> ์ด์ƒ์น˜ ์œ ๋ฌด ์ €์žฅ
app_train['DAYS_EMPLOYED_ANOM'] = app_train["DAYS_EMPLOYED"] == 365243

# ์ด์ƒ์น˜๋ฅผ np.nan์œผ๋กœ
app_train['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)

app_train['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram');
plt.xlabel('Days Employment');

  • ๋ถ„ํฌ๋Š” ์šฐ๋ฆฌ๊ฐ€ ์˜ˆ์ƒํ•˜๋Š” ๊ฒƒ๊ณผ ํ›จ์”ฌ ๋” ๋น„์Šทํ•œ ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋˜๋ฉฐ, ์ด๋Ÿฌํ•œ ๊ฐ’์ด ์›๋ž˜ ๋น„์ •์ƒ์ž„์„ ๋ชจ๋ธ์— ์•Œ๋ ค์ฃผ๋Š” ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ(DAYS_EMPLOYED_ANOM)๋„ ์ƒ์„ฑํ•จ

    • NaN ๊ฐ’์„ ์•„๋งˆ column์˜ ์ค‘๊ฐ„๊ฐ’(median)์œผ๋กœ ์ฑ„์›Œ์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ
  • ์ด์ƒ์น˜๋ฅผ ์ œ์™ธํ•˜๊ณ ๋Š” ๋ฌธ์ œ๊ฐ€ ์—†์–ด ๋ณด์ž„

app_test['DAYS_EMPLOYED_ANOM'] = app_test["DAYS_EMPLOYED"] == 365243
app_test["DAYS_EMPLOYED"].replace({365243: np.nan}, inplace = True)

print('There are %d anomalies in the test data out of %d entries' % (app_test["DAYS_EMPLOYED_ANOM"].sum(), len(app_test)))
There are 9274 anomalies in the test data out of 48744 entries

b) ์ƒ๊ด€๊ด€๊ณ„(Correlations)

  • ๋ฐ์ดํ„ฐ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์ดํ•ดํ•˜๋Š” ํ•œ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์€ feature์™€ target ์‚ฌ์ด์˜ ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ์ฐพ๋Š” ๊ฒƒ

  • ์ƒ๊ด€ ๊ณ„์ˆ˜๋Š” feature์˜ ๊ด€๋ จ์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ€์žฅ ์ข‹์€ ๋ฐฉ๋ฒ•์€ ์•„๋‹ˆ์ง€๋งŒ, ๋ฐ์ดํ„ฐ ๋‚ด์—์„œ ๊ฐ€๋Šฅํ•œ ๊ด€๊ณ„์— ๋Œ€ํ•œ ๊ฐœ๋…์„ ์ œ๊ณต

  • ์ƒ๊ด€๊ณ„์ˆ˜์˜ ์ ˆ๋Œ€๊ฐ’์— ๋Œ€ํ•œ ์ผ๋ฐ˜์ ์ธ ํ•ด์„

    • 0 ~ 0.19: ๋งค์šฐ ์•ฝํ•œ ์ƒ๊ด€๊ด€๊ณ„

    • 0.20 ~ 0.39: ์•ฝํ•œ ์ƒ๊ด€๊ด€๊ณ„

    • 0.40 ~ 0.59: ์–ด๋Š ์ •๋„์˜ ์ƒ๊ด€๊ด€๊ณ„

    • 0.60 ~ 0.79: ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„

    • 0.80 ~ 1.0: ๋งค์šฐ ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„

# target๊ณผ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ตฌํ•œ ํ›„ ์ •๋ ฌ
correlations = app_train.corr()['TARGET'].sort_values()

print('Most Positive Correlations:\n', correlations.tail(15))
print()
print('\nMost Negative Correlations:\n', correlations.head(15))
Most Positive Correlations:
 OCCUPATION_TYPE_Laborers                             0.043019
FLAG_DOCUMENT_3                                      0.044346
REG_CITY_NOT_LIVE_CITY                               0.044395
FLAG_EMP_PHONE                                       0.045982
NAME_EDUCATION_TYPE_Secondary / secondary special    0.049824
REG_CITY_NOT_WORK_CITY                               0.050994
DAYS_ID_PUBLISH                                      0.051457
CODE_GENDER_M                                        0.054713
DAYS_LAST_PHONE_CHANGE                               0.055218
NAME_INCOME_TYPE_Working                             0.057481
REGION_RATING_CLIENT                                 0.058899
REGION_RATING_CLIENT_W_CITY                          0.060893
DAYS_EMPLOYED                                        0.074958
DAYS_BIRTH                                           0.078239
TARGET                                               1.000000
Name: TARGET, dtype: float64


Most Negative Correlations:
 EXT_SOURCE_3                           -0.178919
EXT_SOURCE_2                           -0.160472
EXT_SOURCE_1                           -0.155317
NAME_EDUCATION_TYPE_Higher education   -0.056593
CODE_GENDER_F                          -0.054704
NAME_INCOME_TYPE_Pensioner             -0.046209
DAYS_EMPLOYED_ANOM                     -0.045987
ORGANIZATION_TYPE_XNA                  -0.045987
FLOORSMAX_AVG                          -0.044003
FLOORSMAX_MEDI                         -0.043768
FLOORSMAX_MODE                         -0.043226
EMERGENCYSTATE_MODE_No                 -0.042201
HOUSETYPE_MODE_block of flats          -0.040594
AMT_GOODS_PRICE                        -0.039645
REGION_POPULATION_RELATIVE             -0.037227
Name: TARGET, dtype: float64
  • DAYS_BIRTH๊ฐ€ target๊ณผ ๊ฐ€์žฅ ๊ฐ•ํ•œ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง

    • DAYS_BIRTH๋Š” ๋Œ€์ถœ ๋‹น์‹œ ๊ณ ๊ฐ์˜ ๋‚˜์ด -> ์Œ์ˆ˜
  • ์ƒ๊ด€๊ณ„์ˆ˜๋Š” ์–‘์ˆ˜์ด์ง€๋งŒ, feature์˜ ์‹ค์ œ ๊ฐ’์€ ์Œ์ˆ˜์ž„

    • ์ฆ‰, ํด๋ผ์ด์–ธํŠธ๊ฐ€ ๋‚˜์ด๊ฐ€ ๋“ค์ˆ˜๋ก ๋Œ€์ถœ ๋ถˆ์ดํ–‰ ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ์•„์ง(์ฆ‰, target == 0)

    • feature์— ์ ˆ๋Œ“๊ฐ’์„ ์ทจํ•ด ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์Œ์ˆ˜๊ฐ€ ๋จ

c) ๋Œ€์ถœ ์ƒํ™˜(Target)์— ๋‚˜์ด(Age)๊ฐ€ ๋ฏธ์น˜๋Š” ์˜ํ–ฅ

app_train['DAYS_BIRTH'] = abs(app_train['DAYS_BIRTH']) # ๋‚˜์ด๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ์ถœ์ƒ์ผ๋กœ๋ถ€ํ„ฐ ์ง€๋‚œ ๋‚ ์„ ๊ตฌํ•จ
app_train['DAYS_BIRTH'].corr(app_train['TARGET'])
-0.07823930830982694
  • target๊ณผ ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง

    • ๊ณ ๊ฐ์ด ๋‚˜์ด๊ฐ€ ๋“ค์ˆ˜๋ก ๋Œ€์ถœ๊ธˆ์„ ์ œ๋•Œ ์ƒํ™˜ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ
### ํžˆ์Šคํ† ๊ทธ๋žจ ๊ทธ๋ฆฌ๊ธฐ
# ๋” ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ธฐ ์œ„ํ•ด ํžˆ์Šคํ† ๊ทธ๋žจ์œผ๋กœ ์‹œ๊ฐํ™”
# x์ถ•์„ ์—ฐ ๋‹จ์œ„๋กœ ์ง€์ •

# Set the style of plots
plt.style.use('fivethirtyeight')

# ์—ฐ ๋‹จ์œ„๋กœ ๋‚˜์ด์˜ ๋ถ„ํฌ ํŒŒ์•…
plt.hist(app_train['DAYS_BIRTH'] / 365, edgecolor = 'k', bins = 25)
plt.title('Age of Client')
plt.xlabel('Age (years)')
plt.ylabel('Count')
Text(0, 0.5, 'Count')

  • ์—ฐ๋ น ๋ถ„ํฌ๋Š” ๋ชจ๋“  ์—ฐ๋ น๋Œ€์— ์ด์ƒ์น˜๊ฐ€ ์—†๋‹ค๋Š” ๊ฒƒ ์™ธ์—๋Š” ๋ณ„๋‹ค๋ฅธ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜์ง€ x
### KDE plot ์‹œ๊ฐํ™”
# target์— ๋Œ€ํ•œ ์—ฐ๋ น์˜ ์˜ํ–ฅ์„ ์‹œ๊ฐํ™”ํ•˜๊ธฐ ์œ„ํ•ด kernel density estimate plot(KDE)์„ ์‹œ๊ฐํ™”
# KDE๋Š” ๋‹จ์ผ ๋ณ€์ˆ˜์˜ ๋ถ„ํฌ๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉฐ, ์ข€ ๋” smoothํ•œ ํžˆ์Šคํ† ๊ทธ๋žจ ์ •๋„๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์Œ

plt.figure(figsize = (10, 8))

# target = 0(์ œ๋•Œ ๋Œ€์ถœ ์ƒํ™˜ํ•จ)
sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, 'DAYS_BIRTH'] / 365, 
            label = 'target == 0')

# target = 1(์ œ๋•Œ ๋Œ€์ถœ ์ƒํ™˜ํ•˜์ง€ ๋ชปํ•จ)
sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, 'DAYS_BIRTH'] / 365, 
            label = 'target == 1')

plt.xlabel('Age (years)'); plt.ylabel('Density'); plt.title('Distribution of Ages');

  • target = 1 ๊ณก์„ ์€ ๋‚˜์ด ๋ฒ”์œ„์˜ ๋” ์ Š์€ ์ชฝ์œผ๋กœ ์น˜์šฐ์ฒ˜์ ธ ์žˆ์Œ

    • ์œ ์˜ํ•œ ์ƒ๊ด€ ๊ด€๊ณ„(์ƒ๊ด€ ๊ณ„์ˆ˜: -0.07)๋Š” ์•„๋‹ˆ์ง€๋งŒ, ๋Œ€์ƒ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๊ธฐ ๋•Œ๋ฌธ์— ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ์—์„œ ์œ ์šฉํ•  ์ˆ˜ ์žˆ์Œ
### ์—ฐ๋ น๋Œ€๋ณ„ ํ‰๊ท ์ ์ธ ๋Œ€์ถœ ์ƒํ™˜ ์‹คํŒจ ์ •๋„ ํŒŒ์•…ํ•˜๊ธฐ
# ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๋จผ์ € ์—ฐ๋ น ๋ฒ”์ฃผ๋ฅผ ๊ฐ๊ฐ 5๋…„ ๋‹จ์œ„์˜ ๋นˆ์œผ๋กœ ์ž˜๋ผ๋‚ด๊ณ , ๊ฐ ๋นˆ์— ๋Œ€ํ•ด ๋ชฉํ‘œ๊ฐ’์˜ ํ‰๊ท ๊ฐ’์„ ๊ณ„์‚ฐ
# ์ด๋ฅผ ํ†ตํ•ด ๊ฐ ์—ฐ๋ น๋Œ€๋ณ„๋กœ ์ƒํ™˜๋˜์ง€ ์•Š์€ ๋Œ€์ถœ์˜ ๋น„์œจ์„ ํŒŒ์•…

# ๋‹ค๋ฅธ dataframe์— ์กด์žฌํ•˜๋Š” ์ •๋ณด๋“ค์„ ๊ฒฐํ•ฉ
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365

# ๋‚˜์ด ๋ฐ์ดํ„ฐ ๋ฒ”์œ„ ๋‚˜๋ˆ„๊ธฐ
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], 
                                  bins = np.linspace(20, 70, num = 11))
age_data.head(10)
TARGET DAYS_BIRTH YEARS_BIRTH YEARS_BINNED
0 1 9461 25.920548 (25.0, 30.0]
1 0 16765 45.931507 (45.0, 50.0]
2 0 19046 52.180822 (50.0, 55.0]
3 0 19005 52.068493 (50.0, 55.0]
4 0 19932 54.608219 (50.0, 55.0]
5 0 16941 46.413699 (45.0, 50.0]
6 0 13778 37.747945 (35.0, 40.0]
7 0 18850 51.643836 (50.0, 55.0]
8 0 20099 55.065753 (55.0, 60.0]
9 0 14469 39.641096 (35.0, 40.0]
# bins๋ฅผ ๊ธฐ์ค€์œผ๋กœ groupbyํ•˜์—ฌ ์—ฐ๋ น๋Œ€ ํ‰๊ท  ๊ตฌํ•˜๊ธฐ
age_groups  = age_data.groupby('YEARS_BINNED').mean()
age_groups
TARGET DAYS_BIRTH YEARS_BIRTH
YEARS_BINNED
(20.0, 25.0] 0.123036 8532.795625 23.377522
(25.0, 30.0] 0.111436 10155.219250 27.822518
(30.0, 35.0] 0.102814 11854.848377 32.479037
(35.0, 40.0] 0.089414 13707.908253 37.555913
(40.0, 45.0] 0.078491 15497.661233 42.459346
(45.0, 50.0] 0.074171 17323.900441 47.462741
(50.0, 55.0] 0.066968 19196.494791 52.593136
(55.0, 60.0] 0.055314 20984.262742 57.491131
(60.0, 65.0] 0.052737 22780.547460 62.412459
(65.0, 70.0] 0.037270 24292.614340 66.555108
### ์‹œ๊ฐํ™”

plt.figure(figsize = (8, 8))

# Graph the age bins and the average of the target as a bar plot
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])

# Plot labeling
plt.xticks(rotation = 75); plt.xlabel('Age Group (years)'); plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group');

  • ์ Š์€ ์ง€์›์ž๋“ค์€ ๋Œ€์ถœ๊ธˆ์„ ์ƒํ™˜ํ•˜์ง€ ๋ชปํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋” ๋†’์Œ

    • ๊ฐ€์žฅ ์ Š์€ 3๊ฐœ ๊ทธ๋ฃน์€ ์—ฐ์ฒด์œจ์ด 10% ์ด์ƒ, ๊ฐ€์žฅ ๋‚˜์ด๊ฐ€ ๋งŽ์€ ๊ทธ๋ฃน์€ ์—ฐ์ฒด์œจ์ด 5% ๋ฏธ๋งŒ์ž„
  • ์ Š์€ ๊ณ ๊ฐ๋“ค์ด ๋Œ€์ถœ๊ธˆ์„ ์ƒํ™˜ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ๊ธฐ ๋•Œ๋ฌธ์—, ๊ทธ๋“ค์—๊ฒŒ ๋” ๋งŽ์€ ๊ฐ€์ด๋“œ๋‚˜ ์žฌ๋ฌด ๊ด€๋ฆฌ ํŒ์„ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์ด ๊ถŒ์žฅ๋œ๋‹ค.

d) Exterior Sources

  • target๊ณผ ๊ฐ€์žฅ ๊ฐ•ํ•œ ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง„ 3๊ฐœ์˜ ๋ณ€์ˆ˜๋Š” EXT_SOURCE_1, EXT_SOURCE_2, ๊ทธ๋ฆฌ๊ณ  EXT_SOURCE_3์ž„

  • ์„ค๋ช…์— ๋”ฐ๋ฅด๋ฉด ํ•ด๋‹น feature๋“ค์€ ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ ์†Œ์Šค์˜ ์ •๊ทœํ™”๋œ ์ ์ˆ˜๋ฅผ ์˜๋ฏธ

    • ์ˆ˜๋งŽ์€ ๋ฐ์ดํ„ฐ ์†Œ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋งŒ๋“  ์ผ์ข…์˜ ๋ˆ„์  ์‹ ์šฉ ๋“ฑ๊ธ‰์ด๋ผ ์ƒ๊ฐํ•ด๋„ ok
### ์ƒ๊ด€๊ณ„์ˆ˜ ํŒŒ์•…ํ•˜๊ธฐ

ext_data = app_train[['TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
ext_data_corrs = ext_data.corr()
ext_data_corrs
TARGET EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH
TARGET 1.000000 -0.155317 -0.160472 -0.178919 -0.078239
EXT_SOURCE_1 -0.155317 1.000000 0.213982 0.186846 0.600610
EXT_SOURCE_2 -0.160472 0.213982 1.000000 0.109167 0.091996
EXT_SOURCE_3 -0.178919 0.186846 0.109167 1.000000 0.205478
DAYS_BIRTH -0.078239 0.600610 0.091996 0.205478 1.000000
### ์‹œ๊ฐํ™”

plt.figure(figsize = (8, 6))

# Heatmap
sns.heatmap(ext_data_corrs, cmap = plt.cm.RdYlBu_r, vmin = -0.25, 
            annot = True, vmax = 0.6)
plt.title('Correlation Heatmap');

  • ์„ธ ๊ฐ€์ง€ EXT_SOURCE feature๋“ค์€ ๋ชจ๋‘ ๋Œ€์ƒ๊ณผ ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง

    • EXT_SOURCE์˜ ๊ฐ’์ด ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ํด๋ผ์ด์–ธํŠธ๊ฐ€ ๋Œ€์ถœ๊ธˆ์„ ์ƒํ™˜ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง
  • DAYS_BIRTH๊ฐ€ EXT_SOURCE_1๊ณผ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ

    • ์ด ์ ์ˆ˜์˜ ์š”์ธ ์ค‘ ํ•˜๋‚˜๊ฐ€ ํด๋ผ์ด์–ธํŠธ์˜ ์—ฐ๋ น์ผ ์ˆ˜ ์žˆ์Œ
### target ๊ฐ’ ๋ณ„๋กœ ๊ฐ feature์˜ ๋ถ„ํฌ๋ฅผ ์‹œ๊ฐํ™”
# ๊ฐ ๋ณ€์ˆ˜๋“ค์ด target์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ ํŒŒ์•…

plt.figure(figsize = (10, 12))

# iterate through the sources
for i, source in enumerate(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']):
    # create a new subplot for each source
    plt.subplot(3, 1, i + 1)

    # ์ œ๋•Œ ์ƒํ™˜๋จ(target = 0)
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, source], label = 'target == 0')
    # ์ œ๋•Œ ์ƒํ™˜๋˜์ง€ ๋ชปํ•จ(target = 1)
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, source], label = 'target == 1')
    
    plt.title('Distribution of %s by Target Value' % source)
    plt.xlabel('%s' % source); plt.ylabel('Density');
    
plt.tight_layout(h_pad = 2.5)

  • EXT_SOURCE_3์ด ๋Œ€์ƒ ๊ฐ’ ๊ฐ„์— ๊ฐ€์žฅ ํฐ ์ฐจ์ด๋ฅผ ๋ณด์ž„

  • ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด ํ•ด๋‹น feature๋“ค์ด ์‹ ์ฒญ์ž๊ฐ€ ๋Œ€์ถœ๊ธˆ์„ ์ƒํ™˜ํ•  ๊ฐ€๋Šฅ์„ฑ๊ณผ ์–ด๋Š ์ •๋„ ๊ด€๋ จ์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ถ„๋ช…ํžˆ ์•Œ ์ˆ˜ ์žˆ์Œ

  • ์ด๋Ÿฌํ•œ ๋ณ€์ˆ˜๋Š” ์‹ ์ฒญ์ž๊ฐ€ ๋Œ€์ถœ๊ธˆ์„ ์ œ๋•Œ ์ƒํ™˜ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ์— ์œ ์šฉํ•˜๊ฒŒ ํ™œ์šฉ๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ

2-6. Pairs Plot

  • ์ตœ์ข… ํƒ์ƒ‰ ๊ทธ๋ž˜ํ”„๋กœ EXT_SOURCE ๋ณ€์ˆ˜๋“ค๊ณผ DAYS_BIRTH ๋ณ€์ˆ˜์˜ pairplot์„ ๊ทธ๋ฆด ์ˆ˜ ์žˆ์Œ

  • ๋‹จ์ผ ๋ณ€์ˆ˜์˜ ๋ถ„ํฌ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์—ฌ๋Ÿฌ ๋ณ€์ˆ˜ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Œ

# ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ณต์‚ฌ
plot_data = ext_data.drop(columns = ['DAYS_BIRTH']).copy()

# ์—ฐ๋„๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•œ ๊ณ ๊ฐ์˜ ์—ฐ๋ ฌ๋Œ€ ์ถ”๊ฐ€
plot_data['YEARS_BIRTH'] = age_data['YEARS_BIRTH']

# ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ, ์ถœ๋ ฅ์€ 100000๋ฒˆ์งธ ํ–‰๊นŒ์ง€๋งŒ
plot_data = plot_data.dropna().loc[:100000, :]

### ๋‘ ๋ณ€์ˆ˜ ์‚ฌ์ด์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜
def corr_func(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1] # ์ƒ๊ด€๊ณ„์ˆ˜ ๊ณ„์‚ฐ
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.2, .8), xycoords = ax.transAxes, 
                size = 10)

### ์‹œ๊ฐํ™”
# ๊ฐ์ฒด ์ƒ์„ฑ
grid = sns.PairGrid(data = plot_data, diag_sharey=False,
                    hue = 'TARGET', 
                    vars = [x for x in list(plot_data.columns) if x != 'TARGET'])
# ์œ—๋ถ€๋ถ„: scatter plot
grid.map_upper(plt.scatter, alpha = 0.2)
# ์ฃผ๋Œ€๊ฐ์„  ๋ถ€๋ถ„: kdeplot
grid.map_diag(sns.kdeplot)
# ์•„๋ž˜ ๋ถ€๋ถ„: density plot
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r);

# ์ฐจํŠธ์˜ ์ด๋ฆ„ ์ง€์ •
plt.suptitle('Ext Source and Age Features Pairs Plot', size = 20, y = 1.05);

  • ๋นจ๊ฐ„์ƒ‰: ์ƒํ™˜๋˜์ง€ ์•Š์€ ๋Œ€์ถœ(target = 1)/ ํŒŒ๋ž€์ƒ‰: ์ง€๊ธ‰๋œ ๋Œ€์ถœ(target = 0)

  • EXT_SOURCE_1์™€ DAYS_BIRTH(๋˜๋Š” ์ด์™€ ๋™๋“ฑํ•œ YEARS_BIRTH) ์‚ฌ์ด์—๋Š” ์ค‘๊ฐ„ ์ •๋„์˜ ์–‘์˜ ์„ ํ˜• ๊ด€๊ณ„๊ฐ€ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚จ

    • ์ด ๊ธฐ๋Šฅ์ด ํด๋ผ์ด์–ธํŠธ์˜ ์—ฐ๋ น์„ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ์Œ์„ ๋‚˜ํƒ€๋ƒ„

3. ํŠน์„ฑ ๊ณตํ•™(Feature Engineering)

  • Kaggle ๋Œ€ํšŒ๋Š” Feature Engineering์— ์˜ํ•ด ์šฐ์Šน์—ฌ๋ถ€๊ฐ€ ๊ฒฐ์ •๋จ

    • ์ด๋Ÿฌํ•œ ๋Œ€ํšŒ๋“ค์€ ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ€์žฅ ์œ ์šฉํ•œ feature์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ์‚ฌ๋žŒ์ด ์šฐ์Šน

    • ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ ๋ชจ๋“  ์šฐ์Šน ๋ชจ๋ธ์ด ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์˜ ๋ณ€ํ˜•์ธ ๊ฒฝํ–ฅ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋Œ€๋ถ€๋ถ„ ๋งž๋Š” ๋ง์ž„


  • ๊ทธ๋ฆฌ๊ณ  ์ด๋Ÿฌํ•œ ๊ฒƒ๋“ค์€ ๋จธ์‹ ๋Ÿฌ๋‹์˜ ํŒจํ„ด ์ค‘ ํ•˜๋‚˜๋ฅผ ๋‚˜ํƒ€๋ƒ„

    • feature engineering์ด ๋ชจ๋ธ ๊ตฌ์ถ• ๋ฐ hyperparameter ์กฐ์ •๋ณด๋‹ค ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ

    • ๊ด€๋ จ article


  • ์˜ฌ๋ฐ”๋ฅธ ๋ชจ๋ธ๊ณผ ์ตœ์ ์˜ ์„ค์ •์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜์ง€๋งŒ ๋ชจ๋ธ์€ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด์„œ๋งŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ

    • ์ด ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€๋Šฅํ•œ ํ•œ ์ž‘์—…๊ณผ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž์˜ ์—…๋ฌด์ž„

    • ์ฐธ๊ณ  ์ž๋ฃŒ

  • Feature Engineering์€ ์ผ๋ฐ˜์ ์ธ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋งํ•˜๋ฉฐ, feature ๊ตฌ์„ฑ(๊ธฐ์กด ๋ฐ์ดํ„ฐ์—์„œ ์ƒˆ๋กœ์šด feature ์ถ”๊ฐ€)๊ณผ feature ์„ ํƒ(๊ฐ€์žฅ ์ค‘์š”ํ•œ feature๋งŒ ์„ ํƒํ•˜๊ฑฐ๋‚˜ ๋‹ค๋ฅธ ์ฐจ์› ์ถ•์†Œ ๋ฐฉ๋ฒ• ์‚ฌ์šฉ ๋“ฑ)์„ ๋ชจ๋‘ ํฌํ•จํ•จ

  • ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์†Œ์Šค๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ๋งŽ์€ feature engineering์„ ์ˆ˜ํ–‰ํ•˜์ง€๋งŒ ์ด ๋…ธํŠธ๋ถ์—์„œ๋Š” ๋‘ ๊ฐ€์ง€ ๊ฐ„๋‹จํ•œ feature ๊ตฌ์„ฑ ๋ฐฉ๋ฒ•๋งŒ ์‹œ๋„:

    • ๋‹คํ•ญ ๋ณ€์ˆ˜(Polynomial features)

    • ๋„๋ฉ”์ธ ์ง€์‹ ๊ธฐ๋ฐ˜ ๋ณ€์ˆ˜(Domain knowledge features)

3-1. ๋‹คํ•ญ ๋ณ€์ˆ˜(Polynomial Features)

  • ๊ฐ„๋‹จํ•œ feature ๊ตฌ์„ฑ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋Š” polynomial features๊ฐ€ ์žˆ์Œ

    • ๊ธฐ์กด feature์˜ ์ œ๊ณฑ๊ณผ ๊ธฐ์กด feature๋“ค ๊ฐ„์˜ ์ƒํ˜ธ ์ž‘์šฉ์„ ํ•˜๋Š” feature๋“ค์„ ์ƒ์„ฑ

    • ์˜ˆ๋ฅผ ๋“ค์–ด, EXT_SOURCE_1^2 ๋ฐ EXT_SOURCE_2^2 ๋ณ€์ˆ˜์™€ EXT_SOURCE_1 x EXT_SOURCE_2, EXT_SOURCE_1 x EXT_SOURCE_2^2, EXT_SOURCE_1^2 x EXT_SOURCE_2^2 ๋“ฑ๊ณผ ๊ฐ™์€ ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ

  • ์—ฌ๋Ÿฌ ๊ฐœ๋ณ„ ๋ณ€์ˆ˜์˜ ์กฐํ•ฉ์ธ feature๋“ค์€ ๋ณ€์ˆ˜ ๊ฐ„์˜ ์ƒํ˜ธ ์ž‘์šฉ์„ ๋‹ด๊ธฐ ๋•Œ๋ฌธ์— interaction terms๋ผ๊ณ  ์นญํ•จ

    • ๋‘ ๊ฐœ์˜ ๋ณ€์ˆ˜๊ฐ€ ๊ทธ ์ž์ฒด๋กœ๋Š” target์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์ง€๋งŒ, ์ด๋“ค์„ ํ•˜๋‚˜์˜ interaction variable๋กœ ๊ฒฐํ•ฉํ•˜๋ฉด target๊ณผ ์œ ์˜๋ฏธํ•œ ๊ด€๊ณ„๋ฅผ ๋ณด์ผ ์ˆ˜ ์žˆ์Œ
  • interaction terms์€ ์—ฌ๋Ÿฌ ๋ณ€์ˆ˜๋“ค์˜ ํšจ๊ณผ๋ฅผ ๋‹ด๊ธฐ ์œ„ํ•ด ํ†ต๊ณ„ ๋ชจ๋ธ์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜์ง€๋งŒ, ๋จธ์‹  ๋Ÿฌ๋‹์—์„œ๋Š” ์ž์ฃผ ์‚ฌ์šฉ๋˜์ง€ ์•Š์Œ

    • ์—ฌ๊ธฐ์„œ๋Š” ๋ชจ๋ธ์ด ๊ณ ๊ฐ๋“ค์˜ ๋Œ€์ถœ ์ƒํ™˜ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋Š”์ง€ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด ๋ช‡ ๊ฐ€์ง€๋ฅผ ์‹œ๋„ํ•ด ๋ณผ ์˜ˆ์ •

  • ๋‹ค์Œ ์ฝ”๋“œ์—์„œ๋Š” EXT_SOURCE ๋ณ€์ˆ˜์™€ DAYS_BIRTH ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ polynomial feature๋ฅผ ์ƒ์„ฑ

  • Scikit - learn์—๋Š” ์ง€์ •๋œ degree๊นŒ์ง€ ๋‹คํ•ญ์‹๊ณผ interaction terms์„ ์ƒ์„ฑํ•˜๋Š” PolynomialFeatures๋ผ๋Š” ์œ ์šฉํ•œ ํด๋ž˜์Šค๊ฐ€ ์žˆ์Œ

    • ์—ฌ๊ธฐ์„œ๋Š” degree = 3์„ ์‚ฌ์šฉ
  • polynomial feature์„ ์ƒ์„ฑํ•  ๋•Œ feature์˜ ์ˆ˜๊ฐ€ degree์— ๋”ฐ๋ผ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ํ™•์žฅ๋˜๊ณ , overfitting ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๊ธฐ์—, ๋„ˆ๋ฌด ๋†’์€ degree๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๊ถŒ์žฅ๋˜์ง€ ์•Š์Œ

### polynomial features๋ฅผ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ ์ง€์ •
poly_features = app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'TARGET']]
poly_features_test = app_test[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]

### ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ Imputer
# Imputer ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์—๋Ÿฌ -> SimpleImputer๋กœ ๋ณ€๊ฒฝ
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'median') # ๊ฒฐ์ธก์น˜๋ฅผ ์ค‘๊ฐ„๊ฐ’์œผ๋กœ ๋Œ€์ฒด

### ๋‹คํ•ญ ๋ณ€์ˆ˜๋“ค
poly_target = poly_features['TARGET']
poly_features = poly_features.drop(columns = ['TARGET'])

### ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ
poly_features = imputer.fit_transform(poly_features) # ํ•™์Šต & ๋ณ€ํ˜•
## test ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ ๋ณ€ํ™˜๊ธฐ๋กœ "๋™์ผ"ํ•˜๊ฒŒ ๋ณ€ํ˜•ํ•ด์•ผ ํ•จ
poly_features_test = imputer.transform(poly_features_test) # ๋ณ€ํ˜•๋งŒ

### ๋‹คํ•ญ ๋ณ€์ˆ˜ ์ƒ์„ฑ
from sklearn.preprocessing import PolynomialFeatures
poly_transformer = PolynomialFeatures(degree = 3)
### ๋ชจ๋ธ์— ๋‹คํ•ญ ๋ณ€์ˆ˜๋“ค ๋˜ํ•œ ํ•™์Šต์‹œํ‚ค๊ธฐ
poly_transformer.fit(poly_features)

### feature ๋ณ€ํ˜•
poly_features = poly_transformer.transform(poly_features) # ํ•™์Šต & ๋ณ€ํ˜•
poly_features_test = poly_transformer.transform(poly_features_test) # ๋ณ€ํ˜•๋งŒ
print('Polynomial Features shape: ', poly_features.shape)
Polynomial Features shape:  (307511, 35)
  • ๊ฝค ๋งŽ์€ ์ƒˆ๋กœ์šด feature๋“ค์ด ์ƒˆ๋กœ ์ƒ์„ฑ๋จ

  • feature์˜ ์ด๋ฆ„์„ ์–ป์œผ๋ ค๋ฉด get_feature_names_out() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉ

poly_transformer.get_feature_names_out(input_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH'])[:15]
array(['1', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH',
       'EXT_SOURCE_1^2', 'EXT_SOURCE_1 EXT_SOURCE_2',
       'EXT_SOURCE_1 EXT_SOURCE_3', 'EXT_SOURCE_1 DAYS_BIRTH',
       'EXT_SOURCE_2^2', 'EXT_SOURCE_2 EXT_SOURCE_3',
       'EXT_SOURCE_2 DAYS_BIRTH', 'EXT_SOURCE_3^2',
       'EXT_SOURCE_3 DAYS_BIRTH', 'DAYS_BIRTH^2'], dtype=object)
  • 35๊ฐœ์˜ feature์ด ์žˆ์œผ๋ฉฐ ๊ฐœ๋ณ„ feature์˜ ์ฐจ์ˆ˜(degree)๋Š” ์ตœ๋Œ€ 3๊นŒ์ง€ ๋†’์•„์ง€๋ฉฐ(3์ฐจํ•ญ๊นŒ์ง€) interaction terms๊ฐ€ ์ƒ์„ฑ๋จ

  • ์ƒˆ๋กœ์šด feature๋“ค์ด target๊ณผ ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธ

### feature๋“ค์— ๋Œ€ํ•œ DataFrame ์ƒ์„ฑ
poly_features = pd.DataFrame(poly_features, 
                             columns = poly_transformer.get_feature_names_out(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
                                                                           'EXT_SOURCE_3', 'DAYS_BIRTH']))

### Target ๊ฐ’ ์ถ”๊ฐ€
poly_features['TARGET'] = poly_target

### Target๊ณผ์˜ ์ƒ๊ด€๋„ ํŒŒ์•…
poly_corrs = poly_features.corr()['TARGET'].sort_values()

### ๊ฐ€์žฅ ๊ฐ•ํ•œ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„/ ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„
print(poly_corrs.head(10))
print(poly_corrs.tail(5))
EXT_SOURCE_2 EXT_SOURCE_3                -0.193939
EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3   -0.189605
EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH     -0.181283
EXT_SOURCE_2^2 EXT_SOURCE_3              -0.176428
EXT_SOURCE_2 EXT_SOURCE_3^2              -0.172282
EXT_SOURCE_1 EXT_SOURCE_2                -0.166625
EXT_SOURCE_1 EXT_SOURCE_3                -0.164065
EXT_SOURCE_2                             -0.160295
EXT_SOURCE_2 DAYS_BIRTH                  -0.156873
EXT_SOURCE_1 EXT_SOURCE_2^2              -0.156867
Name: TARGET, dtype: float64
DAYS_BIRTH     -0.078239
DAYS_BIRTH^2   -0.076672
DAYS_BIRTH^3   -0.074273
TARGET          1.000000
1                    NaN
Name: TARGET, dtype: float64
  • ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜ ์ค‘ ์ผ๋ถ€๋Š” ์›๋ž˜ feature๋ณด๋‹ค target๊ณผ ๋” ํฐ(์ ˆ๋Œ€๊ฐ’) ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ์žˆ์Œ

    • ์ƒ๊ด€๋„๊ฐ€ ๋” ๋†’์•„์ง

    • ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•  ๋•Œ ์ด๋Ÿฌํ•œ feature์ด ์žˆ๋Š” ๊ฒฝ์šฐ์™€ ์—†๋Š” ๊ฒฝ์šฐ๋ฅผ ๋น„๊ตํ•˜์—ฌ ๋‹คํ•ญ ๋ณ€์ˆ˜๋“ค์ด ๋ชจ๋ธ ํ•™์Šต์— ์‹ค์ œ๋กœ ๋„์›€์ด ๋˜๋Š”์ง€ ํ™•์ธ ๊ฐ€๋Šฅ

  • ์ด๋Ÿฌํ•œ feature์„ train, test ๋ฐ์ดํ„ฐ ๋ณต์‚ฌ๋ณธ์— ์ถ”๊ฐ€ํ•œ ๋‹ค์Œ, ๋‹คํ•ญ ๋ณ€์ˆ˜๋“ค์ด ์žˆ๋Š” ๋ชจ๋ธ๊ณผ ์—†๋Š” ๋ชจ๋ธ์„ ๊ฐ๊ฐ ํ‰๊ฐ€

    • ์ผ๋‹จ ํ•ด๋ด์•ผ ์•ˆ๋‹ค,,
### ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— test data ์‚ฝ์ž…
poly_features_test = pd.DataFrame(poly_features_test, 
                                  columns = poly_transformer.get_feature_names_out(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
                                                                                'EXT_SOURCE_3', 'DAYS_BIRTH']))
### ํ•™์Šต์šฉ df์— ๋‹คํ•ญ ๋ณ€์ˆ˜ ๋ณ‘ํ•ฉ
poly_features['SK_ID_CURR'] = app_train['SK_ID_CURR']
app_train_poly = app_train.merge(poly_features, on = 'SK_ID_CURR', how = 'left')

### ํ‰๊ฐ€์šฉ df์— ๋‹คํ•ญ ๋ณ€์ˆ˜ ๋ณ‘ํ•ฉ
poly_features_test['SK_ID_CURR'] = app_test['SK_ID_CURR']
app_test_poly = app_test.merge(poly_features_test, on = 'SK_ID_CURR', how = 'left')

### df ์ •๋ ฌ
app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)

### ์ƒˆ๋กœ ์ƒ์„ฑ๋œ df์˜ ํ˜•ํƒœ ์ถœ๋ ฅ
print('Training data with polynomial features shape: ', app_train_poly.shape)
print('Testing data with polynomial features shape:  ', app_test_poly.shape)
Training data with polynomial features shape:  (307511, 275)
Testing data with polynomial features shape:   (48744, 275)

3-2. Domain Knowledge Features

  • ์‹ ์šฉ ์ „๋ฌธ๊ฐ€๊ฐ€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋„๋ฉ”์ธ ์ง€์‹์ด๋ผ๊ณ  ๋ถ€๋ฅด๋Š” ๊ฒƒ์ด ์˜ณ์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ์ด๊ฒƒ์„ ์ œํ•œ๋œ ๊ธˆ์œต ์ง€์‹์„ ์ ์šฉํ•˜๋ ค๋Š” ์‹œ๋„๋ผ๊ณ  ๋ถ€๋ฅผ ์ˆ˜ ์žˆ์Œ

    • ๋”ฐ๋ผ์„œ, ๊ณ ๊ฐ์˜ ์ฑ„๋ฌด ๋ถˆ์ดํ–‰ ์—ฌ๋ถ€๋ฅผ ์•Œ๋ ค์ฃผ๋Š” ๋ฐ ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ์„ ๋‹ด์•„๋‚ด๋Š” ๋ช‡ ๊ฐ€์ง€ feature๋“ค์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ

      • CREDIT_INCOME_PERCENT: ๊ณ ๊ฐ์˜ ์ˆ˜์ž…์— ๋Œ€ํ•œ ์‹ ์šฉ ๊ธˆ์•ก์˜ ๋ฐฑ๋ถ„์œจ

      • ANNUITY_INCOME_PERCENT: ๊ณ ๊ฐ์˜ ์†Œ๋“์— ๋Œ€ํ•œ ์—ฐ๊ธˆ ๋Œ€์ถœ์˜ ๋น„์œจ

      • CREDIT_TERM: ์ง€๋ถˆ ๊ธฐ๊ฐ„(์›” ๋‹จ์œ„๋กœ ์ง€๋ถˆํ•ด์•ผ ํ•˜๋Š” ๊ธฐ๊ฐ„, ์—ฐ๊ธˆ์€ ๋งค์›” ์ง€๋ถˆํ•ด์•ผ ํ•˜๋Š” ๊ธˆ์•ก์ด๋ฏ€๋กœ)

      • DAYS_EMPLOYED_PERCENT: ๊ณ ๊ฐ์˜ ๋‚˜์ด์— ๋Œ€ํ•œ ๊ณ ์šฉ ์ผ์ˆ˜์˜ ๋ฐฑ๋ถ„์œจ

### ๋„๋ฉ”์ธ ์ง€์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์—ฌ๋Ÿฌ ๋ณ€์ˆ˜๋“ค์„ ์ƒ์„ฑ

app_train_domain = app_train.copy()
app_test_domain = app_test.copy()

app_train_domain['CREDIT_INCOME_PERCENT'] = app_train_domain['AMT_CREDIT'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['ANNUITY_INCOME_PERCENT'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['CREDIT_TERM'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_CREDIT']
app_train_domain['DAYS_EMPLOYED_PERCENT'] = app_train_domain['DAYS_EMPLOYED'] / app_train_domain['DAYS_BIRTH']
app_test_domain['CREDIT_INCOME_PERCENT'] = app_test_domain['AMT_CREDIT'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['ANNUITY_INCOME_PERCENT'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['CREDIT_TERM'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_CREDIT']
app_test_domain['DAYS_EMPLOYED_PERCENT'] = app_test_domain['DAYS_EMPLOYED'] / app_test_domain['DAYS_BIRTH']

โบ ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜๋“ค์„ ์‹œ๊ฐํ™”

  • ๋„๋ฉ”์ธ ์ง€์‹ ๋ณ€์ˆ˜๋“ค์„ ๊ทธ๋ž˜ํ”„์—์„œ ์‹œ๊ฐ์ ์œผ๋กœ ํƒ๊ตฌ

  • Target ๊ฐ’์— ๋”ฐ๋ผ ์ƒ‰์„ ๋‹ค๋ฅด๊ฒŒ ์ ์šฉํ•œ kdeplot์œผ๋กœ ์‹œ๊ฐํ™”

plt.figure(figsize = (12, 20))

### ์ƒˆ๋กœ ์ƒ์„ฑํ•œ feature๋“ค์„ ์ˆœํ™˜์‹œํ‚ค๋ฉฐ
for i, feature in enumerate(['CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT', 'CREDIT_TERM', 'DAYS_EMPLOYED_PERCENT']):
    # ๊ฐ ๋ณ€์ˆ˜๋“ค์— ๋Œ€ํ•ด subplot ์ƒ์„ฑ
    plt.subplot(4, 1, i + 1)
    # ์ œ๋•Œ ์ƒํ™˜๋œ ๋Œ€์ถœ
    sns.kdeplot(app_train_domain.loc[app_train_domain['TARGET'] == 0, feature], label = 'target == 0')
    # ์ œ๋•Œ ์ƒํ™˜๋˜์ง€ ๋ชปํ•œ ๋Œ€์ถœ
    sns.kdeplot(app_train_domain.loc[app_train_domain['TARGET'] == 1, feature], label = 'target == 1')
    
    # Plot Labeling
    plt.title('Distribution of %s by Target Value' % feature)
    plt.xlabel('%s' % feature); plt.ylabel('Density');
    
plt.tight_layout(h_pad = 2.5)

  • ์œ„์˜ ๊ทธ๋ž˜ํ”„๋“ค์„ ๋ณด๋ฉด ์ƒˆ๋กœ ๋งŒ๋“  feature๋“ค์ด ์œ ์šฉํ•œ์ง€ ๋ถˆํ™•์‹คํ•จ

    • ๋”ฐ๋ผ์„œ, ๊ทธ๋ƒฅ ์ด feature๋“ค์„ ๊ฐ€์ง€๊ณ  ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ™•์ธํ•ด๋ด…๋‹ˆ๋‹ค.(Just Do it)

(+) ๊ฐœ์ธ์ ์ธ ์ƒ๊ฐ

  • feature๋“ค์˜ ์™œ๊ณก ์ •๋„๊ฐ€ ๋„ˆ๋ฌด ์‹ฌํ•จ(skewed data)

    • ํšŒ๊ท€ ๋ฌธ์ œ์—์„œ๋Š” ์˜คํžˆ๋ ค ๋ชจ๋ธ์ด overfitting๋˜๋Š” ๋ฌธ์ œ๋ฅผ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ์Œ

4. Baseline ๋ชจ๋ธ๋“ค

  • ๋‹จ์ˆœํ•œ baseline ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ test set์˜ ๋ชจ๋“  ์˜ˆ์ œ์— ๋Œ€ํ•ด ๋™์ผํ•œ ๊ฐ’์„ ์ถ”์ธกํ•  ์ˆ˜ ์žˆ์Œ

    • ๋Œ€์ถœ๊ธˆ์„ ์ƒํ™˜ํ•˜์ง€ ์•Š์„ ํ™•๋ฅ (target = 1)์„ ์˜ˆ์ธกํ•˜๋ผ๋Š” ์š”์ฒญ์„ ๋ฐ›์•˜๊ธฐ ๋•Œ๋ฌธ์—, ์™„์ „ํžˆ ๋ถˆํ™•์‹คํ•œ ๊ฒฝ์šฐ ๊ฒ€์ • ์„ธํŠธ์˜ ๋ชจ๋“  ๊ด€์ธก์น˜์— ๋Œ€ํ•ด 0.5๋ฅผ ์ถ”์ธกํ•  ์ˆ˜ ์žˆ์Œ(๊ทธ๋ƒฅ ์ ˆ๋ฐ˜์œผ๋กœ ์ฐ์–ด๋ฒ„๋ฆฌ๋Š”๊ฑฐ์ง€)
  • ์‹ค์ œ baseline์— ๋น„ํ•ด ์กฐ๊ธˆ ๋” ์ •๊ตํ•œ ๋ชจ๋ธ์ธ LogisticRegression์„ ํ™œ์šฉ

4-1. Logistic Regression

  • Baseline์„ ์–ป๊ธฐ ์œ„ํ•ด ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ์ธ์ฝ”๋”ฉ ํ•œ ํ›„ ๋ชจ๋“  feature๋ฅผ ํ™œ์šฉ

    • ๊ฒฐ์ธก์น˜๋ฅผ ์ฑ„์šฐ๊ณ  feature๋“ค์˜ ๋ฒ”์œ„๋ฅผ ์ •๊ทœํ™”(normalize)ํ•˜์—ฌ ์ „์ฒ˜๋ฆฌ => Feature Scaling

from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer

### ์ผ๋‹จ train data์—์„œ target๊ฐ’์„ ์ œ๊ฑฐ
if 'TARGET' in app_train:
    train = app_train.drop(columns = ['TARGET'])
else:
    train = app_train.copy()
    
### ๋ณ€์ˆ˜ ๋ชฉ๋ก
features = list(train.columns)

### test data ๋ณต์‚ฌ
test = app_test.copy()

### ๊ฒฐ์ธก์น˜: ์ค‘๊ฐ„๊ฐ’์œผ๋กœ ๋Œ€์ฒด
imputer = SimpleImputer(strategy = 'median')

### ๋ฐ์ดํ„ฐ ๋ฒ”์œ„๋ฅผ 0 ~ 1๋กœ ์กฐ์ •
scaler = MinMaxScaler(feature_range = (0, 1))

### ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ
imputer.fit(train) #  imputer๋ฅผ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต์‹œํ‚ค๊ณ 
train = imputer.transform(train) # ๋ณ€ํ˜•!
test = imputer.transform(app_test) # train data๋ฅผ ๋ณ€ํ˜•์‹œํ‚จ imputer๋ฅผ ๊ฐ€์ง€๊ณ  "๊ทธ๋Œ€๋กœ" ๋ณ€ํ˜•

### Scaling
scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)

print('Training data shape: ', train.shape)
print('Testing data shape: ', test.shape)
Training data shape:  (307511, 240)
Testing data shape:  (48744, 240)
  • ์šฐ๋ฆฌ๋Š” ์ฒซ ๋ฒˆ์งธ ๋ชจ๋ธ์— ๋Œ€ํ•ด Scikit-Learn์˜ ```LogisticRegression``์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ž„

  • ๊ธฐ๋ณธ ๋ชจ๋ธ์—์„œ ์œ ์ผํ•˜๊ฒŒ ๋ณ€๊ฒฝํ•  ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋Š” overfitting์˜ ์–‘์„ ์ œ์–ดํ•˜๋Š” ์ •๊ทœํ™” ๋งค๊ฐœ๋ณ€์ˆ˜์ธ C๋ฅผ ๋‚ฎ์ถ”๋Š” ๊ฒƒ์ž„

    • ๋‚ฎ์€ ๊ฐ’์€ ๊ณผ์ ํ•ฉ์„ ์ค„์ž„
### ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ์ ์šฉ

from sklearn.linear_model import LogisticRegression

# ๋ชจ๋ธ ๊ฐ์ฒด ์ƒ์„ฑ
log_reg = LogisticRegression(C = 0.0001)

# model fitting(ํ•™์Šต)
log_reg.fit(train, train_labels)
LogisticRegression(C=0.0001)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  • ์ด์ œ ๋ชจ๋ธ์ด ํ•™์Šต๋˜์—ˆ์œผ๋ฏ€๋กœ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ

  • predict_proba() ๋ฉ”์†Œ๋“œ๋ฅผ ํ†ตํ•ด ๋Œ€์ถœ์„ ๊ฐš์ง€ ์•Š์„(target = 1) ํ™•๋ฅ ์„ ์˜ˆ์ธก

    • ์ฒซ ๋ฒˆ์งธ ์—ด์€ ๋Œ€์ƒ์ด 0์ผ ํ™•๋ฅ ์ด๊ณ , ๋‘ ๋ฒˆ์งธ ์—ด์€ ๋Œ€์ƒ์ด 1์ผ ํ™•๋ฅ ์ž„

    • ๋‹จ์ผ ํ–‰์˜ ๊ฒฝ์šฐ ๋‘ ์—ด์˜ ํ•ฉ์ด 1์ด์–ด์•ผ ํ•จ

    • ๋Œ€์ถœ์ด ์ƒํ™˜๋˜์ง€ ์•Š์„(target = 1) ํ™•๋ฅ ์„ ์›ํ•˜๋ฏ€๋กœ ๋‘ ๋ฒˆ์งธ ์—ด์„ ์„ ํƒ

### ์˜ˆ์ธก
# ๋Œ€์ถœ์„ ์ƒํ™˜ํ•˜์ง€ ์•Š์„ ํ™•๋ฅ ๋งŒ ์„ ํƒ

log_reg_pred = log_reg.predict_proba(test)[:, 1]
### ์ œ์ถœ์šฉ DataFrame

submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred

submit.head()
SK_ID_CURR TARGET
0 100001 0.078515
1 100005 0.137926
2 100013 0.082194
3 100028 0.080921
4 100038 0.132618
  • ์˜ˆ์ธก์€ ๋Œ€์ถœ์ด ์ƒํ™˜๋˜์ง€ ์•Š์„(target = 1) ํ™•๋ฅ ์ด 0๊ณผ 1 ์‚ฌ์ด์ž„์„ ๋‚˜ํƒ€๋ƒ„

    • ๋Œ€์ถœ์ด ์œ„ํ—˜ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•œ ํ™•๋ฅ  ์ž„๊ณ„๊ฐ’์„ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Œ
# Save the submission to a csv file
submit.to_csv('log_reg_baseline.csv', index = False)
  • ์ ์ˆ˜๋Š” ์•ฝ 0.671

4-2. Improved Model: Random Forest

  • ๋™์ผํ•œ train ๋ฐ์ดํ„ฐ์—์„œ RandomForest๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผœ๋ณด์ž.

  • ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ํŠนํžˆ ์šฐ๋ฆฌ๊ฐ€ ์ˆ˜๋ฐฑ ๊ฐœ์˜ ํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ํ›จ์”ฌ ๋” ๊ฐ•๋ ฅํ•œ ๋ชจ๋ธ์ž„

  • n_estimators = 100์œผ๋กœ ์„ค์ •

from sklearn.ensemble import RandomForestClassifier

### RandomForest ๊ฐ์ฒด ์ƒ์„ฑ
random_forest = RandomForestClassifier(n_estimators = 100, random_state = 50, 
                                       verbose = 1, n_jobs = -1)
### ๋ชจ๋ธ ํ•™์Šต
random_forest.fit(train, train_labels)

### ํ”ผ์ฒ˜ ์ค‘์š”๋„ ํ™•์ธ
feature_importance_values = random_forest.feature_importances_
feature_importances = pd.DataFrame({'feature': features, 'importance': feature_importance_values})

### test data๋ฅผ ์ด์šฉํ•ด ์˜ˆ์ธก ์ˆ˜ํ–‰
predictions = random_forest.predict_proba(test)[:, 1]
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  3.5min finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.8s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    2.3s finished
### ๋Œ€ํšŒ ์ œ์ถœ์šฉ ์ฝ”๋“œ

submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions

submit.to_csv('random_forest_baseline.csv', index = False)
  • ์ ์ˆ˜๋Š” ์•ฝ 0.678

- Engineered Features๋กœ ์˜ˆ์ธกํ•˜๊ธฐ

  • Polynomial Features ๋ฐ ๋„๋ฉ”์ธ ์ง€์‹์ด ๋ชจ๋ธ์„ ๊ฐœ์„ ํ–ˆ๋Š”์ง€ ํ™•์ธํ•˜๋Š” ์œ ์ผํ•œ ๋ฐฉ๋ฒ•์€ ์ด๋Ÿฌํ•œ feature์— ๋Œ€ํ•œ ํ…Œ์ŠคํŠธ ๋ชจ๋ธ์„ ์ง์ ‘ ํ›ˆ๋ จ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•๋ฐ–์— ์—†๋ฆ„

    • ์ด๋Ÿฌํ•œ feature๋“ค์ด ์—†๋Š” ๋ชจ๋ธ์˜ performance์™€ ๋น„๊ตํ•˜์—ฌ feature engineering์˜ ํšจ๊ณผ๋ฅผ ์ธก์ •

โบTesting polynomial features

poly_features_names = list(app_train_poly.columns) # Polynomial feature๋“ค์˜ ๋ชฉ๋ก

### ๊ฒฐ์ธก์น˜๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ ์ค‘์•™๊ฐ’์œผ๋กœ ๋Œ€์ฒด
imputer = SimpleImputer(strategy = 'median')
poly_features = imputer.fit_transform(app_train_poly)
poly_features_test = imputer.transform(app_test_poly)

### Scaling
scaler = MinMaxScaler(feature_range = (0, 1))
poly_features = scaler.fit_transform(poly_features)
poly_features_test = scaler.transform(poly_features_test)
### ๋ชจ๋ธ ๊ฐ์ฒด ์ƒ์„ฑ
random_forest_poly = RandomForestClassifier(n_estimators = 100, random_state = 50, 
                                            verbose = 1, n_jobs = -1)

### ๋ชจ๋ธ ํ•™์Šต
random_forest_poly.fit(poly_features, train_labels)

### ์˜ˆ์ธก
predictions = random_forest_poly.predict_proba(poly_features_test)[:, 1]
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  4.9min finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.5s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    1.0s finished
### ์ œ์ถœ์šฉ ์ฝ”๋“œ
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions

submit.to_csv('random_forest_baseline_engineered.csv', index = False)
  • ์ ์ˆ˜: ์•ฝ 0.678

    • ๋‹คํ•ญ ๋ณ€์ˆ˜๋“ค์„ ์ ์šฉ์‹œํ‚ค ์•Š์•˜์„ ๋•Œ์™€ ์„ฑ๋Šฅ์ด ๋™์ผ

โบ Testing domain features

# ์ผ๋‹จ target์„ ๋นผ๊ณ  ๋ฐ์ดํ„ฐ ๊ฐ€๊ณต
app_train_domain = app_train_domain.drop(columns = 'TARGET')
domain_features_names = list(app_train_domain.columns)

### ๊ฒฐ์ธก์น˜๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ "์ค‘์•™๊ฐ’"์œผ๋กœ ๋Œ€์ฒด
imputer = SimpleImputer(strategy = 'median')
domain_features = imputer.fit_transform(app_train_domain)
domain_features_test = imputer.transform(app_test_domain)

### Scaling
scaler = MinMaxScaler(feature_range = (0, 1))
domain_features = scaler.fit_transform(domain_features)
domain_features_test = scaler.transform(domain_features_test)

### ๋ชจ๋ธ ๊ฐ์ฒด ์ƒ์„ฑ
random_forest_domain = RandomForestClassifier(n_estimators = 100, random_state = 50, 
                                              verbose = 1, n_jobs = -1)

### ๋ชจ๋ธ ํ•™์Šต
random_forest_domain.fit(domain_features, train_labels)

### feature ์ค‘์š”๋„ ํ™•์ธ
feature_importance_values_domain = random_forest_domain.feature_importances_
feature_importances_domain = pd.DataFrame({'feature': domain_features_names, 
                                           'importance': feature_importance_values_domain})

### ์˜ˆ์ธก
predictions = random_forest_domain.predict_proba(domain_features_test)[:, 1]
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  3.1min finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    1.3s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    2.9s finished
### ์ œ์ถœ์šฉ ์ฝ”๋“œ
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions

submit.to_csv('random_forest_baseline_domain.csv', index = False)
  • ์ œ์ถœํ–ˆ์„ ๋•Œ 0.679์ ์„ ์–ป์—ˆ์Œ

    • ๊ฐ€๊ณต๋œ feature๋“ค์ด ํ•ด๋‹น ๋ชจ๋ธ์—์„œ๋Š” ๋„์›€์ด ๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ

4-3.๋ณ€์ˆ˜ ์ค‘์š”๋„(Feature Importances)

  • ์–ด๋–ค ๋ณ€์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ RandomForest์˜ feature importances๋ฅผ ํ™•์ธํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Œ

  • Exploratory Data Analysis(EDA)์—์„œ ๋ณธ ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ๊ณ ๋ คํ•  ๋•Œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ feature๋Š” EXT_SOURCE ๋ฐ DAYS_BIRTH์ผ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•  ์ˆ˜ ์žˆ์Œ

  • ์ดํ›„ ์ฐจ์› ์ถ•์†Œ ์‹œ์— ์ด๋Ÿฌํ•œ feature importances๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Œ

Feature ์ค‘์š”๋„ ์‹œ๊ฐํ™”

  • ๋ชจํ˜•์—์„œ ๋ฐ˜ํ™˜๋œ ์ค‘์š”๋„ ๊ทธ๋ฆผ์„ ํ‘œ์‹œ

    • ์ค‘์š”๋„๊ฐ€ ๋†’์„์ˆ˜๋ก feature ์ค‘์š”๋„์˜ ๋ชจ๋“  ์ฒ™๋„์—์„œ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ์Œ

  • Arguments>

    • df(๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„):

      • ํ”ผ์ณ ์ค‘์š”๋„

      • features ์ปฌ๋Ÿผ์— ๋ณ€์ˆ˜๋“ค์ด ์žˆ๊ณ  importance ์ปฌ๋Ÿผ์— ์ค‘์š”๋„๊ฐ€ ์žˆ์Œ

  • Return>

    • ๊ฐ€์žฅ ์ค‘์š”ํ•œ 15๊ฐ€์ง€ feature๋“ค์— ๋Œ€ํ•ด ์‹œ๊ฐํ™”
def plot_feature_importances(df):
    ### ์ค‘์š”๋„ ์ˆœ์„œ๋Œ€๋กœ feature๋“ค์„ ์ •๋ ฌ
    df = df.sort_values('importance', ascending = False).reset_index()
    
    ### feature ์ค‘์š”๋„ ์ •๊ทœํ™”(๋‹ค ํ•ฉ์ณ์„œ 1์ด ๋˜๊ฒŒ)
    df['importance_normalized'] = df['importance'] / df['importance'].sum()

    ### ๊ฐ€๋กœ ๋ง‰๋Œ€ ๊ทธ๋ฆฌ๊ธฐ(์ค‘์š”๋„ โˆ ๊ธธ์ด)
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()
    ax.barh(list(reversed(list(df.index[:15]))), # ์ค‘์š”ํ•œ ๊ฒƒ๋ถ€ํ„ฐ ํ‘œ์‹œํ•ด์•ผ ํ•˜๋ฏ€๋กœ index๋ฅผ ๊ฑฐ๊พธ๋กœ
            df['importance_normalized'].head(15), 
            align = 'center', edgecolor = 'k')
    
    ### ์ถ• ์„ค์ •
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))
    
    plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
    plt.show()
    
    return df
feature_importances_sorted = plot_feature_importances(feature_importances)

  • ์˜ˆ์ƒ๋Œ€๋กœ ๊ฐ€์žฅ ์ค‘์š”ํ•œ feature๋“ค์€ EXT_SOURCE์™€ DAYS_BIRTH์ž„

  • ์œ„์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด ๋ชจ๋ธ์— ๋†’์€ importance์„ ์ง€๋‹Œ feature๋“ค์€ ์†Œ์ˆ˜์— ๋ถˆ๊ณผํ•œ ๊ฒƒ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Œ

    • ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ๋งŽ์€ feature๋“ค์„ ์‚ญ์ œํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌ(์กฐ๊ธˆ ๋–จ์–ด์งˆ ์ˆ˜๋Š” ์žˆ์Œ)
  • feature importance๋Š” ๋ชจ๋ธ์„ ํ•ด์„ํ•˜๊ฑฐ๋‚˜ ์ฐจ์› ์ถ•์†Œ๋ฅผ ํ•˜๋Š” ๊ฐ€์žฅ ์ •๊ตํ•œ ๋ฐฉ๋ฒ•์€ ์•„๋‹ˆ์ง€๋งŒ, ์˜ˆ์ธก์„ ํ•  ๋•Œ ๋ชจ๋ธ์ด ๊ณ ๋ คํ•˜๋Š” ์š”์†Œ๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์คŒ

feature_importances_domain_sorted = plot_feature_importances(feature_importances_domain)

  • ์†์œผ๋กœ ์—”์ง€๋‹ˆ์–ด๋งํ•œ 4๊ฐ€์ง€ feature ๋ชจ๋‘ feature importances ์ƒ์œ„ 15์œ„ ์•ˆ์— ๋“ค์—ˆ์Œ

    • CREDIT_TERM, ANNUITY_INCOME_PERCENT, CREDIT_INCOME_PERCENT, DAYS_EMPLOYED_PERCENT
  • ๋„๋ฉ”์ธ ์ง€์‹์ด ๋ชจ๋ธ ์„ฑ๋Šฅ์— ํšจ๊ณผ๊ฐ€ ์žˆ์—ˆ๋‹ค.

5. ์ถ”๊ฐ€ ํ•™์Šต_ Light Gradient Boosting Machine

  • LightGBM library

  • Gradient Boosting Machine์€ ํ˜„์žฌ ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ํ•™์Šต์„ ์œ„ํ•œ ์„ ๋„์ ์ธ ๋ชจ๋ธ์ž„

(ํŠนํžˆ Kaggle)

  • ๊ต์ฐจ ๊ฒ€์ฆ(Cross Validation)์„ ์‚ฌ์šฉํ•˜์—ฌ LGBM ๋ชจ๋ธ์„ train/test

  • Parameters>

    • features(pd.DataFrame):

      • ํ•™์Šต์— ์‚ฌ์šฉํ•  training features

      • ๋ฐ˜๋“œ์‹œ target ์ปฌ๋Ÿผ์„ ํฌํ•จํ•ด์•ผ ํ•จ

    • test_features(pd.DataFrame):

      • ๋ชจํ˜•์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ๊ฒ€์ฆ์šฉ ๋ฐ์ดํ„ฐ
    • encoding(str, default = โ€˜oheโ€™):

      • ๋ณ€์ˆ˜๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๋ฐฉ๋ฒ•

      • one-hot encoding(-> ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜)์ธ ๊ฒฝ์šฐ ohe , label encoding(-> ์ˆซ์žํ˜• ๋ณ€์ˆ˜)์ธ ๊ฒฝ์šฐ le

    • n_folds(int, default = 5):

      • ๊ต์ฐจ ๊ฒ€์ฆ์— ์‚ฌ์šฉํ•  folds ์ˆ˜
  • Return>

    • submission(pd.DataFrame):

      • ๋ชจ๋ธ์— ์˜ํ•ด ์˜ˆ์ธก๋œ SK_ID_CURR ๊ฐ’๊ณผ TARGET ํ™•๋ฅ ์„ ๊ฐ€์ง„ df
    • feature_importances(pd.DataFrame):

      • ๋ชจ๋ธ์—์„œ feature๋“ค์˜ ์ค‘์š”๋„์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๋Š” df
    • valid_metrics(pd.DataFrame):

      • ๊ฐ fold์™€ ์ „์ฒด์ ์ธ ๋ถ€๋ถ„์—์„œ์˜ training๊ณผ validation ์ ์ˆ˜(ROC-AUC)๊ฐ€ ์ €์žฅ๋จ
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import gc
def model(features, test_features, encoding = 'ohe', n_folds = 5):
    ### ๋ฐ์ดํ„ฐ ์ค€๋น„

    # id ์ถ”์ถœ
    train_ids = features['SK_ID_CURR']
    test_ids = test_features['SK_ID_CURR']
    # label ์ถ”์ถœ -> for training
    labels = features['TARGET']
    # id์™€ target ์ œ๊ฑฐ
    features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
    test_features = test_features.drop(columns = ['SK_ID_CURR'])

    # --------------------------------------------------------------
    

    ### Encoding

    ## ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜: One-Hot Encoding
    if encoding == 'ohe':
        features = pd.get_dummies(features)
        test_features = pd.get_dummies(test_features)
        
        # ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ column๋“ค์„ ํ†ต์ผ์‹œํ‚ค๊ธฐ
        features, test_features = features.align(test_features, join = 'inner', axis = 1)
        # ๋”ฐ๋กœ ๊ธฐ๋กํ•  ๋ฒ”์ฃผํ˜• ์ธ๋ฑ์Šค๋Š” x
        cat_indices = 'auto'
    
    ## ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜: Label Encoding
    elif encoding == 'le':
        # encoder ๊ฐ์ฒด ์ƒ์„ฑ
        label_encoder = LabelEncoder()

        # ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋“ค์„ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•œ list
        cat_indices = []

        # ๊ฐ ์ปฌ๋Ÿผ๋งˆ๋‹ค ๋ฐ˜๋ณตํ•˜๋ฉฐ
        for i, col in enumerate(features):
            if features[col].dtype == 'object': # ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
                # ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ์ •์ˆ˜(integer)๋กœ ๋งคํ•‘
                features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
                test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))

                # ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋“ค์„ ์ €์žฅ
                cat_indices.append(i)

    ## ์œ ํšจํ•˜์ง€ ์•Š์€ ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹ ์ž…๋ ฅ ์‹œ ์˜ค๋ฅ˜
    else:
        raise ValueError("Encoding must be either 'ohe' or 'le'")
    
    # --------------------------------------------------------------------------


    ### ๊ฐ€๊ณต๋œ ๋ฐ์ดํ„ฐ ํ˜•ํƒœ ํ™•์ธํ•˜๊ธฐ
    print('Training Data Shape: ', features.shape)
    print('Testing Data Shape: ', test_features.shape)

    ### feature ์ด๋ฆ„ ์ถ”์ถœ
    feature_names = list(features.columns)
    
    ### np array๋กœ ๋ณ€ํ™˜
    features = np.array(features)
    test_features = np.array(test_features)

    # -------------------------------------------------------------------------

    ### ๊ต์ฐจ ๊ฒ€์ฆ
    # KFold ๊ฐ์ฒด ์ƒ์„ฑ
    k_fold = KFold(n_splits = n_folds, shuffle = True, random_state = 50)
    
    # feature ์ค‘์š”๋„๋ฅผ ์ €์žฅํ•  ๋นˆ array ์ƒ์„ฑ
    feature_importance_values = np.zeros(len(feature_names))
    
    # ์˜ˆ์ธก๊ฐ’์„ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•œ ๋นˆ array ์ƒ์„ฑ
    test_predictions = np.zeros(test_features.shape[0])
    
    # ํ‘œ๋ณธ ์™ธ(out of fold) ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์˜ˆ์ธก์„ ์œ„ํ•œ arrau
    out_of_fold = np.zeros(features.shape[0])
    
    # --------------------------------------------------------------------------

    ### Modeling
    # ๊ฒ€์ฆ/ ํ•™์Šต ์ˆ˜ํ–‰ ํ›„ ์ ์ˆ˜๋ฅผ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•œ ๋ฆฌ์ŠคํŠธ
    valid_scores = []
    train_scores = []
    
    # ๊ฐ fold๋ฅผ ๋ฐ˜๋ณตํ•˜๋ฉฐ
    for train_indices, valid_indices in k_fold.split(features):
        ## ๋ฐ์ดํ„ฐ ์ค€๋น„
        # fold์—์„œ์˜ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ
        train_features, train_labels = features[train_indices], labels[train_indices]
        # fold์—์„œ์˜ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ
        valid_features, valid_labels = features[valid_indices], labels[valid_indices]

        ## ๋ชจ๋ธ ๊ฐ์ฒด ์ƒ์„ฑ
        model = lgb.LGBMClassifier(n_estimators = 10000, objective = 'binary', 
                                   class_weight = 'balanced', learning_rate = 0.05, 
                                   reg_alpha = 0.1, reg_lambda = 0.1, 
                                   subsample = 0.8, n_jobs = -1, random_state = 50)
        
        ## ๋ชจ๋ธ ํ•™์Šต
        model.fit(train_features, train_labels, eval_metric = 'auc',
                  eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
                  eval_names = ['valid', 'train'], categorical_feature = cat_indices,
                  early_stopping_rounds = 100, verbose = 200)
        # ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์€ iteration ๊ธฐ๋ก
        best_iteration = model.best_iteration_
        # ํ”ผ์ฒ˜ ์ค‘์š”๋„ ๊ธฐ๋ก
        feature_importance_values += model.feature_importances_ / k_fold.n_splits
        
        ## ์˜ˆ์ธก
        test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
        # ํ‘œ๋ณธ/์˜ˆ์ธก ์ €์žฅ
        out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
        # ๊ฐ€์žฅ ์ข‹์€ ์ ์ˆ˜ ๊ธฐ๋ก
        valid_score = model.best_score_['valid']['auc']
        train_score = model.best_score_['train']['auc']
        valid_scores.append(valid_score)
        train_scores.append(train_score)
        
        ## ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ธฐํ™”
        gc.enable()
        del model, train_features, valid_features
        gc.collect()
        
    # --------------------------------------------------------------------------

    ### ์ œ์ถœ์šฉ ํŒŒ์ผ ๋งŒ๋“ค๊ธฐ
    submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
    
    ### ํ”ผ์ฒ˜ ์ค‘์š”๋„ dataframe
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
    
    ### ์ „์ฒด ๋ฐ์ดํ„ฐ์˜ validation score ์ธก์ •
    valid_auc = roc_auc_score(labels, out_of_fold)
    
    ### ์ „์ฒด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ score๋ฅผ metric์— ์ถ”๊ฐ€
    valid_scores.append(valid_auc)
    train_scores.append(np.mean(train_scores))
    
    ### validation score๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑ
    fold_names = list(range(n_folds))
    fold_names.append('overall')
    
    ### training ๋ฐ validation score๊ฐ€ ์ €์žฅ๋œ df
    metrics = pd.DataFrame({'fold': fold_names,
                            'train': train_scores,
                            'valid': valid_scores}) 
    
    return submission, feature_importances, metrics

โบ ๊ธฐ๋ณธ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต

### ์„ฑ๋Šฅ ํ‰๊ฐ€

submission, fi, metrics = model(app_train, app_test)
print('Baseline metrics')
print(metrics)
Training Data Shape:  (307511, 239)
Testing Data Shape:  (48744, 239)
[200]	train's auc: 0.798723	train's binary_logloss: 0.547797	valid's auc: 0.755039	valid's binary_logloss: 0.563266
[400]	train's auc: 0.82838	train's binary_logloss: 0.518334	valid's auc: 0.755107	valid's binary_logloss: 0.545575
[200]	train's auc: 0.798409	train's binary_logloss: 0.548179	valid's auc: 0.758332	valid's binary_logloss: 0.563587
[400]	train's auc: 0.828244	train's binary_logloss: 0.518308	valid's auc: 0.758563	valid's binary_logloss: 0.545588
[200]	train's auc: 0.797648	train's binary_logloss: 0.549331	valid's auc: 0.763246	valid's binary_logloss: 0.564236
[200]	train's auc: 0.798855	train's binary_logloss: 0.547952	valid's auc: 0.757131	valid's binary_logloss: 0.562234
[200]	train's auc: 0.797918	train's binary_logloss: 0.548584	valid's auc: 0.758065	valid's binary_logloss: 0.564721
Baseline metrics
      fold     train     valid
0        0  0.816657  0.755215
1        1  0.816900  0.758754
2        2  0.808111  0.763630
3        3  0.811887  0.757583
4        4  0.811617  0.758344
5  overall  0.813034  0.758705
### feature ์ค‘์š”๋„ ํ™•์ธ

fi_sorted = plot_feature_importances(fi)

### ์ œ์ถœ ํŒŒ์ผ ์ƒ์„ฑ

submission.to_csv('baseline_lgb.csv', index = False)
  • ์ ์ˆ˜๋Š” ์•ฝ 0.735

โบ Domain Feature๋“ค์„ ์ถ”๊ฐ€

### ์„ฑ๋Šฅ ํ‰๊ฐ€

app_train_domain['TARGET'] = train_labels

# Test the domain knolwedge features
submission_domain, fi_domain, metrics_domain = model(app_train_domain, app_test_domain)
print('Baseline with domain knowledge features metrics')
print(metrics_domain)
Training Data Shape:  (307511, 243)
Testing Data Shape:  (48744, 243)
[200]	train's auc: 0.804779	train's binary_logloss: 0.541283	valid's auc: 0.762511	valid's binary_logloss: 0.557227
[200]	train's auc: 0.804016	train's binary_logloss: 0.542318	valid's auc: 0.765768	valid's binary_logloss: 0.557819
[200]	train's auc: 0.8038	train's binary_logloss: 0.542856	valid's auc: 0.7703	valid's binary_logloss: 0.557925
[400]	train's auc: 0.834559	train's binary_logloss: 0.511454	valid's auc: 0.770511	valid's binary_logloss: 0.538558
[200]	train's auc: 0.804603	train's binary_logloss: 0.541718	valid's auc: 0.765497	valid's binary_logloss: 0.556274
[200]	train's auc: 0.804782	train's binary_logloss: 0.541397	valid's auc: 0.765076	valid's binary_logloss: 0.558641
Baseline with domain knowledge features metrics
      fold     train     valid
0        0  0.815523  0.763069
1        1  0.807075  0.766062
2        2  0.832138  0.770730
3        3  0.811100  0.765884
4        4  0.819404  0.765249
5  overall  0.817048  0.766186
### ํ”ผ์ณ ์ค‘์š”๋„

fi_sorted = plot_feature_importances(fi_domain)

  • ์šฐ๋ฆฌ๊ฐ€ ๋งŒ๋“  feature๋“ค ์ค‘ ์ผ๋ถ€๊ฐ€ ์ค‘์š”ํ•œ ๋ณ€์ˆ˜๋กœ ์ฑ„ํƒ๋˜์—ˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

    • ์ถ”๊ฐ€์ ์ธ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ ์ง€์‹์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์„ ๊ณ ๋ คํ•ด ๋ณผ ์ˆ˜ ์žˆ์Œ
submission_domain.to_csv('baseline_lgb_domain_features.csv', index = False)
  • ์ ์ˆ˜๋Š” ์•ฝ 0.754 ์ •๋„

6. Conclusions

6-1. ๋ชจ๋ธ๋ณ„ ์„ฑ๋Šฅ ๋น„๊ต

  1. feature engineering ์ ์šฉ x
  • LogisticRegressor: 0.671

  • RandomForest: 0.678

  • LGBM: 0.735

  1. feature engineering ์ ์šฉ O
  • RandomForest: 0.678(๋‹คํ–ฅ ๋ณ€์ˆ˜), 0.679(๋„๋ฉ”์ธ feature)

    • ๋ณ„๋‹ค๋ฅธ ์„ฑ๋Šฅ ๊ฐœ์„ ์ด ์—†์—ˆ์Œ
  • LGBM: 0.754

    • ์•ฝ๊ฐ„์˜ ์„ฑ๋Šฅ ๊ฐœ์„ ์ด ์žˆ์—ˆ์Œ

6-2. ์ „์ฒด์ ์ธ ํ”„๋กœ์„ธ์Šค ์ •๋ฆฌ

  1. ๋จผ์ € ๋ฐ์ดํ„ฐ, ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์ž‘์—…, ์ œ์ถœ๋ฌผ์„ ํŒ๋‹จํ•˜๋Š” ๊ธฐ์ค€์„ ํ™•์‹คํžˆ ์ดํ•ด

  2. ๋ชจ๋ธ๋ง์— ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋Š” ๊ด€๊ณ„, ์ถ”์„ธ ๋˜๋Š” ์ด์ƒ์น˜๋ฅผ ์‹๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ„๋‹จํ•œ EDA๋ฅผ ์ˆ˜ํ–‰

  • ๊ทธ ๊ณผ์ •์—์„œ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ธ์ฝ”๋”ฉ, ๊ฒฐ์ธก๊ฐ’ ๋Œ€์น˜, ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ธ์ฝ”๋”ฉ ๊ฐ™์€ ํ•„์š”ํ•œ ์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋ฅผ ์ˆ˜ํ–‰
  1. ๊ธฐ์กด ๋ฐ์ดํ„ฐ์—์„œ ์ƒˆ๋กœ์šด feature๋“ค์„ ๊ตฌ์„ฑํ•˜์—ฌ ๊ทธ๋ ‡๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด ๋ชจ๋ธ์— ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธ

  2. ๋ฐ์ดํ„ฐ ํƒ์ƒ‰, ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ feature engineering์„ ์™„๋ฃŒ

  3. ๋ชจ๋ธ ๊ตฌํ˜„

  • LogisticRegression, RandomForest

  • Engineered๋œ ๋ณ€์ˆ˜ ํ•™์Šต -> ์„ฑ๋Šฅ ์ฒดํฌ

๐Ÿ“Œ ๋จธ์‹ ๋Ÿฌ๋‹ ํ”„๋กœ์ ํŠธ์˜ ์ผ๋ฐ˜์ ์ธ ๊ฐœ์š”

  1. ๋ฌธ์ œ์™€ ๋ฐ์ดํ„ฐ ์ดํ•ด

  2. ๋ฐ์ดํ„ฐ ์ •๋ฆฌ ๋ฐ formatting(์ •๋ณด ๊ตฌ์„ฑ)

  • ํ•ด๋‹น ๋Œ€ํšŒ์˜ ๊ฒฝ์šฐ ๋Œ€๋ถ€๋ถ„ ์ˆ˜ํ–‰๋˜์–ด ์žˆ์—ˆ์Œ
  1. EDA

  2. ๊ธฐ๋ณธ ๋ชจ๋ธ ์ƒ์„ฑ & ํ•™์Šต & ์˜ˆ์ธก

  3. ๋ชจ๋ธ ๊ฐœ์„ 

  4. ๊ฒฐ๊ณผ ํ•ด์„

๐Ÿ“šReferences