๐Ÿ“š Reference

1. Introduction: Manual Feature Engineering

  • ์ด ์ปค๋„์—์„œ๋Š” The Home Credit Defalut Risk Competition ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Feature๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋‹ค๋ฃจ๊ณ  ์žˆ์Œ

  • ์ปค๋„ ์ดˆ๋ฐ˜ ๋ถ€๋ถ„์—๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด application ๋ฐ์ดํ„ฐ๋งŒ ์‚ฌ์šฉํ•จ

    • ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์€ ๋ชจ๋ธ์€ ๋ฆฌ๋”๋ณด๋“œ์—์„œ ์•ฝ 0.74๋ฅผ ๊ธฐ๋ก
  • ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ์ •๋ณด๋ฅผ ์ข€ ๋” ๋ชจ์œผ๋ ค ํ•จ

    • ์—ฌ๊ธฐ์„œ๋Š” bureau ๋ฐ bureau_balance ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์˜€์Œ

    • bureau

      • โ€˜Home Creditโ€™์— ์ œ์ถœ๋œ ๊ณ ๊ฐ(Client)์˜ ๋‹ค๋ฅธ ๊ธˆ์œต ๊ธฐ๊ด€์—์„œ์˜ ๊ณผ๊ฑฐ์˜ ๋Œ€์ถœ ๊ธฐ๋ก

      • ๊ฐ๊ฐ์˜ ๋Œ€์ถœ ๊ธฐ๋ก์€ ๊ฐ๊ฐ์˜ ์—ด๋กœ ์ •๋ฆฌ๋˜์–ด ์žˆ์Œ

    • bureau_balance

      • ๊ณผ๊ฑฐ ๋Œ€์ถœ๋“ค์˜ ์›”๋ณ„ ๋ฐ์ดํ„ฐ

      • ๊ฐ๊ฐ์˜ ์›”๋ณ„ ๋ฐ์ดํ„ฐ๋Š” ๊ฐ๊ฐ์˜ ์—ด๋กœ ์ •๋ฆฌ๋˜์–ด ์žˆ์Œ

  • Manual Feature Engineering์€ ์–ด๋–ป๊ฒŒ ๋ณด๋ฉด ์ง€๋ฃจํ•œ ๊ณผ์ •์ผ ์ˆ˜ ์žˆ์Œ + ํ•ด๋‹น ์ž‘์—…์€ ๋„๋ฉ”์ธ ์ง€์‹(domain expertise)์„ ํ•„์š”๋กœ ํ•˜๊ธฐ๋„ ํ•จ

    • ๋Œ€์ถœ ๋ฐ ์ฑ„๋ฌด ๋ถˆ์ดํ–‰์˜ ์ฃผ๋œ ์›์ธ์— ๋Œ€ํ•œ ์ง€์‹์„ ๊ฐ–์ถ”๋Š”๋ฐ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ตœ์ข… ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ ๊ฐ€๋Šฅํ•œ ๋งŽ์€ ์ •๋ณด๋“ค์„ ์–ป๋Š”๋ฐ ์ฃผ์•ˆ์ ์„ ๋‘์—ˆ์Œ

    • ์ฆ‰, ์ด ์ปค๋„์€ ์–ด๋–ค feature๋“ค์ด ์ค‘์š”ํ•œ์ง€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์— ์žˆ์–ด์„œ ์‚ฌ๋žŒ๋ณด๋‹ค ๋ชจ๋ธ์ด ๊ณ ๋ฅด๋„๋ก ํ•˜๋Š” ์ ‘๊ทผ๋ฐฉ์‹์„ ํƒํ•จ

    • ์ตœ๋Œ€ํ•œ ๋งŽ์€ feature๋“ค์„ ๋งŒ๋“ค๊ณ , ๋ชจ๋ธ์€ ์ด๋Ÿฌํ•œ feature๋“ค์„ ์ „๋ถ€ ํ™œ์šฉ -> ์ถ”ํ›„ ๋ชจ๋ธ์—์„œ ์–ป์–ด์ง„ feature importance๋‚˜ PCA๋ฅผ ํ†ตํ•ด feautre reduction์„ ํ•  ์ˆ˜ ์žˆ์Œ

  • Manual Feature Engineering์˜ ๊ฐ ๊ณผ์ •์€ ๋งŽ์€ ์–‘์˜ Pandas ์ฝ”๋“œ์™€ ์•ฝ๊ฐ„์˜ ์ธ๋‚ด์‹ฌ, ํŠนํžˆ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์— ์žˆ์–ด์„œ ๋งŽ์€ ์ธ๋‚ด์‹ฌ์„ ํ•„์š”๋กœ ํ•จ

    • ๋น„๋ก ์ž๋™ํ™”๋œ Feature Engineering ๋„๊ตฌ๋“ค์ด ํ™œ์šฉ๋˜๊ธฐ ์‹œ์ž‘ํ–ˆ์ง€๋งŒ, ๋‹น๋ถ„๊ฐ„ Feature Engineering์€ ์—ฌ์ „ํžˆ ์ „์ฒ˜๋ฆฌ ์ž‘์—…์„ ํ•„์š”๋กœ ํ•จ
# ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์„ ์œ„ํ•œ Pandas ๋ฐ Numpy
import pandas as pd
import numpy as np

# ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ matplotlib ๋ฐ seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# pandas์—์„œ ๋‚˜์˜ค๋Š” ๊ฒฝ๊ณ ๋ฌธ ๋ฌด์‹œ
import warnings
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')

1-1. Example: ๊ณ ๊ฐ์˜ ์ด์ „ ๋Œ€์ถœ ์ˆ˜๋Ÿ‰ ํŒŒ์•…(Counts of a clientโ€™s previous loans)

  • Manual Feature Engineering์˜ ๋ณดํŽธ์ ์ธ ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด, ๋จผ์ € ๊ณ ๊ฐ์˜ ๊ณผ๊ฑฐ ํƒ€ ๊ธˆ์œต๊ธฐ๊ด€์—์„œ์˜ ๋Œ€์ถœ ์ˆ˜๋Ÿ‰์„ ๊ฐ„๋‹จํžˆ ํŒŒ์•…ํ•˜๊ณ ์ž ํ•จ

๐Ÿ“Œ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” Pandas ๋ช…๋ น์–ด

  • groupby

    • column๊ฐ’์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๊ทธ๋ฃนํ™”

    • ์ด ๊ณผ์ •์—์„œ๋Š” SK_ID_CURR ์ปฌ๋Ÿผ์˜ ๊ฐ’์— ๋”ฐ๋ผ ๊ณ ๊ฐ๋ณ„๋กœ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๊ทธ๋ฃนํ™”

  • agg

    • ๊ทธ๋ฃนํ™”๋œ ๋ฐ์ดํ„ฐ์˜ ํ‰๊ท  ๋“ฑ์„ ๊ณ„์‚ฐ

    • grouped_df.mean()์„ ํ†ตํ•ด ์ง์ ‘ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•˜๊ฑฐ๋‚˜, agg ๋ช…๋ น์–ด์™€ ๋ฆฌ์ŠคํŠธ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ‰๊ท , ์ตœ๋Œ€๊ฐ’, ์ตœ์†Œ๊ฐ’, ํ•ฉ๊ณ„ ๋“ฑ์„ ๊ณ„์‚ฐ(grouped_df.agg([mean, max, min, sum]))

  • merge

    • ์ง‘๊ณ„๋œ(aggregated) ๊ฐ’์„ ํ•ด๋‹น ๊ณ ๊ฐ์™€ ๋งค์นญ

    • SK_ID_CURR ์ปฌ๋Ÿผ์„ ํ™œ์šฉํ•˜์—ฌ ์ง‘๊ณ„๋œ ๊ฐ’์„ ์›๋ณธ training ๋ฐ์ดํ„ฐ๋กœ ๋ณ‘ํ•ฉํ•˜๊ณ , ํ•ด๋‹น ๊ฐ’์ด ์—†์„ ๊ฒฝ์šฐ์—๋Š” NaN ๊ฐ’์„ ์ž…๋ ฅ.

  • ๋˜ํ•œ rename ๋ช…๋ น์–ด๋ฅผ ํ†ตํ•ด ์ปฌ๋Ÿผ์„ dict์„ ํ™œ์šฉํ•˜์—ฌ ๋ณ€๊ฒฝ

    • ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ์ƒ์„ฑ๋œ ๋ณ€์ˆ˜๋“ค์„ ๊ณ„์†ํ•ด์„œ ์ถ”์ ํ•˜๋Š”๋ฐ ์œ ์šฉ
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
### bureau ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

bureau = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/3แ„Œแ…ฎแ„Žแ…ก/data/bureau.csv')
bureau.head()
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN
### ๊ณ ๊ฐ id (SK_ID_CURR)๋ฅผ ๊ธฐ์ค€์œผ๋กœ groupby
# ์ด์ „ ๋Œ€์ถœ ๊ฐฏ์ˆ˜๋ฅผ ํŒŒ์•…ํ•˜๊ณ , ์ปฌ๋Ÿผ ์ด๋ฆ„์„ ๋ณ€๊ฒฝ

previous_loan_counts = bureau.groupby('SK_ID_CURR', as_index = False)['SK_ID_BUREAU'].count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_counts'})
previous_loan_counts.head()
SK_ID_CURR previous_loan_counts
0 100001 7
1 100002 8
2 100003 4
3 100004 2
4 100005 3
### ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„๊ณผ ๋ณ‘ํ•ฉ(join)
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/3แ„Œแ…ฎแ„Žแ…ก/data/application_train.csv')
train = train.merge(previous_loan_counts, on = 'SK_ID_CURR', how = 'left')

### ๊ฒฐ์ธก์น˜ -> 0์œผ๋กœ ์ฑ„์šฐ๊ธฐ
train['previous_loan_counts'] = train['previous_loan_counts'].fillna(0)
train.head()
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR previous_loan_counts
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 8.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 4.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 2.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 NaN NaN NaN NaN NaN NaN 0.0
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

5 rows ร— 123 columns

  • ์Šคํฌ๋กค์„ ์šฐ์ธก ๋๊นŒ์ง€ ์›€์ง์—ฌ ์ƒˆ๋กญ๊ฒŒ ๋งŒ๋“ค์–ด์ง„ column(previous_loan_counts)์„ ํ™•์ธํ•ด ๋ณด์„ธ์š”!

1-2. R Value๋ฅผ ํ™œ์šฉํ•œ ๋ณ€์ˆ˜ ์œ ์šฉ์„ฑ ํ‰๊ฐ€

  • ์ƒˆ๋กญ๊ฒŒ ์ƒ์„ฑ๋œ ๋ณ€์ˆ˜๋“ค์ด ์œ ์šฉํ•œ์ง€ ํŒ๋‹จํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ์„  target ๋ณ€์ˆ˜์™€ ํ•ด๋‹น ๋ณ€์ˆ˜ ๊ฐ„์˜ ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜(Pearson Correlation Coefficient, r-value)๋ฅผ ๊ณ„์‚ฐ

๐Ÿ“Œ ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜(Pearson Correlation Coefficient)

  • ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜

  • ๋‘ ๋ณ€์ˆ˜ ์‚ฌ์ด์˜ ์„ ํ˜• ๊ด€๊ณ„๋Š” -1(์™„๋ฒฝํ•œ ์Œ์˜ ์„ ํ˜•๊ด€๊ณ„)์—์„œ๋ถ€ํ„ฐ +1(์™„๋ฒฝํ•œ ์–‘์˜ ์„ ํ˜•๊ด€๊ณ„) ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ํ‘œํ˜„๋จ

  • r-value๊ฐ€ ๋ณ€์ˆ˜์˜ ์œ ์šฉ์„ฑ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์ตœ์„ ์˜ ๋ฐฉ์‹์€ ์•„๋‹ˆ์ง€๋งŒ, ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๋ฐœ์ „์‹œํ‚ค๋Š” ๋ฐ ํšจ๊ณผ๊ฐ€ ์žˆ์„ ์ง€์— ๋Œ€ํ•œ ๋Œ€๋žต์ ์ธ ์ •๋ณด๋ฅผ ์ค„ ์ˆ˜๋Š” ์žˆ์Œ

  • ๋ชฉํ‘œ๊ฐ’์— ๋Œ€ํ•œ r-value๊ฐ€ ์ปค์งˆ์ˆ˜๋ก, ํ•ด๋‹น ๋ณ€์ˆ˜๊ฐ€ ๋ชฉํ‘œ๊ฐ’์— ์˜ํ–ฅ์„ ๋ผ์น  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง

  • ๋ชฉํ‘œ๊ฐ’์— ๋Œ€ํ•ด ๊ฐ€์žฅ ํฐ r-value์˜ ์ ˆ๋Œ“๊ฐ’์„ ๊ฐ€์ง€๋Š” ๋ณ€์ˆ˜๋ฅผ ์ฐพ๊ณ ์ž ํ•จ

๐Ÿ“Œ ์ปค๋„ ๋ฐ€๋„ ์ถ”์ • ๊ทธ๋ž˜ํ”„(Kernel Density Estimate Plots)

  • ๋ชฉํ‘œ๊ฐ’(target)๊ณผ์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์‹œ๊ฐํ™”

  • ๋‹จ์ผ ๋ณ€์ˆ˜์˜ ๋ถ„ํฌ๋ฅผ ๋ณด์—ฌ์คŒ(ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ํ•œ(smoothed) ๊ฒƒ์œผ๋กœ ์ƒ๊ฐํ•˜๋ฉด ๋  ๋“ฏ..)

  • ๋ณ€์ˆ˜๋“ค์ด ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ๊ณผ ๊ด€๋ จ์„ฑ์„ ๊ฐ€์ง€๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ๋Š” ์ง€ํ‘œ๋กœ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Œ


  • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์˜ ๊ฐ’ ์ฐจ์ด์— ๋”ฐ๋ฅธ ๋ถ„ํฌ ์ฐจ์ด๋ฅผ ๋ณด๊ธฐ ์œ„ํ•ด, ์นดํ…Œ๊ณ ๋ฆฌ์— ๋”ฐ๋ผ ์ƒ‰์„ ๋‹ค๋ฅด๊ฒŒ ์น ํ•˜์˜€์Œ

    • TARGET ๊ฐ’์ด 0์ธ์ง€ 1์ธ์ง€์— ๋”ฐ๋ผ ์ƒ‰์„ ๋‹ค๋ฅด๊ฒŒ ์น ํ•œ previous_loan_count์˜ ์ปค๋„๋ฐ€๋„์ถ”์ •๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆด ์ˆ˜ ์žˆ์Œ
### ์ปค๋„๋ฐ€๋„๊ทธ๋ž˜ํ”„ ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ ํ•จ์ˆ˜

def kde_target(var_name, df):
    '''
    input:
      var_name = str, ๋ณ€์ˆ˜๊ฐ€ ๋˜๋Š” Column
      df : DataFrame, ๋Œ€์ƒ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„
        
    return: None
    '''
   
   ### ํ†ต๊ณ„๊ฐ’ ์–ป๊ธฐ(์ƒ๊ด€๊ณ„์ˆ˜, ์ค‘๊ฐ„๊ฐ’)

    # ์ƒˆ๋กญ๊ฒŒ ์ƒ์„ฑ๋œ ๋ณ€์ˆ˜์™€ target ๋ณ€์ˆ˜ ๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ๊ณ„์‚ฐ
    corr = df['TARGET'].corr(df[var_name])
    # ๊ฐ ๊ทธ๋ฃน์˜ ์ค‘์•™๊ฐ’ ๊ณ„์‚ฐ
    avg_repaid = df.loc[df['TARGET'] == 0, var_name].median() # ๋Œ€์ถœ ์ƒํ™˜
    avg_not_repaid = df.loc[df['TARGET'] == 1, var_name].median() # ๋Œ€์ถœ ์ƒํ™˜x
    

    ### ์‹œ๊ฐํ™”

    # ์‹œ๊ฐํ™” map ์„ค์ •
    plt.figure(figsize = (12, 6))
    # target๊ฐ’์— ๋”ฐ๋ผ ์ƒ‰์„ ๋‹ฌ๋ฆฌํ•˜์—ฌ ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
    sns.kdeplot(df.loc[df['TARGET'] == 0, var_name], label = 'TARGET == 0')
    sns.kdeplot(df.loc[df['TARGET'] == 1, var_name], label = 'TARGET == 1')
    # labeling
    plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)
    plt.legend()
    

    ### ๊ฒฐ๊ณผ ์ถœ๋ ฅ

    # ์ƒ๊ด€๊ณ„์ˆ˜ ์ถœ๋ ฅ
    print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr))
    # ์ค‘๊ฐ„๊ฐ’ ์ถœ๋ ฅ
    print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid)
    print('Median value for loan that was repaid =     %0.4f' % avg_repaid)
    print()

EXT_SOURCE_3 ๋ณ€์ˆ˜ ์‹œ๊ฐํ™”

kde_target('EXT_SOURCE_3', train)
The correlation between EXT_SOURCE_3 and the TARGET is -0.1789
Median value for loan that was not repaid = 0.3791
Median value for loan that was repaid =     0.5460

previous_loan_counts ์‹œ๊ฐํ™”

  • ์ƒˆ๋กœ ์ƒ์„ฑํ•œ ๋ณ€์ˆ˜์— ๋Œ€ํ•ด ์‹œ๊ฐํ™”
kde_target('previous_loan_counts', train)
The correlation between previous_loan_counts and the TARGET is -0.0100
Median value for loan that was not repaid = 3.0000
Median value for loan that was repaid =     4.0000

  • ์ƒˆ๋กญ๊ฒŒ ์ƒ์„ฑ๋œ ๋ณ€์ˆ˜(previous_loan_counts)๊ฐ€ ์ค‘์š”ํ•˜์ง€ ์•Š์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ

    • ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ์ž‘์Œ

    • target ๊ฐ’์— ๋”ฐ๋ฅธ ๋ถ„ํฌ์˜ ์ฐจ์ด๋„ ๊ฑฐ์˜ ์—†์Œ

1-3. ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜ ์ƒ์„ฑํ•˜๊ธฐ

a) ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜๋“ค์˜ ๋Œ€ํ‘œ๊ฐ’ ๊ณ„์‚ฐ

  • agg()๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ํ‰๊ท , ์ตœ๋Œ€๊ฐ’, ์ตœ์†Œ๊ฐ’, ํ•ฉ๊ณ„ ๋“ฑ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Œ

    • ๋ณ„๋„์˜ ํ•จ์ˆ˜๋ฅผ ์ž‘์„ฑํ•œ ํ›„ ์ด๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๊ฒƒ๋„ ๊ฐ€๋Šฅํ•จ
  • bureau ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์•ˆ์˜ ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜๋“ค์„ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ๋ชจ๋“  ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜๋“ค์˜ ๋Œ€ํ‘œ๊ฐ’๋“ค์„ ๊ณ„์‚ฐ

    • ์ด๋ฅผ ์œ„ํ•ด ๊ณ ๊ฐ ID(SK_ID_CURR)๋ณ„๋กœ groupby()ํ•˜๊ณ , ๊ทธ๋ฃนํ™”๋œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ๋Œ€ํ‘œ๊ฐ’๋“ค์„ agg()๋ฅผ ํ†ตํ•ด ๊ตฌํ•œ ๋’ค, ๊ฒฐ๊ณผ๋ฅผ train ๋ฐ์ดํ„ฐ ์…‹๊ณผ ๋ณ‘ํ•ฉ(merge())
### ๋Œ€ํ‘œ๊ฐ’ ๊ณ„์‚ฐ
# ๊ณ ๊ฐid์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ๊ณ„์‚ฐ

bureau_agg = bureau.drop(columns = ['SK_ID_BUREAU']).groupby('SK_ID_CURR', as_index = False).agg(['count', 'mean', 'max', 'min', 'sum']).reset_index()
bureau_agg.head()
SK_ID_CURR DAYS_CREDIT CREDIT_DAY_OVERDUE ... DAYS_CREDIT_UPDATE AMT_ANNUITY
count mean max min sum count mean max min ... count mean max min sum count mean max min sum
0 100001 7 -735.000000 -49 -1572 -5145 7 0.0 0 0 ... 7 -93.142857 -6 -155 -652 7 3545.357143 10822.5 0.0 24817.5
1 100002 8 -874.000000 -103 -1437 -6992 8 0.0 0 0 ... 8 -499.875000 -7 -1185 -3999 7 0.000000 0.0 0.0 0.0
2 100003 4 -1400.750000 -606 -2586 -5603 4 0.0 0 0 ... 4 -816.000000 -43 -2131 -3264 0 NaN NaN NaN 0.0
3 100004 2 -867.000000 -408 -1326 -1734 2 0.0 0 0 ... 2 -532.000000 -382 -682 -1064 0 NaN NaN NaN 0.0
4 100005 3 -190.666667 -62 -373 -572 3 0.0 0 0 ... 3 -54.333333 -11 -121 -163 3 1420.500000 4261.5 0.0 4261.5

5 rows ร— 61 columns

  • ์ƒˆ๋กœ ์ƒ์„ฑ๋œ column๋“ค์— ๋Œ€ํ•ด ์ƒˆ๋กญ๊ฒŒ ์ด๋ฆ„์„ ์ ์–ด์ฃผ๋Š” ๊ฒƒ์ด ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ

    • ๋ฐ‘์˜ ์ฝ”๋“œ๋“ค์€ ์›๋ณธ๋ฐ์ดํ„ฐ์˜ Column์— ๋Œ€ํ‘œ๊ฐ’๋“ค์˜ ์ข…๋ฅ˜๋“ค์„ ์ถ”๊ฐ€์ ์œผ๋กœ ๊ธฐ์ž…ํ•ด์ฃผ๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰
  • ์—ฌ๊ธฐ์„œ๋Š” Multi-Level index ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์ž‘์—…์˜ ๋Œ€์ƒ์œผ๋กœ ํ•จ

    • ์ด๋Ÿฌํ•œ ๋ถ€๋ถ„์€ ํ˜ผ๋™์„ ์ค„ ์ˆ˜ ์žˆ๊ณ , ์ž‘์—…ํ•˜๊ธฐ๋„ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์—, single-level index๋กœ ์ตœ๋Œ€ํ•œ ๋น ๋ฅด๊ฒŒ ๋ณ€ํ™˜ํ•˜๊ณ ์ž ํ•˜์˜€์Œ
### ์ปฌ๋Ÿผ๋ช… ์žฌ์ •์˜

# ์ปฌ๋Ÿผ๋ช…์„ ์ €์žฅํ•œ ๋ฆฌ์ŠคํŠธ
columns = ['SK_ID_CURR']

for var in bureau_agg.columns.levels[0]: # ์›๋ณธ ์ปฌ๋Ÿผ(๋ณ€์ˆ˜)๋“ค๋งŒ ๊ฐ€์ ธ์˜ด
                                         # min, max ์ด๋Ÿฐ๊ฒƒ๋“ค ์—†์•ฐ
    # id ์ปฌ๋Ÿผ์€ ์ƒ๋žต
    if var != 'SK_ID_CURR':
        # ๋Œ€ํ‘œ๊ฐ’์˜ ์ข…๋ฅ˜์— ๋”ฐ๋ผ ๋ฐ˜๋ณต๋ฌธ์„ ์ƒ์„ฑ
        for stat in bureau_agg.columns.levels[1][:-1]:
            # ๋ณ€์ˆ˜ ๋ฐ ๋Œ€ํ‘œ๊ฐ’์˜ ์ข…๋ฅ˜์— ๋”ฐ๋ผ ์ƒˆ๋กœ์šด column name์„ ์ƒ์„ฑ
            columns.append('bureau_%s_%s' % (var, stat)) # bureau_๋ณ€์ˆ˜_๋Œ€ํ‘œ๊ฐ’
### ์ƒ์„ฑ๋œ list๋ฅผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ column name์œผ๋กœ ์ง€์ •

bureau_agg.columns = columns
bureau_agg.head()
SK_ID_CURR bureau_DAYS_CREDIT_count bureau_DAYS_CREDIT_mean bureau_DAYS_CREDIT_max bureau_DAYS_CREDIT_min bureau_DAYS_CREDIT_sum bureau_CREDIT_DAY_OVERDUE_count bureau_CREDIT_DAY_OVERDUE_mean bureau_CREDIT_DAY_OVERDUE_max bureau_CREDIT_DAY_OVERDUE_min ... bureau_DAYS_CREDIT_UPDATE_count bureau_DAYS_CREDIT_UPDATE_mean bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_min bureau_DAYS_CREDIT_UPDATE_sum bureau_AMT_ANNUITY_count bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_min bureau_AMT_ANNUITY_sum
0 100001 7 -735.000000 -49 -1572 -5145 7 0.0 0 0 ... 7 -93.142857 -6 -155 -652 7 3545.357143 10822.5 0.0 24817.5
1 100002 8 -874.000000 -103 -1437 -6992 8 0.0 0 0 ... 8 -499.875000 -7 -1185 -3999 7 0.000000 0.0 0.0 0.0
2 100003 4 -1400.750000 -606 -2586 -5603 4 0.0 0 0 ... 4 -816.000000 -43 -2131 -3264 0 NaN NaN NaN 0.0
3 100004 2 -867.000000 -408 -1326 -1734 2 0.0 0 0 ... 2 -532.000000 -382 -682 -1064 0 NaN NaN NaN 0.0
4 100005 3 -190.666667 -62 -373 -572 3 0.0 0 0 ... 3 -54.333333 -11 -121 -163 3 1420.500000 4261.5 0.0 4261.5

5 rows ร— 61 columns

### ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์™€ ๋ณ‘ํ•ฉ

train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')
train.head()
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... bureau_DAYS_CREDIT_UPDATE_count bureau_DAYS_CREDIT_UPDATE_mean bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_min bureau_DAYS_CREDIT_UPDATE_sum bureau_AMT_ANNUITY_count bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_min bureau_AMT_ANNUITY_sum
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 8.0 -499.875 -7.0 -1185.0 -3999.0 7.0 0.0 0.0 0.0 0.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 4.0 -816.000 -43.0 -2131.0 -3264.0 0.0 NaN NaN NaN 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 2.0 -532.000 -382.0 -682.0 -1064.0 0.0 NaN NaN NaN 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 1.0 -783.000 -783.0 -783.0 -783.0 0.0 NaN NaN NaN 0.0

5 rows ร— 183 columns

b) target๊ณผ ๋Œ€ํ‘œ๊ฐ’๋“ค์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ๋ถ„์„

  • ์ƒˆ๋กญ๊ฒŒ ์ƒ์„ฑ๋œ ๊ฐ’๋“ค๊ณผ ๋ชฉํ‘œ๊ฐ’๊ณผ์˜ ์ƒ๊ด€ ๊ณ„์ˆ˜๋ฅผ ๋ถ„์„
# ์ƒˆ๋กœ ์ƒ์„ฑ๋œ ๋ณ€์ˆ˜๋“ค์— ๋Œ€ํ•œ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ์ €์žฅํ•  ๋ฆฌ์ŠคํŠธ
new_corrs = []

# ๋ณ€์ˆ˜๋ณ„๋กœ ๋ฐ˜๋ณต๋ฌธ์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ..
for col in columns:
    # target๊ณผ์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ๊ณ„์‚ฐ
    corr = train['TARGET'].corr(train[col])
    # ํŠœํ”Œ(tuple)๋กœ ๋ฆฌ์ŠคํŠธ์— ์ถ”๊ฐ€
    new_corrs.append((col, corr))
  • ์ƒ๊ด€๊ณ„์ˆ˜๋“ค์„ ์ ˆ๋Œ“๊ฐ’์— ๋”ฐ๋ผ์„œ ์ •๋ ฌ

    • sorted() ํ•จ์ˆ˜ ํ™œ์šฉ
### ์ƒ๊ด€๊ณ„์ˆ˜๋“ค์„ ์ ˆ๋Œ€๊ฐ’์— ๋”ฐ๋ผ ์ •๋ ฌ

new_corrs = sorted(new_corrs, key = lambda x: abs(x[1]), reverse = True) # ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ
new_corrs[:15] # ์ƒ๊ด€๋„๊ฐ€ ๋†’์€ ์ƒ์œ„ 15๊ฐœ ๋ณ€์ˆ˜๋งŒ ์ถ”์ถœ
[('bureau_DAYS_CREDIT_mean', 0.08972896721998114),
 ('bureau_DAYS_CREDIT_min', 0.0752482510301036),
 ('bureau_DAYS_CREDIT_UPDATE_mean', 0.06892735266968673),
 ('bureau_DAYS_ENDDATE_FACT_min', 0.05588737984392077),
 ('bureau_DAYS_CREDIT_ENDDATE_sum', 0.0537348956010205),
 ('bureau_DAYS_ENDDATE_FACT_mean', 0.05319962585758616),
 ('bureau_DAYS_CREDIT_max', 0.04978205463997299),
 ('bureau_DAYS_ENDDATE_FACT_sum', 0.048853502611115894),
 ('bureau_DAYS_CREDIT_ENDDATE_mean', 0.046982754334835494),
 ('bureau_DAYS_CREDIT_UPDATE_min', 0.042863922470730155),
 ('bureau_DAYS_CREDIT_sum', 0.041999824814846716),
 ('bureau_DAYS_CREDIT_UPDATE_sum', 0.04140363535306002),
 ('bureau_DAYS_CREDIT_ENDDATE_max', 0.036589634696329094),
 ('bureau_DAYS_CREDIT_ENDDATE_min', 0.034281109921616024),
 ('bureau_DAYS_ENDDATE_FACT_count', -0.030492306653325495)]
  • ์ƒˆ๋กœ ์ƒ์„ฑ๋œ ๋ณ€์ˆ˜๋“ค ์ค‘ ๋Œ€์ƒ๊ณผ ์œ ์˜ํ•œ ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ์žˆ๋Š” ๋ณ€์ˆ˜๊ฐ€ ์—†์Œ
### ๊ฐ€์žฅ ์ƒ๊ด€๋„๊ฐ€ ๋†’์€ bureau_DAYS_CREDIT_mean ๋ณ€์ˆ˜ ์‹œ๊ฐํ™”

kde_target('bureau_DAYS_CREDIT_mean', train)
The correlation between bureau_DAYS_CREDIT_mean and the TARGET is 0.0897
Median value for loan that was not repaid = -835.3333
Median value for loan that was repaid =     -1067.0000

  • ํ•ด๋‹น column์€ โ€˜๊ณ ๊ฐ์ด ์‹ ์šฉ๊ด€๋ฆฌ๊ตญ์— ์‹ ์šฉ๋“ฑ๊ธ‰์„ ์‹ ์ฒญํ•œ ๋‚ ๋กœ๋ถ€ํ„ฐ ํ˜„ ๋Œ€์ถœ ์‹ ์ฒญ๊นŒ์ง€ ๊ฑธ๋ฆฐ ๊ธฐ๊ฐ„์€ ๋ช‡ ์ผ์ธ๊ฐ€?โ€™์— ๊ด€ํ•œ ๋ฐ์ดํ„ฐ์ž„

  • ์ด์ „ ๋Œ€์ถœ์„ ๋ฐ›๊ณ ๋‚˜์„œ โ€˜Home Creditโ€™์—์„œ ๋Œ€์ถœ์„ ๋ฐ›๊ธฐ์ „๊นŒ์ง€ ๊ฑธ๋ฆฐ ์ผ์ˆ˜๋กœ ํ•ด์„

    • ๋งˆ์ด๋„ˆ์Šค ์ˆ˜์น˜๊ฐ€ ํฌ๋‹ค๋Š” ๊ฒƒ์€ ์ด์ „ ๋Œ€์ถœ์ด ์ด๋ฃจ์–ด์ง„ ์‹œ์ ์ด ๋” ์˜ค๋ž˜๋์Œ์„ ์˜๋ฏธ

    • ๋” ์˜ค๋ž˜๋œ ๊ณผ๊ฑฐ์— ๋Œ€์ถœ์„ ์‹ ์ฒญํ–ˆ๋˜ ๊ณ ๊ฐ๋“ค์€ โ€˜Home Creditโ€™์—์„œ ๋Œ€์ถœ์„ ์ƒํ™˜ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค๋Š” ๊ฒƒ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Œ(์•ฝํ•œ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„)

      • ํ•˜์ง€๋งŒ ๋„ˆ๋ฌด ๋‚ฎ์€ ์ƒ๊ด€๊ด€๊ณ„ -> ๊ทธ์ € ๋…ธ์ด์ฆˆ์ผ ์ˆ˜๋„ ์žˆ๋‹ค.

โœ” ๋‹ค์ค‘ ๋น„๊ต ๋ฌธ์ œ(Multiple Comparisons Problem)

  • ๋ณ€์ˆ˜๊ฐ€ ๋งค์šฐ ๋งŽ์„ ๋•Œ๋Š”, ๋‹ค์ค‘ ๋น„๊ต ๋ฌธ์ œ๋กœ ์•Œ๋ ค์ง„ ์šฐ์—ฐ์— ์˜ํ•ด ๋ณ€์ˆ˜๋“ค์ด ๋ชฉํ‘œ๊ฐ’(target)์— ๋Œ€ํ•ด ์—ฐ๊ด€์„ฑ์„ ๊ฐ€์งˆ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์Œ

  • ์ˆ˜๋ฐฑ๊ฐœ์˜ ๋ณ€์ˆ˜(feature)๋“ค์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์ง€๋งŒ, ๋ช‡๊ฐœ์˜ ๊ฒฝ์šฐ ๊ทธ์ € ๋ฐ์ดํ„ฐ ์•ˆ์— ๋žœ๋คํ•˜๊ฒŒ ์žˆ๋Š” ๋…ธ์ด์ฆˆ๊ฐ’๋“ค์— ์˜ํ•ด ๋ชฉํ‘œ๊ฐ’(target)๊ณผ ์—ฐ๊ด€์„ฑ์„ ๊ฐ€์ง€๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ผ ์ˆ˜ ์žˆ์Œ

    • ์ด๋Ÿฌํ•œ ๋ณ€์ˆ˜๋“ค์€ train ๋ฐ์ดํ„ฐ ์ƒ์—์„œ๋Š” target ๊ฐ’๊ณผ ๊ด€๋ จ์„ฑ์ด ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด์ง€๋งŒ, ๊ด€๋ จ์„ฑ๋“ค์ด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๊นŒ์ง€ ์ผ๋ฐ˜ํ™” ๋  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์— ์žˆ์–ด overfitting์„ ์•ผ๊ธฐํ•  ์ˆ˜ ์žˆ์Œ

    • feature๋“ค์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์— ๋Œ€ํ•œ ๋งŽ์€ ๊ณ ๋ ค ํ•„์š”

  • ๋‹ค์ค‘ ๋น„๊ต ๋ฌธ์ œ

c) ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜๋“ค์˜ ๋Œ€ํ‘œ๊ฐ’ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ํ•จ์ˆ˜ ์ƒ์„ฑ

  • ์ด์ „์˜ ์ž‘์—…๋“ค์„ ์š”์•ฝํ•˜์—ฌ, ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜๋“ค์˜ ๋Œ€ํ‘œ๊ฐ’ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ํ•จ์ˆ˜๋ฅผ ์ƒ์„ฑ

  • ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์ „์ฒด์— ๊ฑธ์ณ ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜๋“ค์˜ ๋Œ€ํ‘œ๊ฐ’๋“ค์„ ๊ณ„์‚ฐํ•˜๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰


  • ํŒŒ๋ผ๋ฏธํ„ฐ(Parameters)>

    • df(dataframe): ์—ฐ์‚ฐ์˜ ๋Œ€์ƒ์ด ๋˜๋Š” ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„

    • group_var(string): ๊ทธ๋ฃนํ™”(groupby)์˜ ๊ธฐ์ค€์ด ๋˜๋Š” column

    • df_name(string): column๋ช…์„ ์žฌ์ •์˜ํ•˜๋Š”๋ฐ ์“ฐ์ด๋Š” ๋ณ€์ˆ˜

  • ์ถœ๋ ฅ๊ฐ’(Returns)>

    • agg (dataframe)

      • ๋ชจ๋“  ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜๋“ค์˜ ๋Œ€ํ‘œ๊ฐ’๋“ค์ด ์—ฐ์‚ฐ๋œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„

      • ๊ฐ๊ฐ์˜ ๊ทธ๋ฃนํ™”๋œ ์ธ์Šคํ„ด์Šค๋“ค์€ ๋Œ€ํ‘œ๊ฐ’(ํ‰๊ท , ์ตœ์†Œ๊ฐ’, ์ตœ๋Œ€๊ฐ’, ํ•ฉ๊ณ„)๋“ค์„ ๊ฐ€์ง

      • ์ƒˆ๋กญ๊ฒŒ ์ƒ์„ฑ๋œ feature๋“ค์„ ๊ตฌ๋ถ„ํ•˜๊ธฐ์œ„ํ•ด column๋“ค์˜ ์ด๋ฆ„์„ ์žฌ์ •์˜

def agg_numeric(df, group_var, df_name):
    # ๊ทธ๋ฃนํ™” ๋Œ€์ƒ์ด ์•„๋‹Œ id๋“ค์„ ์ œ๊ฑฐ
    for col in df:
        if col != group_var and 'SK_ID' in col:
            df = df.drop(columns = col)
    group_ids = df[group_var]
    numeric_df = df.select_dtypes('number')
    numeric_df[group_var] = group_ids
    
    # ํŠน์ • ๋ณ€์ˆ˜๋“ค์„ ๊ทธ๋ฃนํ™”ํ•˜๊ณ  ๋Œ€ํ‘œ๊ฐ’๋“ค์„ ๊ณ„์‚ฐ
    agg = numeric_df.groupby(group_var).agg(['count','mean','max','min','sum']).reset_index()
    
    # ์ƒˆ๋กœ์šด column๋ช… ์ƒ์„ฑ
    columns = [group_var]
    
    # ๋ณ€์ˆ˜๋“ค ๋ณ„๋กœ..
    for var in agg.columns.levels[0]:
        # id column์€ ์ƒ๋žต
        if var != group_var:
            # ์ข…๋ฅ˜๋ณ„๋กœ ๋Œ€ํ‘œ๊ฐ’ ๊ตฌํ•˜๊ธฐ
            for stat in agg.columns.levels[1][:-1]:
                # ๋ณ€์ˆ˜ ๋ฐ ๋Œ€ํ‘œ๊ฐ’์˜ ์ข…๋ฅ˜์— ๋”ฐ๋ผ ์ƒˆ๋กœ์šด column name์„ ์ƒ์„ฑ
                columns.append('%s_%s_%s' % (df_name, var, stat))
    agg.columns = columns
    
    return agg
### ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜๋“ค์˜ ๋Œ€ํ‘œ๊ฐ’ ๊ณ„์‚ฐ

bureau_agg_new = agg_numeric(bureau.drop(columns = ['SK_ID_BUREAU']), 
                             group_var = 'SK_ID_CURR', 
                             df_name = 'bureau')
bureau_agg_new.head()
SK_ID_CURR bureau_DAYS_CREDIT_count bureau_DAYS_CREDIT_mean bureau_DAYS_CREDIT_max bureau_DAYS_CREDIT_min bureau_DAYS_CREDIT_sum bureau_CREDIT_DAY_OVERDUE_count bureau_CREDIT_DAY_OVERDUE_mean bureau_CREDIT_DAY_OVERDUE_max bureau_CREDIT_DAY_OVERDUE_min ... bureau_DAYS_CREDIT_UPDATE_count bureau_DAYS_CREDIT_UPDATE_mean bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_min bureau_DAYS_CREDIT_UPDATE_sum bureau_AMT_ANNUITY_count bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_min bureau_AMT_ANNUITY_sum
0 100001 7 -735.000000 -49 -1572 -5145 7 0.0 0 0 ... 7 -93.142857 -6 -155 -652 7 3545.357143 10822.5 0.0 24817.5
1 100002 8 -874.000000 -103 -1437 -6992 8 0.0 0 0 ... 8 -499.875000 -7 -1185 -3999 7 0.000000 0.0 0.0 0.0
2 100003 4 -1400.750000 -606 -2586 -5603 4 0.0 0 0 ... 4 -816.000000 -43 -2131 -3264 0 NaN NaN NaN 0.0
3 100004 2 -867.000000 -408 -1326 -1734 2 0.0 0 0 ... 2 -532.000000 -382 -682 -1064 0 NaN NaN NaN 0.0
4 100005 3 -190.666667 -62 -373 -572 3 0.0 0 0 ... 3 -54.333333 -11 -121 -163 3 1420.500000 4261.5 0.0 4261.5

5 rows ร— 61 columns

### ์ง์ ‘ ๋งŒ๋“  df

bureau_agg.head()
SK_ID_CURR bureau_DAYS_CREDIT_count bureau_DAYS_CREDIT_mean bureau_DAYS_CREDIT_max bureau_DAYS_CREDIT_min bureau_DAYS_CREDIT_sum bureau_CREDIT_DAY_OVERDUE_count bureau_CREDIT_DAY_OVERDUE_mean bureau_CREDIT_DAY_OVERDUE_max bureau_CREDIT_DAY_OVERDUE_min ... bureau_DAYS_CREDIT_UPDATE_count bureau_DAYS_CREDIT_UPDATE_mean bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_min bureau_DAYS_CREDIT_UPDATE_sum bureau_AMT_ANNUITY_count bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_min bureau_AMT_ANNUITY_sum
0 100001 7 -735.000000 -49 -1572 -5145 7 0.0 0 0 ... 7 -93.142857 -6 -155 -652 7 3545.357143 10822.5 0.0 24817.5
1 100002 8 -874.000000 -103 -1437 -6992 8 0.0 0 0 ... 8 -499.875000 -7 -1185 -3999 7 0.000000 0.0 0.0 0.0
2 100003 4 -1400.750000 -606 -2586 -5603 4 0.0 0 0 ... 4 -816.000000 -43 -2131 -3264 0 NaN NaN NaN 0.0
3 100004 2 -867.000000 -408 -1326 -1734 2 0.0 0 0 ... 2 -532.000000 -382 -682 -1064 0 NaN NaN NaN 0.0
4 100005 3 -190.666667 -62 -373 -572 3 0.0 0 0 ... 3 -54.333333 -11 -121 -163 3 1420.500000 4261.5 0.0 4261.5

5 rows ร— 61 columns

  • ๋‘ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ด ๋™๋“ฑํ•˜๊ฒŒ ์ƒ์„ฑ๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

d) ์ƒ๊ด€๊ณ„์ˆ˜ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ํ•จ์ˆ˜

  • target๊ณผ ๋ณ€์ˆ˜ ๊ฐ„ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐ
### ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์—์„œ target๊ณผ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜

def target_corrs(df):
    # ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•œ ๋ฆฌ์ŠคํŠธ ์ƒ์„ฑ
    corrs = []
    # ๋ณ€์ˆ˜๋ณ„๋กœ..
    for col in df.columns:
        print(col)

        # target ์ปฌ๋Ÿผ์€ ์ƒ๋žต(์ž๊ธฐ ์ž์‹ ๊ณผ ์ƒ๊ด€๊ณ„์ˆ˜ ๊ตฌํ•˜๋ฉด 1์ด๋‹ˆ๊นŒ..)
        if col != 'TARGET':
            # ์ƒ๊ด€๊ณ„์ˆ˜ ๊ณ„์‚ฐ
            corr = df['TARGET'].corr(df[col])
            # ํŠœํ”Œ(tuple)๋กœ ๋ฆฌ์ŠคํŠธ์— ์ถ”๊ฐ€
            corrs.append((col, corr))
            
    # ์ƒ๊ด€๊ณ„์ˆ˜๋“ค์„ ์ ˆ๋Œ€๊ฐ’ ํฌ๊ธฐ์— ๋”ฐ๋ผ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ
    corrs = sorted(corrs, key = lambda x: abs(x[1]), reverse = True)
    
    return corrs

1-4. ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜(Categorical Variables)

1๏ธโƒฃ ์ฃผ๋กœ ๋ฌธ์ž์—ด ๋ฐ์ดํ„ฐ๋กœ, ์ด๋Ÿฌํ•œ ๋ฐ์ดํ„ฐ๋“ค์— ๋Œ€ํ•ด์„œ๋Š” ํ‰๊ท ์ด๋‚˜ ์ตœ๋Œ€์น˜ ๋“ฑ ํ†ต๊ณ„๊ฐ’์„ ํ™œ์šฉํ•˜๊ธฐ๊ฐ€ ์–ด๋ ค์›€

  • ๊ทธ ๋Œ€์‹ , ๊ฐ ๋ฒ”์ฃผ๋ณ„๋กœ ๊ฐ’๋“ค์˜ ๊ฐœ์ˆ˜๋ฅผ count

๐Ÿ‘‰ Example

  • ์ด๋Ÿฌํ•œ ๋ฐ์ดํ„ฐ๋ฅผ..
SK_ID_CURR Loan type

|โ€”โ€”โ€”โ€”|โ€”โ€”โ€”โ€“|

1 home
1 home
1 home
1 credit
2 credit
3 credit
3 cash
3 cash
4 credit
4 home
4 home

-

  • ๊ฐ ๊ณ ๊ฐ์˜ ๋ฒ”์ฃผ๋ณ„ ๋Œ€์ถœ ๊ฐฏ์ˆ˜๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Œ
SK_ID_CURR credit count cash count home count total count

|โ€”โ€”โ€”โ€”|โ€”โ€”โ€”โ€”โ€“|โ€”โ€”โ€”โ€”|โ€”โ€”โ€”โ€”|โ€”โ€”โ€”โ€”-|

1 1 0 3 4
2 1 0 0 1
3 1 2 0 3
4 1 0 2 3

2๏ธโƒฃ ๊ทธ ๋‹ค์Œ, ๊ณ ๊ฐ๋ณ„ ๋Œ€์ถœ ํšŸ์ˆ˜๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ฐ’๋“ค์„ ์ •๊ทœํ™”(normalize)

  • ๊ณ ๊ฐ๋ณ„ ๋Œ€์ถœ ํšŸ์ˆ˜์˜ ํ•ฉ๊ณ„๊ฐ€ 1์ด ๋˜๋„๋ก ์Šค์ผ€์ผ ์กฐ์ •
SK_ID_CURR credit count cash count home count total count credit count norm cash count norm home count norm

|โ€”โ€”โ€”โ€”|โ€”โ€”โ€”โ€”โ€“|โ€”โ€”โ€”โ€”|โ€”โ€”โ€”โ€”|โ€”โ€”โ€”โ€”-|โ€”โ€”โ€”โ€”โ€”โ€”-|โ€”โ€”โ€”โ€”โ€”โ€“|โ€”โ€”โ€”โ€”โ€”โ€“|

1 1 0 3 4 0.25 0 0.75
2 1 0 0 1 1.00 0 0
3 1 2 0 3 0.33 0.66 0
4 1 0 2 3 0.33 0 0.66

a) ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ธ์ฝ”๋”ฉ(Encoding)

  • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋“ค์„ ์ธ์ฝ”๋”ฉ(encoding)ํ•จ์œผ๋กœ์จ ๋ฐ์ดํ„ฐ๋“ค์ด ๋‹ด๊ณ  ์žˆ๋Š” ์ •๋ณด๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

  • ์›-ํ•ซ ์ธ์ฝ”๋”ฉ(One-hot Encoding) ์ ์šฉ

    • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋“ค(dtype=object)์— ํ•œํ•˜์—ฌ ์ ์šฉ
### One-hot Encoding

categorical = pd.get_dummies(bureau.select_dtypes('object'))
categorical['SK_ID_CURR'] = bureau['SK_ID_CURR']
categorical.head()
CREDIT_ACTIVE_Active CREDIT_ACTIVE_Bad debt CREDIT_ACTIVE_Closed CREDIT_ACTIVE_Sold CREDIT_CURRENCY_currency 1 CREDIT_CURRENCY_currency 2 CREDIT_CURRENCY_currency 3 CREDIT_CURRENCY_currency 4 CREDIT_TYPE_Another type of loan CREDIT_TYPE_Car loan ... CREDIT_TYPE_Loan for business development CREDIT_TYPE_Loan for purchase of shares (margin lending) CREDIT_TYPE_Loan for the purchase of equipment CREDIT_TYPE_Loan for working capital replenishment CREDIT_TYPE_Microloan CREDIT_TYPE_Mobile operator loan CREDIT_TYPE_Mortgage CREDIT_TYPE_Real estate loan CREDIT_TYPE_Unknown type of loan SK_ID_CURR
0 0 0 1 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 215354
1 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 215354
2 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 215354
3 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 215354
4 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 215354

5 rows ร— 24 columns

### ๊ณ ๊ฐ id๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ™”

categorical_grouped = categorical.groupby('SK_ID_CURR').agg(['sum', 'mean'])
categorical_grouped.head()
CREDIT_ACTIVE_Active CREDIT_ACTIVE_Bad debt CREDIT_ACTIVE_Closed CREDIT_ACTIVE_Sold CREDIT_CURRENCY_currency 1 ... CREDIT_TYPE_Microloan CREDIT_TYPE_Mobile operator loan CREDIT_TYPE_Mortgage CREDIT_TYPE_Real estate loan CREDIT_TYPE_Unknown type of loan
sum mean sum mean sum mean sum mean sum mean ... sum mean sum mean sum mean sum mean sum mean
SK_ID_CURR
100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0

5 rows ร— 46 columns

  • sum: ๊ณ ๊ฐ๋ณ„ ํ•ด๋‹น ๋ฒ”์ฃผ์— ์†ํ•œ ๋Œ€์ถœ์˜ ์ด ํšŸ์ˆ˜

  • mean: ํšŸ์ˆ˜๋ฅผ ์ •๊ทœํ™”์‹œํ‚จ ๊ฒƒ

์›-ํ•ซ ์ธ์ฝ”๋”ฉ์„ ํ†ตํ•ด ์ด๋Ÿฌํ•œ ์ˆ˜์น˜๋“ค์„ ์‰ฝ๊ฒŒ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Œ

  • ์—ฌ๊ธฐ์„œ, ์ง€๋‚œ๋ฒˆ column ์ œ๋ชฉ๋“ค์„ ์žฌ์ •์˜ํ•  ๋•Œ ์‚ฌ์šฉํ•˜์˜€๋˜ ๊ฒƒ๊ณผ ๋น„์Šทํ•œ ํ•จ์ˆ˜๋ฅผ ํ™œ์šฉํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

    • multi-level index๋กœ ์ž‘์„ฑ๋˜์–ด ์žˆ๋Š” column๋“ค์„ ๋‹ค๋ฃฐ ๊ฒƒ์ž„
  • ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๊ฐ€ ์†ํ•œ column์˜ ์ด๋ฆ„์„ ์ฐจ์šฉํ•œ first-level(level 0)์„ ๋”ฐ๋ผ ๋ฐ˜๋ณต๋ฌธ์„ ๋จผ์ € ์‹คํ–‰ํ•˜๊ณ , ๊ทธ ๋‹ค์Œ ๊ณ„์‚ฐ๋œ ํ†ต๊ณ„์น˜๋“ค์„ ๋”ฐ๋ผ ๋ฐ˜๋ณต๋ฌธ์„ ํ•œ๋ฒˆ ๋” ์‹คํ–‰

  • ์ดํ›„ level 0์˜ ์ด๋ฆ„์— ํ†ต๊ณ„์น˜์˜ ์ข…๋ฅ˜๋ฅผ ํ•ฉ์ณ column ์ œ๋ชฉ๋“ค์„ ์žฌ์ •์˜

    • ์˜ˆ๋ฅผ ๋“ค๋ฉด, CREDIT_ACTIVE_Active๊ฐ€ level 0์ด๊ณ , sum์ด level 1์ธ column์€ CREDIT_ACTIVE_Active_count๋กœ ์ •์˜๋จ

ํ•จ์ˆ˜ํ™”

### Level 0: ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๊ฐ€ ์†ํ•œ column๋ช…

categorical_grouped.columns.levels[0][:10]
Index(['CREDIT_ACTIVE_Active', 'CREDIT_ACTIVE_Bad debt',
       'CREDIT_ACTIVE_Closed', 'CREDIT_ACTIVE_Sold',
       'CREDIT_CURRENCY_currency 1', 'CREDIT_CURRENCY_currency 2',
       'CREDIT_CURRENCY_currency 3', 'CREDIT_CURRENCY_currency 4',
       'CREDIT_TYPE_Another type of loan', 'CREDIT_TYPE_Car loan'],
      dtype='object')
### Level 1: ๊ณ„์‚ฐ๋  ํ†ต๊ณ„์น˜(sum, mean)

categorical_grouped.columns.levels[1]
Index(['sum', 'mean'], dtype='object')
group_var = 'SK_ID_CURR'

# ์ƒˆ๋กœ์šด column ์ œ๋ชฉ๋“ค์„ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•œ ๋ฆฌ์ŠคํŠธ
columns = []

# ๋ณ€์ˆ˜๋“ค์„ ์ˆœํ™˜ํ•˜๋ฉฐ...
for var in categorical_grouped.columns.levels[0]:
    # ๊ณ ๊ฐ id column(๊ทธ๋ฃนํ™” ๋ณ€์ˆ˜)์€ ์ƒ๋žต
    if var != group_var:
        # ํ†ต๊ณ„์น˜์˜ ์ข…๋ฅ˜์— ๋”ฐ๋ผ ๋ฐ˜๋ณต๋ฌธ ์‹คํ–‰
        for stat in ['count', 'count_norm']:
            # ์ƒˆ๋กœ์šด column ์ œ๋ชฉ ์ •์˜
            columns.append('%s_%s' % (var, stat))
### ๋ณ€์ˆ˜ ์ด๋ฆ„ ์žฌ์ •์˜

categorical_grouped.columns = columns
categorical_grouped.head()
CREDIT_ACTIVE_Active_count CREDIT_ACTIVE_Active_count_norm CREDIT_ACTIVE_Bad debt_count CREDIT_ACTIVE_Bad debt_count_norm CREDIT_ACTIVE_Closed_count CREDIT_ACTIVE_Closed_count_norm CREDIT_ACTIVE_Sold_count CREDIT_ACTIVE_Sold_count_norm CREDIT_CURRENCY_currency 1_count CREDIT_CURRENCY_currency 1_count_norm ... CREDIT_TYPE_Microloan_count CREDIT_TYPE_Microloan_count_norm CREDIT_TYPE_Mobile operator loan_count CREDIT_TYPE_Mobile operator loan_count_norm CREDIT_TYPE_Mortgage_count CREDIT_TYPE_Mortgage_count_norm CREDIT_TYPE_Real estate loan_count CREDIT_TYPE_Real estate loan_count_norm CREDIT_TYPE_Unknown type of loan_count CREDIT_TYPE_Unknown type of loan_count_norm
SK_ID_CURR
100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0

5 rows ร— 46 columns

  • sum: ์ด ํšŸ์ˆ˜

  • count_norm: ํšŸ์ˆ˜๋ฅผ ์ •๊ทœํ™”์‹œํ‚จ ๊ฒƒ

### ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์™€ ๊ฒฐํ•ฉ

train = train.merge(categorical_grouped, left_on = 'SK_ID_CURR', 
                    right_index = True, how = 'left')
train.head()
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... CREDIT_TYPE_Microloan_count CREDIT_TYPE_Microloan_count_norm CREDIT_TYPE_Mobile operator loan_count CREDIT_TYPE_Mobile operator loan_count_norm CREDIT_TYPE_Mortgage_count CREDIT_TYPE_Mortgage_count_norm CREDIT_TYPE_Real estate loan_count CREDIT_TYPE_Real estate loan_count_norm CREDIT_TYPE_Unknown type of loan_count CREDIT_TYPE_Unknown type of loan_count_norm
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows ร— 229 columns

train.shape
(307511, 229)
train.iloc[:10, 123:]
bureau_DAYS_CREDIT_count bureau_DAYS_CREDIT_mean bureau_DAYS_CREDIT_max bureau_DAYS_CREDIT_min bureau_DAYS_CREDIT_sum bureau_CREDIT_DAY_OVERDUE_count bureau_CREDIT_DAY_OVERDUE_mean bureau_CREDIT_DAY_OVERDUE_max bureau_CREDIT_DAY_OVERDUE_min bureau_CREDIT_DAY_OVERDUE_sum ... CREDIT_TYPE_Microloan_count CREDIT_TYPE_Microloan_count_norm CREDIT_TYPE_Mobile operator loan_count CREDIT_TYPE_Mobile operator loan_count_norm CREDIT_TYPE_Mortgage_count CREDIT_TYPE_Mortgage_count_norm CREDIT_TYPE_Real estate loan_count CREDIT_TYPE_Real estate loan_count_norm CREDIT_TYPE_Unknown type of loan_count CREDIT_TYPE_Unknown type of loan_count_norm
0 8.0 -874.000000 -103.0 -1437.0 -6992.0 8.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 4.0 -1400.750000 -606.0 -2586.0 -5603.0 4.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2.0 -867.000000 -408.0 -1326.0 -1734.0 2.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1.0 -1149.000000 -1149.0 -1149.0 -1149.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 3.0 -757.333333 -78.0 -1097.0 -2272.0 3.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 18.0 -1271.500000 -239.0 -2882.0 -22887.0 18.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 2.0 -1939.500000 -1138.0 -2741.0 -3879.0 2.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 4.0 -1773.000000 -1309.0 -2508.0 -7092.0 4.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10 rows ร— 106 columns

b) ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋“ค์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜

  • ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๊ทธ๋ฃนํ™”

  • ์ดํ›„ ๊ฐ๊ฐ์˜ ๋ฒ”์ฃผ์— ๋”ฐ๋ผ counts์™€ normalized_counts๋ฅผ ๊ณ„์‚ฐ


  • ํŒŒ๋ผ๋ฏธํ„ฐ(Parameters)>

    • df (dataframe): ์—ฐ์‚ฐ์˜ ๋Œ€์ƒ์ด ๋˜๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„

    • group_var(string): groupby์˜ ๊ธฐ์ค€์ด๋˜๋Š” column

    • df_name(string): column ๋ช…์„ ์žฌ์ •์˜ํ•˜๋Š”๋ฐ ์“ฐ์ด๋Š” ๋ณ€์ˆ˜

  • ์ถœ๋ ฅ๊ฐ’(Returns)>

    • categorical: ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ group_var์— ๋Œ€ํ•ด ๊ฐ ๋ฒ”์ฃผ๋“ค์˜ counts ๋ฐ normalized_counts ๊ฐ’์ด ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„
### ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํ•จ์ˆ˜

def count_categorical(df, group_var, df_name):    
    # ๋ฒ”์ฃผํ˜• column๋“ค์„ ์„ ํƒ
    categorical = pd.get_dummies(df.select_dtypes('object'))
    # ํ™•์‹คํžˆ id๊ฐ€ column์— ์žˆ๋„๋ก ์ง€์ •ํ•˜๊ธฐ
    categorical[group_var] = df[group_var]

    # group_var๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ  sum๊ณผ mean์„ ๊ณ„์‚ฐ
    categorical = categorical.groupby(group_var).agg(['sum', 'mean'])
    
    column_names = []
    
    # level = 0์˜ column๋“ค์— ๋”ฐ๋ผ ๋ฐ˜๋ณต๋ฌธ์„ ์‹คํ–‰
    for var in categorical.columns.levels[0]:
      # level = 1์˜ ํ†ต๊ณ„๊ฐ’๋“ค์— ๋Œ€ํ•ด ๋ฐ˜๋ณต๋ฌธ์„ ์‹คํ–‰
        for stat in ['count', 'count_norm']:
            # ์ปฌ๋Ÿผ๋ช… ์žฌ์ •์˜
            column_names.append('%s_%s_%s' % (df_name, var, stat))
    
    categorical.columns = column_names
    
    return categorical
bureau_counts = count_categorical(bureau, 
                                  group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_counts.head()
bureau_CREDIT_ACTIVE_Active_count bureau_CREDIT_ACTIVE_Active_count_norm bureau_CREDIT_ACTIVE_Bad debt_count bureau_CREDIT_ACTIVE_Bad debt_count_norm bureau_CREDIT_ACTIVE_Closed_count bureau_CREDIT_ACTIVE_Closed_count_norm bureau_CREDIT_ACTIVE_Sold_count bureau_CREDIT_ACTIVE_Sold_count_norm bureau_CREDIT_CURRENCY_currency 1_count bureau_CREDIT_CURRENCY_currency 1_count_norm ... bureau_CREDIT_TYPE_Microloan_count bureau_CREDIT_TYPE_Microloan_count_norm bureau_CREDIT_TYPE_Mobile operator loan_count bureau_CREDIT_TYPE_Mobile operator loan_count_norm bureau_CREDIT_TYPE_Mortgage_count bureau_CREDIT_TYPE_Mortgage_count_norm bureau_CREDIT_TYPE_Real estate loan_count bureau_CREDIT_TYPE_Real estate loan_count_norm bureau_CREDIT_TYPE_Unknown type of loan_count bureau_CREDIT_TYPE_Unknown type of loan_count_norm
SK_ID_CURR
100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0

5 rows ร— 46 columns

c) ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ์—ฐ์‚ฐ ์ ์šฉํ•˜๊ธฐ

  • bureau_balance ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ํ™œ์šฉ

    • ์ด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์€ ์›”๋ณ„ ๊ฐ ๊ณ ๊ฐ์˜ ๊ณผ๊ฑฐ ํƒ€ ๊ธˆ์œต๊ธฐ๊ด€ ๋Œ€์ถœ ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์Œ
  • ๊ณ ๊ฐ๋“ค์˜ ID์ธ SK_ID_CURR์— ๋”ฐ๋ผ ๊ทธ๋ฃนํ™”ํ•˜๊ธฐ๋ณด๋‹ค, ์ด์ „ ๋Œ€์ถœ์˜ ID์ธ SK_ID_BUREAU๋ฅผ ํ™œ์šฉํ•˜์—ฌ 1์ฐจ ๊ทธ๋ฃนํ™”๋ฅผ ์ง„ํ–‰ํ•  ๊ฒƒ์ž„

    • ๊ทธ๋ฃนํ™”ํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์€ ๊ฐ๊ฐ์˜ ๋Œ€์ถœ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํ–‰๋ณ„๋กœ ํฌํ•จํ•  ๊ฒƒ์ž„
  • ๊ทธ ๋‹ค์Œ, SK_ID_CURR์„ ํ™œ์šฉํ•˜์—ฌ ๊ทธ๋ฃนํ™”ํ•œ ๋’ค, ๊ฐ ๊ณ ๊ฐ๋ณ„ ๋Œ€์ถœ์˜ ๋Œ€ํ‘œ๊ฐ’๋“ค์„ ๊ณ„์‚ฐ

  • ์ตœ์ข… ์‚ฐ์ถœ๋ฌผ์€ ๊ฐ๊ฐ์˜ ํ–‰์— ๊ณ ๊ฐ๋ณ„๋กœ ๋Œ€์ถœ์— ๋Œ€ํ•œ ๋Œ€ํ‘œ๊ฐ’๋“ค์„ ํฌํ•จ

### ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

bureau_balance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/3แ„Œแ…ฎแ„Žแ…ก/data/bureau_balance.csv')
bureau_balance.head()
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C
### 1. ๊ฐ๊ฐ์˜ ์ด์ „ ๋Œ€์ถœ์— ๋Œ€ํ•œ ์ƒํƒœ ๊ฐœ์ˆ˜ ํŒŒ์•…

bureau_balance_counts = count_categorical(bureau_balance, group_var = 'SK_ID_BUREAU', 
                                          df_name = 'bureau_balance')
bureau_balance_counts.head()
bureau_balance_STATUS_0_count bureau_balance_STATUS_0_count_norm bureau_balance_STATUS_1_count bureau_balance_STATUS_1_count_norm bureau_balance_STATUS_2_count bureau_balance_STATUS_2_count_norm bureau_balance_STATUS_3_count bureau_balance_STATUS_3_count_norm bureau_balance_STATUS_4_count bureau_balance_STATUS_4_count_norm bureau_balance_STATUS_5_count bureau_balance_STATUS_5_count_norm bureau_balance_STATUS_C_count bureau_balance_STATUS_C_count_norm bureau_balance_STATUS_X_count bureau_balance_STATUS_X_count_norm
SK_ID_BUREAU
5001709 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 86 0.886598 11 0.113402
5001710 5 0.060241 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 48 0.578313 30 0.361446
5001711 3 0.750000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 1 0.250000
5001712 10 0.526316 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 9 0.473684 0 0.000000
5001713 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 22 1.000000
### 2. ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜ ์ฒ˜๋ฆฌ
# MONTHS_BALACE: ์‹ ์ฒญ์ผ์„ ๊ธฐ์ค€์œผ๋กœ ํ•œ ๋‚จ์€ ๊ฐœ์›” ์ˆ˜

## 2-1. ๊ฐ๊ฐ์˜ 'SK_ID_CURR'๋ณ„ ๋Œ€ํ‘œ๊ฐ’ ๊ณ„์‚ฐ
bureau_balance_agg = agg_numeric(bureau_balance, group_var = 'SK_ID_BUREAU', 
                                 df_name = 'bureau_balance')
bureau_balance_agg.head()
SK_ID_BUREAU bureau_balance_MONTHS_BALANCE_count bureau_balance_MONTHS_BALANCE_mean bureau_balance_MONTHS_BALANCE_max bureau_balance_MONTHS_BALANCE_min bureau_balance_MONTHS_BALANCE_sum
0 5001709 97 -48.0 0 -96 -4656
1 5001710 83 -41.0 0 -82 -3403
2 5001711 4 -1.5 0 -3 -6
3 5001712 19 -9.0 0 -18 -171
4 5001713 22 -10.5 0 -21 -231
## 2-2. ๊ฐ ๊ณ ๊ฐ๋ณ„ ๊ณ„์‚ฐ

# ๋Œ€์ถœ์„ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๊ทธ๋ฃนํ™”
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, 
                                          left_on = 'SK_ID_BUREAU', how = 'outer')

# SK_ID_CURR์„ ํฌํ•จํ•˜์—ฌ ๋ณ‘ํ•ฉ
bureau_by_loan = bureau_by_loan.merge(bureau[['SK_ID_BUREAU', 'SK_ID_CURR']], 
                                      on = 'SK_ID_BUREAU', how = 'left')

bureau_by_loan.head()
SK_ID_BUREAU bureau_balance_MONTHS_BALANCE_count bureau_balance_MONTHS_BALANCE_mean bureau_balance_MONTHS_BALANCE_max bureau_balance_MONTHS_BALANCE_min bureau_balance_MONTHS_BALANCE_sum bureau_balance_STATUS_0_count bureau_balance_STATUS_0_count_norm bureau_balance_STATUS_1_count bureau_balance_STATUS_1_count_norm ... bureau_balance_STATUS_3_count_norm bureau_balance_STATUS_4_count bureau_balance_STATUS_4_count_norm bureau_balance_STATUS_5_count bureau_balance_STATUS_5_count_norm bureau_balance_STATUS_C_count bureau_balance_STATUS_C_count_norm bureau_balance_STATUS_X_count bureau_balance_STATUS_X_count_norm SK_ID_CURR
0 5001709 97 -48.0 0 -96 -4656 0 0.000000 0 0.0 ... 0.0 0 0.0 0 0.0 86 0.886598 11 0.113402 NaN
1 5001710 83 -41.0 0 -82 -3403 5 0.060241 0 0.0 ... 0.0 0 0.0 0 0.0 48 0.578313 30 0.361446 162368.0
2 5001711 4 -1.5 0 -3 -6 3 0.750000 0 0.0 ... 0.0 0 0.0 0 0.0 0 0.000000 1 0.250000 162368.0
3 5001712 19 -9.0 0 -18 -171 10 0.526316 0 0.0 ... 0.0 0 0.0 0 0.0 9 0.473684 0 0.000000 162368.0
4 5001713 22 -10.5 0 -21 -231 0 0.000000 0 0.0 ... 0.0 0 0.0 0 0.0 0 0.000000 22 1.000000 150635.0

5 rows ร— 23 columns

bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), 
                                       group_var = 'SK_ID_CURR', df_name = 'client')
bureau_balance_by_client.head()
SK_ID_CURR client_bureau_balance_MONTHS_BALANCE_count_count client_bureau_balance_MONTHS_BALANCE_count_mean client_bureau_balance_MONTHS_BALANCE_count_max client_bureau_balance_MONTHS_BALANCE_count_min client_bureau_balance_MONTHS_BALANCE_count_sum client_bureau_balance_MONTHS_BALANCE_mean_count client_bureau_balance_MONTHS_BALANCE_mean_mean client_bureau_balance_MONTHS_BALANCE_mean_max client_bureau_balance_MONTHS_BALANCE_mean_min ... client_bureau_balance_STATUS_X_count_count client_bureau_balance_STATUS_X_count_mean client_bureau_balance_STATUS_X_count_max client_bureau_balance_STATUS_X_count_min client_bureau_balance_STATUS_X_count_sum client_bureau_balance_STATUS_X_count_norm_count client_bureau_balance_STATUS_X_count_norm_mean client_bureau_balance_STATUS_X_count_norm_max client_bureau_balance_STATUS_X_count_norm_min client_bureau_balance_STATUS_X_count_norm_sum
0 100001.0 7 24.571429 52 2 172 7 -11.785714 -0.5 -25.5 ... 7 4.285714 9 0 30.0 7 0.214590 0.500000 0.0 1.502129
1 100002.0 8 13.750000 22 4 110 8 -21.875000 -1.5 -39.5 ... 8 1.875000 3 0 15.0 8 0.161932 0.500000 0.0 1.295455
2 100005.0 3 7.000000 13 3 21 3 -3.000000 -1.0 -6.0 ... 3 0.666667 1 0 2.0 3 0.136752 0.333333 0.0 0.410256
3 100010.0 2 36.000000 36 36 72 2 -46.000000 -19.5 -72.5 ... 2 0.000000 0 0 0.0 2 0.000000 0.000000 0.0 0.000000
4 100013.0 4 57.500000 69 40 230 4 -28.250000 -19.5 -34.0 ... 4 10.250000 40 0 41.0 4 0.254545 1.000000 0.0 1.018182

5 rows ร— 106 columns

๐Ÿ“Œ ์ •๋ฆฌ

  • bureau_balance ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณผ์ •์œผ๋กœ ๊ฐ€๊ณต๋จ

    1. ๊ฐ๊ฐ์˜ ๋Œ€์ถœ์— ๋Œ€ํ•ด ์ˆ˜์น˜ํ˜•(numeric) ๋Œ€ํ‘œ๊ฐ’ ๊ณ„์‚ฐ

    2. ๊ฐ๊ฐ์˜ ๋Œ€์ถœ์— ๋Œ€ํ•ด ๋ฒ”์ฃผํ˜•(categorical) ๋ฐ์ดํ„ฐ๋“ค์˜ ๊ฐœ์ˆ˜๋ฅผ ํŒŒ์•…

    3. ๊ฐ๊ฐ์˜ ๋Œ€์ถœ์— ๋Œ€ํ•œ ๋Œ€ํ‘œ๊ฐ’๋“ค๊ณผ ๊ฐฏ์ˆ˜๋ฅผ ๋ณ‘ํ•ฉ

    4. ๊ฐ ๊ณ ๊ฐ๋ณ„๋กœ 3์˜ ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ์ˆ˜์น˜ํ˜• ๋Œ€ํ‘œ๊ฐ’ ๊ณ„์‚ฐ

  • ์ตœ์ข… ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์€ ๊ฐ ๊ณ ๊ฐ์— ๋Œ€ํ•œ ๊ฐœ๋ณ„ ํ–‰์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, ๊ฐ ํ–‰์€ ์ด์ „ ๋ชจ๋“  ๋Œ€์ถœ๋“ค์˜ ์›”๋ณ„ ์ •๋ณด๋“ค์˜ ํ†ต๊ณ„์น˜๋“ค๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Œ


  • client_bureau_balance_MONTHS_BALANCE_mean_mean: ๊ฐ๊ฐ์˜ ๋Œ€์ถœ์— ๋Œ€ํ•œ MONTHS_BALANCE์˜ ํ‰๊ท ๊ฐ’์„ ๊ณ„์‚ฐ -> ํด๋ผ์ด์–ธํŠธ๋ณ„ ๋Œ€์ถœ์˜ ํ‰๊ท ๊ฐ’์„ ๊ณ„์‚ฐ

  • client_bureau_balance_STATUS_X_count_norm_sum: ๊ฐ๊ฐ์˜ ๋Œ€์ถœ์— ๋Œ€ํ•ด STATUS == x ์ธ๊ฒƒ์˜ ๋นˆ๋„๋ฅผ ์ด STATUS ์ˆ˜๋กœ ๋‚˜๋ˆˆ ๋‹ค์Œ, ๊ฐœ๋ณ„ ํด๋ผ์ด์–ธํŠธ๋ณ„๋กœ ๊ทธ ์ˆ˜๋ฅผ ํ•ฉ์‚ฐ

We will hold off on calculating the correlations until we have all the variables together in one dataframe.

2. ์ด์ „๊นŒ์ง€ ์ƒ์„ฑํ•œ ํ•จ์ˆ˜ ํ™œ์šฉํ•˜๊ธฐ

  • ๋ชจ๋“  ๋ณ€์ˆ˜๋“ค์„ ์ดˆ๊ธฐํ™” ํ•œ ๋’ค, ์ƒ์„ฑ๋œ ํ•จ์ˆ˜๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋งŒ๋“ค์–ด๋‚˜๊ฐ€์ž.
# ์˜ค๋ž˜๋œ ๊ฐ์ฒด๋“ค(objects)์„ ์ œ๊ฑฐํ•จ์œผ๋กœ์จ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™•๋ณด

import gc
gc.enable()

del train, bureau, bureau_balance, bureau_agg, bureau_agg_new, bureau_balance_agg, bureau_balance_counts, bureau_by_loan, bureau_balance_by_client, bureau_counts
gc.collect()
0
# ์›๋ณธ ๋ฐ์ดํ„ฐ ๋‹ค์‹œ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ(์ดˆ๊ธฐํ™”)

train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/3แ„Œแ…ฎแ„Žแ…ก/data/application_train.csv')
bureau = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/3แ„Œแ…ฎแ„Žแ…ก/data/bureau.csv')
bureau_balance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/3แ„Œแ…ฎแ„Žแ…ก/data/bureau_balance.csv')

๐Ÿ“Œ Bureau ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๋‚ด ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ์˜ ๊ฐฏ์ˆ˜ ์„ธ๊ธฐ

bureau_counts = count_categorical(bureau, group_var = 'SK_ID_CURR', 
                                  df_name = 'bureau')
bureau_counts.head()
bureau_CREDIT_ACTIVE_Active_count bureau_CREDIT_ACTIVE_Active_count_norm bureau_CREDIT_ACTIVE_Bad debt_count bureau_CREDIT_ACTIVE_Bad debt_count_norm bureau_CREDIT_ACTIVE_Closed_count bureau_CREDIT_ACTIVE_Closed_count_norm bureau_CREDIT_ACTIVE_Sold_count bureau_CREDIT_ACTIVE_Sold_count_norm bureau_CREDIT_CURRENCY_currency 1_count bureau_CREDIT_CURRENCY_currency 1_count_norm ... bureau_CREDIT_TYPE_Microloan_count bureau_CREDIT_TYPE_Microloan_count_norm bureau_CREDIT_TYPE_Mobile operator loan_count bureau_CREDIT_TYPE_Mobile operator loan_count_norm bureau_CREDIT_TYPE_Mortgage_count bureau_CREDIT_TYPE_Mortgage_count_norm bureau_CREDIT_TYPE_Real estate loan_count bureau_CREDIT_TYPE_Real estate loan_count_norm bureau_CREDIT_TYPE_Unknown type of loan_count bureau_CREDIT_TYPE_Unknown type of loan_count_norm
SK_ID_CURR
100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0

5 rows ร— 46 columns

๐Ÿ“ŒBureau ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๋Œ€ํ‘œ๊ฐ’ ๊ณ„์‚ฐ

bureau_agg = agg_numeric(bureau.drop(columns = ['SK_ID_BUREAU']), 
                         group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_agg.head()
SK_ID_CURR bureau_DAYS_CREDIT_count bureau_DAYS_CREDIT_mean bureau_DAYS_CREDIT_max bureau_DAYS_CREDIT_min bureau_DAYS_CREDIT_sum bureau_CREDIT_DAY_OVERDUE_count bureau_CREDIT_DAY_OVERDUE_mean bureau_CREDIT_DAY_OVERDUE_max bureau_CREDIT_DAY_OVERDUE_min ... bureau_DAYS_CREDIT_UPDATE_count bureau_DAYS_CREDIT_UPDATE_mean bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_min bureau_DAYS_CREDIT_UPDATE_sum bureau_AMT_ANNUITY_count bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_min bureau_AMT_ANNUITY_sum
0 100001 7 -735.000000 -49 -1572 -5145 7 0.0 0 0 ... 7 -93.142857 -6 -155 -652 7 3545.357143 10822.5 0.0 24817.5
1 100002 8 -874.000000 -103 -1437 -6992 8 0.0 0 0 ... 8 -499.875000 -7 -1185 -3999 7 0.000000 0.0 0.0 0.0
2 100003 4 -1400.750000 -606 -2586 -5603 4 0.0 0 0 ... 4 -816.000000 -43 -2131 -3264 0 NaN NaN NaN 0.0
3 100004 2 -867.000000 -408 -1326 -1734 2 0.0 0 0 ... 2 -532.000000 -382 -682 -1064 0 NaN NaN NaN 0.0
4 100005 3 -190.666667 -62 -373 -572 3 0.0 0 0 ... 3 -54.333333 -11 -121 -163 3 1420.500000 4261.5 0.0 4261.5

5 rows ร— 61 columns

๐Ÿ“Œ Bureau Balance ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ๊ฐ ๋Œ€์ถœ ๋ณ„ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜

bureau_balance_counts = count_categorical(bureau_balance, group_var = 'SK_ID_BUREAU', 
                                          df_name = 'bureau_balance')
bureau_balance_counts.head()
bureau_balance_STATUS_0_count bureau_balance_STATUS_0_count_norm bureau_balance_STATUS_1_count bureau_balance_STATUS_1_count_norm bureau_balance_STATUS_2_count bureau_balance_STATUS_2_count_norm bureau_balance_STATUS_3_count bureau_balance_STATUS_3_count_norm bureau_balance_STATUS_4_count bureau_balance_STATUS_4_count_norm bureau_balance_STATUS_5_count bureau_balance_STATUS_5_count_norm bureau_balance_STATUS_C_count bureau_balance_STATUS_C_count_norm bureau_balance_STATUS_X_count bureau_balance_STATUS_X_count_norm
SK_ID_BUREAU
5001709 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 86 0.886598 11 0.113402
5001710 5 0.060241 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 48 0.578313 30 0.361446
5001711 3 0.750000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 1 0.250000
5001712 10 0.526316 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 9 0.473684 0 0.000000
5001713 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 22 1.000000

๐Ÿ“Œ Bureau Balance ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๊ฐ ๋Œ€์ถœ๋ณ„ ๋Œ€ํ‘œ๊ฐ’

bureau_balance_agg = agg_numeric(bureau_balance, group_var = 'SK_ID_BUREAU', 
                                 df_name = 'bureau_balance')
bureau_balance_agg.head()
SK_ID_BUREAU bureau_balance_MONTHS_BALANCE_count bureau_balance_MONTHS_BALANCE_mean bureau_balance_MONTHS_BALANCE_max bureau_balance_MONTHS_BALANCE_min bureau_balance_MONTHS_BALANCE_sum
0 5001709 97 -48.0 0 -96 -4656
1 5001710 83 -41.0 0 -82 -3403
2 5001711 4 -1.5 0 -3 -6
3 5001712 19 -9.0 0 -18 -171
4 5001713 22 -10.5 0 -21 -231

๐Ÿ“Œ Bureau Balance ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๊ณ ๊ฐ๋ณ„ ๋Œ€ํ‘œ๊ฐ’

# ๊ฐ ๋Œ€์ถœ๋ณ„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๋ณ‘ํ•ฉ
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, 
                                          left_on = 'SK_ID_BUREAU', how = 'outer')

# ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— SK_ID_CURR์„ ํฌํ•จ
bureau_by_loan = bureau[['SK_ID_BUREAU', 'SK_ID_CURR']].merge(bureau_by_loan, 
                                                              on = 'SK_ID_BUREAU', 
                                                              how = 'left')

# ๊ณ ๊ฐ๋ณ„ ๋Œ€ํ‘œ๊ฐ’ ๊ณ„์‚ฐ
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), 
                                       group_var = 'SK_ID_CURR', df_name = 'client')

๐Ÿ“Œ ๊ณ„์‚ฐ๋œ ํŠน์„ฑ(Feature)๋“ค์„ ํ›ˆ๋ จ์šฉ ๋ฐ์ดํ„ฐ์™€ ๋ณ‘ํ•ฉ

original_features = list(train.columns) # ์›๋ž˜ ๋ณ€์ˆ˜๋“ค
print('Original Number of Features: ', len(original_features))
Original Number of Features:  122
# bureau_count ๋ณ‘ํ•ฉ
train = train.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')

# bureau ๋Œ€ํ‘œ๊ฐ’ ๋ณ‘ํ•ฉ
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')

# ์›”๋ณ„, ๊ณ ๊ฐ๋ณ„ ์ •๋ณด ๋ณ‘ํ•ฉ
train = train.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')
new_features = list(train.columns) # ๋ณ€์ˆ˜ ์ถ”๊ฐ€ ํ›„ ๋ณ€์ˆ˜ ๋ชฉ๋ก
print('Number of features using previous loans from other institutions data: ', len(new_features))
Number of features using previous loans from other institutions data:  333
  • ๋งŽ์€ ๋ณ€์ˆ˜๋“ค์ด ์ƒˆ๋กœ ์ƒ์„ฑ๋˜์—ˆ๋‹ค.

3. Feature Engineering ๊ฒฐ๊ณผ๋ฌผ

  • ๊ฒฐ์ธก์น˜์˜ ๋น„์œจ, target๊ณผ์˜ ์ƒ๊ด€๊ณ„์ˆ˜, ๋ณ€์ˆ˜ ๊ฐ„ ์ƒ๊ด€๊ณ„์ˆ˜๋“ค์„ ํŒŒ์•…

    • ๊ฐ ๋ณ€์ˆ˜ ๊ฐ„ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋Š” ๋ณ€์ˆ˜ ๊ฐ„ collinear ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๋Š” ์ง€ ์—ฌ๋ถ€๋ฅผ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋ณ€์ˆ˜๋“ค์ด ์„œ๋กœ ๊ฐ•ํ•œ ์—ฐ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง์„ ์˜๋ฏธ

    • ์ข…์ข… collinearํ•œ ๋‘ ๋ณ€์ˆ˜๋ฅผ ๋ชจ๋‘ ๊ฐ–๋Š” ๊ฒƒ์€ ์ค‘๋ณต์ด๊ธฐ ๋•Œ๋ฌธ์— ํ•˜๋‚˜๋ฅผ ์ œ๊ฑฐํ•ด์•ผํ•  ํ•„์š”๊ฐ€ ์žˆ๊ธฐ๋„ ํ•จ(๋‹ค์ค‘๊ณต์„ ์„ฑ ๋ฌธ์ œ ๋“ฑ)

  • feature seletion์€ ๋ชจ๋ธ ํ•™์Šต ๋ฐ ์ผ๋ฐ˜ํ™”๋ฅผ ์œ„ํ•ด ๋ณ€์ˆ˜๋“ค์„ ์ œ๊ฑฐํ•˜๋Š” ๊ณผ์ •

    • ํ•„์š”์—†๊ณ , ์ค‘๋ณต์ธ ๋ณ€์ˆ˜๋“ค์„ ์ œ๊ฑฐํ•˜๊ณ , ์ค‘์š”ํ•œ ๋ณ€์ˆ˜๋“ค์„ ๋ณด์กดํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ 

    • Curse of dimensionality(์ฐจ์›์˜ ์ €์ฃผ)

    • ๋„ˆ๋ฌด ๋งŽ์€ feature๋ฅผ ๊ฐ€์งˆ ๋•Œ ์ƒ๊ธฐ๋Š” ๋ฌธ์ œ(Feature ๊ฐฏ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ฒƒ -> ์ง€๋‚˜์น˜๊ฒŒ ๊ณ ์ฐจ์›)

    • ๋ณ€์ˆ˜์˜ ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ๋ณ€์ˆ˜์™€ ๋ชฉํ‘œ๊ฐ’ ์‚ฌ์ด์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๋Š”๋ฐ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ์˜ ์ˆ˜๊ฐ€ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€

    • Feature์˜ ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์€ ๋ชจ๋ธ ํ•™์Šต๊ณผ ๋”๋ถˆ์–ด ์ผ๋ฐ˜ํ™”๋ฅผ ๋„์šธ ์ˆ˜ ์žˆ์Œ

      • ๊ฒฐ์ธก์น˜๋“ค์˜ ๋ฐฑ๋ถ„์œจ์„ ํ™œ์šฉํ•˜์—ฌ ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ’์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š” feature๋“ค์„ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์Œ

      • Gradient Boosting Machine๊ณผ RandomForest ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ๋ฐ˜ํ™˜๋œ feature importance๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Œ

3-1. ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

### column๋ณ„ ๊ฒฐ์ธก์น˜์˜ ๊ฐœ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜

def missing_values_table(df):
        # ๊ฒฐ์ธก์น˜์˜ ์ด ๊ฐœ์ˆ˜
        mis_val = df.isnull().sum()
        
        # ๊ฒฐ์ธก์น˜์˜ ๋น„์œจ
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # ๊ฒฐ๊ณผ ํ…Œ์ด๋ธ”
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis = 1)
        
        # ์ปฌ๋Ÿผ๋ช… ์žฌ์ •์˜
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending = False).round(1)
        
        # ์š”์•ฝํ†ต๊ณ„๋Ÿ‰ ์ถœ๋ ฅ
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        return mis_val_table_ren_columns
missing_train = missing_values_table(train)
missing_train.head(10)
Your selected dataframe has 333 columns.
There are 278 columns that have missing values.
Missing Values % of Total Values
bureau_AMT_ANNUITY_min 227502 74.0
bureau_AMT_ANNUITY_max 227502 74.0
bureau_AMT_ANNUITY_mean 227502 74.0
client_bureau_balance_STATUS_4_count_min 215280 70.0
client_bureau_balance_STATUS_3_count_norm_mean 215280 70.0
client_bureau_balance_MONTHS_BALANCE_count_min 215280 70.0
client_bureau_balance_STATUS_4_count_max 215280 70.0
client_bureau_balance_STATUS_4_count_mean 215280 70.0
client_bureau_balance_STATUS_3_count_norm_min 215280 70.0
client_bureau_balance_STATUS_3_count_norm_max 215280 70.0
  • ๋ˆ„๋ฝ๋œ ๊ฐ’์˜ ๋น„์œจ์ด ๋†’์€ column์ด ์—ฌ๋Ÿฌ ๊ฐœ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ

    • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ๋ฐ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ์žˆ์–ด 90% ์ด์ƒ์˜ ๋ˆ„๋ฝ๊ฐ’์„ ๊ฐ€์ง„ column๋“ค์„ ์ œ๊ฑฐ
missing_train_vars = list(missing_train.index[missing_train['% of Total Values'] > 90])
len(missing_train_vars)
0
  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋„ ๋™์ผํ•œ ์ž‘์—… ์ˆ˜ํ–‰

3-2. ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ

# ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/3แ„Œแ…ฎแ„Žแ…ก/data/application_test.csv')

# bureau ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜๋“ค์„ ๊ณ„์‚ฐํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„๊ณผ ๋ณ‘ํ•ฉ
test = test.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')

# bureau ๋ฐ์ดํ„ฐ์˜ ๋Œ€ํ‘œ๊ฐ’๋“ค์„ ๊ณ„์‚ฐํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„๊ณผ ๋ณ‘ํ•ฉ
test = test.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')

# bureau balance ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜๋“ค์„ ๊ณ„์‚ฐํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„๊ณผ ๋ณ‘ํ•ฉ
test = test.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')
print('Shape of Testing Data: ', test.shape)
Shape of Testing Data:  (48744, 332)

3-3. ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ ๋งž์ถฐ์ฃผ๊ธฐ

  • ํ›ˆ๋ จ์šฉ ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ด ๊ฐ™์€ column๋“ค์„ ๊ฐ€์ง€๋„๋ก ๋งž์ถฐ๋ณด์ž.

    • ์—ฌ๊ธฐ์„œ๋Š” ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š์ง€๋งŒ, ๋ณ€์ˆ˜๋“ค์„ ์›-ํ•ซ ์ธ์ฝ”๋”ฉํ•  ๋•Œ์—๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„๋“ค์ด ๋™์ผํ•œ column์„ ๊ฐ€์ง€๋„๋ก ๋งž์ถฐ์•ผ ํ•จ
train_labels = train['TARGET']

# ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ align
# 'target' column์€ ์ผ๋‹จ ์ œ๊ฑฐํ•˜๊ณ  ๋งž์ถฐ์ฃผ๊ธฐ
train, test = train.align(test, join = 'inner', axis = 1)

train['TARGET'] = train_labels # train data์— ๋Œ€ํ•ด์„œ๋Š” ๋‹ค์‹œ target๊ฐ’ ๊ฒฐํ•ฉ
print('Training Data Shape: ', train.shape)
print('Testing Data Shape: ', test.shape)
Training Data Shape:  (307511, 333)
Testing Data Shape:  (48744, 332)
  • ๋ฐ์ดํ„ฐ ํ˜•ํƒœ๊ฐ€ ํ†ต์ผ๋˜์—ˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

3-4. ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

### ๊ฒฐ์ธก์น˜ ์ƒํƒœ(?) ํ™•์ธ

missing_test = missing_values_table(test)
missing_test.head(10)
Your selected dataframe has 332 columns.
There are 275 columns that have missing values.
Missing Values % of Total Values
COMMONAREA_MEDI 33495 68.7
COMMONAREA_MODE 33495 68.7
COMMONAREA_AVG 33495 68.7
NONLIVINGAPARTMENTS_MEDI 33347 68.4
NONLIVINGAPARTMENTS_AVG 33347 68.4
NONLIVINGAPARTMENTS_MODE 33347 68.4
FONDKAPREMONT_MODE 32797 67.3
LIVINGAPARTMENTS_MEDI 32780 67.2
LIVINGAPARTMENTS_MODE 32780 67.2
LIVINGAPARTMENTS_AVG 32780 67.2
missing_test_vars = list(missing_test.index[missing_test['% of Total Values'] > 90])
len(missing_test_vars)
0
missing_columns = list(set(missing_test_vars + missing_train_vars))
print('There are %d columns with more than 90%% missing in either the training or testing data.' % len(missing_columns))
There are 0 columns with more than 90% missing in either the training or testing data.
# ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š” column ์‚ญ์ œ

train = train.drop(columns = missing_columns)
test = test.drop(columns = missing_columns)
  • 90% ์ด์ƒ ๋ˆ„๋ฝ๋œ ๊ฐ’์„ ๊ฐ€์ง„ column๋“ค์ด ์—†๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฒˆ์—๋Š” ์–ด๋– ํ•œ column๋“ค๋„ ์ œ๊ฑฐ๋˜์ง€ ์•Š์•˜์Œ

    • feature selection์„ ์œ„ํ•ด์„œ๋Š” ์•„๋งˆ๋„ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•ด์•ผ ํ•  ๊ฒƒ ๊ฐ™์Œ
### ๊ฐ€๊ณต๋œ ๋ฐ์ดํ„ฐ ์ œ์žฅ

train.to_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/3แ„Œแ…ฎแ„Žแ…ก/data/train_bureau_raw.csv', index = False)
test.to_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/3แ„Œแ…ฎแ„Žแ…ก/data/test_bureau_raw.csv', index = False)

3-5. ์ƒ๊ด€๊ณ„์ˆ˜(Correlations)

  • target๊ณผ ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜

    • ์ƒˆ๋กญ๊ฒŒ ์ƒ์„ฑ๋œ ๋ณ€์ˆ˜๋“ค ์ค‘ ๊ธฐ์กด ํ›ˆ๋ จ์šฉ ๋ฐ์ดํ„ฐ(application ๋ฐ์ดํ„ฐ์— ์žˆ๋˜) ๋ณ€์ˆ˜๋“ค๋ณด๋‹ค ๋” ๋†’์€ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ฐ€์ง€๋Š” ๋ณ€์ˆ˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ
# ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ƒ์—์„œ ๋ชจ๋“  ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐ

corrs = train.corr()
corrs = corrs.sort_values('TARGET', ascending = False) # ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ

pd.DataFrame(corrs['TARGET'].head(10))
TARGET
TARGET 1.000000
bureau_DAYS_CREDIT_mean 0.089729
client_bureau_balance_MONTHS_BALANCE_min_mean 0.089038
DAYS_BIRTH 0.078239
bureau_CREDIT_ACTIVE_Active_count_norm 0.077356
client_bureau_balance_MONTHS_BALANCE_mean_mean 0.076424
bureau_DAYS_CREDIT_min 0.075248
client_bureau_balance_MONTHS_BALANCE_min_min 0.073225
client_bureau_balance_MONTHS_BALANCE_sum_mean 0.072606
bureau_DAYS_CREDIT_UPDATE_mean 0.068927
pd.DataFrame(corrs['TARGET'].dropna().tail(10))
TARGET
client_bureau_balance_MONTHS_BALANCE_count_min -0.048224
client_bureau_balance_STATUS_C_count_norm_mean -0.055936
client_bureau_balance_STATUS_C_count_max -0.061083
client_bureau_balance_STATUS_C_count_mean -0.062954
client_bureau_balance_MONTHS_BALANCE_count_max -0.068792
bureau_CREDIT_ACTIVE_Closed_count_norm -0.079369
client_bureau_balance_MONTHS_BALANCE_count_mean -0.080193
EXT_SOURCE_1 -0.155317
EXT_SOURCE_2 -0.160472
EXT_SOURCE_3 -0.178919
  • target๊ณผ ๊ฐ€์žฅ ํฐ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ฐ€์ง€๋Š” ๋ณ€์ˆ˜๋Š” ์ƒˆ๋กญ๊ฒŒ ์ƒ์„ฑ๋œ ๋ณ€์ˆ˜

    • ๊ทธ๋Ÿฌ๋‚˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋†’๋‹ค๋Š” ๊ฒƒ์ด ๊ทธ ๋ณ€์ˆ˜๊ฐ€ ์œ ์šฉํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜์ง€๋Š” ์•Š์œผ๋ฉฐ, ์ˆ˜๋ฐฑ ๊ฐœ์˜ ๋ณ€์ˆ˜๋“ค์„ ์ƒ์„ฑํ–ˆ์„ ๊ฒฝ์šฐ์—๋Š”, ๊ทธ์ € random noise ๋•Œ๋ฌธ์— ์ƒ๊ด€๊ด€๊ณ„์— ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ผ ์ˆ˜๋„ ์žˆ์Œ์„ ์ฃผ์˜ํ•ด์•ผ ํ•จ
  • ๋น„ํŒ์ ์œผ๋กœ ์ƒ๊ด€๊ณ„์ˆ˜๋“ค์„ ๋“ค์—ฌ๋‹ค๋ดค์„ ๋•Œ, ๊ทธ๋ž˜๋„ ์ƒˆ๋กญ๊ฒŒ ์ƒ์„ฑ๋œ ๋ช‡๋ช‡ ๋ณ€์ˆ˜๋“ค์€ ์œ ์šฉํ•  ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ž„

    • ๋ณ€์ˆ˜๋“ค์˜ ์œ ์šฉ์„ฑ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด, ํ•™์Šต๋œ ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ feature Importance๋ฅผ ์‚ดํŽด๋ณผ ์˜ˆ์ •

    • ์ƒˆ๋กญ๊ฒŒ ์ƒ์„ฑ๋œ ๋ณ€์ˆ˜๋“ค์— ๋Œ€ํ•œ kde ๊ทธ๋ž˜ํ”„๋ฅผ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Œ

# ์›๋ฌธ์—๋Š” client_bureau_balance_counts_mean ์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์ง€๋งŒ ํ•ด๋‹น ๋ณ€์ˆ˜๊ฐ€ ์—†์–ด,
# 'client_bureau_balance_MONTHS_BALANCE_count_mean'์œผ๋กœ ๋Œ€์ฒด

kde_target(var_name = 'client_bureau_balance_MONTHS_BALANCE_count_mean', df = train)
The correlation between client_bureau_balance_MONTHS_BALANCE_count_mean and the TARGET is -0.0802
Median value for loan that was not repaid = 19.3333
Median value for loan that was repaid =     25.1429

  • ์ด ๋ณ€์ˆ˜๋Š” ๊ฐ ๊ณ ๊ฐ์˜ ๋Œ€์ถœ๋ณ„ ์›”๋ณ„ ๊ธฐ๋ก์— ๋Œ€ํ•œ ํ‰๊ท ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

    • ์˜ˆ๋ฅผ ๋“ค์–ด, ๋งŒ์•ฝ ๊ณ ๊ฐ์˜ ๊ณผ๊ฑฐ ์›”๋ณ„๋กœ 3, 4, 5์˜ ๊ธฐ๋ก์„ ๊ฐ–๊ณ  ์žˆ๋Š” 3๊ฐœ์˜ ๋Œ€์ถœ์„ ๊ฐ–๊ณ  ์žˆ๋‹ค๋ฉด, ์ด ๋ณ€์ˆ˜์— ์žˆ์–ด ๋ฐ์ดํ„ฐ๊ฐ’์€ 4๊ฐ€ ๋จ
  • ๋ถ„์‚ฐ ๊ทธ๋ž˜ํ”„์— ๊ธฐ์ดˆํ•˜์—ฌ, ๊ณผ๊ฑฐ ์›”๋ณ„ ํ‰๊ท ์ด ๋†’์€ ์‚ฌ๋žŒ๋“ค์ด Home Credit์—์„œ์˜ ๋Œ€์ถœ์—์„œ ์ƒํ™˜์„ ์ž˜ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„

    • ๋” ๋งŽ์€ ์‹ ์šฉ ๊ธฐ๋ก์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ณ ๊ฐ๋“ค์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋Œ€์ถœ๊ธˆ์„ ์ƒํ™˜ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋” ๋†’๋‹ค๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ
kde_target(var_name='bureau_CREDIT_ACTIVE_Active_count_norm', df=train)
The correlation between bureau_CREDIT_ACTIVE_Active_count_norm and the TARGET is 0.0774
Median value for loan that was not repaid = 0.5000
Median value for loan that was repaid =     0.3636

  • ํ•ด๋‹น ๋ณ€์ˆ˜๋Š” CREDIT_ACTIVE์˜ ๊ฐ’์ด โ€˜Activeโ€™์ธ ๊ฒƒ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ „์ฒด ๋Œ€์ถœ์˜ ๊ฐœ์ˆ˜๋กœ ๋‚˜๋ˆˆ ๊ฐ’

  • ํ•ด๋‹น ๋ณ€์ˆ˜์˜ ๊ฒฝ์šฐ ๋ชจ๋“  ๊ณณ์—์„œ ๋ถˆ๊ทœ์น™์ ์ž„

  • ์ƒ๊ด€๊ด€๊ณ„ ๋˜ํ•œ ๋งค์šฐ ๋‚ฎ์Œ

๐Ÿ“Œ Collinear Variables

  • target๊ณผ ๋ณ€์ˆ˜ ๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋งŒ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ๊ฐ ๋ณ€์ˆ˜ ๊ฐ„ ์ƒ๊ด€๊ณ„์ˆ˜๊นŒ์ง€ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Œ

    • ์ด๋ฅผ ํ†ตํ•ด ์ œ๊ฑฐํ•ด์•ผ ํ•  ์ˆ˜๋„ ์žˆ๋Š” collinear ๊ด€๊ณ„๋“ค์„ ๊ฐ€์ง€๋Š” ๋ณ€์ˆ˜๋“ค์ด ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ์•Œ๋ ค์คŒ
  • 0.8 ์ด์ƒ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ฐ€์ง€๋Š” ๋ณ€์ˆ˜๋“ค์„ ์ฐพ์•„๋ณด์ž.

# ์ž„๊ณ„๊ฐ’ ์„ค์ •
threshold = 0.8

# ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋†’์€ ๋ณ€์ˆ˜๋“ค์„ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•œ ๋นˆ dictionary ์ƒ์„ฑ
above_threshold_vars = {}

# ๊ฐ๊ฐ์˜ ์นผ๋Ÿผ๋งˆ๋‹ค ์ž„๊ณ„์น˜ ์ด์ƒ์˜ ๋ณ€์ˆ˜๋“ค์„ ์ €์žฅ
for col in corrs:
    above_threshold_vars[col] = list(corrs.index[corrs[col] > threshold])

์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋†’์€ ๋ณ€์ˆ˜๋“ค ์ค‘ 1๊ฐœ์˜ ๋ณ€์ˆ˜๋งŒ ์ œ๊ฑฐ

# ์ œ๊ฑฐํ•  column๋“ค ๋ฐ ์ด๋ฏธ ๊ฒ€์‚ฌ๋œ column๋“ค์˜ ๋ชฉ๋ก์„ ์ €์žฅ์œ„ํ•œ list ์ƒ์„ฑ
cols_to_remove = []
cols_seen = []
cols_to_remove_pair = []

for key, value in above_threshold_vars.items():
    # ์ด๋ฏธ ๊ฒ€์‚ฌ๋œ column ์ €์žฅ
    cols_seen.append(key)
    for x in value:
        if x == key:
            next
        else:
            # ๊ฐ ์Œ ์ค‘ ํ•˜๋‚˜์˜ column๋งŒ์„ ์ œ๊ฑฐ
            if x not in cols_seen:
                cols_to_remove.append(x)
                cols_to_remove_pair.append(key)
            
cols_to_remove = list(set(cols_to_remove))
print('Number of columns to remove: ', len(cols_to_remove))
Number of columns to remove:  134
### ์ƒ๊ด€๋„๊ฐ€ ๋†’์€ ๋ณ€์ˆ˜๋“ค์„ ์ œ๊ฑฐ

train_corrs_removed = train.drop(columns = cols_to_remove)
test_corrs_removed = test.drop(columns = cols_to_remove)

print('Training Corrs Removed Shape: ', train_corrs_removed.shape)
print('Testing Corrs Removed Shape: ', test_corrs_removed.shape)
Training Corrs Removed Shape:  (307511, 199)
Testing Corrs Removed Shape:  (48744, 198)
train_corrs_removed.to_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/3แ„Œแ…ฎแ„Žแ…ก/data/train_bureau_corrs_removed.csv', index = False)
test_corrs_removed.to_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/3แ„Œแ…ฎแ„Žแ…ก/data/test_bureau_corrs_removed.csv', index = False)

4. ๋ชจ๋ธ๋ง(Modeling)

  • ํ•ด๋‹น ๋ถ€๋ถ„์€ ์›๋ฌธ์„ ๊ทธ๋Œ€๋กœ ํƒ‘์žฌํ•˜์˜€์Šต๋‹ˆ๋‹ค.

  • ๋ชจ๋ธ์€ LGBM์„ ํ™œ์šฉ + ๊ต์ฐจ ๊ฒ€์ฆ

  • ์œ„์—์„œ ๊ฐ€๊ณตํ•œ ๋ฐ์ดํ„ฐ๋“ค๋กœ ๋ชจ๋ธ์„ ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€ ํ›„ ์„ฑ๋Šฅ ๋น„๊ต

import lightgbm as lgb

from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

import gc

import matplotlib.pyplot as plt
def model(features, test_features, encoding = 'ohe', n_folds = 5):
    
    """Train and test a light gradient boosting model using
    cross validation. 
    
    Parameters
    --------
        features (pd.DataFrame): 
            dataframe of training features to use 
            for training a model. Must include the TARGET column.
        test_features (pd.DataFrame): 
            dataframe of testing features to use
            for making predictions with the model. 
        encoding (str, default = 'ohe'): 
            method for encoding categorical variables. Either 'ohe' for one-hot encoding or 'le' for integer label encoding
            n_folds (int, default = 5): number of folds to use for cross validation
        
    Return
    --------
        submission (pd.DataFrame): 
            dataframe with `SK_ID_CURR` and `TARGET` probabilities
            predicted by the model.
        feature_importances (pd.DataFrame): 
            dataframe with the feature importances from the model.
        valid_metrics (pd.DataFrame): 
            dataframe with training and validation metrics (ROC AUC) for each fold and overall.
        
    """
    
    # Extract the ids
    train_ids = features['SK_ID_CURR']
    test_ids = test_features['SK_ID_CURR']
    
    # Extract the labels for training
    labels = features['TARGET']
    
    # Remove the ids and target
    features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
    test_features = test_features.drop(columns = ['SK_ID_CURR'])
    
    
    # One Hot Encoding
    if encoding == 'ohe':
        features = pd.get_dummies(features)
        test_features = pd.get_dummies(test_features)
        
        # Align the dataframes by the columns
        features, test_features = features.align(test_features, join = 'inner', axis = 1)
        
        # No categorical indices to record
        cat_indices = 'auto'
    
    # Integer label encoding
    elif encoding == 'le':
        
        # Create a label encoder
        label_encoder = LabelEncoder()
        
        # List for storing categorical indices
        cat_indices = []
        
        # Iterate through each column
        for i, col in enumerate(features):
            if features[col].dtype == 'object':
                # Map the categorical features to integers
                features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
                test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))

                # Record the categorical indices
                cat_indices.append(i)
    
    # Catch error if label encoding scheme is not valid
    else:
        raise ValueError("Encoding must be either 'ohe' or 'le'")
        
    print('Training Data Shape: ', features.shape)
    print('Testing Data Shape: ', test_features.shape)
    
    # Extract feature names
    feature_names = list(features.columns)
    
    # Convert to np arrays
    features = np.array(features)
    test_features = np.array(test_features)
    
    # Create the kfold object
    k_fold = KFold(n_splits = n_folds, shuffle = False, random_state = 50)
    
    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    
    # Empty array for test predictions
    test_predictions = np.zeros(test_features.shape[0])
    
    # Empty array for out of fold validation predictions
    out_of_fold = np.zeros(features.shape[0])
    
    # Lists for recording validation and training scores
    valid_scores = []
    train_scores = []
    
    # Iterate through each fold
    for train_indices, valid_indices in k_fold.split(features):
        
        # Training data for the fold
        train_features, train_labels = features[train_indices], labels[train_indices]
        # Validation data for the fold
        valid_features, valid_labels = features[valid_indices], labels[valid_indices]
        
        # Create the model
        model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                                   class_weight = 'balanced', learning_rate = 0.05, 
                                   reg_alpha = 0.1, reg_lambda = 0.1, 
                                   subsample = 0.8, n_jobs = -1, random_state = 50)
        
        # Train the model
        model.fit(train_features, train_labels, eval_metric = 'auc',
                  eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
                  eval_names = ['valid', 'train'], categorical_feature = cat_indices,
                  early_stopping_rounds = 100, verbose = 200)
        
        # Record the best iteration
        best_iteration = model.best_iteration_
        
        # Record the feature importances
        feature_importance_values += model.feature_importances_ / k_fold.n_splits
        
        # Make predictions
        test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
        
        # Record the out of fold predictions
        out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
        
        # Record the best score
        valid_score = model.best_score_['valid']['auc']
        train_score = model.best_score_['train']['auc']
        
        valid_scores.append(valid_score)
        train_scores.append(train_score)
        
        # Clean up memory
        gc.enable()
        del model, train_features, valid_features
        gc.collect()
        
    # Make the submission dataframe
    submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
    
    # Make the feature importance dataframe
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
    
    # Overall validation score
    valid_auc = roc_auc_score(labels, out_of_fold)
    
    # Add the overall scores to the metrics
    valid_scores.append(valid_auc)
    train_scores.append(np.mean(train_scores))
    
    # Needed for creating dataframe of validation scores
    fold_names = list(range(n_folds))
    fold_names.append('overall')
    
    # Dataframe of validation scores
    metrics = pd.DataFrame({'fold': fold_names,
                            'train': train_scores,
                            'valid': valid_scores}) 
    
    return submission, feature_importances, metrics
def plot_feature_importances(df):
    """
    Plot importances returned by a model. This can work with any measure of
    feature importance provided that higher importance is better. 
    
    Args:
        df (dataframe): feature importances. Must have the features in a column
        called `features` and the importances in a column called `importance
        
    Returns:
        shows a plot of the 15 most importance features
        
        df (dataframe): feature importances sorted by importance (highest to lowest) 
        with a column for normalized importance
        """
    
    # Sort features according to importance
    df = df.sort_values('importance', ascending = False).reset_index()
    
    # Normalize the feature importances to add up to one
    df['importance_normalized'] = df['importance'] / df['importance'].sum()

    # Make a horizontal bar chart of feature importances
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()
    
    # Need to reverse the index to plot most important on top
    ax.barh(list(reversed(list(df.index[:15]))), 
            df['importance_normalized'].head(15), 
            align = 'center', edgecolor = 'k')
    
    # Set the yticks and labels
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))
    
    # Plot labeling
    plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
    plt.show()
    
    return df

Control

The first step in any experiment is establishing a control. For this we will use the function defined above (that implements a Gradient Boosting Machine model) and the single main data source (application).

train_control = pd.read_csv('../input/application_train.csv')
test_control = pd.read_csv('../input/application_test.csv')

Fortunately, once we have taken the time to write a function, using it is simple (if thereโ€™s a central theme in this notebook, itโ€™s use functions to make things simpler and reproducible!). The function above returns a submission dataframe we can upload to the competition, a fi dataframe of feature importances, and a metrics dataframe with validation and test performance.

submission, fi, metrics = model(train_control, test_control)
metrics

The control slightly overfits because the training score is higher than the validation score. We can address this in later notebooks when we look at regularization (we already perform some regularization in this model by using reg_lambda and reg_alpha as well as early stopping).

We can visualize the feature importance with another function, plot_feature_importances. The feature importances may be useful when itโ€™s time for feature selection.

fi_sorted = plot_feature_importances(fi)
submission.to_csv('control.csv', index = False)

The control scores 0.745 when submitted to the competition.

Test One

Letโ€™s conduct the first test. We will just need to pass in the data to the function, which does most of the work for us.

submission_raw, fi_raw, metrics_raw = model(train, test)
metrics_raw

Based on these numbers, the engineered features perform better than the control case. However, we will have to submit the predictions to the leaderboard before we can say if this better validation performance transfers to the testing data.

fi_raw_sorted = plot_feature_importances(fi_raw)

Examining the feature improtances, it looks as if a few of the feature we constructed are among the most important. Letโ€™s find the percentage of the top 100 most important features that we made in this notebook. However, rather than just compare to the original features, we need to compare to the one-hot encoded original features. These are already recorded for us in fi (from the original data).

top_100 = list(fi_raw_sorted['feature'])[:100]
new_features = [x for x in top_100 if x not in list(fi['feature'])]

print('%% of Top 100 Features created from the bureau data = %d.00' % len(new_features))

Over half of the top 100 features were made by us! That should give us confidence that all the hard work we did was worthwhile.

submission_raw.to_csv('test_one.csv', index = False)

Test one scores 0.759 when submitted to the competition.

Test Two

That was easy, so letโ€™s do another run! Same as before but with the highly collinear variables removed.

submission_corrs, fi_corrs, metrics_corr = model(train_corrs_removed, test_corrs_removed)
metrics_corr

These results are better than the control, but slightly lower than the raw features.

fi_corrs_sorted = plot_feature_importances(fi_corrs)
submission_corrs.to_csv('test_two.csv', index = False)

Test Two scores 0.753 when submitted to the competition.

Results

After all that work, we can say that including the extra information did improve performance! The model is definitely not optimized to our data, but we still had a noticeable improvement over the original dataset when using the calculated features. Letโ€™s officially summarize the performances:

Experiment Train AUC Validation AUC Test AUC

|โ€”โ€”โ€”โ€”|โ€”โ€”-|โ€”โ€”โ€”โ€”|โ€”โ€”-|

Control 0.815 0.760 0.745
Test One 0.837 0.767 0.759
Test Two 0.826 0.765 0.753

(Note that these scores may change from run to run of the notebook. I have not observed that the general ordering changes however.)

All of our hard work translates to a small improvement of 0.014 ROC AUC over the original testing data. Removing the highly collinear variables slightly decreases performance so we will want to consider a different method for feature selection. Moreover, we can say that some of the features we built are among the most important as judged by the model.

In a competition such as this, even an improvement of this size is enough to move us up 100s of spots on the leaderboard. By making numerous small improvements such as in this notebook, we can gradually achieve better and better performance. I encourage others to use the results here to make their own improvements, and I will continue to document the steps I take to help others.

Next Steps

Going forward, we can now use the functions we developed in this notebook on the other datasets. There are still 4 other data files to use in our model! In the next notebook, we will incorporate the information from these other data files (which contain information on previous loans at Home Credit) into our training data. Then we can build the same model and run more experiments to determine the effect of our feature engineering. There is plenty more work to be done in this competition, and plenty more gains in performance to be had! Iโ€™ll see you in the next notebook.