[ECC DS 3์ฃผ์ฐจ] 2. Introduction: Manual Feature Engineering
๐ Reference
1. Introduction: Manual Feature Engineering
-
์ด ์ปค๋์์๋
The Home Credit Defalut Risk Competition
๋ฐ์ดํฐ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก Feature๋ฅผ ์์ฑํ๋ ๋ฐฉ๋ฒ์ ๋ค๋ฃจ๊ณ ์์ -
์ปค๋ ์ด๋ฐ ๋ถ๋ถ์๋ ๋ชจ๋ธ์ ๋ง๋ค๊ธฐ ์ํด
application
๋ฐ์ดํฐ๋ง ์ฌ์ฉํจ- ํด๋น ๋ฐ์ดํฐ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ๋ง๋ค์ด์ง ๊ฐ์ฅ ์ฑ๋ฅ์ด ์ข์ ๋ชจ๋ธ์ ๋ฆฌ๋๋ณด๋์์ ์ฝ 0.74๋ฅผ ๊ธฐ๋ก
-
๋ค๋ฅธ ๋ฐ์ดํฐํ๋ ์๋ค์ ํ์ฉํ์ฌ ์ ๋ณด๋ฅผ ์ข ๋ ๋ชจ์ผ๋ ค ํจ
-
์ฌ๊ธฐ์๋
bureau
๋ฐbureau_balance
๋ฐ์ดํฐ๋ฅผ ํ์ฉํ์์ -
bureau
-
โHome Creditโ์ ์ ์ถ๋ ๊ณ ๊ฐ(Client)์ ๋ค๋ฅธ ๊ธ์ต ๊ธฐ๊ด์์์ ๊ณผ๊ฑฐ์ ๋์ถ ๊ธฐ๋ก
-
๊ฐ๊ฐ์ ๋์ถ ๊ธฐ๋ก์ ๊ฐ๊ฐ์ ์ด๋ก ์ ๋ฆฌ๋์ด ์์
-
-
bureau_balance
-
๊ณผ๊ฑฐ ๋์ถ๋ค์ ์๋ณ ๋ฐ์ดํฐ
-
๊ฐ๊ฐ์ ์๋ณ ๋ฐ์ดํฐ๋ ๊ฐ๊ฐ์ ์ด๋ก ์ ๋ฆฌ๋์ด ์์
-
-
-
Manual Feature Engineering์ ์ด๋ป๊ฒ ๋ณด๋ฉด ์ง๋ฃจํ ๊ณผ์ ์ผ ์ ์์ + ํด๋น ์์ ์ ๋๋ฉ์ธ ์ง์(domain expertise)์ ํ์๋ก ํ๊ธฐ๋ ํจ
-
๋์ถ ๋ฐ ์ฑ๋ฌด ๋ถ์ดํ์ ์ฃผ๋ ์์ธ์ ๋ํ ์ง์์ ๊ฐ์ถ๋๋ฐ๋ ํ๊ณ๊ฐ ์๊ธฐ ๋๋ฌธ์, ์ต์ข ํ์ต์ฉ ๋ฐ์ดํฐํ๋ ์์์ ๊ฐ๋ฅํ ๋ง์ ์ ๋ณด๋ค์ ์ป๋๋ฐ ์ฃผ์์ ์ ๋์์
-
์ฆ, ์ด ์ปค๋์ ์ด๋ค feature๋ค์ด ์ค์ํ์ง๋ฅผ ๊ฒฐ์ ํ๋ ๊ฒ์ ์์ด์ ์ฌ๋๋ณด๋ค ๋ชจ๋ธ์ด ๊ณ ๋ฅด๋๋ก ํ๋ ์ ๊ทผ๋ฐฉ์์ ํํจ
-
์ต๋ํ ๋ง์ feature๋ค์ ๋ง๋ค๊ณ , ๋ชจ๋ธ์ ์ด๋ฌํ feature๋ค์ ์ ๋ถ ํ์ฉ -> ์ถํ ๋ชจ๋ธ์์ ์ป์ด์ง feature importance๋ PCA๋ฅผ ํตํด feautre reduction์ ํ ์ ์์
-
-
Manual Feature Engineering์ ๊ฐ ๊ณผ์ ์ ๋ง์ ์์ Pandas ์ฝ๋์ ์ฝ๊ฐ์ ์ธ๋ด์ฌ, ํนํ ๋ฐ์ดํฐ ์ฒ๋ฆฌ์ ์์ด์ ๋ง์ ์ธ๋ด์ฌ์ ํ์๋ก ํจ
- ๋น๋ก ์๋ํ๋ Feature Engineering ๋๊ตฌ๋ค์ด ํ์ฉ๋๊ธฐ ์์ํ์ง๋ง, ๋น๋ถ๊ฐ Feature Engineering์ ์ฌ์ ํ ์ ์ฒ๋ฆฌ ์์ ์ ํ์๋ก ํจ
# ๋ฐ์ดํฐ ์ฒ๋ฆฌ์ ์ํ Pandas ๋ฐ Numpy
import pandas as pd
import numpy as np
# ์๊ฐํ๋ฅผ ์ํ matplotlib ๋ฐ seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# pandas์์ ๋์ค๋ ๊ฒฝ๊ณ ๋ฌธ ๋ฌด์
import warnings
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')
1-1. Example: ๊ณ ๊ฐ์ ์ด์ ๋์ถ ์๋ ํ์ (Counts of a clientโs previous loans)
- Manual Feature Engineering์ ๋ณดํธ์ ์ธ ๋ฐฉ๋ฒ์ ์ค๋ช ํ๊ธฐ ์ํด, ๋จผ์ ๊ณ ๊ฐ์ ๊ณผ๊ฑฐ ํ ๊ธ์ต๊ธฐ๊ด์์์ ๋์ถ ์๋์ ๊ฐ๋จํ ํ์ ํ๊ณ ์ ํจ
๐ ์์ฃผ ์ฌ์ฉ๋๋ Pandas ๋ช ๋ น์ด
-
-
column๊ฐ์ ๋ฐ๋ผ ๋ฐ์ดํฐ ํ๋ ์์ ๊ทธ๋ฃนํ
-
์ด ๊ณผ์ ์์๋
SK_ID_CURR
์ปฌ๋ผ์ ๊ฐ์ ๋ฐ๋ผ ๊ณ ๊ฐ๋ณ๋ก ๋ฐ์ดํฐ ํ๋ ์์ ๊ทธ๋ฃนํ
-
-
-
๊ทธ๋ฃนํ๋ ๋ฐ์ดํฐ์ ํ๊ท ๋ฑ์ ๊ณ์ฐ
-
grouped_df.mean()
์ ํตํด ์ง์ ํ๊ท ์ ๊ณ์ฐํ๊ฑฐ๋, agg ๋ช ๋ น์ด์ ๋ฆฌ์คํธ๋ฅผ ํ์ฉํ์ฌ ํ๊ท , ์ต๋๊ฐ, ์ต์๊ฐ, ํฉ๊ณ ๋ฑ์ ๊ณ์ฐ(grouped_df.agg([mean, max, min, sum])
)
-
-
-
์ง๊ณ๋(aggregated) ๊ฐ์ ํด๋น ๊ณ ๊ฐ์ ๋งค์นญ
-
SK_ID_CURR
์ปฌ๋ผ์ ํ์ฉํ์ฌ ์ง๊ณ๋ ๊ฐ์ ์๋ณธ training ๋ฐ์ดํฐ๋ก ๋ณํฉํ๊ณ , ํด๋น ๊ฐ์ด ์์ ๊ฒฝ์ฐ์๋ NaN ๊ฐ์ ์ ๋ ฅ.
-
-
๋ํ
rename
๋ช ๋ น์ด๋ฅผ ํตํด ์ปฌ๋ผ์ dict์ ํ์ฉํ์ฌ ๋ณ๊ฒฝ- ์ด๋ฌํ ๋ฐฉ์์ ์์ฑ๋ ๋ณ์๋ค์ ๊ณ์ํด์ ์ถ์ ํ๋๋ฐ ์ ์ฉ
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
### bureau ๋ฐ์ดํฐ ๋ถ๋ฌ์ค๊ธฐ
bureau = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/3แแ
ฎแแ
ก/data/bureau.csv')
bureau.head()
SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
### ๊ณ ๊ฐ id (SK_ID_CURR)๋ฅผ ๊ธฐ์ค์ผ๋ก groupby
# ์ด์ ๋์ถ ๊ฐฏ์๋ฅผ ํ์
ํ๊ณ , ์ปฌ๋ผ ์ด๋ฆ์ ๋ณ๊ฒฝ
previous_loan_counts = bureau.groupby('SK_ID_CURR', as_index = False)['SK_ID_BUREAU'].count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_counts'})
previous_loan_counts.head()
SK_ID_CURR | previous_loan_counts | |
---|---|---|
0 | 100001 | 7 |
1 | 100002 | 8 |
2 | 100003 | 4 |
3 | 100004 | 2 |
4 | 100005 | 3 |
### ํ์ต์ฉ ๋ฐ์ดํฐํ๋ ์๊ณผ ๋ณํฉ(join)
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/3แแ
ฎแแ
ก/data/application_train.csv')
train = train.merge(previous_loan_counts, on = 'SK_ID_CURR', how = 'left')
### ๊ฒฐ์ธก์น -> 0์ผ๋ก ์ฑ์ฐ๊ธฐ
train['previous_loan_counts'] = train['previous_loan_counts'].fillna(0)
train.head()
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | previous_loan_counts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 |
1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 |
2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 |
4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows ร 123 columns
- ์คํฌ๋กค์ ์ฐ์ธก ๋๊น์ง ์์ง์ฌ ์๋กญ๊ฒ ๋ง๋ค์ด์ง column(
previous_loan_counts
)์ ํ์ธํด ๋ณด์ธ์!
1-2. R Value๋ฅผ ํ์ฉํ ๋ณ์ ์ ์ฉ์ฑ ํ๊ฐ
- ์๋กญ๊ฒ ์์ฑ๋ ๋ณ์๋ค์ด ์ ์ฉํ์ง ํ๋จํ๊ธฐ ์ํด, ์ฐ์ target ๋ณ์์ ํด๋น ๋ณ์ ๊ฐ์
ํผ์ด์จ ์๊ด๊ณ์(Pearson Correlation Coefficient, r-value)
๋ฅผ ๊ณ์ฐ
๐ ํผ์ด์จ ์๊ด๊ณ์(Pearson Correlation Coefficient)
-
๋ ๋ณ์ ์ฌ์ด์ ์ ํ ๊ด๊ณ๋ -1(์๋ฒฝํ ์์ ์ ํ๊ด๊ณ)์์๋ถํฐ +1(์๋ฒฝํ ์์ ์ ํ๊ด๊ณ) ์ฌ์ด์ ๊ฐ์ผ๋ก ํํ๋จ
-
r-value
๊ฐ ๋ณ์์ ์ ์ฉ์ฑ์ ํ๊ฐํ๊ธฐ ์ํ ์ต์ ์ ๋ฐฉ์์ ์๋์ง๋ง, ๋จธ์ ๋ฌ๋ ๋ชจ๋ธ์ ๋ฐ์ ์ํค๋ ๋ฐ ํจ๊ณผ๊ฐ ์์ ์ง์ ๋ํ ๋๋ต์ ์ธ ์ ๋ณด๋ฅผ ์ค ์๋ ์์ -
๋ชฉํ๊ฐ์ ๋ํ
r-value
๊ฐ ์ปค์ง์๋ก, ํด๋น ๋ณ์๊ฐ ๋ชฉํ๊ฐ์ ์ํฅ์ ๋ผ์น ๊ฐ๋ฅ์ฑ์ด ๋์์ง -
๋ชฉํ๊ฐ์ ๋ํด ๊ฐ์ฅ ํฐ
r-value
์ ์ ๋๊ฐ์ ๊ฐ์ง๋ ๋ณ์๋ฅผ ์ฐพ๊ณ ์ ํจ
๐ ์ปค๋ ๋ฐ๋ ์ถ์ ๊ทธ๋ํ(Kernel Density Estimate Plots)
-
๋ชฉํ๊ฐ(target)๊ณผ์ ์๊ด๊ด๊ณ๋ฅผ ์๊ฐํ
-
๋จ์ผ ๋ณ์์ ๋ถํฌ๋ฅผ ๋ณด์ฌ์ค(ํ์คํ ๊ทธ๋จ์ ๋ถ๋๋ฝ๊ฒ ํ(smoothed) ๊ฒ์ผ๋ก ์๊ฐํ๋ฉด ๋ ๋ฏ..)
-
๋ณ์๋ค์ด ๋จธ์ ๋ฌ๋ ๋ชจ๋ธ๊ณผ ๊ด๋ จ์ฑ์ ๊ฐ์ง๋์ง๋ฅผ ๋ณด์ฌ์ค ์ ์๋ ์งํ๋ก ํ์ฉ๋ ์ ์์
-
๋ฒ์ฃผํ ๋ณ์์ ๊ฐ ์ฐจ์ด์ ๋ฐ๋ฅธ ๋ถํฌ ์ฐจ์ด๋ฅผ ๋ณด๊ธฐ ์ํด, ์นดํ ๊ณ ๋ฆฌ์ ๋ฐ๋ผ ์์ ๋ค๋ฅด๊ฒ ์น ํ์์
TARGET
๊ฐ์ด 0์ธ์ง 1์ธ์ง์ ๋ฐ๋ผ ์์ ๋ค๋ฅด๊ฒ ์น ํprevious_loan_count
์ ์ปค๋๋ฐ๋์ถ์ ๊ทธ๋ํ๋ฅผ ๊ทธ๋ฆด ์ ์์
### ์ปค๋๋ฐ๋๊ทธ๋ํ ์๊ฐํ๋ฅผ ์ํ ํจ์
def kde_target(var_name, df):
'''
input:
var_name = str, ๋ณ์๊ฐ ๋๋ Column
df : DataFrame, ๋์ ๋ฐ์ดํฐํ๋ ์
return: None
'''
### ํต๊ณ๊ฐ ์ป๊ธฐ(์๊ด๊ณ์, ์ค๊ฐ๊ฐ)
# ์๋กญ๊ฒ ์์ฑ๋ ๋ณ์์ target ๋ณ์ ๊ฐ์ ์๊ด๊ณ์ ๊ณ์ฐ
corr = df['TARGET'].corr(df[var_name])
# ๊ฐ ๊ทธ๋ฃน์ ์ค์๊ฐ ๊ณ์ฐ
avg_repaid = df.loc[df['TARGET'] == 0, var_name].median() # ๋์ถ ์ํ
avg_not_repaid = df.loc[df['TARGET'] == 1, var_name].median() # ๋์ถ ์ํx
### ์๊ฐํ
# ์๊ฐํ map ์ค์
plt.figure(figsize = (12, 6))
# target๊ฐ์ ๋ฐ๋ผ ์์ ๋ฌ๋ฆฌํ์ฌ ๊ทธ๋ํ ๊ทธ๋ฆฌ๊ธฐ
sns.kdeplot(df.loc[df['TARGET'] == 0, var_name], label = 'TARGET == 0')
sns.kdeplot(df.loc[df['TARGET'] == 1, var_name], label = 'TARGET == 1')
# labeling
plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)
plt.legend()
### ๊ฒฐ๊ณผ ์ถ๋ ฅ
# ์๊ด๊ณ์ ์ถ๋ ฅ
print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr))
# ์ค๊ฐ๊ฐ ์ถ๋ ฅ
print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid)
print('Median value for loan that was repaid = %0.4f' % avg_repaid)
print()
EXT_SOURCE_3
๋ณ์ ์๊ฐํ
- ์ด์ ๋
ธํธ๋ถ์ ์ํด RandomForest ๋ฐ GradientBoostingMachine์ ์ํด ๊ฐ์ฅ ์ค์ํ ๋ณ์๋ก ํ๋ช
๋
EXT_SOURCE_3
๋ฅผ ํ์ฉํ์ฌ ํ ์คํธ
kde_target('EXT_SOURCE_3', train)
The correlation between EXT_SOURCE_3 and the TARGET is -0.1789 Median value for loan that was not repaid = 0.3791 Median value for loan that was repaid = 0.5460
previous_loan_counts
์๊ฐํ
- ์๋ก ์์ฑํ ๋ณ์์ ๋ํด ์๊ฐํ
kde_target('previous_loan_counts', train)
The correlation between previous_loan_counts and the TARGET is -0.0100 Median value for loan that was not repaid = 3.0000 Median value for loan that was repaid = 4.0000
-
์๋กญ๊ฒ ์์ฑ๋ ๋ณ์(
previous_loan_counts
)๊ฐ ์ค์ํ์ง ์์์ ์ ์ ์์-
์๊ด๊ณ์๊ฐ ๋๋ฌด ์์
-
target ๊ฐ์ ๋ฐ๋ฅธ ๋ถํฌ์ ์ฐจ์ด๋ ๊ฑฐ์ ์์
-
1-3. ์๋ก์ด ๋ณ์ ์์ฑํ๊ธฐ
a) ์์นํ ๋ณ์๋ค์ ๋ํ๊ฐ ๊ณ์ฐ
-
agg()
๋ฅผ ํ์ฉํ์ฌ ๋ฐ์ดํฐ ํ๋ ์์ ํ๊ท , ์ต๋๊ฐ, ์ต์๊ฐ, ํฉ๊ณ ๋ฑ์ ๊ตฌํ ์ ์์- ๋ณ๋์ ํจ์๋ฅผ ์์ฑํ ํ ์ด๋ฅผ ๋ถ๋ฌ์ค๋ ๊ฒ๋ ๊ฐ๋ฅํจ
-
bureau ๋ฐ์ดํฐ ํ๋ ์ ์์ ์์นํ ๋ณ์๋ค์ ํ์ฉํ๊ธฐ ์ํด, ๋ชจ๋ ์์นํ ๋ณ์๋ค์ ๋ํ๊ฐ๋ค์ ๊ณ์ฐ
- ์ด๋ฅผ ์ํด ๊ณ ๊ฐ ID(
SK_ID_CURR
)๋ณ๋กgroupby()
ํ๊ณ , ๊ทธ๋ฃนํ๋ ๋ฐ์ดํฐ ํ๋ ์์ ๋ํ๊ฐ๋ค์agg()
๋ฅผ ํตํด ๊ตฌํ ๋ค, ๊ฒฐ๊ณผ๋ฅผ train ๋ฐ์ดํฐ ์ ๊ณผ ๋ณํฉ(merge()
)
- ์ด๋ฅผ ์ํด ๊ณ ๊ฐ ID(
### ๋ํ๊ฐ ๊ณ์ฐ
# ๊ณ ๊ฐid์ ๋ฐ๋ผ ๋ฐ์ดํฐ ํ๋ ์์ ๊ทธ๋ฃนํํ์ฌ ๊ณ์ฐ
bureau_agg = bureau.drop(columns = ['SK_ID_BUREAU']).groupby('SK_ID_CURR', as_index = False).agg(['count', 'mean', 'max', 'min', 'sum']).reset_index()
bureau_agg.head()
SK_ID_CURR | DAYS_CREDIT | CREDIT_DAY_OVERDUE | ... | DAYS_CREDIT_UPDATE | AMT_ANNUITY | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | max | min | sum | count | mean | max | min | ... | count | mean | max | min | sum | count | mean | max | min | sum | ||
0 | 100001 | 7 | -735.000000 | -49 | -1572 | -5145 | 7 | 0.0 | 0 | 0 | ... | 7 | -93.142857 | -6 | -155 | -652 | 7 | 3545.357143 | 10822.5 | 0.0 | 24817.5 |
1 | 100002 | 8 | -874.000000 | -103 | -1437 | -6992 | 8 | 0.0 | 0 | 0 | ... | 8 | -499.875000 | -7 | -1185 | -3999 | 7 | 0.000000 | 0.0 | 0.0 | 0.0 |
2 | 100003 | 4 | -1400.750000 | -606 | -2586 | -5603 | 4 | 0.0 | 0 | 0 | ... | 4 | -816.000000 | -43 | -2131 | -3264 | 0 | NaN | NaN | NaN | 0.0 |
3 | 100004 | 2 | -867.000000 | -408 | -1326 | -1734 | 2 | 0.0 | 0 | 0 | ... | 2 | -532.000000 | -382 | -682 | -1064 | 0 | NaN | NaN | NaN | 0.0 |
4 | 100005 | 3 | -190.666667 | -62 | -373 | -572 | 3 | 0.0 | 0 | 0 | ... | 3 | -54.333333 | -11 | -121 | -163 | 3 | 1420.500000 | 4261.5 | 0.0 | 4261.5 |
5 rows ร 61 columns
-
์๋ก ์์ฑ๋ column๋ค์ ๋ํด ์๋กญ๊ฒ ์ด๋ฆ์ ์ ์ด์ฃผ๋ ๊ฒ์ด ์ข์ ๊ฒ ๊ฐ์
- ๋ฐ์ ์ฝ๋๋ค์ ์๋ณธ๋ฐ์ดํฐ์ Column์ ๋ํ๊ฐ๋ค์ ์ข ๋ฅ๋ค์ ์ถ๊ฐ์ ์ผ๋ก ๊ธฐ์ ํด์ฃผ๋ ์ญํ ์ ์ํ
-
์ฌ๊ธฐ์๋ Multi-Level index ๋ฐ์ดํฐํ๋ ์์ ์์ ์ ๋์์ผ๋ก ํจ
- ์ด๋ฌํ ๋ถ๋ถ์ ํผ๋์ ์ค ์ ์๊ณ , ์์ ํ๊ธฐ๋ ์ด๋ ต๊ธฐ ๋๋ฌธ์, single-level index๋ก ์ต๋ํ ๋น ๋ฅด๊ฒ ๋ณํํ๊ณ ์ ํ์์
### ์ปฌ๋ผ๋ช
์ฌ์ ์
# ์ปฌ๋ผ๋ช
์ ์ ์ฅํ ๋ฆฌ์คํธ
columns = ['SK_ID_CURR']
for var in bureau_agg.columns.levels[0]: # ์๋ณธ ์ปฌ๋ผ(๋ณ์)๋ค๋ง ๊ฐ์ ธ์ด
# min, max ์ด๋ฐ๊ฒ๋ค ์์ฐ
# id ์ปฌ๋ผ์ ์๋ต
if var != 'SK_ID_CURR':
# ๋ํ๊ฐ์ ์ข
๋ฅ์ ๋ฐ๋ผ ๋ฐ๋ณต๋ฌธ์ ์์ฑ
for stat in bureau_agg.columns.levels[1][:-1]:
# ๋ณ์ ๋ฐ ๋ํ๊ฐ์ ์ข
๋ฅ์ ๋ฐ๋ผ ์๋ก์ด column name์ ์์ฑ
columns.append('bureau_%s_%s' % (var, stat)) # bureau_๋ณ์_๋ํ๊ฐ
### ์์ฑ๋ list๋ฅผ ๋ฐ์ดํฐ ํ๋ ์์ column name์ผ๋ก ์ง์
bureau_agg.columns = columns
bureau_agg.head()
SK_ID_CURR | bureau_DAYS_CREDIT_count | bureau_DAYS_CREDIT_mean | bureau_DAYS_CREDIT_max | bureau_DAYS_CREDIT_min | bureau_DAYS_CREDIT_sum | bureau_CREDIT_DAY_OVERDUE_count | bureau_CREDIT_DAY_OVERDUE_mean | bureau_CREDIT_DAY_OVERDUE_max | bureau_CREDIT_DAY_OVERDUE_min | ... | bureau_DAYS_CREDIT_UPDATE_count | bureau_DAYS_CREDIT_UPDATE_mean | bureau_DAYS_CREDIT_UPDATE_max | bureau_DAYS_CREDIT_UPDATE_min | bureau_DAYS_CREDIT_UPDATE_sum | bureau_AMT_ANNUITY_count | bureau_AMT_ANNUITY_mean | bureau_AMT_ANNUITY_max | bureau_AMT_ANNUITY_min | bureau_AMT_ANNUITY_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100001 | 7 | -735.000000 | -49 | -1572 | -5145 | 7 | 0.0 | 0 | 0 | ... | 7 | -93.142857 | -6 | -155 | -652 | 7 | 3545.357143 | 10822.5 | 0.0 | 24817.5 |
1 | 100002 | 8 | -874.000000 | -103 | -1437 | -6992 | 8 | 0.0 | 0 | 0 | ... | 8 | -499.875000 | -7 | -1185 | -3999 | 7 | 0.000000 | 0.0 | 0.0 | 0.0 |
2 | 100003 | 4 | -1400.750000 | -606 | -2586 | -5603 | 4 | 0.0 | 0 | 0 | ... | 4 | -816.000000 | -43 | -2131 | -3264 | 0 | NaN | NaN | NaN | 0.0 |
3 | 100004 | 2 | -867.000000 | -408 | -1326 | -1734 | 2 | 0.0 | 0 | 0 | ... | 2 | -532.000000 | -382 | -682 | -1064 | 0 | NaN | NaN | NaN | 0.0 |
4 | 100005 | 3 | -190.666667 | -62 | -373 | -572 | 3 | 0.0 | 0 | 0 | ... | 3 | -54.333333 | -11 | -121 | -163 | 3 | 1420.500000 | 4261.5 | 0.0 | 4261.5 |
5 rows ร 61 columns
### ํ๋ จ ๋ฐ์ดํฐ์ ๋ณํฉ
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')
train.head()
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | bureau_DAYS_CREDIT_UPDATE_count | bureau_DAYS_CREDIT_UPDATE_mean | bureau_DAYS_CREDIT_UPDATE_max | bureau_DAYS_CREDIT_UPDATE_min | bureau_DAYS_CREDIT_UPDATE_sum | bureau_AMT_ANNUITY_count | bureau_AMT_ANNUITY_mean | bureau_AMT_ANNUITY_max | bureau_AMT_ANNUITY_min | bureau_AMT_ANNUITY_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 8.0 | -499.875 | -7.0 | -1185.0 | -3999.0 | 7.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 4.0 | -816.000 | -43.0 | -2131.0 | -3264.0 | 0.0 | NaN | NaN | NaN | 0.0 |
2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 2.0 | -532.000 | -382.0 | -682.0 | -1064.0 | 0.0 | NaN | NaN | NaN | 0.0 |
3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 1.0 | -783.000 | -783.0 | -783.0 | -783.0 | 0.0 | NaN | NaN | NaN | 0.0 |
5 rows ร 183 columns
b) target๊ณผ ๋ํ๊ฐ๋ค์ ์๊ด๊ณ์ ๋ถ์
- ์๋กญ๊ฒ ์์ฑ๋ ๊ฐ๋ค๊ณผ ๋ชฉํ๊ฐ๊ณผ์ ์๊ด ๊ณ์๋ฅผ ๋ถ์
# ์๋ก ์์ฑ๋ ๋ณ์๋ค์ ๋ํ ์๊ด๊ณ์๋ฅผ ์ ์ฅํ ๋ฆฌ์คํธ
new_corrs = []
# ๋ณ์๋ณ๋ก ๋ฐ๋ณต๋ฌธ์ ์ํํ๋ฉฐ..
for col in columns:
# target๊ณผ์ ์๊ด๊ณ์ ๊ณ์ฐ
corr = train['TARGET'].corr(train[col])
# ํํ(tuple)๋ก ๋ฆฌ์คํธ์ ์ถ๊ฐ
new_corrs.append((col, corr))
-
์๊ด๊ณ์๋ค์ ์ ๋๊ฐ์ ๋ฐ๋ผ์ ์ ๋ ฌ
sorted()
ํจ์ ํ์ฉ
### ์๊ด๊ณ์๋ค์ ์ ๋๊ฐ์ ๋ฐ๋ผ ์ ๋ ฌ
new_corrs = sorted(new_corrs, key = lambda x: abs(x[1]), reverse = True) # ๋ด๋ฆผ์ฐจ์ ์ ๋ ฌ
new_corrs[:15] # ์๊ด๋๊ฐ ๋์ ์์ 15๊ฐ ๋ณ์๋ง ์ถ์ถ
[('bureau_DAYS_CREDIT_mean', 0.08972896721998114), ('bureau_DAYS_CREDIT_min', 0.0752482510301036), ('bureau_DAYS_CREDIT_UPDATE_mean', 0.06892735266968673), ('bureau_DAYS_ENDDATE_FACT_min', 0.05588737984392077), ('bureau_DAYS_CREDIT_ENDDATE_sum', 0.0537348956010205), ('bureau_DAYS_ENDDATE_FACT_mean', 0.05319962585758616), ('bureau_DAYS_CREDIT_max', 0.04978205463997299), ('bureau_DAYS_ENDDATE_FACT_sum', 0.048853502611115894), ('bureau_DAYS_CREDIT_ENDDATE_mean', 0.046982754334835494), ('bureau_DAYS_CREDIT_UPDATE_min', 0.042863922470730155), ('bureau_DAYS_CREDIT_sum', 0.041999824814846716), ('bureau_DAYS_CREDIT_UPDATE_sum', 0.04140363535306002), ('bureau_DAYS_CREDIT_ENDDATE_max', 0.036589634696329094), ('bureau_DAYS_CREDIT_ENDDATE_min', 0.034281109921616024), ('bureau_DAYS_ENDDATE_FACT_count', -0.030492306653325495)]
- ์๋ก ์์ฑ๋ ๋ณ์๋ค ์ค ๋์๊ณผ ์ ์ํ ์๊ด ๊ด๊ณ๊ฐ ์๋ ๋ณ์๊ฐ ์์
### ๊ฐ์ฅ ์๊ด๋๊ฐ ๋์ bureau_DAYS_CREDIT_mean ๋ณ์ ์๊ฐํ
kde_target('bureau_DAYS_CREDIT_mean', train)
The correlation between bureau_DAYS_CREDIT_mean and the TARGET is 0.0897 Median value for loan that was not repaid = -835.3333 Median value for loan that was repaid = -1067.0000
-
ํด๋น column์ โ๊ณ ๊ฐ์ด ์ ์ฉ๊ด๋ฆฌ๊ตญ์ ์ ์ฉ๋ฑ๊ธ์ ์ ์ฒญํ ๋ ๋ก๋ถํฐ ํ ๋์ถ ์ ์ฒญ๊น์ง ๊ฑธ๋ฆฐ ๊ธฐ๊ฐ์ ๋ช ์ผ์ธ๊ฐ?โ์ ๊ดํ ๋ฐ์ดํฐ์
-
์ด์ ๋์ถ์ ๋ฐ๊ณ ๋์ โHome Creditโ์์ ๋์ถ์ ๋ฐ๊ธฐ์ ๊น์ง ๊ฑธ๋ฆฐ ์ผ์๋ก ํด์
-
๋ง์ด๋์ค ์์น๊ฐ ํฌ๋ค๋ ๊ฒ์ ์ด์ ๋์ถ์ด ์ด๋ฃจ์ด์ง ์์ ์ด ๋ ์ค๋๋์์ ์๋ฏธ
-
๋ ์ค๋๋ ๊ณผ๊ฑฐ์ ๋์ถ์ ์ ์ฒญํ๋ ๊ณ ๊ฐ๋ค์ โHome Creditโ์์ ๋์ถ์ ์ํํ ๊ฐ๋ฅ์ฑ์ด ๋๋ค๋ ๊ฒ์ ํ์ ํ ์ ์์(์ฝํ ์์ ์๊ด๊ด๊ณ)
- ํ์ง๋ง ๋๋ฌด ๋ฎ์ ์๊ด๊ด๊ณ -> ๊ทธ์ ๋ ธ์ด์ฆ์ผ ์๋ ์๋ค.
-
โ ๋ค์ค ๋น๊ต ๋ฌธ์ (Multiple Comparisons Problem)
-
๋ณ์๊ฐ ๋งค์ฐ ๋ง์ ๋๋, ๋ค์ค ๋น๊ต ๋ฌธ์ ๋ก ์๋ ค์ง ์ฐ์ฐ์ ์ํด ๋ณ์๋ค์ด ๋ชฉํ๊ฐ(target)์ ๋ํด ์ฐ๊ด์ฑ์ ๊ฐ์ง ๊ฒ์ผ๋ก ๊ธฐ๋ํ ์ ์์
-
์๋ฐฑ๊ฐ์ ๋ณ์(feature)๋ค์ ๋ง๋ค ์ ์์ง๋ง, ๋ช๊ฐ์ ๊ฒฝ์ฐ ๊ทธ์ ๋ฐ์ดํฐ ์์ ๋๋คํ๊ฒ ์๋ ๋ ธ์ด์ฆ๊ฐ๋ค์ ์ํด ๋ชฉํ๊ฐ(target)๊ณผ ์ฐ๊ด์ฑ์ ๊ฐ์ง๋ ๊ฒ์ฒ๋ผ ๋ณด์ผ ์ ์์
-
์ด๋ฌํ ๋ณ์๋ค์ train ๋ฐ์ดํฐ ์์์๋ target ๊ฐ๊ณผ ๊ด๋ จ์ฑ์ด ์๋ ๊ฒ์ฒ๋ผ ๋ณด์ด์ง๋ง, ๊ด๋ จ์ฑ๋ค์ด ํ ์คํธ ๋ฐ์ดํฐ๊น์ง ์ผ๋ฐํ ๋ ์ ์๊ธฐ ๋๋ฌธ์ ๋ชจ๋ธ์ ์์ด overfitting์ ์ผ๊ธฐํ ์ ์์
-
feature๋ค์ ๋ง๋๋ ๊ฒ์ ๋ํ ๋ง์ ๊ณ ๋ ค ํ์
-
c) ์์นํ ๋ณ์๋ค์ ๋ํ๊ฐ ์ฐ์ฐ์ ์ํ ํจ์ ์์ฑ
-
์ด์ ์ ์์ ๋ค์ ์์ฝํ์ฌ, ์์นํ ๋ณ์๋ค์ ๋ํ๊ฐ ๊ณ์ฐ์ ์ํ ํจ์๋ฅผ ์์ฑ
-
๋ฐ์ดํฐ ํ๋ ์ ์ ์ฒด์ ๊ฑธ์ณ ์์นํ ๋ณ์๋ค์ ๋ํ๊ฐ๋ค์ ๊ณ์ฐํ๋ ์ญํ ์ ์ํ
-
ํ๋ผ๋ฏธํฐ(Parameters)>
-
df(dataframe)
: ์ฐ์ฐ์ ๋์์ด ๋๋ ๋ฐ์ดํฐ ํ๋ ์ -
group_var(string)
: ๊ทธ๋ฃนํ(groupby)์ ๊ธฐ์ค์ด ๋๋ column -
df_name(string)
: column๋ช ์ ์ฌ์ ์ํ๋๋ฐ ์ฐ์ด๋ ๋ณ์
-
-
์ถ๋ ฅ๊ฐ(Returns)>
-
agg (dataframe)
-
๋ชจ๋ ์์นํ ๋ณ์๋ค์ ๋ํ๊ฐ๋ค์ด ์ฐ์ฐ๋ ๋ฐ์ดํฐํ๋ ์
-
๊ฐ๊ฐ์ ๊ทธ๋ฃนํ๋ ์ธ์คํด์ค๋ค์ ๋ํ๊ฐ(ํ๊ท , ์ต์๊ฐ, ์ต๋๊ฐ, ํฉ๊ณ)๋ค์ ๊ฐ์ง
-
์๋กญ๊ฒ ์์ฑ๋ feature๋ค์ ๊ตฌ๋ถํ๊ธฐ์ํด column๋ค์ ์ด๋ฆ์ ์ฌ์ ์
-
-
def agg_numeric(df, group_var, df_name):
# ๊ทธ๋ฃนํ ๋์์ด ์๋ id๋ค์ ์ ๊ฑฐ
for col in df:
if col != group_var and 'SK_ID' in col:
df = df.drop(columns = col)
group_ids = df[group_var]
numeric_df = df.select_dtypes('number')
numeric_df[group_var] = group_ids
# ํน์ ๋ณ์๋ค์ ๊ทธ๋ฃนํํ๊ณ ๋ํ๊ฐ๋ค์ ๊ณ์ฐ
agg = numeric_df.groupby(group_var).agg(['count','mean','max','min','sum']).reset_index()
# ์๋ก์ด column๋ช
์์ฑ
columns = [group_var]
# ๋ณ์๋ค ๋ณ๋ก..
for var in agg.columns.levels[0]:
# id column์ ์๋ต
if var != group_var:
# ์ข
๋ฅ๋ณ๋ก ๋ํ๊ฐ ๊ตฌํ๊ธฐ
for stat in agg.columns.levels[1][:-1]:
# ๋ณ์ ๋ฐ ๋ํ๊ฐ์ ์ข
๋ฅ์ ๋ฐ๋ผ ์๋ก์ด column name์ ์์ฑ
columns.append('%s_%s_%s' % (df_name, var, stat))
agg.columns = columns
return agg
### ์์นํ ๋ณ์๋ค์ ๋ํ๊ฐ ๊ณ์ฐ
bureau_agg_new = agg_numeric(bureau.drop(columns = ['SK_ID_BUREAU']),
group_var = 'SK_ID_CURR',
df_name = 'bureau')
bureau_agg_new.head()
SK_ID_CURR | bureau_DAYS_CREDIT_count | bureau_DAYS_CREDIT_mean | bureau_DAYS_CREDIT_max | bureau_DAYS_CREDIT_min | bureau_DAYS_CREDIT_sum | bureau_CREDIT_DAY_OVERDUE_count | bureau_CREDIT_DAY_OVERDUE_mean | bureau_CREDIT_DAY_OVERDUE_max | bureau_CREDIT_DAY_OVERDUE_min | ... | bureau_DAYS_CREDIT_UPDATE_count | bureau_DAYS_CREDIT_UPDATE_mean | bureau_DAYS_CREDIT_UPDATE_max | bureau_DAYS_CREDIT_UPDATE_min | bureau_DAYS_CREDIT_UPDATE_sum | bureau_AMT_ANNUITY_count | bureau_AMT_ANNUITY_mean | bureau_AMT_ANNUITY_max | bureau_AMT_ANNUITY_min | bureau_AMT_ANNUITY_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100001 | 7 | -735.000000 | -49 | -1572 | -5145 | 7 | 0.0 | 0 | 0 | ... | 7 | -93.142857 | -6 | -155 | -652 | 7 | 3545.357143 | 10822.5 | 0.0 | 24817.5 |
1 | 100002 | 8 | -874.000000 | -103 | -1437 | -6992 | 8 | 0.0 | 0 | 0 | ... | 8 | -499.875000 | -7 | -1185 | -3999 | 7 | 0.000000 | 0.0 | 0.0 | 0.0 |
2 | 100003 | 4 | -1400.750000 | -606 | -2586 | -5603 | 4 | 0.0 | 0 | 0 | ... | 4 | -816.000000 | -43 | -2131 | -3264 | 0 | NaN | NaN | NaN | 0.0 |
3 | 100004 | 2 | -867.000000 | -408 | -1326 | -1734 | 2 | 0.0 | 0 | 0 | ... | 2 | -532.000000 | -382 | -682 | -1064 | 0 | NaN | NaN | NaN | 0.0 |
4 | 100005 | 3 | -190.666667 | -62 | -373 | -572 | 3 | 0.0 | 0 | 0 | ... | 3 | -54.333333 | -11 | -121 | -163 | 3 | 1420.500000 | 4261.5 | 0.0 | 4261.5 |
5 rows ร 61 columns
### ์ง์ ๋ง๋ df
bureau_agg.head()
SK_ID_CURR | bureau_DAYS_CREDIT_count | bureau_DAYS_CREDIT_mean | bureau_DAYS_CREDIT_max | bureau_DAYS_CREDIT_min | bureau_DAYS_CREDIT_sum | bureau_CREDIT_DAY_OVERDUE_count | bureau_CREDIT_DAY_OVERDUE_mean | bureau_CREDIT_DAY_OVERDUE_max | bureau_CREDIT_DAY_OVERDUE_min | ... | bureau_DAYS_CREDIT_UPDATE_count | bureau_DAYS_CREDIT_UPDATE_mean | bureau_DAYS_CREDIT_UPDATE_max | bureau_DAYS_CREDIT_UPDATE_min | bureau_DAYS_CREDIT_UPDATE_sum | bureau_AMT_ANNUITY_count | bureau_AMT_ANNUITY_mean | bureau_AMT_ANNUITY_max | bureau_AMT_ANNUITY_min | bureau_AMT_ANNUITY_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100001 | 7 | -735.000000 | -49 | -1572 | -5145 | 7 | 0.0 | 0 | 0 | ... | 7 | -93.142857 | -6 | -155 | -652 | 7 | 3545.357143 | 10822.5 | 0.0 | 24817.5 |
1 | 100002 | 8 | -874.000000 | -103 | -1437 | -6992 | 8 | 0.0 | 0 | 0 | ... | 8 | -499.875000 | -7 | -1185 | -3999 | 7 | 0.000000 | 0.0 | 0.0 | 0.0 |
2 | 100003 | 4 | -1400.750000 | -606 | -2586 | -5603 | 4 | 0.0 | 0 | 0 | ... | 4 | -816.000000 | -43 | -2131 | -3264 | 0 | NaN | NaN | NaN | 0.0 |
3 | 100004 | 2 | -867.000000 | -408 | -1326 | -1734 | 2 | 0.0 | 0 | 0 | ... | 2 | -532.000000 | -382 | -682 | -1064 | 0 | NaN | NaN | NaN | 0.0 |
4 | 100005 | 3 | -190.666667 | -62 | -373 | -572 | 3 | 0.0 | 0 | 0 | ... | 3 | -54.333333 | -11 | -121 | -163 | 3 | 1420.500000 | 4261.5 | 0.0 | 4261.5 |
5 rows ร 61 columns
- ๋ ๋ฐ์ดํฐํ๋ ์์ด ๋๋ฑํ๊ฒ ์์ฑ๋์๋ค๋ ๊ฒ์ ํ์ธํ ์ ์์
d) ์๊ด๊ณ์ ๊ณ์ฐ์ ์ํ ํจ์
- target๊ณผ ๋ณ์ ๊ฐ ์๊ด๊ณ์๋ฅผ ๊ณ์ฐ
### ๋ฐ์ดํฐํ๋ ์ ์์์ target๊ณผ์ ์๊ด๊ณ์๋ฅผ ๊ณ์ฐํ๊ธฐ ์ํ ํจ์
def target_corrs(df):
# ์๊ด๊ด๊ณ๋ฅผ ์ ์ฅํ๊ธฐ ์ํ ๋ฆฌ์คํธ ์์ฑ
corrs = []
# ๋ณ์๋ณ๋ก..
for col in df.columns:
print(col)
# target ์ปฌ๋ผ์ ์๋ต(์๊ธฐ ์์ ๊ณผ ์๊ด๊ณ์ ๊ตฌํ๋ฉด 1์ด๋๊น..)
if col != 'TARGET':
# ์๊ด๊ณ์ ๊ณ์ฐ
corr = df['TARGET'].corr(df[col])
# ํํ(tuple)๋ก ๋ฆฌ์คํธ์ ์ถ๊ฐ
corrs.append((col, corr))
# ์๊ด๊ณ์๋ค์ ์ ๋๊ฐ ํฌ๊ธฐ์ ๋ฐ๋ผ ๋ด๋ฆผ์ฐจ์ ์ ๋ ฌ
corrs = sorted(corrs, key = lambda x: abs(x[1]), reverse = True)
return corrs
1-4. ๋ฒ์ฃผํ ๋ณ์(Categorical Variables)
1๏ธโฃ ์ฃผ๋ก ๋ฌธ์์ด ๋ฐ์ดํฐ๋ก, ์ด๋ฌํ ๋ฐ์ดํฐ๋ค์ ๋ํด์๋ ํ๊ท ์ด๋ ์ต๋์น ๋ฑ ํต๊ณ๊ฐ์ ํ์ฉํ๊ธฐ๊ฐ ์ด๋ ค์
- ๊ทธ ๋์ , ๊ฐ ๋ฒ์ฃผ๋ณ๋ก ๊ฐ๋ค์ ๊ฐ์๋ฅผ count
๐ Example
- ์ด๋ฌํ ๋ฐ์ดํฐ๋ฅผ..
SK_ID_CURR | Loan type |
|โโโโ|โโโโ|
1 | home |
1 | home |
1 | home |
1 | credit |
2 | credit |
3 | credit |
3 | cash |
3 | cash |
4 | credit |
4 | home |
4 | home |
-
- ๊ฐ ๊ณ ๊ฐ์ ๋ฒ์ฃผ๋ณ ๋์ถ ๊ฐฏ์๋ฅผ ํ์ฉํ์ฌ ๋ค์๊ณผ ๊ฐ์ด ๋ณ๊ฒฝํ ์ ์์
SK_ID_CURR | credit count | cash count | home count | total count |
|โโโโ|โโโโโ|โโโโ|โโโโ|โโโโ-|
1 | 1 | 0 | 3 | 4 |
2 | 1 | 0 | 0 | 1 |
3 | 1 | 2 | 0 | 3 |
4 | 1 | 0 | 2 | 3 |
2๏ธโฃ ๊ทธ ๋ค์, ๊ณ ๊ฐ๋ณ ๋์ถ ํ์๋ฅผ ํ์ฉํ์ฌ ๊ฐ๋ค์ ์ ๊ทํ(normalize)
- ๊ณ ๊ฐ๋ณ ๋์ถ ํ์์ ํฉ๊ณ๊ฐ 1์ด ๋๋๋ก ์ค์ผ์ผ ์กฐ์
SK_ID_CURR | credit count | cash count | home count | total count | credit count norm | cash count norm | home count norm |
|โโโโ|โโโโโ|โโโโ|โโโโ|โโโโ-|โโโโโโ-|โโโโโโ|โโโโโโ|
1 | 1 | 0 | 3 | 4 | 0.25 | 0 | 0.75 |
2 | 1 | 0 | 0 | 1 | 1.00 | 0 | 0 |
3 | 1 | 2 | 0 | 3 | 0.33 | 0.66 | 0 |
4 | 1 | 0 | 2 | 3 | 0.33 | 0 | 0.66 |
a) ๋ฒ์ฃผํ ๋ณ์ ์ธ์ฝ๋ฉ(Encoding)
-
๋ฒ์ฃผํ ๋ณ์๋ค์ ์ธ์ฝ๋ฉ(encoding)ํจ์ผ๋ก์จ ๋ฐ์ดํฐ๋ค์ด ๋ด๊ณ ์๋ ์ ๋ณด๋ฅผ ํ์ธํ ์ ์์
-
์-ํซ ์ธ์ฝ๋ฉ(One-hot Encoding) ์ ์ฉ
- ๋ฒ์ฃผํ ๋ณ์๋ค(
dtype=object
)์ ํํ์ฌ ์ ์ฉ
- ๋ฒ์ฃผํ ๋ณ์๋ค(
### One-hot Encoding
categorical = pd.get_dummies(bureau.select_dtypes('object'))
categorical['SK_ID_CURR'] = bureau['SK_ID_CURR']
categorical.head()
CREDIT_ACTIVE_Active | CREDIT_ACTIVE_Bad debt | CREDIT_ACTIVE_Closed | CREDIT_ACTIVE_Sold | CREDIT_CURRENCY_currency 1 | CREDIT_CURRENCY_currency 2 | CREDIT_CURRENCY_currency 3 | CREDIT_CURRENCY_currency 4 | CREDIT_TYPE_Another type of loan | CREDIT_TYPE_Car loan | ... | CREDIT_TYPE_Loan for business development | CREDIT_TYPE_Loan for purchase of shares (margin lending) | CREDIT_TYPE_Loan for the purchase of equipment | CREDIT_TYPE_Loan for working capital replenishment | CREDIT_TYPE_Microloan | CREDIT_TYPE_Mobile operator loan | CREDIT_TYPE_Mortgage | CREDIT_TYPE_Real estate loan | CREDIT_TYPE_Unknown type of loan | SK_ID_CURR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 215354 |
1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 215354 |
2 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 215354 |
3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 215354 |
4 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 215354 |
5 rows ร 24 columns
### ๊ณ ๊ฐ id๋ฅผ ๊ธฐ์ค์ผ๋ก ๊ทธ๋ฃนํ
categorical_grouped = categorical.groupby('SK_ID_CURR').agg(['sum', 'mean'])
categorical_grouped.head()
CREDIT_ACTIVE_Active | CREDIT_ACTIVE_Bad debt | CREDIT_ACTIVE_Closed | CREDIT_ACTIVE_Sold | CREDIT_CURRENCY_currency 1 | ... | CREDIT_TYPE_Microloan | CREDIT_TYPE_Mobile operator loan | CREDIT_TYPE_Mortgage | CREDIT_TYPE_Real estate loan | CREDIT_TYPE_Unknown type of loan | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | ... | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | |
SK_ID_CURR | |||||||||||||||||||||
100001 | 3 | 0.428571 | 0 | 0.0 | 4 | 0.571429 | 0 | 0.0 | 7 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100002 | 2 | 0.250000 | 0 | 0.0 | 6 | 0.750000 | 0 | 0.0 | 8 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100003 | 1 | 0.250000 | 0 | 0.0 | 3 | 0.750000 | 0 | 0.0 | 4 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100004 | 0 | 0.000000 | 0 | 0.0 | 2 | 1.000000 | 0 | 0.0 | 2 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100005 | 2 | 0.666667 | 0 | 0.0 | 1 | 0.333333 | 0 | 0.0 | 3 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
5 rows ร 46 columns
-
sum
: ๊ณ ๊ฐ๋ณ ํด๋น ๋ฒ์ฃผ์ ์ํ ๋์ถ์ ์ด ํ์ -
mean
: ํ์๋ฅผ ์ ๊ทํ์ํจ ๊ฒ
์-ํซ ์ธ์ฝ๋ฉ์ ํตํด ์ด๋ฌํ ์์น๋ค์ ์ฝ๊ฒ ๊ณ์ฐํ ์ ์์
-
์ฌ๊ธฐ์, ์ง๋๋ฒ column ์ ๋ชฉ๋ค์ ์ฌ์ ์ํ ๋ ์ฌ์ฉํ์๋ ๊ฒ๊ณผ ๋น์ทํ ํจ์๋ฅผ ํ์ฉํ๊ฒ ์ต๋๋ค.
- multi-level index๋ก ์์ฑ๋์ด ์๋ column๋ค์ ๋ค๋ฃฐ ๊ฒ์
-
๋ฒ์ฃผํ ๋ฐ์ดํฐ๊ฐ ์ํ column์ ์ด๋ฆ์ ์ฐจ์ฉํ first-level(level 0)์ ๋ฐ๋ผ ๋ฐ๋ณต๋ฌธ์ ๋จผ์ ์คํํ๊ณ , ๊ทธ ๋ค์ ๊ณ์ฐ๋ ํต๊ณ์น๋ค์ ๋ฐ๋ผ ๋ฐ๋ณต๋ฌธ์ ํ๋ฒ ๋ ์คํ
-
์ดํ level 0์ ์ด๋ฆ์ ํต๊ณ์น์ ์ข ๋ฅ๋ฅผ ํฉ์ณ column ์ ๋ชฉ๋ค์ ์ฌ์ ์
- ์๋ฅผ ๋ค๋ฉด,
CREDIT_ACTIVE_Active
๊ฐ level 0์ด๊ณ ,sum
์ด level 1์ธ column์CREDIT_ACTIVE_Active_count
๋ก ์ ์๋จ
- ์๋ฅผ ๋ค๋ฉด,
ํจ์ํ
### Level 0: ๋ฒ์ฃผํ ๋ฐ์ดํฐ๊ฐ ์ํ column๋ช
categorical_grouped.columns.levels[0][:10]
Index(['CREDIT_ACTIVE_Active', 'CREDIT_ACTIVE_Bad debt', 'CREDIT_ACTIVE_Closed', 'CREDIT_ACTIVE_Sold', 'CREDIT_CURRENCY_currency 1', 'CREDIT_CURRENCY_currency 2', 'CREDIT_CURRENCY_currency 3', 'CREDIT_CURRENCY_currency 4', 'CREDIT_TYPE_Another type of loan', 'CREDIT_TYPE_Car loan'], dtype='object')
### Level 1: ๊ณ์ฐ๋ ํต๊ณ์น(sum, mean)
categorical_grouped.columns.levels[1]
Index(['sum', 'mean'], dtype='object')
group_var = 'SK_ID_CURR'
# ์๋ก์ด column ์ ๋ชฉ๋ค์ ์ ์ฅํ๊ธฐ ์ํ ๋ฆฌ์คํธ
columns = []
# ๋ณ์๋ค์ ์ํํ๋ฉฐ...
for var in categorical_grouped.columns.levels[0]:
# ๊ณ ๊ฐ id column(๊ทธ๋ฃนํ ๋ณ์)์ ์๋ต
if var != group_var:
# ํต๊ณ์น์ ์ข
๋ฅ์ ๋ฐ๋ผ ๋ฐ๋ณต๋ฌธ ์คํ
for stat in ['count', 'count_norm']:
# ์๋ก์ด column ์ ๋ชฉ ์ ์
columns.append('%s_%s' % (var, stat))
### ๋ณ์ ์ด๋ฆ ์ฌ์ ์
categorical_grouped.columns = columns
categorical_grouped.head()
CREDIT_ACTIVE_Active_count | CREDIT_ACTIVE_Active_count_norm | CREDIT_ACTIVE_Bad debt_count | CREDIT_ACTIVE_Bad debt_count_norm | CREDIT_ACTIVE_Closed_count | CREDIT_ACTIVE_Closed_count_norm | CREDIT_ACTIVE_Sold_count | CREDIT_ACTIVE_Sold_count_norm | CREDIT_CURRENCY_currency 1_count | CREDIT_CURRENCY_currency 1_count_norm | ... | CREDIT_TYPE_Microloan_count | CREDIT_TYPE_Microloan_count_norm | CREDIT_TYPE_Mobile operator loan_count | CREDIT_TYPE_Mobile operator loan_count_norm | CREDIT_TYPE_Mortgage_count | CREDIT_TYPE_Mortgage_count_norm | CREDIT_TYPE_Real estate loan_count | CREDIT_TYPE_Real estate loan_count_norm | CREDIT_TYPE_Unknown type of loan_count | CREDIT_TYPE_Unknown type of loan_count_norm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | |||||||||||||||||||||
100001 | 3 | 0.428571 | 0 | 0.0 | 4 | 0.571429 | 0 | 0.0 | 7 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100002 | 2 | 0.250000 | 0 | 0.0 | 6 | 0.750000 | 0 | 0.0 | 8 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100003 | 1 | 0.250000 | 0 | 0.0 | 3 | 0.750000 | 0 | 0.0 | 4 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100004 | 0 | 0.000000 | 0 | 0.0 | 2 | 1.000000 | 0 | 0.0 | 2 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100005 | 2 | 0.666667 | 0 | 0.0 | 1 | 0.333333 | 0 | 0.0 | 3 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
5 rows ร 46 columns
-
sum
: ์ด ํ์ -
count_norm
: ํ์๋ฅผ ์ ๊ทํ์ํจ ๊ฒ
### ํ๋ จ ๋ฐ์ดํฐ์ ๊ฒฐํฉ
train = train.merge(categorical_grouped, left_on = 'SK_ID_CURR',
right_index = True, how = 'left')
train.head()
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | CREDIT_TYPE_Microloan_count | CREDIT_TYPE_Microloan_count_norm | CREDIT_TYPE_Mobile operator loan_count | CREDIT_TYPE_Mobile operator loan_count_norm | CREDIT_TYPE_Mortgage_count | CREDIT_TYPE_Mortgage_count_norm | CREDIT_TYPE_Real estate loan_count | CREDIT_TYPE_Real estate loan_count_norm | CREDIT_TYPE_Unknown type of loan_count | CREDIT_TYPE_Unknown type of loan_count_norm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows ร 229 columns
train.shape
(307511, 229)
train.iloc[:10, 123:]
bureau_DAYS_CREDIT_count | bureau_DAYS_CREDIT_mean | bureau_DAYS_CREDIT_max | bureau_DAYS_CREDIT_min | bureau_DAYS_CREDIT_sum | bureau_CREDIT_DAY_OVERDUE_count | bureau_CREDIT_DAY_OVERDUE_mean | bureau_CREDIT_DAY_OVERDUE_max | bureau_CREDIT_DAY_OVERDUE_min | bureau_CREDIT_DAY_OVERDUE_sum | ... | CREDIT_TYPE_Microloan_count | CREDIT_TYPE_Microloan_count_norm | CREDIT_TYPE_Mobile operator loan_count | CREDIT_TYPE_Mobile operator loan_count_norm | CREDIT_TYPE_Mortgage_count | CREDIT_TYPE_Mortgage_count_norm | CREDIT_TYPE_Real estate loan_count | CREDIT_TYPE_Real estate loan_count_norm | CREDIT_TYPE_Unknown type of loan_count | CREDIT_TYPE_Unknown type of loan_count_norm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8.0 | -874.000000 | -103.0 | -1437.0 | -6992.0 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 4.0 | -1400.750000 | -606.0 | -2586.0 | -5603.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 2.0 | -867.000000 | -408.0 | -1326.0 | -1734.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1.0 | -1149.000000 | -1149.0 | -1149.0 | -1149.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | 3.0 | -757.333333 | -78.0 | -1097.0 | -2272.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
6 | 18.0 | -1271.500000 | -239.0 | -2882.0 | -22887.0 | 18.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
7 | 2.0 | -1939.500000 | -1138.0 | -2741.0 | -3879.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
8 | 4.0 | -1773.000000 | -1309.0 | -2508.0 | -7092.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 rows ร 106 columns
b) ๋ฒ์ฃผํ ๋ฐ์ดํฐ๋ค์ ์ฒ๋ฆฌํ๊ธฐ ์ํ ํจ์
-
๋ฐ์ดํฐ ํ๋ ์์ ๊ทธ๋ฃนํ
-
์ดํ ๊ฐ๊ฐ์ ๋ฒ์ฃผ์ ๋ฐ๋ผ
counts
์normalized_counts
๋ฅผ ๊ณ์ฐ
-
ํ๋ผ๋ฏธํฐ(Parameters)>
-
df (dataframe)
: ์ฐ์ฐ์ ๋์์ด ๋๋ ๋ฐ์ดํฐํ๋ ์ -
group_var(string)
: groupby์ ๊ธฐ์ค์ด๋๋ column -
df_name(string)
: column ๋ช ์ ์ฌ์ ์ํ๋๋ฐ ์ฐ์ด๋ ๋ณ์
-
-
์ถ๋ ฅ๊ฐ(Returns)>
categorical
: ๋ฐ์ดํฐํ๋ ์ group_var์ ๋ํด ๊ฐ ๋ฒ์ฃผ๋ค์ counts ๋ฐ normalized_counts ๊ฐ์ด ํฌํจ๋ ๋ฐ์ดํฐ ํ๋ ์
### ๋ฒ์ฃผํ ๋ฐ์ดํฐ ์ฒ๋ฆฌ๋ฅผ ์ํ ํจ์
def count_categorical(df, group_var, df_name):
# ๋ฒ์ฃผํ column๋ค์ ์ ํ
categorical = pd.get_dummies(df.select_dtypes('object'))
# ํ์คํ id๊ฐ column์ ์๋๋ก ์ง์ ํ๊ธฐ
categorical[group_var] = df[group_var]
# group_var๋ฅผ ๊ธฐ์ค์ผ๋ก ๊ทธ๋ฃนํํ๊ณ sum๊ณผ mean์ ๊ณ์ฐ
categorical = categorical.groupby(group_var).agg(['sum', 'mean'])
column_names = []
# level = 0์ column๋ค์ ๋ฐ๋ผ ๋ฐ๋ณต๋ฌธ์ ์คํ
for var in categorical.columns.levels[0]:
# level = 1์ ํต๊ณ๊ฐ๋ค์ ๋ํด ๋ฐ๋ณต๋ฌธ์ ์คํ
for stat in ['count', 'count_norm']:
# ์ปฌ๋ผ๋ช
์ฌ์ ์
column_names.append('%s_%s_%s' % (df_name, var, stat))
categorical.columns = column_names
return categorical
bureau_counts = count_categorical(bureau,
group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_counts.head()
bureau_CREDIT_ACTIVE_Active_count | bureau_CREDIT_ACTIVE_Active_count_norm | bureau_CREDIT_ACTIVE_Bad debt_count | bureau_CREDIT_ACTIVE_Bad debt_count_norm | bureau_CREDIT_ACTIVE_Closed_count | bureau_CREDIT_ACTIVE_Closed_count_norm | bureau_CREDIT_ACTIVE_Sold_count | bureau_CREDIT_ACTIVE_Sold_count_norm | bureau_CREDIT_CURRENCY_currency 1_count | bureau_CREDIT_CURRENCY_currency 1_count_norm | ... | bureau_CREDIT_TYPE_Microloan_count | bureau_CREDIT_TYPE_Microloan_count_norm | bureau_CREDIT_TYPE_Mobile operator loan_count | bureau_CREDIT_TYPE_Mobile operator loan_count_norm | bureau_CREDIT_TYPE_Mortgage_count | bureau_CREDIT_TYPE_Mortgage_count_norm | bureau_CREDIT_TYPE_Real estate loan_count | bureau_CREDIT_TYPE_Real estate loan_count_norm | bureau_CREDIT_TYPE_Unknown type of loan_count | bureau_CREDIT_TYPE_Unknown type of loan_count_norm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | |||||||||||||||||||||
100001 | 3 | 0.428571 | 0 | 0.0 | 4 | 0.571429 | 0 | 0.0 | 7 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100002 | 2 | 0.250000 | 0 | 0.0 | 6 | 0.750000 | 0 | 0.0 | 8 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100003 | 1 | 0.250000 | 0 | 0.0 | 3 | 0.750000 | 0 | 0.0 | 4 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100004 | 0 | 0.000000 | 0 | 0.0 | 2 | 1.000000 | 0 | 0.0 | 2 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100005 | 2 | 0.666667 | 0 | 0.0 | 1 | 0.333333 | 0 | 0.0 | 3 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
5 rows ร 46 columns
c) ๋ค๋ฅธ ๋ฐ์ดํฐํ๋ ์์ ์ฐ์ฐ ์ ์ฉํ๊ธฐ
-
bureau_balance ๋ฐ์ดํฐํ๋ ์์ ํ์ฉ
- ์ด ๋ฐ์ดํฐํ๋ ์์ ์๋ณ ๊ฐ ๊ณ ๊ฐ์ ๊ณผ๊ฑฐ ํ ๊ธ์ต๊ธฐ๊ด ๋์ถ ๋ฐ์ดํฐ๋ฅผ ํฌํจํ๊ณ ์์
-
๊ณ ๊ฐ๋ค์ ID์ธ
SK_ID_CURR
์ ๋ฐ๋ผ ๊ทธ๋ฃนํํ๊ธฐ๋ณด๋ค, ์ด์ ๋์ถ์ ID์ธSK_ID_BUREAU
๋ฅผ ํ์ฉํ์ฌ 1์ฐจ ๊ทธ๋ฃนํ๋ฅผ ์งํํ ๊ฒ์- ๊ทธ๋ฃนํํ ๋ฐ์ดํฐํ๋ ์์ ๊ฐ๊ฐ์ ๋์ถ์ ๋ํ ์ ๋ณด๋ฅผ ํ๋ณ๋ก ํฌํจํ ๊ฒ์
-
๊ทธ ๋ค์,
SK_ID_CURR
์ ํ์ฉํ์ฌ ๊ทธ๋ฃนํํ ๋ค, ๊ฐ ๊ณ ๊ฐ๋ณ ๋์ถ์ ๋ํ๊ฐ๋ค์ ๊ณ์ฐ -
์ต์ข ์ฐ์ถ๋ฌผ์ ๊ฐ๊ฐ์ ํ์ ๊ณ ๊ฐ๋ณ๋ก ๋์ถ์ ๋ํ ๋ํ๊ฐ๋ค์ ํฌํจ
### ๋ฐ์ดํฐ ๋ถ๋ฌ์ค๊ธฐ
bureau_balance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/3แแ
ฎแแ
ก/data/bureau_balance.csv')
bureau_balance.head()
SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
---|---|---|---|
0 | 5715448 | 0 | C |
1 | 5715448 | -1 | C |
2 | 5715448 | -2 | C |
3 | 5715448 | -3 | C |
4 | 5715448 | -4 | C |
### 1. ๊ฐ๊ฐ์ ์ด์ ๋์ถ์ ๋ํ ์ํ ๊ฐ์ ํ์
bureau_balance_counts = count_categorical(bureau_balance, group_var = 'SK_ID_BUREAU',
df_name = 'bureau_balance')
bureau_balance_counts.head()
bureau_balance_STATUS_0_count | bureau_balance_STATUS_0_count_norm | bureau_balance_STATUS_1_count | bureau_balance_STATUS_1_count_norm | bureau_balance_STATUS_2_count | bureau_balance_STATUS_2_count_norm | bureau_balance_STATUS_3_count | bureau_balance_STATUS_3_count_norm | bureau_balance_STATUS_4_count | bureau_balance_STATUS_4_count_norm | bureau_balance_STATUS_5_count | bureau_balance_STATUS_5_count_norm | bureau_balance_STATUS_C_count | bureau_balance_STATUS_C_count_norm | bureau_balance_STATUS_X_count | bureau_balance_STATUS_X_count_norm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_BUREAU | ||||||||||||||||
5001709 | 0 | 0.000000 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 86 | 0.886598 | 11 | 0.113402 |
5001710 | 5 | 0.060241 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 48 | 0.578313 | 30 | 0.361446 |
5001711 | 3 | 0.750000 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.000000 | 1 | 0.250000 |
5001712 | 10 | 0.526316 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 9 | 0.473684 | 0 | 0.000000 |
5001713 | 0 | 0.000000 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.000000 | 22 | 1.000000 |
### 2. ์์นํ ๋ณ์ ์ฒ๋ฆฌ
# MONTHS_BALACE: ์ ์ฒญ์ผ์ ๊ธฐ์ค์ผ๋ก ํ ๋จ์ ๊ฐ์ ์
## 2-1. ๊ฐ๊ฐ์ 'SK_ID_CURR'๋ณ ๋ํ๊ฐ ๊ณ์ฐ
bureau_balance_agg = agg_numeric(bureau_balance, group_var = 'SK_ID_BUREAU',
df_name = 'bureau_balance')
bureau_balance_agg.head()
SK_ID_BUREAU | bureau_balance_MONTHS_BALANCE_count | bureau_balance_MONTHS_BALANCE_mean | bureau_balance_MONTHS_BALANCE_max | bureau_balance_MONTHS_BALANCE_min | bureau_balance_MONTHS_BALANCE_sum | |
---|---|---|---|---|---|---|
0 | 5001709 | 97 | -48.0 | 0 | -96 | -4656 |
1 | 5001710 | 83 | -41.0 | 0 | -82 | -3403 |
2 | 5001711 | 4 | -1.5 | 0 | -3 | -6 |
3 | 5001712 | 19 | -9.0 | 0 | -18 | -171 |
4 | 5001713 | 22 | -10.5 | 0 | -21 | -231 |
## 2-2. ๊ฐ ๊ณ ๊ฐ๋ณ ๊ณ์ฐ
# ๋์ถ์ ๊ธฐ์ค์ผ๋ก ๋ฐ์ดํฐํ๋ ์์ ๊ทธ๋ฃนํ
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True,
left_on = 'SK_ID_BUREAU', how = 'outer')
# SK_ID_CURR์ ํฌํจํ์ฌ ๋ณํฉ
bureau_by_loan = bureau_by_loan.merge(bureau[['SK_ID_BUREAU', 'SK_ID_CURR']],
on = 'SK_ID_BUREAU', how = 'left')
bureau_by_loan.head()
SK_ID_BUREAU | bureau_balance_MONTHS_BALANCE_count | bureau_balance_MONTHS_BALANCE_mean | bureau_balance_MONTHS_BALANCE_max | bureau_balance_MONTHS_BALANCE_min | bureau_balance_MONTHS_BALANCE_sum | bureau_balance_STATUS_0_count | bureau_balance_STATUS_0_count_norm | bureau_balance_STATUS_1_count | bureau_balance_STATUS_1_count_norm | ... | bureau_balance_STATUS_3_count_norm | bureau_balance_STATUS_4_count | bureau_balance_STATUS_4_count_norm | bureau_balance_STATUS_5_count | bureau_balance_STATUS_5_count_norm | bureau_balance_STATUS_C_count | bureau_balance_STATUS_C_count_norm | bureau_balance_STATUS_X_count | bureau_balance_STATUS_X_count_norm | SK_ID_CURR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5001709 | 97 | -48.0 | 0 | -96 | -4656 | 0 | 0.000000 | 0 | 0.0 | ... | 0.0 | 0 | 0.0 | 0 | 0.0 | 86 | 0.886598 | 11 | 0.113402 | NaN |
1 | 5001710 | 83 | -41.0 | 0 | -82 | -3403 | 5 | 0.060241 | 0 | 0.0 | ... | 0.0 | 0 | 0.0 | 0 | 0.0 | 48 | 0.578313 | 30 | 0.361446 | 162368.0 |
2 | 5001711 | 4 | -1.5 | 0 | -3 | -6 | 3 | 0.750000 | 0 | 0.0 | ... | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.000000 | 1 | 0.250000 | 162368.0 |
3 | 5001712 | 19 | -9.0 | 0 | -18 | -171 | 10 | 0.526316 | 0 | 0.0 | ... | 0.0 | 0 | 0.0 | 0 | 0.0 | 9 | 0.473684 | 0 | 0.000000 | 162368.0 |
4 | 5001713 | 22 | -10.5 | 0 | -21 | -231 | 0 | 0.000000 | 0 | 0.0 | ... | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.000000 | 22 | 1.000000 | 150635.0 |
5 rows ร 23 columns
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']),
group_var = 'SK_ID_CURR', df_name = 'client')
bureau_balance_by_client.head()
SK_ID_CURR | client_bureau_balance_MONTHS_BALANCE_count_count | client_bureau_balance_MONTHS_BALANCE_count_mean | client_bureau_balance_MONTHS_BALANCE_count_max | client_bureau_balance_MONTHS_BALANCE_count_min | client_bureau_balance_MONTHS_BALANCE_count_sum | client_bureau_balance_MONTHS_BALANCE_mean_count | client_bureau_balance_MONTHS_BALANCE_mean_mean | client_bureau_balance_MONTHS_BALANCE_mean_max | client_bureau_balance_MONTHS_BALANCE_mean_min | ... | client_bureau_balance_STATUS_X_count_count | client_bureau_balance_STATUS_X_count_mean | client_bureau_balance_STATUS_X_count_max | client_bureau_balance_STATUS_X_count_min | client_bureau_balance_STATUS_X_count_sum | client_bureau_balance_STATUS_X_count_norm_count | client_bureau_balance_STATUS_X_count_norm_mean | client_bureau_balance_STATUS_X_count_norm_max | client_bureau_balance_STATUS_X_count_norm_min | client_bureau_balance_STATUS_X_count_norm_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100001.0 | 7 | 24.571429 | 52 | 2 | 172 | 7 | -11.785714 | -0.5 | -25.5 | ... | 7 | 4.285714 | 9 | 0 | 30.0 | 7 | 0.214590 | 0.500000 | 0.0 | 1.502129 |
1 | 100002.0 | 8 | 13.750000 | 22 | 4 | 110 | 8 | -21.875000 | -1.5 | -39.5 | ... | 8 | 1.875000 | 3 | 0 | 15.0 | 8 | 0.161932 | 0.500000 | 0.0 | 1.295455 |
2 | 100005.0 | 3 | 7.000000 | 13 | 3 | 21 | 3 | -3.000000 | -1.0 | -6.0 | ... | 3 | 0.666667 | 1 | 0 | 2.0 | 3 | 0.136752 | 0.333333 | 0.0 | 0.410256 |
3 | 100010.0 | 2 | 36.000000 | 36 | 36 | 72 | 2 | -46.000000 | -19.5 | -72.5 | ... | 2 | 0.000000 | 0 | 0 | 0.0 | 2 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
4 | 100013.0 | 4 | 57.500000 | 69 | 40 | 230 | 4 | -28.250000 | -19.5 | -34.0 | ... | 4 | 10.250000 | 40 | 0 | 41.0 | 4 | 0.254545 | 1.000000 | 0.0 | 1.018182 |
5 rows ร 106 columns
๐ ์ ๋ฆฌ
-
bureau_balance ๋ฐ์ดํฐ ํ๋ ์์ ๋ค์๊ณผ ๊ฐ์ ๊ณผ์ ์ผ๋ก ๊ฐ๊ณต๋จ
-
๊ฐ๊ฐ์ ๋์ถ์ ๋ํด ์์นํ(numeric) ๋ํ๊ฐ ๊ณ์ฐ
-
๊ฐ๊ฐ์ ๋์ถ์ ๋ํด ๋ฒ์ฃผํ(categorical) ๋ฐ์ดํฐ๋ค์ ๊ฐ์๋ฅผ ํ์
-
๊ฐ๊ฐ์ ๋์ถ์ ๋ํ ๋ํ๊ฐ๋ค๊ณผ ๊ฐฏ์๋ฅผ ๋ณํฉ
-
๊ฐ ๊ณ ๊ฐ๋ณ๋ก 3์ ๊ฒฐ๊ณผ์ ๋ํ ์์นํ ๋ํ๊ฐ ๊ณ์ฐ
-
-
์ต์ข ๋ฐ์ดํฐํ๋ ์์ ๊ฐ ๊ณ ๊ฐ์ ๋ํ ๊ฐ๋ณ ํ์ผ๋ก ๊ตฌ์ฑ๋๋ฉฐ, ๊ฐ ํ์ ์ด์ ๋ชจ๋ ๋์ถ๋ค์ ์๋ณ ์ ๋ณด๋ค์ ํต๊ณ์น๋ค๋ก ๊ตฌ์ฑ๋์ด ์์
-
client_bureau_balance_MONTHS_BALANCE_mean_mean
: ๊ฐ๊ฐ์ ๋์ถ์ ๋ํMONTHS_BALANCE
์ ํ๊ท ๊ฐ์ ๊ณ์ฐ -> ํด๋ผ์ด์ธํธ๋ณ ๋์ถ์ ํ๊ท ๊ฐ์ ๊ณ์ฐ -
client_bureau_balance_STATUS_X_count_norm_sum
: ๊ฐ๊ฐ์ ๋์ถ์ ๋ํดSTATUS
== x ์ธ๊ฒ์ ๋น๋๋ฅผ ์ดSTATUS
์๋ก ๋๋ ๋ค์, ๊ฐ๋ณ ํด๋ผ์ด์ธํธ๋ณ๋ก ๊ทธ ์๋ฅผ ํฉ์ฐ
We will hold off on calculating the correlations until we have all the variables together in one dataframe.
2. ์ด์ ๊น์ง ์์ฑํ ํจ์ ํ์ฉํ๊ธฐ
- ๋ชจ๋ ๋ณ์๋ค์ ์ด๊ธฐํ ํ ๋ค, ์์ฑ๋ ํจ์๋ค์ ์ฌ์ฉํ์ฌ ์ฒ์๋ถํฐ ๋ง๋ค์ด๋๊ฐ์.
# ์ค๋๋ ๊ฐ์ฒด๋ค(objects)์ ์ ๊ฑฐํจ์ผ๋ก์จ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ํ๋ณด
import gc
gc.enable()
del train, bureau, bureau_balance, bureau_agg, bureau_agg_new, bureau_balance_agg, bureau_balance_counts, bureau_by_loan, bureau_balance_by_client, bureau_counts
gc.collect()
0
# ์๋ณธ ๋ฐ์ดํฐ ๋ค์ ๋ถ๋ฌ์ค๊ธฐ(์ด๊ธฐํ)
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/3แแ
ฎแแ
ก/data/application_train.csv')
bureau = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/3แแ
ฎแแ
ก/data/bureau.csv')
bureau_balance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/3แแ
ฎแแ
ก/data/bureau_balance.csv')
๐ Bureau ๋ฐ์ดํฐํ๋ ์ ๋ด ๋ฒ์ฃผํ ๋ฐ์ดํฐ์ ๊ฐฏ์ ์ธ๊ธฐ
bureau_counts = count_categorical(bureau, group_var = 'SK_ID_CURR',
df_name = 'bureau')
bureau_counts.head()
bureau_CREDIT_ACTIVE_Active_count | bureau_CREDIT_ACTIVE_Active_count_norm | bureau_CREDIT_ACTIVE_Bad debt_count | bureau_CREDIT_ACTIVE_Bad debt_count_norm | bureau_CREDIT_ACTIVE_Closed_count | bureau_CREDIT_ACTIVE_Closed_count_norm | bureau_CREDIT_ACTIVE_Sold_count | bureau_CREDIT_ACTIVE_Sold_count_norm | bureau_CREDIT_CURRENCY_currency 1_count | bureau_CREDIT_CURRENCY_currency 1_count_norm | ... | bureau_CREDIT_TYPE_Microloan_count | bureau_CREDIT_TYPE_Microloan_count_norm | bureau_CREDIT_TYPE_Mobile operator loan_count | bureau_CREDIT_TYPE_Mobile operator loan_count_norm | bureau_CREDIT_TYPE_Mortgage_count | bureau_CREDIT_TYPE_Mortgage_count_norm | bureau_CREDIT_TYPE_Real estate loan_count | bureau_CREDIT_TYPE_Real estate loan_count_norm | bureau_CREDIT_TYPE_Unknown type of loan_count | bureau_CREDIT_TYPE_Unknown type of loan_count_norm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | |||||||||||||||||||||
100001 | 3 | 0.428571 | 0 | 0.0 | 4 | 0.571429 | 0 | 0.0 | 7 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100002 | 2 | 0.250000 | 0 | 0.0 | 6 | 0.750000 | 0 | 0.0 | 8 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100003 | 1 | 0.250000 | 0 | 0.0 | 3 | 0.750000 | 0 | 0.0 | 4 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100004 | 0 | 0.000000 | 0 | 0.0 | 2 | 1.000000 | 0 | 0.0 | 2 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
100005 | 2 | 0.666667 | 0 | 0.0 | 1 | 0.333333 | 0 | 0.0 | 3 | 1.0 | ... | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
5 rows ร 46 columns
๐Bureau ๋ฐ์ดํฐํ๋ ์์ ๋ํ๊ฐ ๊ณ์ฐ
bureau_agg = agg_numeric(bureau.drop(columns = ['SK_ID_BUREAU']),
group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_agg.head()
SK_ID_CURR | bureau_DAYS_CREDIT_count | bureau_DAYS_CREDIT_mean | bureau_DAYS_CREDIT_max | bureau_DAYS_CREDIT_min | bureau_DAYS_CREDIT_sum | bureau_CREDIT_DAY_OVERDUE_count | bureau_CREDIT_DAY_OVERDUE_mean | bureau_CREDIT_DAY_OVERDUE_max | bureau_CREDIT_DAY_OVERDUE_min | ... | bureau_DAYS_CREDIT_UPDATE_count | bureau_DAYS_CREDIT_UPDATE_mean | bureau_DAYS_CREDIT_UPDATE_max | bureau_DAYS_CREDIT_UPDATE_min | bureau_DAYS_CREDIT_UPDATE_sum | bureau_AMT_ANNUITY_count | bureau_AMT_ANNUITY_mean | bureau_AMT_ANNUITY_max | bureau_AMT_ANNUITY_min | bureau_AMT_ANNUITY_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100001 | 7 | -735.000000 | -49 | -1572 | -5145 | 7 | 0.0 | 0 | 0 | ... | 7 | -93.142857 | -6 | -155 | -652 | 7 | 3545.357143 | 10822.5 | 0.0 | 24817.5 |
1 | 100002 | 8 | -874.000000 | -103 | -1437 | -6992 | 8 | 0.0 | 0 | 0 | ... | 8 | -499.875000 | -7 | -1185 | -3999 | 7 | 0.000000 | 0.0 | 0.0 | 0.0 |
2 | 100003 | 4 | -1400.750000 | -606 | -2586 | -5603 | 4 | 0.0 | 0 | 0 | ... | 4 | -816.000000 | -43 | -2131 | -3264 | 0 | NaN | NaN | NaN | 0.0 |
3 | 100004 | 2 | -867.000000 | -408 | -1326 | -1734 | 2 | 0.0 | 0 | 0 | ... | 2 | -532.000000 | -382 | -682 | -1064 | 0 | NaN | NaN | NaN | 0.0 |
4 | 100005 | 3 | -190.666667 | -62 | -373 | -572 | 3 | 0.0 | 0 | 0 | ... | 3 | -54.333333 | -11 | -121 | -163 | 3 | 1420.500000 | 4261.5 | 0.0 | 4261.5 |
5 rows ร 61 columns
๐ Bureau Balance ๋ฐ์ดํฐ ํ๋ ์์ ๊ฐ ๋์ถ ๋ณ ๋ฒ์ฃผํ ๋ฐ์ดํฐ ๊ฐ์
bureau_balance_counts = count_categorical(bureau_balance, group_var = 'SK_ID_BUREAU',
df_name = 'bureau_balance')
bureau_balance_counts.head()
bureau_balance_STATUS_0_count | bureau_balance_STATUS_0_count_norm | bureau_balance_STATUS_1_count | bureau_balance_STATUS_1_count_norm | bureau_balance_STATUS_2_count | bureau_balance_STATUS_2_count_norm | bureau_balance_STATUS_3_count | bureau_balance_STATUS_3_count_norm | bureau_balance_STATUS_4_count | bureau_balance_STATUS_4_count_norm | bureau_balance_STATUS_5_count | bureau_balance_STATUS_5_count_norm | bureau_balance_STATUS_C_count | bureau_balance_STATUS_C_count_norm | bureau_balance_STATUS_X_count | bureau_balance_STATUS_X_count_norm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_BUREAU | ||||||||||||||||
5001709 | 0 | 0.000000 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 86 | 0.886598 | 11 | 0.113402 |
5001710 | 5 | 0.060241 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 48 | 0.578313 | 30 | 0.361446 |
5001711 | 3 | 0.750000 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.000000 | 1 | 0.250000 |
5001712 | 10 | 0.526316 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 9 | 0.473684 | 0 | 0.000000 |
5001713 | 0 | 0.000000 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.000000 | 22 | 1.000000 |
๐ Bureau Balance ๋ฐ์ดํฐํ๋ ์์ ๊ฐ ๋์ถ๋ณ ๋ํ๊ฐ
bureau_balance_agg = agg_numeric(bureau_balance, group_var = 'SK_ID_BUREAU',
df_name = 'bureau_balance')
bureau_balance_agg.head()
SK_ID_BUREAU | bureau_balance_MONTHS_BALANCE_count | bureau_balance_MONTHS_BALANCE_mean | bureau_balance_MONTHS_BALANCE_max | bureau_balance_MONTHS_BALANCE_min | bureau_balance_MONTHS_BALANCE_sum | |
---|---|---|---|---|---|---|
0 | 5001709 | 97 | -48.0 | 0 | -96 | -4656 |
1 | 5001710 | 83 | -41.0 | 0 | -82 | -3403 |
2 | 5001711 | 4 | -1.5 | 0 | -3 | -6 |
3 | 5001712 | 19 | -9.0 | 0 | -18 | -171 |
4 | 5001713 | 22 | -10.5 | 0 | -21 | -231 |
๐ Bureau Balance ๋ฐ์ดํฐํ๋ ์์ ๊ณ ๊ฐ๋ณ ๋ํ๊ฐ
# ๊ฐ ๋์ถ๋ณ ๋ฐ์ดํฐํ๋ ์ ๋ณํฉ
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True,
left_on = 'SK_ID_BUREAU', how = 'outer')
# ๋ฐ์ดํฐ ํ๋ ์์ SK_ID_CURR์ ํฌํจ
bureau_by_loan = bureau[['SK_ID_BUREAU', 'SK_ID_CURR']].merge(bureau_by_loan,
on = 'SK_ID_BUREAU',
how = 'left')
# ๊ณ ๊ฐ๋ณ ๋ํ๊ฐ ๊ณ์ฐ
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']),
group_var = 'SK_ID_CURR', df_name = 'client')
๐ ๊ณ์ฐ๋ ํน์ฑ(Feature)๋ค์ ํ๋ จ์ฉ ๋ฐ์ดํฐ์ ๋ณํฉ
original_features = list(train.columns) # ์๋ ๋ณ์๋ค
print('Original Number of Features: ', len(original_features))
Original Number of Features: 122
# bureau_count ๋ณํฉ
train = train.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')
# bureau ๋ํ๊ฐ ๋ณํฉ
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')
# ์๋ณ, ๊ณ ๊ฐ๋ณ ์ ๋ณด ๋ณํฉ
train = train.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')
new_features = list(train.columns) # ๋ณ์ ์ถ๊ฐ ํ ๋ณ์ ๋ชฉ๋ก
print('Number of features using previous loans from other institutions data: ', len(new_features))
Number of features using previous loans from other institutions data: 333
- ๋ง์ ๋ณ์๋ค์ด ์๋ก ์์ฑ๋์๋ค.
3. Feature Engineering ๊ฒฐ๊ณผ๋ฌผ
-
๊ฒฐ์ธก์น์ ๋น์จ, target๊ณผ์ ์๊ด๊ณ์, ๋ณ์ ๊ฐ ์๊ด๊ณ์๋ค์ ํ์
-
๊ฐ ๋ณ์ ๊ฐ ๋์ ์๊ด๊ด๊ณ๋ ๋ณ์ ๊ฐ collinear ๊ด๊ณ๋ฅผ ๊ฐ์ง๋ ์ง ์ฌ๋ถ๋ฅผ ๋ณด์ฌ์ค ์ ์์ผ๋ฉฐ, ์ด๋ ๋ณ์๋ค์ด ์๋ก ๊ฐํ ์ฐ๊ด๊ด๊ณ๋ฅผ ๊ฐ์ง์ ์๋ฏธ
-
์ข ์ข collinearํ ๋ ๋ณ์๋ฅผ ๋ชจ๋ ๊ฐ๋ ๊ฒ์ ์ค๋ณต์ด๊ธฐ ๋๋ฌธ์ ํ๋๋ฅผ ์ ๊ฑฐํด์ผํ ํ์๊ฐ ์๊ธฐ๋ ํจ(๋ค์ค๊ณต์ ์ฑ ๋ฌธ์ ๋ฑ)
-
-
feature seletion์ ๋ชจ๋ธ ํ์ต ๋ฐ ์ผ๋ฐํ๋ฅผ ์ํด ๋ณ์๋ค์ ์ ๊ฑฐํ๋ ๊ณผ์
-
ํ์์๊ณ , ์ค๋ณต์ธ ๋ณ์๋ค์ ์ ๊ฑฐํ๊ณ , ์ค์ํ ๋ณ์๋ค์ ๋ณด์กดํ๋ ๊ฒ์ด ๋ชฉ์
-
Curse of dimensionality(์ฐจ์์ ์ ์ฃผ)
-
๋๋ฌด ๋ง์ feature๋ฅผ ๊ฐ์ง ๋ ์๊ธฐ๋ ๋ฌธ์ (Feature ๊ฐฏ์๊ฐ ๋ง์ ๊ฒ -> ์ง๋์น๊ฒ ๊ณ ์ฐจ์)
-
๋ณ์์ ์๊ฐ ์ฆ๊ฐํจ์ ๋ฐ๋ผ ๋ณ์์ ๋ชฉํ๊ฐ ์ฌ์ด์ ์๊ด๊ด๊ณ๋ฅผ ํ์ตํ๋๋ฐ ํ์ํ ๋ฐ์ดํฐ์ ์๊ฐ ๊ธฐํ๊ธ์์ ์ผ๋ก ์ฆ๊ฐ
-
Feature์ ์๋ฅผ ์ค์ด๋ ๊ฒ์ ๋ชจ๋ธ ํ์ต๊ณผ ๋๋ถ์ด ์ผ๋ฐํ๋ฅผ ๋์ธ ์ ์์
-
๊ฒฐ์ธก์น๋ค์ ๋ฐฑ๋ถ์จ์ ํ์ฉํ์ฌ ๋๋ถ๋ถ์ ๊ฐ์ด ์กด์ฌํ์ง ์๋ feature๋ค์ ์ ๊ฑฐํ ์ ์์
-
Gradient Boosting Machine๊ณผ RandomForest ๋ชจ๋ธ๋ก๋ถํฐ ๋ฐํ๋ feature importance๋ฅผ ํ์ฉํ ์ ์์
-
-
3-1. ๊ฒฐ์ธก์น ์ฒ๋ฆฌ
### column๋ณ ๊ฒฐ์ธก์น์ ๊ฐ์๋ฅผ ๊ณ์ฐํ๊ธฐ ์ํ ํจ์
def missing_values_table(df):
# ๊ฒฐ์ธก์น์ ์ด ๊ฐ์
mis_val = df.isnull().sum()
# ๊ฒฐ์ธก์น์ ๋น์จ
mis_val_percent = 100 * df.isnull().sum() / len(df)
# ๊ฒฐ๊ณผ ํ
์ด๋ธ
mis_val_table = pd.concat([mis_val, mis_val_percent], axis = 1)
# ์ปฌ๋ผ๋ช
์ฌ์ ์
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'})
# ๊ฒฐ๊ณผ๋ฅผ ๋ด๋ฆผ์ฐจ์์ผ๋ก ์ ๋ ฌ
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending = False).round(1)
# ์์ฝํต๊ณ๋ ์ถ๋ ฅ
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")
return mis_val_table_ren_columns
missing_train = missing_values_table(train)
missing_train.head(10)
Your selected dataframe has 333 columns. There are 278 columns that have missing values.
Missing Values | % of Total Values | |
---|---|---|
bureau_AMT_ANNUITY_min | 227502 | 74.0 |
bureau_AMT_ANNUITY_max | 227502 | 74.0 |
bureau_AMT_ANNUITY_mean | 227502 | 74.0 |
client_bureau_balance_STATUS_4_count_min | 215280 | 70.0 |
client_bureau_balance_STATUS_3_count_norm_mean | 215280 | 70.0 |
client_bureau_balance_MONTHS_BALANCE_count_min | 215280 | 70.0 |
client_bureau_balance_STATUS_4_count_max | 215280 | 70.0 |
client_bureau_balance_STATUS_4_count_mean | 215280 | 70.0 |
client_bureau_balance_STATUS_3_count_norm_min | 215280 | 70.0 |
client_bureau_balance_STATUS_3_count_norm_max | 215280 | 70.0 |
-
๋๋ฝ๋ ๊ฐ์ ๋น์จ์ด ๋์ column์ด ์ฌ๋ฌ ๊ฐ ์์์ ์ ์ ์์
- ํ๋ จ ๋ฐ์ดํฐ ๋ฐ ํ ์คํธ ๋ฐ์ดํฐ์ ์์ด 90% ์ด์์ ๋๋ฝ๊ฐ์ ๊ฐ์ง column๋ค์ ์ ๊ฑฐ
missing_train_vars = list(missing_train.index[missing_train['% of Total Values'] > 90])
len(missing_train_vars)
0
- ํ ์คํธ ๋ฐ์ดํฐ์ ๋ํด์๋ ๋์ผํ ์์ ์ํ
3-2. ํ ์คํธ ๋ฐ์ดํฐ ์ฒ๋ฆฌ
# ํ
์คํธ ๋ฐ์ดํฐ ๋ถ๋ฌ์ค๊ธฐ
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/3แแ
ฎแแ
ก/data/application_test.csv')
# bureau ๋ฐ์ดํฐ์ ๊ฐ์๋ค์ ๊ณ์ฐํ ๋ฐ์ดํฐํ๋ ์๊ณผ ๋ณํฉ
test = test.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')
# bureau ๋ฐ์ดํฐ์ ๋ํ๊ฐ๋ค์ ๊ณ์ฐํ ๋ฐ์ดํฐํ๋ ์๊ณผ ๋ณํฉ
test = test.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')
# bureau balance ๋ฐ์ดํฐ์ ๊ฐ์๋ค์ ๊ณ์ฐํ ๋ฐ์ดํฐํ๋ ์๊ณผ ๋ณํฉ
test = test.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')
print('Shape of Testing Data: ', test.shape)
Shape of Testing Data: (48744, 332)
3-3. ๋ฐ์ดํฐ ๊ฐ์ ๋ง์ถฐ์ฃผ๊ธฐ
-
ํ๋ จ์ฉ ๋ฐ์ดํฐ์ ํ ์คํธ ๋ฐ์ดํฐํ๋ ์์ด ๊ฐ์ column๋ค์ ๊ฐ์ง๋๋ก ๋ง์ถฐ๋ณด์.
- ์ฌ๊ธฐ์๋ ๋ฌธ์ ๊ฐ ๋์ง ์์ง๋ง, ๋ณ์๋ค์ ์-ํซ ์ธ์ฝ๋ฉํ ๋์๋ ๋ฐ์ดํฐํ๋ ์๋ค์ด ๋์ผํ column์ ๊ฐ์ง๋๋ก ๋ง์ถฐ์ผ ํจ
train_labels = train['TARGET']
# ๋ฐ์ดํฐํ๋ ์์ align
# 'target' column์ ์ผ๋จ ์ ๊ฑฐํ๊ณ ๋ง์ถฐ์ฃผ๊ธฐ
train, test = train.align(test, join = 'inner', axis = 1)
train['TARGET'] = train_labels # train data์ ๋ํด์๋ ๋ค์ target๊ฐ ๊ฒฐํฉ
print('Training Data Shape: ', train.shape)
print('Testing Data Shape: ', test.shape)
Training Data Shape: (307511, 333) Testing Data Shape: (48744, 332)
- ๋ฐ์ดํฐ ํํ๊ฐ ํต์ผ๋์์์ ํ์ธํ ์ ์๋ค.
3-4. ๊ฒฐ์ธก์น ์ฒ๋ฆฌ
### ๊ฒฐ์ธก์น ์ํ(?) ํ์ธ
missing_test = missing_values_table(test)
missing_test.head(10)
Your selected dataframe has 332 columns. There are 275 columns that have missing values.
Missing Values | % of Total Values | |
---|---|---|
COMMONAREA_MEDI | 33495 | 68.7 |
COMMONAREA_MODE | 33495 | 68.7 |
COMMONAREA_AVG | 33495 | 68.7 |
NONLIVINGAPARTMENTS_MEDI | 33347 | 68.4 |
NONLIVINGAPARTMENTS_AVG | 33347 | 68.4 |
NONLIVINGAPARTMENTS_MODE | 33347 | 68.4 |
FONDKAPREMONT_MODE | 32797 | 67.3 |
LIVINGAPARTMENTS_MEDI | 32780 | 67.2 |
LIVINGAPARTMENTS_MODE | 32780 | 67.2 |
LIVINGAPARTMENTS_AVG | 32780 | 67.2 |
missing_test_vars = list(missing_test.index[missing_test['% of Total Values'] > 90])
len(missing_test_vars)
0
missing_columns = list(set(missing_test_vars + missing_train_vars))
print('There are %d columns with more than 90%% missing in either the training or testing data.' % len(missing_columns))
There are 0 columns with more than 90% missing in either the training or testing data.
# ๊ฒฐ์ธก์น๊ฐ ์๋ column ์ญ์
train = train.drop(columns = missing_columns)
test = test.drop(columns = missing_columns)
-
90% ์ด์ ๋๋ฝ๋ ๊ฐ์ ๊ฐ์ง column๋ค์ด ์๊ธฐ ๋๋ฌธ์, ์ด๋ฒ์๋ ์ด๋ ํ column๋ค๋ ์ ๊ฑฐ๋์ง ์์์
- feature selection์ ์ํด์๋ ์๋ง๋ ๋ค๋ฅธ ๋ฐฉ๋ฒ์ ์ ์ฉํด์ผ ํ ๊ฒ ๊ฐ์
### ๊ฐ๊ณต๋ ๋ฐ์ดํฐ ์ ์ฅ
train.to_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/3แแ
ฎแแ
ก/data/train_bureau_raw.csv', index = False)
test.to_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/3แแ
ฎแแ
ก/data/test_bureau_raw.csv', index = False)
3-5. ์๊ด๊ณ์(Correlations)
-
target๊ณผ ๋ณ์๋ค ๊ฐ์ ์๊ด๊ณ์
- ์๋กญ๊ฒ ์์ฑ๋ ๋ณ์๋ค ์ค ๊ธฐ์กด ํ๋ จ์ฉ ๋ฐ์ดํฐ(
application
๋ฐ์ดํฐ์ ์๋) ๋ณ์๋ค๋ณด๋ค ๋ ๋์ ์๊ด๊ณ์๋ฅผ ๊ฐ์ง๋ ๋ณ์๋ฅผ ์ฐพ์ ์ ์์
- ์๋กญ๊ฒ ์์ฑ๋ ๋ณ์๋ค ์ค ๊ธฐ์กด ํ๋ จ์ฉ ๋ฐ์ดํฐ(
# ๋ฐ์ดํฐํ๋ ์์์์ ๋ชจ๋ ๋ณ์๋ค ๊ฐ์ ์๊ด๊ณ์๋ฅผ ๊ณ์ฐ
corrs = train.corr()
corrs = corrs.sort_values('TARGET', ascending = False) # ๋ด๋ฆผ์ฐจ์ ์ ๋ ฌ
pd.DataFrame(corrs['TARGET'].head(10))
TARGET | |
---|---|
TARGET | 1.000000 |
bureau_DAYS_CREDIT_mean | 0.089729 |
client_bureau_balance_MONTHS_BALANCE_min_mean | 0.089038 |
DAYS_BIRTH | 0.078239 |
bureau_CREDIT_ACTIVE_Active_count_norm | 0.077356 |
client_bureau_balance_MONTHS_BALANCE_mean_mean | 0.076424 |
bureau_DAYS_CREDIT_min | 0.075248 |
client_bureau_balance_MONTHS_BALANCE_min_min | 0.073225 |
client_bureau_balance_MONTHS_BALANCE_sum_mean | 0.072606 |
bureau_DAYS_CREDIT_UPDATE_mean | 0.068927 |
pd.DataFrame(corrs['TARGET'].dropna().tail(10))
TARGET | |
---|---|
client_bureau_balance_MONTHS_BALANCE_count_min | -0.048224 |
client_bureau_balance_STATUS_C_count_norm_mean | -0.055936 |
client_bureau_balance_STATUS_C_count_max | -0.061083 |
client_bureau_balance_STATUS_C_count_mean | -0.062954 |
client_bureau_balance_MONTHS_BALANCE_count_max | -0.068792 |
bureau_CREDIT_ACTIVE_Closed_count_norm | -0.079369 |
client_bureau_balance_MONTHS_BALANCE_count_mean | -0.080193 |
EXT_SOURCE_1 | -0.155317 |
EXT_SOURCE_2 | -0.160472 |
EXT_SOURCE_3 | -0.178919 |
-
target๊ณผ ๊ฐ์ฅ ํฐ ์๊ด๊ณ์๋ฅผ ๊ฐ์ง๋ ๋ณ์๋ ์๋กญ๊ฒ ์์ฑ๋ ๋ณ์
- ๊ทธ๋ฌ๋ ์๊ด๊ณ์๊ฐ ๋๋ค๋ ๊ฒ์ด ๊ทธ ๋ณ์๊ฐ ์ ์ฉํ๋ค๋ ๊ฒ์ ์๋ฏธํ์ง๋ ์์ผ๋ฉฐ, ์๋ฐฑ ๊ฐ์ ๋ณ์๋ค์ ์์ฑํ์ ๊ฒฝ์ฐ์๋, ๊ทธ์ random noise ๋๋ฌธ์ ์๊ด๊ด๊ณ์ ์๋ ๊ฒ์ฒ๋ผ ๋ณด์ผ ์๋ ์์์ ์ฃผ์ํด์ผ ํจ
-
๋นํ์ ์ผ๋ก ์๊ด๊ณ์๋ค์ ๋ค์ฌ๋ค๋ดค์ ๋, ๊ทธ๋๋ ์๋กญ๊ฒ ์์ฑ๋ ๋ช๋ช ๋ณ์๋ค์ ์ ์ฉํ ๊ฒ์ฒ๋ผ ๋ณด์
-
๋ณ์๋ค์ ์ ์ฉ์ฑ์ ํ๊ฐํ๊ธฐ ์ํด, ํ์ต๋ ๋ชจ๋ธ๋ก๋ถํฐ feature Importance๋ฅผ ์ดํด๋ณผ ์์
-
์๋กญ๊ฒ ์์ฑ๋ ๋ณ์๋ค์ ๋ํ kde ๊ทธ๋ํ๋ฅผ ์์ฑํ ์ ์์
-
# ์๋ฌธ์๋ client_bureau_balance_counts_mean ์ ์ฌ์ฉํ๊ณ ์์ง๋ง ํด๋น ๋ณ์๊ฐ ์์ด,
# 'client_bureau_balance_MONTHS_BALANCE_count_mean'์ผ๋ก ๋์ฒด
kde_target(var_name = 'client_bureau_balance_MONTHS_BALANCE_count_mean', df = train)
The correlation between client_bureau_balance_MONTHS_BALANCE_count_mean and the TARGET is -0.0802 Median value for loan that was not repaid = 19.3333 Median value for loan that was repaid = 25.1429
-
์ด ๋ณ์๋ ๊ฐ ๊ณ ๊ฐ์ ๋์ถ๋ณ ์๋ณ ๊ธฐ๋ก์ ๋ํ ํ๊ท ์ ์๋ฏธํฉ๋๋ค.
- ์๋ฅผ ๋ค์ด, ๋ง์ฝ ๊ณ ๊ฐ์ ๊ณผ๊ฑฐ ์๋ณ๋ก 3, 4, 5์ ๊ธฐ๋ก์ ๊ฐ๊ณ ์๋ 3๊ฐ์ ๋์ถ์ ๊ฐ๊ณ ์๋ค๋ฉด, ์ด ๋ณ์์ ์์ด ๋ฐ์ดํฐ๊ฐ์ 4๊ฐ ๋จ
-
๋ถ์ฐ ๊ทธ๋ํ์ ๊ธฐ์ดํ์ฌ, ๊ณผ๊ฑฐ ์๋ณ ํ๊ท ์ด ๋์ ์ฌ๋๋ค์ด Home Credit์์์ ๋์ถ์์ ์ํ์ ์ํ๋ ๊ฒ์ผ๋ก ๋ณด์
- ๋ ๋ง์ ์ ์ฉ ๊ธฐ๋ก์ ๊ฐ์ง๊ณ ์๋ ๊ณ ๊ฐ๋ค์ด ์ผ๋ฐ์ ์ผ๋ก ๋์ถ๊ธ์ ์ํํ ๊ฐ๋ฅ์ฑ์ด ๋ ๋๋ค๋ ๊ฒ์ ๋ํ๋ผ ์ ์์
kde_target(var_name='bureau_CREDIT_ACTIVE_Active_count_norm', df=train)
The correlation between bureau_CREDIT_ACTIVE_Active_count_norm and the TARGET is 0.0774 Median value for loan that was not repaid = 0.5000 Median value for loan that was repaid = 0.3636
-
ํด๋น ๋ณ์๋
CREDIT_ACTIVE
์ ๊ฐ์ด โActiveโ์ธ ๊ฒ์ ๊ฐ์๋ฅผ ์ ์ฒด ๋์ถ์ ๊ฐ์๋ก ๋๋ ๊ฐ -
ํด๋น ๋ณ์์ ๊ฒฝ์ฐ ๋ชจ๋ ๊ณณ์์ ๋ถ๊ท์น์ ์
-
์๊ด๊ด๊ณ ๋ํ ๋งค์ฐ ๋ฎ์
๐ Collinear Variables
-
target๊ณผ ๋ณ์ ๊ฐ์ ์๊ด๊ณ์๋ง ๊ณ์ฐํ๋ ๊ฒ์ด ์๋, ๊ฐ ๋ณ์ ๊ฐ ์๊ด๊ณ์๊น์ง ๊ณ์ฐํ ์ ์์
- ์ด๋ฅผ ํตํด ์ ๊ฑฐํด์ผ ํ ์๋ ์๋ collinear ๊ด๊ณ๋ค์ ๊ฐ์ง๋ ๋ณ์๋ค์ด ์๋์ง ์ฌ๋ถ๋ฅผ ์๋ ค์ค
-
0.8 ์ด์์ ์๊ด๊ณ์๋ฅผ ๊ฐ์ง๋ ๋ณ์๋ค์ ์ฐพ์๋ณด์.
# ์๊ณ๊ฐ ์ค์
threshold = 0.8
# ์๊ด๊ณ์๊ฐ ๋์ ๋ณ์๋ค์ ์ ์ฅํ๊ธฐ ์ํ ๋น dictionary ์์ฑ
above_threshold_vars = {}
# ๊ฐ๊ฐ์ ์นผ๋ผ๋ง๋ค ์๊ณ์น ์ด์์ ๋ณ์๋ค์ ์ ์ฅ
for col in corrs:
above_threshold_vars[col] = list(corrs.index[corrs[col] > threshold])
์๊ด๊ณ์๊ฐ ๋์ ๋ณ์๋ค ์ค 1๊ฐ์ ๋ณ์๋ง ์ ๊ฑฐ
# ์ ๊ฑฐํ column๋ค ๋ฐ ์ด๋ฏธ ๊ฒ์ฌ๋ column๋ค์ ๋ชฉ๋ก์ ์ ์ฅ์ํ list ์์ฑ
cols_to_remove = []
cols_seen = []
cols_to_remove_pair = []
for key, value in above_threshold_vars.items():
# ์ด๋ฏธ ๊ฒ์ฌ๋ column ์ ์ฅ
cols_seen.append(key)
for x in value:
if x == key:
next
else:
# ๊ฐ ์ ์ค ํ๋์ column๋ง์ ์ ๊ฑฐ
if x not in cols_seen:
cols_to_remove.append(x)
cols_to_remove_pair.append(key)
cols_to_remove = list(set(cols_to_remove))
print('Number of columns to remove: ', len(cols_to_remove))
Number of columns to remove: 134
### ์๊ด๋๊ฐ ๋์ ๋ณ์๋ค์ ์ ๊ฑฐ
train_corrs_removed = train.drop(columns = cols_to_remove)
test_corrs_removed = test.drop(columns = cols_to_remove)
print('Training Corrs Removed Shape: ', train_corrs_removed.shape)
print('Testing Corrs Removed Shape: ', test_corrs_removed.shape)
Training Corrs Removed Shape: (307511, 199) Testing Corrs Removed Shape: (48744, 198)
train_corrs_removed.to_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/3แแ
ฎแแ
ก/data/train_bureau_corrs_removed.csv', index = False)
test_corrs_removed.to_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/3แแ
ฎแแ
ก/data/test_bureau_corrs_removed.csv', index = False)
4. ๋ชจ๋ธ๋ง(Modeling)
-
ํด๋น ๋ถ๋ถ์ ์๋ฌธ์ ๊ทธ๋๋ก ํ์ฌํ์์ต๋๋ค.
-
๋ชจ๋ธ์ LGBM์ ํ์ฉ + ๊ต์ฐจ ๊ฒ์ฆ
-
์์์ ๊ฐ๊ณตํ ๋ฐ์ดํฐ๋ค๋ก ๋ชจ๋ธ์ ํ์ต/์์ธก/ํ๊ฐ ํ ์ฑ๋ฅ ๋น๊ต
import lightgbm as lgb
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
import gc
import matplotlib.pyplot as plt
def model(features, test_features, encoding = 'ohe', n_folds = 5):
"""Train and test a light gradient boosting model using
cross validation.
Parameters
--------
features (pd.DataFrame):
dataframe of training features to use
for training a model. Must include the TARGET column.
test_features (pd.DataFrame):
dataframe of testing features to use
for making predictions with the model.
encoding (str, default = 'ohe'):
method for encoding categorical variables. Either 'ohe' for one-hot encoding or 'le' for integer label encoding
n_folds (int, default = 5): number of folds to use for cross validation
Return
--------
submission (pd.DataFrame):
dataframe with `SK_ID_CURR` and `TARGET` probabilities
predicted by the model.
feature_importances (pd.DataFrame):
dataframe with the feature importances from the model.
valid_metrics (pd.DataFrame):
dataframe with training and validation metrics (ROC AUC) for each fold and overall.
"""
# Extract the ids
train_ids = features['SK_ID_CURR']
test_ids = test_features['SK_ID_CURR']
# Extract the labels for training
labels = features['TARGET']
# Remove the ids and target
features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
test_features = test_features.drop(columns = ['SK_ID_CURR'])
# One Hot Encoding
if encoding == 'ohe':
features = pd.get_dummies(features)
test_features = pd.get_dummies(test_features)
# Align the dataframes by the columns
features, test_features = features.align(test_features, join = 'inner', axis = 1)
# No categorical indices to record
cat_indices = 'auto'
# Integer label encoding
elif encoding == 'le':
# Create a label encoder
label_encoder = LabelEncoder()
# List for storing categorical indices
cat_indices = []
# Iterate through each column
for i, col in enumerate(features):
if features[col].dtype == 'object':
# Map the categorical features to integers
features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))
# Record the categorical indices
cat_indices.append(i)
# Catch error if label encoding scheme is not valid
else:
raise ValueError("Encoding must be either 'ohe' or 'le'")
print('Training Data Shape: ', features.shape)
print('Testing Data Shape: ', test_features.shape)
# Extract feature names
feature_names = list(features.columns)
# Convert to np arrays
features = np.array(features)
test_features = np.array(test_features)
# Create the kfold object
k_fold = KFold(n_splits = n_folds, shuffle = False, random_state = 50)
# Empty array for feature importances
feature_importance_values = np.zeros(len(feature_names))
# Empty array for test predictions
test_predictions = np.zeros(test_features.shape[0])
# Empty array for out of fold validation predictions
out_of_fold = np.zeros(features.shape[0])
# Lists for recording validation and training scores
valid_scores = []
train_scores = []
# Iterate through each fold
for train_indices, valid_indices in k_fold.split(features):
# Training data for the fold
train_features, train_labels = features[train_indices], labels[train_indices]
# Validation data for the fold
valid_features, valid_labels = features[valid_indices], labels[valid_indices]
# Create the model
model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary',
class_weight = 'balanced', learning_rate = 0.05,
reg_alpha = 0.1, reg_lambda = 0.1,
subsample = 0.8, n_jobs = -1, random_state = 50)
# Train the model
model.fit(train_features, train_labels, eval_metric = 'auc',
eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
eval_names = ['valid', 'train'], categorical_feature = cat_indices,
early_stopping_rounds = 100, verbose = 200)
# Record the best iteration
best_iteration = model.best_iteration_
# Record the feature importances
feature_importance_values += model.feature_importances_ / k_fold.n_splits
# Make predictions
test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
# Record the out of fold predictions
out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
# Record the best score
valid_score = model.best_score_['valid']['auc']
train_score = model.best_score_['train']['auc']
valid_scores.append(valid_score)
train_scores.append(train_score)
# Clean up memory
gc.enable()
del model, train_features, valid_features
gc.collect()
# Make the submission dataframe
submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
# Make the feature importance dataframe
feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
# Overall validation score
valid_auc = roc_auc_score(labels, out_of_fold)
# Add the overall scores to the metrics
valid_scores.append(valid_auc)
train_scores.append(np.mean(train_scores))
# Needed for creating dataframe of validation scores
fold_names = list(range(n_folds))
fold_names.append('overall')
# Dataframe of validation scores
metrics = pd.DataFrame({'fold': fold_names,
'train': train_scores,
'valid': valid_scores})
return submission, feature_importances, metrics
def plot_feature_importances(df):
"""
Plot importances returned by a model. This can work with any measure of
feature importance provided that higher importance is better.
Args:
df (dataframe): feature importances. Must have the features in a column
called `features` and the importances in a column called `importance
Returns:
shows a plot of the 15 most importance features
df (dataframe): feature importances sorted by importance (highest to lowest)
with a column for normalized importance
"""
# Sort features according to importance
df = df.sort_values('importance', ascending = False).reset_index()
# Normalize the feature importances to add up to one
df['importance_normalized'] = df['importance'] / df['importance'].sum()
# Make a horizontal bar chart of feature importances
plt.figure(figsize = (10, 6))
ax = plt.subplot()
# Need to reverse the index to plot most important on top
ax.barh(list(reversed(list(df.index[:15]))),
df['importance_normalized'].head(15),
align = 'center', edgecolor = 'k')
# Set the yticks and labels
ax.set_yticks(list(reversed(list(df.index[:15]))))
ax.set_yticklabels(df['feature'].head(15))
# Plot labeling
plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
plt.show()
return df
Control
The first step in any experiment is establishing a control. For this we will use the function defined above (that implements a Gradient Boosting Machine model) and the single main data source (application
).
train_control = pd.read_csv('../input/application_train.csv')
test_control = pd.read_csv('../input/application_test.csv')
Fortunately, once we have taken the time to write a function, using it is simple (if thereโs a central theme in this notebook, itโs use functions to make things simpler and reproducible!). The function above returns a submission
dataframe we can upload to the competition, a fi
dataframe of feature importances, and a metrics
dataframe with validation and test performance.
submission, fi, metrics = model(train_control, test_control)
metrics
The control slightly overfits because the training score is higher than the validation score. We can address this in later notebooks when we look at regularization (we already perform some regularization in this model by using reg_lambda
and reg_alpha
as well as early stopping).
We can visualize the feature importance with another function, plot_feature_importances
. The feature importances may be useful when itโs time for feature selection.
fi_sorted = plot_feature_importances(fi)
submission.to_csv('control.csv', index = False)
The control scores 0.745 when submitted to the competition.
Test One
Letโs conduct the first test. We will just need to pass in the data to the function, which does most of the work for us.
submission_raw, fi_raw, metrics_raw = model(train, test)
metrics_raw
Based on these numbers, the engineered features perform better than the control case. However, we will have to submit the predictions to the leaderboard before we can say if this better validation performance transfers to the testing data.
fi_raw_sorted = plot_feature_importances(fi_raw)
Examining the feature improtances, it looks as if a few of the feature we constructed are among the most important. Letโs find the percentage of the top 100 most important features that we made in this notebook. However, rather than just compare to the original features, we need to compare to the one-hot encoded original features. These are already recorded for us in fi
(from the original data).
top_100 = list(fi_raw_sorted['feature'])[:100]
new_features = [x for x in top_100 if x not in list(fi['feature'])]
print('%% of Top 100 Features created from the bureau data = %d.00' % len(new_features))
Over half of the top 100 features were made by us! That should give us confidence that all the hard work we did was worthwhile.
submission_raw.to_csv('test_one.csv', index = False)
Test one scores 0.759 when submitted to the competition.
Test Two
That was easy, so letโs do another run! Same as before but with the highly collinear variables removed.
submission_corrs, fi_corrs, metrics_corr = model(train_corrs_removed, test_corrs_removed)
metrics_corr
These results are better than the control, but slightly lower than the raw features.
fi_corrs_sorted = plot_feature_importances(fi_corrs)
submission_corrs.to_csv('test_two.csv', index = False)
Test Two scores 0.753 when submitted to the competition.
Results
After all that work, we can say that including the extra information did improve performance! The model is definitely not optimized to our data, but we still had a noticeable improvement over the original dataset when using the calculated features. Letโs officially summarize the performances:
Experiment | Train AUC | Validation AUC | Test AUC |
|โโโโ|โโ-|โโโโ|โโ-|
Control | 0.815 | 0.760 | 0.745 |
Test One | 0.837 | 0.767 | 0.759 |
Test Two | 0.826 | 0.765 | 0.753 |
(Note that these scores may change from run to run of the notebook. I have not observed that the general ordering changes however.)
All of our hard work translates to a small improvement of 0.014 ROC AUC over the original testing data. Removing the highly collinear variables slightly decreases performance so we will want to consider a different method for feature selection. Moreover, we can say that some of the features we built are among the most important as judged by the model.
In a competition such as this, even an improvement of this size is enough to move us up 100s of spots on the leaderboard. By making numerous small improvements such as in this notebook, we can gradually achieve better and better performance. I encourage others to use the results here to make their own improvements, and I will continue to document the steps I take to help others.
Next Steps
Going forward, we can now use the functions we developed in this notebook on the other datasets. There are still 4 other data files to use in our model! In the next notebook, we will incorporate the information from these other data files (which contain information on previous loans at Home Credit) into our training data. Then we can build the same model and run more experiments to determine the effect of our feature engineering. There is plenty more work to be done in this competition, and plenty more gains in performance to be had! Iโll see you in the next notebook.