[ECC DS 3주차] 2. Introduction: Manual Feature Engineering

2023-04-04

📚 Reference

한글화 작업 된 커널

1. Introduction: Manual Feature Engineering

이 커널에서는 The Home Credit Defalut Risk Competition 데이터를 기반으로 Feature를 생성하는 방법을 다루고 있음
커널 초반 부분에는 모델을 만들기 위해 application 데이터만 사용함
- 해당 데이터를 기반으로 만들어진 가장 성능이 좋은 모델은 리더보드에서 약 0.74를 기록
다른 데이터프레임들을 활용하여 정보를 좀 더 모으려 함
- 여기서는 bureau 및 bureau_balance 데이터를 활용하였음
- bureau
  - ‘Home Credit’에 제출된 고객(Client)의 다른 금융 기관에서의 과거의 대출 기록
  - 각각의 대출 기록은 각각의 열로 정리되어 있음
- bureau_balance
  - 과거 대출들의 월별 데이터
  - 각각의 월별 데이터는 각각의 열로 정리되어 있음
Manual Feature Engineering은 어떻게 보면 지루한 과정일 수 있음 + 해당 작업은 도메인 지식(domain expertise)을 필요로 하기도 함
- 대출 및 채무 불이행의 주된 원인에 대한 지식을 갖추는데는 한계가 있기 때문에, 최종 학습용 데이터프레임에서 가능한 많은 정보들을 얻는데 주안점을 두었음
- 즉, 이 커널은 어떤 feature들이 중요한지를 결정하는 것에 있어서 사람보다 모델이 고르도록 하는 접근방식을 택함
- 최대한 많은 feature들을 만들고, 모델은 이러한 feature들을 전부 활용 -> 추후 모델에서 얻어진 feature importance나 PCA를 통해 feautre reduction을 할 수 있음
Manual Feature Engineering의 각 과정은 많은 양의 Pandas 코드와 약간의 인내심, 특히 데이터 처리에 있어서 많은 인내심을 필요로 함
- 비록 자동화된 Feature Engineering 도구들이 활용되기 시작했지만, 당분간 Feature Engineering은 여전히 전처리 작업을 필요로 함

# 데이터 처리을 위한 Pandas 및 Numpy
import pandas as pd
import numpy as np

# 시각화를 위한 matplotlib 및 seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# pandas에서 나오는 경고문 무시
import warnings
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')

1-1. Example: 고객의 이전 대출 수량 파악(Counts of a client’s previous loans)

Manual Feature Engineering의 보편적인 방법을 설명하기 위해, 먼저 고객의 과거 타 금융기관에서의 대출 수량을 간단히 파악하고자 함

📌 자주 사용되는 Pandas 명령어

groupby
- column값에 따라 데이터 프레임을 그룹화
- 이 과정에서는 SK_ID_CURR 컬럼의 값에 따라 고객별로 데이터 프레임을 그룹화
agg
- 그룹화된 데이터의 평균 등을 계산
- grouped_df.mean()을 통해 직접 평균을 계산하거나, agg 명령어와 리스트를 활용하여 평균, 최대값, 최소값, 합계 등을 계산(grouped_df.agg([mean, max, min, sum]))
merge
- 집계된(aggregated) 값을 해당 고객와 매칭
- SK_ID_CURR 컬럼을 활용하여 집계된 값을 원본 training 데이터로 병합하고, 해당 값이 없을 경우에는 NaN 값을 입력.
또한 rename 명령어를 통해 컬럼을 dict을 활용하여 변경
- 이러한 방식은 생성된 변수들을 계속해서 추적하는데 유용

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

### bureau 데이터 불러오기

bureau = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/3주차/data/bureau.csv')
bureau.head()

	SK_ID_CURR	SK_ID_BUREAU	CREDIT_ACTIVE	CREDIT_CURRENCY	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	AMT_CREDIT_SUM_LIMIT	CREDIT_TYPE	DAYS_CREDIT_UPDATE	AMT_ANNUITY
0	215354	5714462	Closed	currency 1	-497	-153.0	-153.0	NaN	91323.0	0.0	NaN	Consumer credit	-131	NaN
1	215354	5714463	Active	currency 1	-208	1075.0	NaN	NaN	225000.0	171342.0	NaN	Credit card	-20	NaN
2	215354	5714464	Active	currency 1	-203	528.0	NaN	NaN	464323.5	NaN	NaN	Consumer credit	-16	NaN
3	215354	5714465	Active	currency 1	-203	NaN	NaN	NaN	90000.0	NaN	NaN	Credit card	-16	NaN
4	215354	5714466	Active	currency 1	-629	1197.0	NaN	77674.5	2700000.0	NaN	NaN	Consumer credit	-21	NaN

### 고객 id (SK_ID_CURR)를 기준으로 groupby
# 이전 대출 갯수를 파악하고, 컬럼 이름을 변경

previous_loan_counts = bureau.groupby('SK_ID_CURR', as_index = False)['SK_ID_BUREAU'].count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_counts'})
previous_loan_counts.head()

	SK_ID_CURR	previous_loan_counts
0	100001	7
1	100002	8
2	100003	4
3	100004	2
4	100005	3

### 학습용 데이터프레임과 병합(join)
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/3주차/data/application_train.csv')
train = train.merge(previous_loan_counts, on = 'SK_ID_CURR', how = 'left')

### 결측치 -> 0으로 채우기
train['previous_loan_counts'] = train['previous_loan_counts'].fillna(0)
train.head()

	SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	...	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR	previous_loan_counts
0	100002	1	Cash loans	M	N	Y	202500.0	406597.5	24700.5	...	0.0	0.0	0.0	0.0	0.0	1.0	8.0
1	100003	0	Cash loans	F	N	N	270000.0	1293502.5	35698.5	...	0.0	0.0	0.0	0.0	0.0	0.0	4.0
2	100004	0	Revolving loans	M	Y	Y	67500.0	135000.0	6750.0	...	0.0	0.0	0.0	0.0	0.0	0.0	2.0
3	100006	0	Cash loans	F	N	Y	135000.0	312682.5	29686.5	...	NaN	NaN	NaN	NaN	NaN	NaN	0.0
4	100007	0	Cash loans	M	N	Y	121500.0	513000.0	21865.5	...	0.0	0.0	0.0	0.0	0.0	0.0	1.0

5 rows × 123 columns

스크롤을 우측 끝까지 움직여 새롭게 만들어진 column(previous_loan_counts)을 확인해 보세요!

1-2. R Value를 활용한 변수 유용성 평가

새롭게 생성된 변수들이 유용한지 판단하기 위해, 우선 target 변수와 해당 변수 간의 피어슨 상관계수(Pearson Correlation Coefficient, r-value)를 계산

📌 피어슨 상관계수(Pearson Correlation Coefficient)

피어슨 상관계수
두 변수 사이의 선형 관계는 -1(완벽한 음의 선형관계)에서부터 +1(완벽한 양의 선형관계) 사이의 값으로 표현됨
r-value가 변수의 유용성을 평가하기 위한 최선의 방식은 아니지만, 머신러닝 모델을 발전시키는 데 효과가 있을 지에 대한 대략적인 정보를 줄 수는 있음
목표값에 대한 r-value가 커질수록, 해당 변수가 목표값에 영향을 끼칠 가능성이 높아짐
목표값에 대해 가장 큰 r-value의 절댓값을 가지는 변수를 찾고자 함

📌 커널 밀도 추정 그래프(Kernel Density Estimate Plots)

목표값(target)과의 상관관계를 시각화
단일 변수의 분포를 보여줌(히스토그램을 부드럽게 한(smoothed) 것으로 생각하면 될 듯..)
변수들이 머신러닝 모델과 관련성을 가지는지를 보여줄 수 있는 지표로 활용될 수 있음

범주형 변수의 값 차이에 따른 분포 차이를 보기 위해, 카테고리에 따라 색을 다르게 칠하였음
- TARGET 값이 0인지 1인지에 따라 색을 다르게 칠한 previous_loan_count의 커널밀도추정그래프를 그릴 수 있음

### 커널밀도그래프 시각화를 위한 함수

def kde_target(var_name, df):
    '''
    input:
      var_name = str, 변수가 되는 Column
      df : DataFrame, 대상 데이터프레임
        
    return: None
    '''
   
   ### 통계값 얻기(상관계수, 중간값)

    # 새롭게 생성된 변수와 target 변수 간의 상관계수 계산
    corr = df['TARGET'].corr(df[var_name])
    # 각 그룹의 중앙값 계산
    avg_repaid = df.loc[df['TARGET'] == 0, var_name].median() # 대출 상환
    avg_not_repaid = df.loc[df['TARGET'] == 1, var_name].median() # 대출 상환x
    

    ### 시각화

    # 시각화 map 설정
    plt.figure(figsize = (12, 6))
    # target값에 따라 색을 달리하여 그래프 그리기
    sns.kdeplot(df.loc[df['TARGET'] == 0, var_name], label = 'TARGET == 0')
    sns.kdeplot(df.loc[df['TARGET'] == 1, var_name], label = 'TARGET == 1')
    # labeling
    plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)
    plt.legend()
    

    ### 결과 출력

    # 상관계수 출력
    print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr))
    # 중간값 출력
    print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid)
    print('Median value for loan that was repaid =     %0.4f' % avg_repaid)
    print()

EXT_SOURCE_3 변수 시각화

이전 노트북에 의해 RandomForest 및 GradientBoostingMachine에 의해 가장 중요한 변수로 판명된 EXT_SOURCE_3를 활용하여 테스트

kde_target('EXT_SOURCE_3', train)

The correlation between EXT_SOURCE_3 and the TARGET is -0.1789
Median value for loan that was not repaid = 0.3791
Median value for loan that was repaid =     0.5460

previous_loan_counts 시각화

새로 생성한 변수에 대해 시각화

kde_target('previous_loan_counts', train)

The correlation between previous_loan_counts and the TARGET is -0.0100
Median value for loan that was not repaid = 3.0000
Median value for loan that was repaid =     4.0000

새롭게 생성된 변수(previous_loan_counts)가 중요하지 않음을 알 수 있음
- 상관계수가 너무 작음
- target 값에 따른 분포의 차이도 거의 없음

1-3. 새로운 변수 생성하기

a) 수치형 변수들의 대표값 계산

agg()를 활용하여 데이터 프레임의 평균, 최대값, 최소값, 합계 등을 구할 수 있음
- 별도의 함수를 작성한 후 이를 불러오는 것도 가능함
bureau 데이터 프레임 안의 수치형 변수들을 활용하기 위해, 모든 수치형 변수들의 대표값들을 계산
- 이를 위해 고객 ID(SK_ID_CURR)별로 groupby()하고, 그룹화된 데이터 프레임의 대표값들을 agg()를 통해 구한 뒤, 결과를 train 데이터 셋과 병합(merge())

### 대표값 계산
# 고객id에 따라 데이터 프레임을 그룹화하여 계산

bureau_agg = bureau.drop(columns = ['SK_ID_BUREAU']).groupby('SK_ID_CURR', as_index = False).agg(['count', 'mean', 'max', 'min', 'sum']).reset_index()
bureau_agg.head()

	SK_ID_CURR	DAYS_CREDIT					CREDIT_DAY_OVERDUE				...	DAYS_CREDIT_UPDATE					AMT_ANNUITY
		count	mean	max	min	sum	count	mean	max	min	...	count	mean	max	min	sum	count	mean	max	min	sum
0	100001	7	-735.000000	-49	-1572	-5145	7	0.0	0	0	...	7	-93.142857	-6	-155	-652	7	3545.357143	10822.5	0.0	24817.5
1	100002	8	-874.000000	-103	-1437	-6992	8	0.0	0	0	...	8	-499.875000	-7	-1185	-3999	7	0.000000	0.0	0.0	0.0
2	100003	4	-1400.750000	-606	-2586	-5603	4	0.0	0	0	...	4	-816.000000	-43	-2131	-3264	0	NaN	NaN	NaN	0.0
3	100004	2	-867.000000	-408	-1326	-1734	2	0.0	0	0	...	2	-532.000000	-382	-682	-1064	0	NaN	NaN	NaN	0.0
4	100005	3	-190.666667	-62	-373	-572	3	0.0	0	0	...	3	-54.333333	-11	-121	-163	3	1420.500000	4261.5	0.0	4261.5

5 rows × 61 columns

새로 생성된 column들에 대해 새롭게 이름을 적어주는 것이 좋을 것 같음
- 밑의 코드들은 원본데이터의 Column에 대표값들의 종류들을 추가적으로 기입해주는 역할을 수행
여기서는 Multi-Level index 데이터프레임을 작업의 대상으로 함
- 이러한 부분은 혼동을 줄 수 있고, 작업하기도 어렵기 때문에, single-level index로 최대한 빠르게 변환하고자 하였음

### 컬럼명 재정의

# 컬럼명을 저장한 리스트
columns = ['SK_ID_CURR']

for var in bureau_agg.columns.levels[0]: # 원본 컬럼(변수)들만 가져옴
                                         # min, max 이런것들 없앰
    # id 컬럼은 생략
    if var != 'SK_ID_CURR':
        # 대표값의 종류에 따라 반복문을 생성
        for stat in bureau_agg.columns.levels[1][:-1]:
            # 변수 및 대표값의 종류에 따라 새로운 column name을 생성
            columns.append('bureau_%s_%s' % (var, stat)) # bureau_변수_대표값

### 생성된 list를 데이터 프레임의 column name으로 지정

bureau_agg.columns = columns
bureau_agg.head()

	SK_ID_CURR	bureau_DAYS_CREDIT_count	bureau_DAYS_CREDIT_mean	bureau_DAYS_CREDIT_max	bureau_DAYS_CREDIT_min	bureau_DAYS_CREDIT_sum	bureau_CREDIT_DAY_OVERDUE_count	...	bureau_DAYS_CREDIT_UPDATE_count	bureau_DAYS_CREDIT_UPDATE_mean	bureau_DAYS_CREDIT_UPDATE_max	bureau_DAYS_CREDIT_UPDATE_min	bureau_DAYS_CREDIT_UPDATE_sum	bureau_AMT_ANNUITY_count	bureau_AMT_ANNUITY_mean	bureau_AMT_ANNUITY_max	bureau_AMT_ANNUITY_min	bureau_AMT_ANNUITY_sum
0	100001	7	-735.000000	-49	-1572	-5145	7	...	7	-93.142857	-6	-155	-652	7	3545.357143	10822.5	0.0	24817.5
1	100002	8	-874.000000	-103	-1437	-6992	8	...	8	-499.875000	-7	-1185	-3999	7	0.000000	0.0	0.0	0.0
2	100003	4	-1400.750000	-606	-2586	-5603	4	...	4	-816.000000	-43	-2131	-3264	0	NaN	NaN	NaN	0.0
3	100004	2	-867.000000	-408	-1326	-1734	2	...	2	-532.000000	-382	-682	-1064	0	NaN	NaN	NaN	0.0
4	100005	3	-190.666667	-62	-373	-572	3	...	3	-54.333333	-11	-121	-163	3	1420.500000	4261.5	0.0	4261.5

5 rows × 61 columns

### 훈련 데이터와 병합

train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')
train.head()

	SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	...	bureau_DAYS_CREDIT_UPDATE_count	bureau_DAYS_CREDIT_UPDATE_mean	bureau_DAYS_CREDIT_UPDATE_max	bureau_DAYS_CREDIT_UPDATE_min	bureau_DAYS_CREDIT_UPDATE_sum	bureau_AMT_ANNUITY_count	bureau_AMT_ANNUITY_mean	bureau_AMT_ANNUITY_max	bureau_AMT_ANNUITY_min	bureau_AMT_ANNUITY_sum
0	100002	1	Cash loans	M	N	Y	202500.0	406597.5	24700.5	...	8.0	-499.875	-7.0	-1185.0	-3999.0	7.0	0.0	0.0	0.0	0.0
1	100003	0	Cash loans	F	N	N	270000.0	1293502.5	35698.5	...	4.0	-816.000	-43.0	-2131.0	-3264.0	0.0	NaN	NaN	NaN	0.0
2	100004	0	Revolving loans	M	Y	Y	67500.0	135000.0	6750.0	...	2.0	-532.000	-382.0	-682.0	-1064.0	0.0	NaN	NaN	NaN	0.0
3	100006	0	Cash loans	F	N	Y	135000.0	312682.5	29686.5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	100007	0	Cash loans	M	N	Y	121500.0	513000.0	21865.5	...	1.0	-783.000	-783.0	-783.0	-783.0	0.0	NaN	NaN	NaN	0.0

5 rows × 183 columns

b) target과 대표값들의 상관계수 분석

새롭게 생성된 값들과 목표값과의 상관 계수를 분석

# 새로 생성된 변수들에 대한 상관계수를 저장할 리스트
new_corrs = []

# 변수별로 반복문을 수행하며..
for col in columns:
    # target과의 상관계수 계산
    corr = train['TARGET'].corr(train[col])
    # 튜플(tuple)로 리스트에 추가
    new_corrs.append((col, corr))

상관계수들을 절댓값에 따라서 정렬
- sorted() 함수 활용

### 상관계수들을 절대값에 따라 정렬

new_corrs = sorted(new_corrs, key = lambda x: abs(x[1]), reverse = True) # 내림차순 정렬
new_corrs[:15] # 상관도가 높은 상위 15개 변수만 추출

[('bureau_DAYS_CREDIT_mean', 0.08972896721998114),
 ('bureau_DAYS_CREDIT_min', 0.0752482510301036),
 ('bureau_DAYS_CREDIT_UPDATE_mean', 0.06892735266968673),
 ('bureau_DAYS_ENDDATE_FACT_min', 0.05588737984392077),
 ('bureau_DAYS_CREDIT_ENDDATE_sum', 0.0537348956010205),
 ('bureau_DAYS_ENDDATE_FACT_mean', 0.05319962585758616),
 ('bureau_DAYS_CREDIT_max', 0.04978205463997299),
 ('bureau_DAYS_ENDDATE_FACT_sum', 0.048853502611115894),
 ('bureau_DAYS_CREDIT_ENDDATE_mean', 0.046982754334835494),
 ('bureau_DAYS_CREDIT_UPDATE_min', 0.042863922470730155),
 ('bureau_DAYS_CREDIT_sum', 0.041999824814846716),
 ('bureau_DAYS_CREDIT_UPDATE_sum', 0.04140363535306002),
 ('bureau_DAYS_CREDIT_ENDDATE_max', 0.036589634696329094),
 ('bureau_DAYS_CREDIT_ENDDATE_min', 0.034281109921616024),
 ('bureau_DAYS_ENDDATE_FACT_count', -0.030492306653325495)]

새로 생성된 변수들 중 대상과 유의한 상관 관계가 있는 변수가 없음

### 가장 상관도가 높은 bureau_DAYS_CREDIT_mean 변수 시각화

kde_target('bureau_DAYS_CREDIT_mean', train)

The correlation between bureau_DAYS_CREDIT_mean and the TARGET is 0.0897
Median value for loan that was not repaid = -835.3333
Median value for loan that was repaid =     -1067.0000

해당 column은 ‘고객이 신용관리국에 신용등급을 신청한 날로부터 현 대출 신청까지 걸린 기간은 몇 일인가?’에 관한 데이터임
이전 대출을 받고나서 ‘Home Credit’에서 대출을 받기전까지 걸린 일수로 해석
- 마이너스 수치가 크다는 것은 이전 대출이 이루어진 시점이 더 오래됐음을 의미
- 더 오래된 과거에 대출을 신청했던 고객들은 ‘Home Credit’에서 대출을 상환할 가능성이 높다는 것을 파악할 수 있음(약한 양의 상관관계)
  - 하지만 너무 낮은 상관관계 -> 그저 노이즈일 수도 있다.

✔ 다중 비교 문제(Multiple Comparisons Problem)

변수가 매우 많을 때는, 다중 비교 문제로 알려진 우연에 의해 변수들이 목표값(target)에 대해 연관성을 가질 것으로 기대할 수 있음
수백개의 변수(feature)들을 만들 수 있지만, 몇개의 경우 그저 데이터 안에 랜덤하게 있는 노이즈값들에 의해 목표값(target)과 연관성을 가지는 것처럼 보일 수 있음
- 이러한 변수들은 train 데이터 상에서는 target 값과 관련성이 있는 것처럼 보이지만, 관련성들이 테스트 데이터까지 일반화 될 수 없기 때문에 모델에 있어 overfitting을 야기할 수 있음
- feature들을 만드는 것에 대한 많은 고려 필요
다중 비교 문제

c) 수치형 변수들의 대표값 연산을 위한 함수 생성

이전의 작업들을 요약하여, 수치형 변수들의 대표값 계산을 위한 함수를 생성
데이터 프레임 전체에 걸쳐 수치형 변수들의 대표값들을 계산하는 역할을 수행

파라미터(Parameters)>
- df(dataframe): 연산의 대상이 되는 데이터 프레임
- group_var(string): 그룹화(groupby)의 기준이 되는 column
- df_name(string): column명을 재정의하는데 쓰이는 변수
출력값(Returns)>
- agg (dataframe)
  - 모든 수치형 변수들의 대표값들이 연산된 데이터프레임
  - 각각의 그룹화된 인스턴스들은 대표값(평균, 최소값, 최대값, 합계)들을 가짐
  - 새롭게 생성된 feature들을 구분하기위해 column들의 이름을 재정의

def agg_numeric(df, group_var, df_name):
    # 그룹화 대상이 아닌 id들을 제거
    for col in df:
        if col != group_var and 'SK_ID' in col:
            df = df.drop(columns = col)
    group_ids = df[group_var]
    numeric_df = df.select_dtypes('number')
    numeric_df[group_var] = group_ids
    
    # 특정 변수들을 그룹화하고 대표값들을 계산
    agg = numeric_df.groupby(group_var).agg(['count','mean','max','min','sum']).reset_index()
    
    # 새로운 column명 생성
    columns = [group_var]
    
    # 변수들 별로..
    for var in agg.columns.levels[0]:
        # id column은 생략
        if var != group_var:
            # 종류별로 대표값 구하기
            for stat in agg.columns.levels[1][:-1]:
                # 변수 및 대표값의 종류에 따라 새로운 column name을 생성
                columns.append('%s_%s_%s' % (df_name, var, stat))
    agg.columns = columns
    
    return agg

### 수치형 변수들의 대표값 계산

bureau_agg_new = agg_numeric(bureau.drop(columns = ['SK_ID_BUREAU']), 
                             group_var = 'SK_ID_CURR', 
                             df_name = 'bureau')
bureau_agg_new.head()

	SK_ID_CURR	bureau_DAYS_CREDIT_count	bureau_DAYS_CREDIT_mean	bureau_DAYS_CREDIT_max	bureau_DAYS_CREDIT_min	bureau_DAYS_CREDIT_sum	bureau_CREDIT_DAY_OVERDUE_count	...	bureau_DAYS_CREDIT_UPDATE_count	bureau_DAYS_CREDIT_UPDATE_mean	bureau_DAYS_CREDIT_UPDATE_max	bureau_DAYS_CREDIT_UPDATE_min	bureau_DAYS_CREDIT_UPDATE_sum	bureau_AMT_ANNUITY_count	bureau_AMT_ANNUITY_mean	bureau_AMT_ANNUITY_max	bureau_AMT_ANNUITY_min	bureau_AMT_ANNUITY_sum
0	100001	7	-735.000000	-49	-1572	-5145	7	...	7	-93.142857	-6	-155	-652	7	3545.357143	10822.5	0.0	24817.5
1	100002	8	-874.000000	-103	-1437	-6992	8	...	8	-499.875000	-7	-1185	-3999	7	0.000000	0.0	0.0	0.0
2	100003	4	-1400.750000	-606	-2586	-5603	4	...	4	-816.000000	-43	-2131	-3264	0	NaN	NaN	NaN	0.0
3	100004	2	-867.000000	-408	-1326	-1734	2	...	2	-532.000000	-382	-682	-1064	0	NaN	NaN	NaN	0.0
4	100005	3	-190.666667	-62	-373	-572	3	...	3	-54.333333	-11	-121	-163	3	1420.500000	4261.5	0.0	4261.5

5 rows × 61 columns

### 직접 만든 df

bureau_agg.head()

	SK_ID_CURR	bureau_DAYS_CREDIT_count	bureau_DAYS_CREDIT_mean	bureau_DAYS_CREDIT_max	bureau_DAYS_CREDIT_min	bureau_DAYS_CREDIT_sum	bureau_CREDIT_DAY_OVERDUE_count	...	bureau_DAYS_CREDIT_UPDATE_count	bureau_DAYS_CREDIT_UPDATE_mean	bureau_DAYS_CREDIT_UPDATE_max	bureau_DAYS_CREDIT_UPDATE_min	bureau_DAYS_CREDIT_UPDATE_sum	bureau_AMT_ANNUITY_count	bureau_AMT_ANNUITY_mean	bureau_AMT_ANNUITY_max	bureau_AMT_ANNUITY_min	bureau_AMT_ANNUITY_sum
0	100001	7	-735.000000	-49	-1572	-5145	7	...	7	-93.142857	-6	-155	-652	7	3545.357143	10822.5	0.0	24817.5
1	100002	8	-874.000000	-103	-1437	-6992	8	...	8	-499.875000	-7	-1185	-3999	7	0.000000	0.0	0.0	0.0
2	100003	4	-1400.750000	-606	-2586	-5603	4	...	4	-816.000000	-43	-2131	-3264	0	NaN	NaN	NaN	0.0
3	100004	2	-867.000000	-408	-1326	-1734	2	...	2	-532.000000	-382	-682	-1064	0	NaN	NaN	NaN	0.0
4	100005	3	-190.666667	-62	-373	-572	3	...	3	-54.333333	-11	-121	-163	3	1420.500000	4261.5	0.0	4261.5

5 rows × 61 columns

두 데이터프레임이 동등하게 생성되었다는 것을 확인할 수 있음

d) 상관계수 계산을 위한 함수

target과 변수 간 상관계수를 계산

### 데이터프레임 상에서 target과의 상관계수를 계산하기 위한 함수

def target_corrs(df):
    # 상관관계를 저장하기 위한 리스트 생성
    corrs = []
    # 변수별로..
    for col in df.columns:
        print(col)

        # target 컬럼은 생략(자기 자신과 상관계수 구하면 1이니까..)
        if col != 'TARGET':
            # 상관계수 계산
            corr = df['TARGET'].corr(df[col])
            # 튜플(tuple)로 리스트에 추가
            corrs.append((col, corr))
            
    # 상관계수들을 절대값 크기에 따라 내림차순 정렬
    corrs = sorted(corrs, key = lambda x: abs(x[1]), reverse = True)
    
    return corrs

1-4. 범주형 변수(Categorical Variables)

1️⃣ 주로 문자열 데이터로, 이러한 데이터들에 대해서는 평균이나 최대치 등 통계값을 활용하기가 어려움

그 대신, 각 범주별로 값들의 개수를 count

👉 Example

이러한 데이터를..

SK_ID_CURR

Loan type

|————|———–|

home

credit

cash

credit

home

각 고객의 범주별 대출 갯수를 활용하여 다음과 같이 변경할 수 있음

SK_ID_CURR

credit count

cash count

home count

total count

|————|————–|————|————|————-|

2️⃣ 그 다음, 고객별 대출 횟수를 활용하여 값들을 정규화(normalize)

고객별 대출 횟수의 합계가 1이 되도록 스케일 조정

SK_ID_CURR

credit count

cash count

home count

total count

credit count norm

cash count norm

home count norm

|————|————–|————|————|————-|——————-|—————–|—————–|

0.25

0.75

1.00

0.33

0.66

0.33

0.66

a) 범주형 변수 인코딩(Encoding)

범주형 변수들을 인코딩(encoding)함으로써 데이터들이 담고 있는 정보를 확인할 수 있음
원-핫 인코딩(One-hot Encoding) 적용
- 범주형 변수들(dtype=object)에 한하여 적용

### One-hot Encoding

categorical = pd.get_dummies(bureau.select_dtypes('object'))
categorical['SK_ID_CURR'] = bureau['SK_ID_CURR']
categorical.head()

	CREDIT_ACTIVE_Active	CREDIT_ACTIVE_Closed	CREDIT_CURRENCY_currency 1	...	SK_ID_CURR
0	0	1	1	...	215354
1	1	0	1	...	215354
2	1	0	1	...	215354
3	1	0	1	...	215354
4	1	0	1	...	215354

5 rows × 24 columns

### 고객 id를 기준으로 그룹화

categorical_grouped = categorical.groupby('SK_ID_CURR').agg(['sum', 'mean'])
categorical_grouped.head()

	CREDIT_ACTIVE_Active		CREDIT_ACTIVE_Bad debt		CREDIT_ACTIVE_Closed		CREDIT_ACTIVE_Sold		CREDIT_CURRENCY_currency 1		...	CREDIT_TYPE_Microloan		CREDIT_TYPE_Mobile operator loan		CREDIT_TYPE_Mortgage		CREDIT_TYPE_Real estate loan		CREDIT_TYPE_Unknown type of loan
	sum	mean	sum	mean	sum	mean	sum	mean	sum	mean	...	sum	mean	sum	mean	sum	mean	sum	mean	sum	mean
SK_ID_CURR
100001	3	0.428571	0	0.0	4	0.571429	0	0.0	7	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100002	2	0.250000	0	0.0	6	0.750000	0	0.0	8	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100003	1	0.250000	0	0.0	3	0.750000	0	0.0	4	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100004	0	0.000000	0	0.0	2	1.000000	0	0.0	2	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100005	2	0.666667	0	0.0	1	0.333333	0	0.0	3	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0

5 rows × 46 columns

sum: 고객별 해당 범주에 속한 대출의 총 횟수
mean: 횟수를 정규화시킨 것

원-핫 인코딩을 통해 이러한 수치들을 쉽게 계산할 수 있음

여기서, 지난번 column 제목들을 재정의할 때 사용하였던 것과 비슷한 함수를 활용하겠습니다.
- multi-level index로 작성되어 있는 column들을 다룰 것임
범주형 데이터가 속한 column의 이름을 차용한 first-level(level 0)을 따라 반복문을 먼저 실행하고, 그 다음 계산된 통계치들을 따라 반복문을 한번 더 실행
이후 level 0의 이름에 통계치의 종류를 합쳐 column 제목들을 재정의
- 예를 들면, CREDIT_ACTIVE_Active가 level 0이고, sum이 level 1인 column은 CREDIT_ACTIVE_Active_count로 정의됨

함수화

### Level 0: 범주형 데이터가 속한 column명

categorical_grouped.columns.levels[0][:10]

Index(['CREDIT_ACTIVE_Active', 'CREDIT_ACTIVE_Bad debt',
       'CREDIT_ACTIVE_Closed', 'CREDIT_ACTIVE_Sold',
       'CREDIT_CURRENCY_currency 1', 'CREDIT_CURRENCY_currency 2',
       'CREDIT_CURRENCY_currency 3', 'CREDIT_CURRENCY_currency 4',
       'CREDIT_TYPE_Another type of loan', 'CREDIT_TYPE_Car loan'],
      dtype='object')

### Level 1: 계산될 통계치(sum, mean)

categorical_grouped.columns.levels[1]

Index(['sum', 'mean'], dtype='object')

group_var = 'SK_ID_CURR'

# 새로운 column 제목들을 저장하기 위한 리스트
columns = []

# 변수들을 순환하며...
for var in categorical_grouped.columns.levels[0]:
    # 고객 id column(그룹화 변수)은 생략
    if var != group_var:
        # 통계치의 종류에 따라 반복문 실행
        for stat in ['count', 'count_norm']:
            # 새로운 column 제목 정의
            columns.append('%s_%s' % (var, stat))

### 변수 이름 재정의

categorical_grouped.columns = columns
categorical_grouped.head()

	CREDIT_ACTIVE_Active_count	CREDIT_ACTIVE_Active_count_norm	CREDIT_ACTIVE_Bad debt_count	CREDIT_ACTIVE_Bad debt_count_norm	CREDIT_ACTIVE_Closed_count	CREDIT_ACTIVE_Closed_count_norm	CREDIT_ACTIVE_Sold_count	CREDIT_ACTIVE_Sold_count_norm	CREDIT_CURRENCY_currency 1_count	CREDIT_CURRENCY_currency 1_count_norm	...	CREDIT_TYPE_Microloan_count	CREDIT_TYPE_Microloan_count_norm	CREDIT_TYPE_Mobile operator loan_count	CREDIT_TYPE_Mobile operator loan_count_norm	CREDIT_TYPE_Mortgage_count	CREDIT_TYPE_Mortgage_count_norm	CREDIT_TYPE_Real estate loan_count	CREDIT_TYPE_Real estate loan_count_norm	CREDIT_TYPE_Unknown type of loan_count	CREDIT_TYPE_Unknown type of loan_count_norm
SK_ID_CURR
100001	3	0.428571	0	0.0	4	0.571429	0	0.0	7	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100002	2	0.250000	0	0.0	6	0.750000	0	0.0	8	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100003	1	0.250000	0	0.0	3	0.750000	0	0.0	4	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100004	0	0.000000	0	0.0	2	1.000000	0	0.0	2	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100005	2	0.666667	0	0.0	1	0.333333	0	0.0	3	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0

5 rows × 46 columns

sum: 총 횟수
count_norm: 횟수를 정규화시킨 것

### 훈련 데이터와 결합

train = train.merge(categorical_grouped, left_on = 'SK_ID_CURR', 
                    right_index = True, how = 'left')
train.head()

	SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	...	CREDIT_TYPE_Microloan_count	CREDIT_TYPE_Microloan_count_norm	CREDIT_TYPE_Mobile operator loan_count	CREDIT_TYPE_Mobile operator loan_count_norm	CREDIT_TYPE_Mortgage_count	CREDIT_TYPE_Mortgage_count_norm	CREDIT_TYPE_Real estate loan_count	CREDIT_TYPE_Real estate loan_count_norm	CREDIT_TYPE_Unknown type of loan_count	CREDIT_TYPE_Unknown type of loan_count_norm
0	100002	1	Cash loans	M	N	Y	202500.0	406597.5	24700.5	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	100003	0	Cash loans	F	N	N	270000.0	1293502.5	35698.5	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	100004	0	Revolving loans	M	Y	Y	67500.0	135000.0	6750.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	100006	0	Cash loans	F	N	Y	135000.0	312682.5	29686.5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	100007	0	Cash loans	M	N	Y	121500.0	513000.0	21865.5	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 229 columns

train.shape

(307511, 229)

train.iloc[:10, 123:]

	bureau_DAYS_CREDIT_count	bureau_DAYS_CREDIT_mean	bureau_DAYS_CREDIT_max	bureau_DAYS_CREDIT_min	bureau_DAYS_CREDIT_sum	bureau_CREDIT_DAY_OVERDUE_count	bureau_CREDIT_DAY_OVERDUE_mean	bureau_CREDIT_DAY_OVERDUE_max	bureau_CREDIT_DAY_OVERDUE_min	bureau_CREDIT_DAY_OVERDUE_sum	...	CREDIT_TYPE_Microloan_count	CREDIT_TYPE_Microloan_count_norm	CREDIT_TYPE_Mobile operator loan_count	CREDIT_TYPE_Mobile operator loan_count_norm	CREDIT_TYPE_Mortgage_count	CREDIT_TYPE_Mortgage_count_norm	CREDIT_TYPE_Real estate loan_count	CREDIT_TYPE_Real estate loan_count_norm	CREDIT_TYPE_Unknown type of loan_count	CREDIT_TYPE_Unknown type of loan_count_norm
0	8.0	-874.000000	-103.0	-1437.0	-6992.0	8.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	4.0	-1400.750000	-606.0	-2586.0	-5603.0	4.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	2.0	-867.000000	-408.0	-1326.0	-1734.0	2.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	1.0	-1149.000000	-1149.0	-1149.0	-1149.0	1.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	3.0	-757.333333	-78.0	-1097.0	-2272.0	3.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
6	18.0	-1271.500000	-239.0	-2882.0	-22887.0	18.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
7	2.0	-1939.500000	-1138.0	-2741.0	-3879.0	2.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
8	4.0	-1773.000000	-1309.0	-2508.0	-7092.0	4.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

10 rows × 106 columns

b) 범주형 데이터들을 처리하기 위한 함수

데이터 프레임을 그룹화
이후 각각의 범주에 따라 counts와 normalized_counts를 계산

파라미터(Parameters)>
- df (dataframe): 연산의 대상이 되는 데이터프레임
- group_var(string): groupby의 기준이되는 column
- df_name(string): column 명을 재정의하는데 쓰이는 변수
출력값(Returns)>
- categorical: 데이터프레임 group_var에 대해 각 범주들의 counts 및 normalized_counts 값이 포함된 데이터 프레임

### 범주형 데이터 처리를 위한 함수

def count_categorical(df, group_var, df_name):    
    # 범주형 column들을 선택
    categorical = pd.get_dummies(df.select_dtypes('object'))
    # 확실히 id가 column에 있도록 지정하기
    categorical[group_var] = df[group_var]

    # group_var를 기준으로 그룹화하고 sum과 mean을 계산
    categorical = categorical.groupby(group_var).agg(['sum', 'mean'])
    
    column_names = []
    
    # level = 0의 column들에 따라 반복문을 실행
    for var in categorical.columns.levels[0]:
      # level = 1의 통계값들에 대해 반복문을 실행
        for stat in ['count', 'count_norm']:
            # 컬럼명 재정의
            column_names.append('%s_%s_%s' % (df_name, var, stat))
    
    categorical.columns = column_names
    
    return categorical

bureau_counts = count_categorical(bureau, 
                                  group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_counts.head()

	bureau_CREDIT_ACTIVE_Active_count	bureau_CREDIT_ACTIVE_Active_count_norm	bureau_CREDIT_ACTIVE_Bad debt_count	bureau_CREDIT_ACTIVE_Bad debt_count_norm	bureau_CREDIT_ACTIVE_Closed_count	bureau_CREDIT_ACTIVE_Closed_count_norm	bureau_CREDIT_ACTIVE_Sold_count	bureau_CREDIT_ACTIVE_Sold_count_norm	bureau_CREDIT_CURRENCY_currency 1_count	bureau_CREDIT_CURRENCY_currency 1_count_norm	...	bureau_CREDIT_TYPE_Microloan_count	bureau_CREDIT_TYPE_Microloan_count_norm	bureau_CREDIT_TYPE_Mobile operator loan_count	bureau_CREDIT_TYPE_Mobile operator loan_count_norm	bureau_CREDIT_TYPE_Mortgage_count	bureau_CREDIT_TYPE_Mortgage_count_norm	bureau_CREDIT_TYPE_Real estate loan_count	bureau_CREDIT_TYPE_Real estate loan_count_norm	bureau_CREDIT_TYPE_Unknown type of loan_count	bureau_CREDIT_TYPE_Unknown type of loan_count_norm
SK_ID_CURR
100001	3	0.428571	0	0.0	4	0.571429	0	0.0	7	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100002	2	0.250000	0	0.0	6	0.750000	0	0.0	8	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100003	1	0.250000	0	0.0	3	0.750000	0	0.0	4	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100004	0	0.000000	0	0.0	2	1.000000	0	0.0	2	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100005	2	0.666667	0	0.0	1	0.333333	0	0.0	3	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0

5 rows × 46 columns

c) 다른 데이터프레임에 연산 적용하기

bureau_balance 데이터프레임을 활용
- 이 데이터프레임은 월별 각 고객의 과거 타 금융기관 대출 데이터를 포함하고 있음
고객들의 ID인 SK_ID_CURR에 따라 그룹화하기보다, 이전 대출의 ID인 SK_ID_BUREAU를 활용하여 1차 그룹화를 진행할 것임
- 그룹화한 데이터프레임은 각각의 대출에 대한 정보를 행별로 포함할 것임
그 다음, SK_ID_CURR을 활용하여 그룹화한 뒤, 각 고객별 대출의 대표값들을 계산
최종 산출물은 각각의 행에 고객별로 대출에 대한 대표값들을 포함

### 데이터 불러오기

bureau_balance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/3주차/data/bureau_balance.csv')
bureau_balance.head()

	SK_ID_BUREAU	MONTHS_BALANCE	STATUS
0	5715448	0	C
1	5715448	-1	C
2	5715448	-2	C
3	5715448	-3	C
4	5715448	-4	C

### 1. 각각의 이전 대출에 대한 상태 개수 파악

bureau_balance_counts = count_categorical(bureau_balance, group_var = 'SK_ID_BUREAU', 
                                          df_name = 'bureau_balance')
bureau_balance_counts.head()

	bureau_balance_STATUS_0_count	bureau_balance_STATUS_0_count_norm	bureau_balance_STATUS_1_count	bureau_balance_STATUS_1_count_norm	bureau_balance_STATUS_2_count	bureau_balance_STATUS_2_count_norm	bureau_balance_STATUS_3_count	bureau_balance_STATUS_3_count_norm	bureau_balance_STATUS_4_count	bureau_balance_STATUS_4_count_norm	bureau_balance_STATUS_5_count	bureau_balance_STATUS_5_count_norm	bureau_balance_STATUS_C_count	bureau_balance_STATUS_C_count_norm	bureau_balance_STATUS_X_count	bureau_balance_STATUS_X_count_norm
SK_ID_BUREAU
5001709	0	0.000000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	86	0.886598	11	0.113402
5001710	5	0.060241	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	48	0.578313	30	0.361446
5001711	3	0.750000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.000000	1	0.250000
5001712	10	0.526316	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	9	0.473684	0	0.000000
5001713	0	0.000000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.000000	22	1.000000

### 2. 수치형 변수 처리
# MONTHS_BALACE: 신청일을 기준으로 한 남은 개월 수

## 2-1. 각각의 'SK_ID_CURR'별 대표값 계산
bureau_balance_agg = agg_numeric(bureau_balance, group_var = 'SK_ID_BUREAU', 
                                 df_name = 'bureau_balance')
bureau_balance_agg.head()

	SK_ID_BUREAU	bureau_balance_MONTHS_BALANCE_count	bureau_balance_MONTHS_BALANCE_mean	bureau_balance_MONTHS_BALANCE_min	bureau_balance_MONTHS_BALANCE_sum
0	5001709	97	-48.0	-96	-4656
1	5001710	83	-41.0	-82	-3403
2	5001711	4	-1.5	-3	-6
3	5001712	19	-9.0	-18	-171
4	5001713	22	-10.5	-21	-231

## 2-2. 각 고객별 계산

# 대출을 기준으로 데이터프레임을 그룹화
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, 
                                          left_on = 'SK_ID_BUREAU', how = 'outer')

# SK_ID_CURR을 포함하여 병합
bureau_by_loan = bureau_by_loan.merge(bureau[['SK_ID_BUREAU', 'SK_ID_CURR']], 
                                      on = 'SK_ID_BUREAU', how = 'left')

bureau_by_loan.head()

	SK_ID_BUREAU	bureau_balance_MONTHS_BALANCE_count	bureau_balance_MONTHS_BALANCE_mean	bureau_balance_MONTHS_BALANCE_min	bureau_balance_MONTHS_BALANCE_sum	bureau_balance_STATUS_0_count	bureau_balance_STATUS_0_count_norm	...	bureau_balance_STATUS_C_count	bureau_balance_STATUS_C_count_norm	bureau_balance_STATUS_X_count	bureau_balance_STATUS_X_count_norm	SK_ID_CURR
0	5001709	97	-48.0	-96	-4656	0	0.000000	...	86	0.886598	11	0.113402	NaN
1	5001710	83	-41.0	-82	-3403	5	0.060241	...	48	0.578313	30	0.361446	162368.0
2	5001711	4	-1.5	-3	-6	3	0.750000	...	0	0.000000	1	0.250000	162368.0
3	5001712	19	-9.0	-18	-171	10	0.526316	...	9	0.473684	0	0.000000	162368.0
4	5001713	22	-10.5	-21	-231	0	0.000000	...	0	0.000000	22	1.000000	150635.0

5 rows × 23 columns

bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), 
                                       group_var = 'SK_ID_CURR', df_name = 'client')
bureau_balance_by_client.head()

	SK_ID_CURR	client_bureau_balance_MONTHS_BALANCE_count_count	client_bureau_balance_MONTHS_BALANCE_count_mean	client_bureau_balance_MONTHS_BALANCE_count_max	client_bureau_balance_MONTHS_BALANCE_count_min	client_bureau_balance_MONTHS_BALANCE_count_sum	client_bureau_balance_MONTHS_BALANCE_mean_count	client_bureau_balance_MONTHS_BALANCE_mean_mean	client_bureau_balance_MONTHS_BALANCE_mean_max	client_bureau_balance_MONTHS_BALANCE_mean_min	...	client_bureau_balance_STATUS_X_count_count	client_bureau_balance_STATUS_X_count_mean	client_bureau_balance_STATUS_X_count_max	client_bureau_balance_STATUS_X_count_sum	client_bureau_balance_STATUS_X_count_norm_count	client_bureau_balance_STATUS_X_count_norm_mean	client_bureau_balance_STATUS_X_count_norm_max	client_bureau_balance_STATUS_X_count_norm_sum
0	100001.0	7	24.571429	52	2	172	7	-11.785714	-0.5	-25.5	...	7	4.285714	9	30.0	7	0.214590	0.500000	1.502129
1	100002.0	8	13.750000	22	4	110	8	-21.875000	-1.5	-39.5	...	8	1.875000	3	15.0	8	0.161932	0.500000	1.295455
2	100005.0	3	7.000000	13	3	21	3	-3.000000	-1.0	-6.0	...	3	0.666667	1	2.0	3	0.136752	0.333333	0.410256
3	100010.0	2	36.000000	36	36	72	2	-46.000000	-19.5	-72.5	...	2	0.000000	0	0.0	2	0.000000	0.000000	0.000000
4	100013.0	4	57.500000	69	40	230	4	-28.250000	-19.5	-34.0	...	4	10.250000	40	41.0	4	0.254545	1.000000	1.018182

5 rows × 106 columns

📌 정리

bureau_balance 데이터 프레임은 다음과 같은 과정으로 가공됨
1. 각각의 대출에 대해 수치형(numeric) 대표값 계산
2. 각각의 대출에 대해 범주형(categorical) 데이터들의 개수를 파악
3. 각각의 대출에 대한 대표값들과 갯수를 병합
4. 각 고객별로 3의 결과에 대한 수치형 대표값 계산
최종 데이터프레임은 각 고객에 대한 개별 행으로 구성되며, 각 행은 이전 모든 대출들의 월별 정보들의 통계치들로 구성되어 있음

client_bureau_balance_MONTHS_BALANCE_mean_mean: 각각의 대출에 대한 MONTHS_BALANCE의 평균값을 계산 -> 클라이언트별 대출의 평균값을 계산
client_bureau_balance_STATUS_X_count_norm_sum: 각각의 대출에 대해 STATUS == x 인것의 빈도를 총 STATUS 수로 나눈 다음, 개별 클라이언트별로 그 수를 합산

We will hold off on calculating the correlations until we have all the variables together in one dataframe.

2. 이전까지 생성한 함수 활용하기

모든 변수들을 초기화 한 뒤, 생성된 함수들을 사용하여 처음부터 만들어나가자.

# 오래된 객체들(objects)을 제거함으로써 메모리를 확보

import gc
gc.enable()

del train, bureau, bureau_balance, bureau_agg, bureau_agg_new, bureau_balance_agg, bureau_balance_counts, bureau_by_loan, bureau_balance_by_client, bureau_counts
gc.collect()

# 원본 데이터 다시 불러오기(초기화)

train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/3주차/data/application_train.csv')
bureau = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/3주차/data/bureau.csv')
bureau_balance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/3주차/data/bureau_balance.csv')

📌 Bureau 데이터프레임 내 범주형 데이터의 갯수 세기

bureau_counts = count_categorical(bureau, group_var = 'SK_ID_CURR', 
                                  df_name = 'bureau')
bureau_counts.head()

	bureau_CREDIT_ACTIVE_Active_count	bureau_CREDIT_ACTIVE_Active_count_norm	bureau_CREDIT_ACTIVE_Bad debt_count	bureau_CREDIT_ACTIVE_Bad debt_count_norm	bureau_CREDIT_ACTIVE_Closed_count	bureau_CREDIT_ACTIVE_Closed_count_norm	bureau_CREDIT_ACTIVE_Sold_count	bureau_CREDIT_ACTIVE_Sold_count_norm	bureau_CREDIT_CURRENCY_currency 1_count	bureau_CREDIT_CURRENCY_currency 1_count_norm	...	bureau_CREDIT_TYPE_Microloan_count	bureau_CREDIT_TYPE_Microloan_count_norm	bureau_CREDIT_TYPE_Mobile operator loan_count	bureau_CREDIT_TYPE_Mobile operator loan_count_norm	bureau_CREDIT_TYPE_Mortgage_count	bureau_CREDIT_TYPE_Mortgage_count_norm	bureau_CREDIT_TYPE_Real estate loan_count	bureau_CREDIT_TYPE_Real estate loan_count_norm	bureau_CREDIT_TYPE_Unknown type of loan_count	bureau_CREDIT_TYPE_Unknown type of loan_count_norm
SK_ID_CURR
100001	3	0.428571	0	0.0	4	0.571429	0	0.0	7	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100002	2	0.250000	0	0.0	6	0.750000	0	0.0	8	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100003	1	0.250000	0	0.0	3	0.750000	0	0.0	4	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100004	0	0.000000	0	0.0	2	1.000000	0	0.0	2	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100005	2	0.666667	0	0.0	1	0.333333	0	0.0	3	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0

5 rows × 46 columns

📌Bureau 데이터프레임의 대표값 계산

bureau_agg = agg_numeric(bureau.drop(columns = ['SK_ID_BUREAU']), 
                         group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_agg.head()

	SK_ID_CURR	bureau_DAYS_CREDIT_count	bureau_DAYS_CREDIT_mean	bureau_DAYS_CREDIT_max	bureau_DAYS_CREDIT_min	bureau_DAYS_CREDIT_sum	bureau_CREDIT_DAY_OVERDUE_count	...	bureau_DAYS_CREDIT_UPDATE_count	bureau_DAYS_CREDIT_UPDATE_mean	bureau_DAYS_CREDIT_UPDATE_max	bureau_DAYS_CREDIT_UPDATE_min	bureau_DAYS_CREDIT_UPDATE_sum	bureau_AMT_ANNUITY_count	bureau_AMT_ANNUITY_mean	bureau_AMT_ANNUITY_max	bureau_AMT_ANNUITY_min	bureau_AMT_ANNUITY_sum
0	100001	7	-735.000000	-49	-1572	-5145	7	...	7	-93.142857	-6	-155	-652	7	3545.357143	10822.5	0.0	24817.5
1	100002	8	-874.000000	-103	-1437	-6992	8	...	8	-499.875000	-7	-1185	-3999	7	0.000000	0.0	0.0	0.0
2	100003	4	-1400.750000	-606	-2586	-5603	4	...	4	-816.000000	-43	-2131	-3264	0	NaN	NaN	NaN	0.0
3	100004	2	-867.000000	-408	-1326	-1734	2	...	2	-532.000000	-382	-682	-1064	0	NaN	NaN	NaN	0.0
4	100005	3	-190.666667	-62	-373	-572	3	...	3	-54.333333	-11	-121	-163	3	1420.500000	4261.5	0.0	4261.5

5 rows × 61 columns

📌 Bureau Balance 데이터 프레임의 각 대출 별 범주형 데이터 개수

bureau_balance_counts = count_categorical(bureau_balance, group_var = 'SK_ID_BUREAU', 
                                          df_name = 'bureau_balance')
bureau_balance_counts.head()

	bureau_balance_STATUS_0_count	bureau_balance_STATUS_0_count_norm	bureau_balance_STATUS_1_count	bureau_balance_STATUS_1_count_norm	bureau_balance_STATUS_2_count	bureau_balance_STATUS_2_count_norm	bureau_balance_STATUS_3_count	bureau_balance_STATUS_3_count_norm	bureau_balance_STATUS_4_count	bureau_balance_STATUS_4_count_norm	bureau_balance_STATUS_5_count	bureau_balance_STATUS_5_count_norm	bureau_balance_STATUS_C_count	bureau_balance_STATUS_C_count_norm	bureau_balance_STATUS_X_count	bureau_balance_STATUS_X_count_norm
SK_ID_BUREAU
5001709	0	0.000000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	86	0.886598	11	0.113402
5001710	5	0.060241	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	48	0.578313	30	0.361446
5001711	3	0.750000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.000000	1	0.250000
5001712	10	0.526316	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	9	0.473684	0	0.000000
5001713	0	0.000000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.000000	22	1.000000

📌 Bureau Balance 데이터프레임의 각 대출별 대표값

bureau_balance_agg = agg_numeric(bureau_balance, group_var = 'SK_ID_BUREAU', 
                                 df_name = 'bureau_balance')
bureau_balance_agg.head()

	SK_ID_BUREAU	bureau_balance_MONTHS_BALANCE_count	bureau_balance_MONTHS_BALANCE_mean	bureau_balance_MONTHS_BALANCE_min	bureau_balance_MONTHS_BALANCE_sum
0	5001709	97	-48.0	-96	-4656
1	5001710	83	-41.0	-82	-3403
2	5001711	4	-1.5	-3	-6
3	5001712	19	-9.0	-18	-171
4	5001713	22	-10.5	-21	-231

📌 Bureau Balance 데이터프레임의 고객별 대표값

# 각 대출별 데이터프레임 병합
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, 
                                          left_on = 'SK_ID_BUREAU', how = 'outer')

# 데이터 프레임에 SK_ID_CURR을 포함
bureau_by_loan = bureau[['SK_ID_BUREAU', 'SK_ID_CURR']].merge(bureau_by_loan, 
                                                              on = 'SK_ID_BUREAU', 
                                                              how = 'left')

# 고객별 대표값 계산
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), 
                                       group_var = 'SK_ID_CURR', df_name = 'client')

📌 계산된 특성(Feature)들을 훈련용 데이터와 병합

original_features = list(train.columns) # 원래 변수들
print('Original Number of Features: ', len(original_features))

Original Number of Features:  122

# bureau_count 병합
train = train.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')

# bureau 대표값 병합
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')

# 월별, 고객별 정보 병합
train = train.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')

new_features = list(train.columns) # 변수 추가 후 변수 목록
print('Number of features using previous loans from other institutions data: ', len(new_features))

Number of features using previous loans from other institutions data:  333

많은 변수들이 새로 생성되었다.

3. Feature Engineering 결과물

결측치의 비율, target과의 상관계수, 변수 간 상관계수들을 파악
- 각 변수 간 높은 상관관계는 변수 간 collinear 관계를 가지는 지 여부를 보여줄 수 있으며, 이는 변수들이 서로 강한 연관관계를 가짐을 의미
- 종종 collinear한 두 변수를 모두 갖는 것은 중복이기 때문에 하나를 제거해야할 필요가 있기도 함(다중공선성 문제 등)
feature seletion은 모델 학습 및 일반화를 위해 변수들을 제거하는 과정
- 필요없고, 중복인 변수들을 제거하고, 중요한 변수들을 보존하는 것이 목적
- Curse of dimensionality(차원의 저주)
- 너무 많은 feature를 가질 때 생기는 문제(Feature 갯수가 많은 것 -> 지나치게 고차원)
- 변수의 수가 증가함에 따라 변수와 목표값 사이의 상관관계를 학습하는데 필요한 데이터의 수가 기하급수적으로 증가
- Feature의 수를 줄이는 것은 모델 학습과 더불어 일반화를 도울 수 있음
  - 결측치들의 백분율을 활용하여 대부분의 값이 존재하지 않는 feature들을 제거할 수 있음
  - Gradient Boosting Machine과 RandomForest 모델로부터 반환된 feature importance를 활용할 수 있음

3-1. 결측치 처리

### column별 결측치의 개수를 계산하기 위한 함수

def missing_values_table(df):
        # 결측치의 총 개수
        mis_val = df.isnull().sum()
        
        # 결측치의 비율
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # 결과 테이블
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis = 1)
        
        # 컬럼명 재정의
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # 결과를 내림차순으로 정렬
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending = False).round(1)
        
        # 요약통계량 출력
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        return mis_val_table_ren_columns

missing_train = missing_values_table(train)
missing_train.head(10)

Your selected dataframe has 333 columns.
There are 278 columns that have missing values.

	Missing Values	% of Total Values
bureau_AMT_ANNUITY_min	227502	74.0
bureau_AMT_ANNUITY_max	227502	74.0
bureau_AMT_ANNUITY_mean	227502	74.0
client_bureau_balance_STATUS_4_count_min	215280	70.0
client_bureau_balance_STATUS_3_count_norm_mean	215280	70.0
client_bureau_balance_MONTHS_BALANCE_count_min	215280	70.0
client_bureau_balance_STATUS_4_count_max	215280	70.0
client_bureau_balance_STATUS_4_count_mean	215280	70.0
client_bureau_balance_STATUS_3_count_norm_min	215280	70.0
client_bureau_balance_STATUS_3_count_norm_max	215280	70.0

누락된 값의 비율이 높은 column이 여러 개 있음을 알 수 있음
- 훈련 데이터 및 테스트 데이터에 있어 90% 이상의 누락값을 가진 column들을 제거

missing_train_vars = list(missing_train.index[missing_train['% of Total Values'] > 90])
len(missing_train_vars)

테스트 데이터에 대해서도 동일한 작업 수행

3-2. 테스트 데이터 처리

# 테스트 데이터 불러오기
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/3주차/data/application_test.csv')

# bureau 데이터의 개수들을 계산한 데이터프레임과 병합
test = test.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')

# bureau 데이터의 대표값들을 계산한 데이터프레임과 병합
test = test.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')

# bureau balance 데이터의 개수들을 계산한 데이터프레임과 병합
test = test.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')

print('Shape of Testing Data: ', test.shape)

Shape of Testing Data:  (48744, 332)

3-3. 데이터 개수 맞춰주기

훈련용 데이터와 테스트 데이터프레임이 같은 column들을 가지도록 맞춰보자.
- 여기서는 문제가 되지 않지만, 변수들을 원-핫 인코딩할 때에는 데이터프레임들이 동일한 column을 가지도록 맞춰야 함

train_labels = train['TARGET']

# 데이터프레임을 align
# 'target' column은 일단 제거하고 맞춰주기
train, test = train.align(test, join = 'inner', axis = 1)

train['TARGET'] = train_labels # train data에 대해서는 다시 target값 결합

print('Training Data Shape: ', train.shape)
print('Testing Data Shape: ', test.shape)

Training Data Shape:  (307511, 333)
Testing Data Shape:  (48744, 332)

데이터 형태가 통일되었음을 확인할 수 있다.

3-4. 결측치 처리

### 결측치 상태(?) 확인

missing_test = missing_values_table(test)
missing_test.head(10)

Your selected dataframe has 332 columns.
There are 275 columns that have missing values.

	Missing Values	% of Total Values
COMMONAREA_MEDI	33495	68.7
COMMONAREA_MODE	33495	68.7
COMMONAREA_AVG	33495	68.7
NONLIVINGAPARTMENTS_MEDI	33347	68.4
NONLIVINGAPARTMENTS_AVG	33347	68.4
NONLIVINGAPARTMENTS_MODE	33347	68.4
FONDKAPREMONT_MODE	32797	67.3
LIVINGAPARTMENTS_MEDI	32780	67.2
LIVINGAPARTMENTS_MODE	32780	67.2
LIVINGAPARTMENTS_AVG	32780	67.2

missing_test_vars = list(missing_test.index[missing_test['% of Total Values'] > 90])
len(missing_test_vars)

missing_columns = list(set(missing_test_vars + missing_train_vars))
print('There are %d columns with more than 90%% missing in either the training or testing data.' % len(missing_columns))

There are 0 columns with more than 90% missing in either the training or testing data.

# 결측치가 있는 column 삭제

train = train.drop(columns = missing_columns)
test = test.drop(columns = missing_columns)

90% 이상 누락된 값을 가진 column들이 없기 때문에, 이번에는 어떠한 column들도 제거되지 않았음
- feature selection을 위해서는 아마도 다른 방법을 적용해야 할 것 같음

### 가공된 데이터 제장

train.to_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/3주차/data/train_bureau_raw.csv', index = False)
test.to_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/3주차/data/test_bureau_raw.csv', index = False)

3-5. 상관계수(Correlations)

target과 변수들 간의 상관계수
- 새롭게 생성된 변수들 중 기존 훈련용 데이터(application 데이터에 있던) 변수들보다 더 높은 상관계수를 가지는 변수를 찾을 수 있음

# 데이터프레임상에서 모든 변수들 간의 상관계수를 계산

corrs = train.corr()

corrs = corrs.sort_values('TARGET', ascending = False) # 내림차순 정렬

pd.DataFrame(corrs['TARGET'].head(10))

	TARGET
TARGET	1.000000
bureau_DAYS_CREDIT_mean	0.089729
client_bureau_balance_MONTHS_BALANCE_min_mean	0.089038
DAYS_BIRTH	0.078239
bureau_CREDIT_ACTIVE_Active_count_norm	0.077356
client_bureau_balance_MONTHS_BALANCE_mean_mean	0.076424
bureau_DAYS_CREDIT_min	0.075248
client_bureau_balance_MONTHS_BALANCE_min_min	0.073225
client_bureau_balance_MONTHS_BALANCE_sum_mean	0.072606
bureau_DAYS_CREDIT_UPDATE_mean	0.068927

pd.DataFrame(corrs['TARGET'].dropna().tail(10))

	TARGET
client_bureau_balance_MONTHS_BALANCE_count_min	-0.048224
client_bureau_balance_STATUS_C_count_norm_mean	-0.055936
client_bureau_balance_STATUS_C_count_max	-0.061083
client_bureau_balance_STATUS_C_count_mean	-0.062954
client_bureau_balance_MONTHS_BALANCE_count_max	-0.068792
bureau_CREDIT_ACTIVE_Closed_count_norm	-0.079369
client_bureau_balance_MONTHS_BALANCE_count_mean	-0.080193
EXT_SOURCE_1	-0.155317
EXT_SOURCE_2	-0.160472
EXT_SOURCE_3	-0.178919

target과 가장 큰 상관계수를 가지는 변수는 새롭게 생성된 변수
- 그러나 상관계수가 높다는 것이 그 변수가 유용하다는 것을 의미하지는 않으며, 수백 개의 변수들을 생성했을 경우에는, 그저 random noise 때문에 상관관계에 있는 것처럼 보일 수도 있음을 주의해야 함
비판적으로 상관계수들을 들여다봤을 때, 그래도 새롭게 생성된 몇몇 변수들은 유용할 것처럼 보임
- 변수들의 유용성을 평가하기 위해, 학습된 모델로부터 feature Importance를 살펴볼 예정
- 새롭게 생성된 변수들에 대한 kde 그래프를 작성할 수 있음

# 원문에는 client_bureau_balance_counts_mean 을 사용하고 있지만 해당 변수가 없어,
# 'client_bureau_balance_MONTHS_BALANCE_count_mean'으로 대체

kde_target(var_name = 'client_bureau_balance_MONTHS_BALANCE_count_mean', df = train)

The correlation between client_bureau_balance_MONTHS_BALANCE_count_mean and the TARGET is -0.0802
Median value for loan that was not repaid = 19.3333
Median value for loan that was repaid =     25.1429

이 변수는 각 고객의 대출별 월별 기록에 대한 평균을 의미합니다.
- 예를 들어, 만약 고객의 과거 월별로 3, 4, 5의 기록을 갖고 있는 3개의 대출을 갖고 있다면, 이 변수에 있어 데이터값은 4가 됨
분산 그래프에 기초하여, 과거 월별 평균이 높은 사람들이 Home Credit에서의 대출에서 상환을 잘하는 것으로 보임
- 더 많은 신용 기록을 가지고 있는 고객들이 일반적으로 대출금을 상환할 가능성이 더 높다는 것을 나타낼 수 있음

kde_target(var_name='bureau_CREDIT_ACTIVE_Active_count_norm', df=train)

The correlation between bureau_CREDIT_ACTIVE_Active_count_norm and the TARGET is 0.0774
Median value for loan that was not repaid = 0.5000
Median value for loan that was repaid =     0.3636

해당 변수는 CREDIT_ACTIVE의 값이 ‘Active’인 것의 개수를 전체 대출의 개수로 나눈 값
해당 변수의 경우 모든 곳에서 불규칙적임
상관관계 또한 매우 낮음

📌 Collinear Variables

target과 변수 간의 상관계수만 계산하는 것이 아닌, 각 변수 간 상관계수까지 계산할 수 있음
- 이를 통해 제거해야 할 수도 있는 collinear 관계들을 가지는 변수들이 있는지 여부를 알려줌
0.8 이상의 상관계수를 가지는 변수들을 찾아보자.

# 임계값 설정
threshold = 0.8

# 상관계수가 높은 변수들을 저장하기 위한 빈 dictionary 생성
above_threshold_vars = {}

# 각각의 칼럼마다 임계치 이상의 변수들을 저장
for col in corrs:
    above_threshold_vars[col] = list(corrs.index[corrs[col] > threshold])

상관계수가 높은 변수들 중 1개의 변수만 제거

# 제거할 column들 및 이미 검사된 column들의 목록을 저장위한 list 생성
cols_to_remove = []
cols_seen = []
cols_to_remove_pair = []

for key, value in above_threshold_vars.items():
    # 이미 검사된 column 저장
    cols_seen.append(key)
    for x in value:
        if x == key:
            next
        else:
            # 각 쌍 중 하나의 column만을 제거
            if x not in cols_seen:
                cols_to_remove.append(x)
                cols_to_remove_pair.append(key)
            
cols_to_remove = list(set(cols_to_remove))
print('Number of columns to remove: ', len(cols_to_remove))

Number of columns to remove:  134

### 상관도가 높은 변수들을 제거

train_corrs_removed = train.drop(columns = cols_to_remove)
test_corrs_removed = test.drop(columns = cols_to_remove)

print('Training Corrs Removed Shape: ', train_corrs_removed.shape)
print('Testing Corrs Removed Shape: ', test_corrs_removed.shape)

Training Corrs Removed Shape:  (307511, 199)
Testing Corrs Removed Shape:  (48744, 198)

train_corrs_removed.to_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/3주차/data/train_bureau_corrs_removed.csv', index = False)
test_corrs_removed.to_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/3주차/data/test_bureau_corrs_removed.csv', index = False)

4. 모델링(Modeling)

해당 부분은 원문을 그대로 탑재하였습니다.
모델은 LGBM을 활용 + 교차 검증
위에서 가공한 데이터들로 모델을 학습/예측/평가 후 성능 비교

import lightgbm as lgb

from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

import gc

import matplotlib.pyplot as plt

def model(features, test_features, encoding = 'ohe', n_folds = 5):
    
    """Train and test a light gradient boosting model using
    cross validation. 
    
    Parameters
    --------
        features (pd.DataFrame): 
            dataframe of training features to use 
            for training a model. Must include the TARGET column.
        test_features (pd.DataFrame): 
            dataframe of testing features to use
            for making predictions with the model. 
        encoding (str, default = 'ohe'): 
            method for encoding categorical variables. Either 'ohe' for one-hot encoding or 'le' for integer label encoding
            n_folds (int, default = 5): number of folds to use for cross validation
        
    Return
    --------
        submission (pd.DataFrame): 
            dataframe with `SK_ID_CURR` and `TARGET` probabilities
            predicted by the model.
        feature_importances (pd.DataFrame): 
            dataframe with the feature importances from the model.
        valid_metrics (pd.DataFrame): 
            dataframe with training and validation metrics (ROC AUC) for each fold and overall.
        
    """
    
    # Extract the ids
    train_ids = features['SK_ID_CURR']
    test_ids = test_features['SK_ID_CURR']
    
    # Extract the labels for training
    labels = features['TARGET']
    
    # Remove the ids and target
    features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
    test_features = test_features.drop(columns = ['SK_ID_CURR'])
    
    
    # One Hot Encoding
    if encoding == 'ohe':
        features = pd.get_dummies(features)
        test_features = pd.get_dummies(test_features)
        
        # Align the dataframes by the columns
        features, test_features = features.align(test_features, join = 'inner', axis = 1)
        
        # No categorical indices to record
        cat_indices = 'auto'
    
    # Integer label encoding
    elif encoding == 'le':
        
        # Create a label encoder
        label_encoder = LabelEncoder()
        
        # List for storing categorical indices
        cat_indices = []
        
        # Iterate through each column
        for i, col in enumerate(features):
            if features[col].dtype == 'object':
                # Map the categorical features to integers
                features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
                test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))

                # Record the categorical indices
                cat_indices.append(i)
    
    # Catch error if label encoding scheme is not valid
    else:
        raise ValueError("Encoding must be either 'ohe' or 'le'")
        
    print('Training Data Shape: ', features.shape)
    print('Testing Data Shape: ', test_features.shape)
    
    # Extract feature names
    feature_names = list(features.columns)
    
    # Convert to np arrays
    features = np.array(features)
    test_features = np.array(test_features)
    
    # Create the kfold object
    k_fold = KFold(n_splits = n_folds, shuffle = False, random_state = 50)
    
    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    
    # Empty array for test predictions
    test_predictions = np.zeros(test_features.shape[0])
    
    # Empty array for out of fold validation predictions
    out_of_fold = np.zeros(features.shape[0])
    
    # Lists for recording validation and training scores
    valid_scores = []
    train_scores = []
    
    # Iterate through each fold
    for train_indices, valid_indices in k_fold.split(features):
        
        # Training data for the fold
        train_features, train_labels = features[train_indices], labels[train_indices]
        # Validation data for the fold
        valid_features, valid_labels = features[valid_indices], labels[valid_indices]
        
        # Create the model
        model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                                   class_weight = 'balanced', learning_rate = 0.05, 
                                   reg_alpha = 0.1, reg_lambda = 0.1, 
                                   subsample = 0.8, n_jobs = -1, random_state = 50)
        
        # Train the model
        model.fit(train_features, train_labels, eval_metric = 'auc',
                  eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
                  eval_names = ['valid', 'train'], categorical_feature = cat_indices,
                  early_stopping_rounds = 100, verbose = 200)
        
        # Record the best iteration
        best_iteration = model.best_iteration_
        
        # Record the feature importances
        feature_importance_values += model.feature_importances_ / k_fold.n_splits
        
        # Make predictions
        test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
        
        # Record the out of fold predictions
        out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
        
        # Record the best score
        valid_score = model.best_score_['valid']['auc']
        train_score = model.best_score_['train']['auc']
        
        valid_scores.append(valid_score)
        train_scores.append(train_score)
        
        # Clean up memory
        gc.enable()
        del model, train_features, valid_features
        gc.collect()
        
    # Make the submission dataframe
    submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
    
    # Make the feature importance dataframe
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
    
    # Overall validation score
    valid_auc = roc_auc_score(labels, out_of_fold)
    
    # Add the overall scores to the metrics
    valid_scores.append(valid_auc)
    train_scores.append(np.mean(train_scores))
    
    # Needed for creating dataframe of validation scores
    fold_names = list(range(n_folds))
    fold_names.append('overall')
    
    # Dataframe of validation scores
    metrics = pd.DataFrame({'fold': fold_names,
                            'train': train_scores,
                            'valid': valid_scores}) 
    
    return submission, feature_importances, metrics

def plot_feature_importances(df):
    """
    Plot importances returned by a model. This can work with any measure of
    feature importance provided that higher importance is better. 
    
    Args:
        df (dataframe): feature importances. Must have the features in a column
        called `features` and the importances in a column called `importance
        
    Returns:
        shows a plot of the 15 most importance features
        
        df (dataframe): feature importances sorted by importance (highest to lowest) 
        with a column for normalized importance
        """
    
    # Sort features according to importance
    df = df.sort_values('importance', ascending = False).reset_index()
    
    # Normalize the feature importances to add up to one
    df['importance_normalized'] = df['importance'] / df['importance'].sum()

    # Make a horizontal bar chart of feature importances
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()
    
    # Need to reverse the index to plot most important on top
    ax.barh(list(reversed(list(df.index[:15]))), 
            df['importance_normalized'].head(15), 
            align = 'center', edgecolor = 'k')
    
    # Set the yticks and labels
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))
    
    # Plot labeling
    plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
    plt.show()
    
    return df

Control

The first step in any experiment is establishing a control. For this we will use the function defined above (that implements a Gradient Boosting Machine model) and the single main data source (application).

train_control = pd.read_csv('../input/application_train.csv')
test_control = pd.read_csv('../input/application_test.csv')

Fortunately, once we have taken the time to write a function, using it is simple (if there’s a central theme in this notebook, it’s use functions to make things simpler and reproducible!). The function above returns a submission dataframe we can upload to the competition, a fi dataframe of feature importances, and a metrics dataframe with validation and test performance.

submission, fi, metrics = model(train_control, test_control)

metrics

The control slightly overfits because the training score is higher than the validation score. We can address this in later notebooks when we look at regularization (we already perform some regularization in this model by using reg_lambda and reg_alpha as well as early stopping).

We can visualize the feature importance with another function, plot_feature_importances. The feature importances may be useful when it’s time for feature selection.

fi_sorted = plot_feature_importances(fi)

submission.to_csv('control.csv', index = False)

The control scores 0.745 when submitted to the competition.

Test One

Let’s conduct the first test. We will just need to pass in the data to the function, which does most of the work for us.

submission_raw, fi_raw, metrics_raw = model(train, test)

metrics_raw

Based on these numbers, the engineered features perform better than the control case. However, we will have to submit the predictions to the leaderboard before we can say if this better validation performance transfers to the testing data.

fi_raw_sorted = plot_feature_importances(fi_raw)

Examining the feature improtances, it looks as if a few of the feature we constructed are among the most important. Let’s find the percentage of the top 100 most important features that we made in this notebook. However, rather than just compare to the original features, we need to compare to the one-hot encoded original features. These are already recorded for us in fi (from the original data).

top_100 = list(fi_raw_sorted['feature'])[:100]
new_features = [x for x in top_100 if x not in list(fi['feature'])]

print('%% of Top 100 Features created from the bureau data = %d.00' % len(new_features))

Over half of the top 100 features were made by us! That should give us confidence that all the hard work we did was worthwhile.

submission_raw.to_csv('test_one.csv', index = False)

Test one scores 0.759 when submitted to the competition.

Test Two

That was easy, so let’s do another run! Same as before but with the highly collinear variables removed.

submission_corrs, fi_corrs, metrics_corr = model(train_corrs_removed, test_corrs_removed)

metrics_corr

These results are better than the control, but slightly lower than the raw features.

fi_corrs_sorted = plot_feature_importances(fi_corrs)

submission_corrs.to_csv('test_two.csv', index = False)

Test Two scores 0.753 when submitted to the competition.

Results

After all that work, we can say that including the extra information did improve performance! The model is definitely not optimized to our data, but we still had a noticeable improvement over the original dataset when using the calculated features. Let’s officially summarize the performances:

Experiment

Train AUC

Validation AUC

Test AUC

|————|——-|————|——-|

Control

0.815

0.760

0.745

Test One

0.837

0.767

0.759

Test Two

0.826

0.765

0.753

(Note that these scores may change from run to run of the notebook. I have not observed that the general ordering changes however.)

All of our hard work translates to a small improvement of 0.014 ROC AUC over the original testing data. Removing the highly collinear variables slightly decreases performance so we will want to consider a different method for feature selection. Moreover, we can say that some of the features we built are among the most important as judged by the model.

In a competition such as this, even an improvement of this size is enough to move us up 100s of spots on the leaderboard. By making numerous small improvements such as in this notebook, we can gradually achieve better and better performance. I encourage others to use the results here to make their own improvements, and I will continue to document the steps I take to help others.

Next Steps

Going forward, we can now use the functions we developed in this notebook on the other datasets. There are still 4 other data files to use in our model! In the next notebook, we will incorporate the information from these other data files (which contain information on previous loans at Home Credit) into our training data. Then we can build the same model and run more experiments to determine the effect of our feature engineering. There is plenty more work to be done in this competition, and plenty more gains in performance to be had! I’ll see you in the next notebook.

Twitter Facebook LinkedIn

📚 Reference

1. Introduction: Manual Feature Engineering

1-1. Example: 고객의 이전 대출 수량 파악(Counts of a client’s previous loans)

📌 자주 사용되는 Pandas 명령어

1-2. R Value를 활용한 변수 유용성 평가

📌 피어슨 상관계수(Pearson Correlation Coefficient)

📌 커널 밀도 추정 그래프(Kernel Density Estimate Plots)

1-3. 새로운 변수 생성하기

a) 수치형 변수들의 대표값 계산

b) target과 대표값들의 상관계수 분석

✔ 다중 비교 문제(Multiple Comparisons Problem)

c) 수치형 변수들의 대표값 연산을 위한 함수 생성

d) 상관계수 계산을 위한 함수

1-4. 범주형 변수(Categorical Variables)

a) 범주형 변수 인코딩(Encoding)

b) 범주형 데이터들을 처리하기 위한 함수

c) 다른 데이터프레임에 연산 적용하기

📌 정리

2. 이전까지 생성한 함수 활용하기

📌 Bureau 데이터프레임 내 범주형 데이터의 갯수 세기

📌Bureau 데이터프레임의 대표값 계산

📌 Bureau Balance 데이터 프레임의 각 대출 별 범주형 데이터 개수

📌 Bureau Balance 데이터프레임의 각 대출별 대표값

📌 Bureau Balance 데이터프레임의 고객별 대표값

📌 계산된 특성(Feature)들을 훈련용 데이터와 병합

3. Feature Engineering 결과물

3-1. 결측치 처리

3-2. 테스트 데이터 처리

3-3. 데이터 개수 맞춰주기

3-4. 결측치 처리

3-5. 상관계수(Correlations)

📌 Collinear Variables

4. 모델링(Modeling)

Control

Test One

Test Two

Results

Next Steps

공유하기

참고

[ECC Github 스터디 11주차] New. 깃허브의 새로운 서비스와 기능

[ECC DS 11주차] 회귀 3_Pycaret Introduction

[ECC DS 10주차] 회귀 2_캐글 주택 가격:고급 회귀 기법

[ECC DS 10주차] 회귀 1_자전거 대여 수요 예측