0. Introduction

  • ๊ธฐ๋ณธ ํ•™์Šต ๋ชจ๋ธ๋“ค์„ ์•™์ƒ๋ธ”ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์„ค๋ช…(ํŠนํžˆ stacking)

  • ๋ช‡ ๊ฐ€์ง€ ๊ธฐ๋ณธ ๋ถ„๋ฅ˜๊ธฐ์˜ ์˜ˆ์ธก์„ 1์ฐจ ์ˆ˜์ค€(๋ฒ ์ด์Šค)์œผ๋กœ ์‚ฌ์šฉํ•œ ๋‹ค์Œ 2์ฐจ ์ˆ˜์ค€์—์„œ ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ „ 1์ฐจ ์ˆ˜์ค€ ์˜ˆ์ธก์˜ ์ถœ๋ ฅ์„ ์˜ˆ์ธก

### Import libraries
import pandas as pd
import numpy as np
import re
import sklearn
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.offline as py
import plotly.io as pio
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

import warnings
warnings.filterwarnings('ignore')

### stacking์— ์‚ฌ์šฉ๋  base model 5๊ฐ€์ง€
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.svm import SVC
from sklearn.model_selection import KFold # sklearn version์— ๋งž๊ฒŒ ๋ชจ๋“ˆ ๋ณ€๊ฒฝ

1. ๋ณ€์ˆ˜ ํŒŒ์•…, ๊ฐ€๊ณต, ์ „์ฒ˜๋ฆฌ

  • ๋ฐ์ดํ„ฐ ํƒ์ƒ‰ -> ๊ฐ€๋Šฅํ•œ ๊ฐ€๊ณต ๋ฐฉ๋ฒ• ์ฐพ๊ธฐ

  • ๋ชจ๋“  ๋ฒ”์ฃผํ˜• feature๋“ค์„ ์ˆ˜์น˜์ ์œผ๋กœ ๋ณ€ํ™˜(encoding)

### Data Loading
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/1แ„Œแ…ฎแ„Žแ…ก/data/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/1แ„Œแ…ฎแ„Žแ…ก/data/test.csv')

# ์‰ฌ์šด ์ ‘๊ทผ์„ ์œ„ํ—ค PassengerId๋งŒ ๋”ฐ๋กœ ์ €์žฅํ•˜๊ธฐ
PassengerId = test['PassengerId']

train.head(3)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

1-1. ํŠน์„ฑ ๊ณตํ•™(Feature Engineering)

### ์ „์ฒด ๋ฐ์ดํ„ฐ
full_data = [train, test]
### ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜ ์ƒ์„ฑ

# ์ด๋ฆ„ ๊ธธ์ด
train['Name_length'] = train['Name'].apply(len)
test['Name_length'] = test['Name'].apply(len)

# cabin ์œ ๋ฌด
train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
test['Has_Cabin'] = test["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

# FamilySize = SibSp + Parch
for dataset in full_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

# ํ˜ผ์ž์ธ ์Šน๊ฐ
for dataset in full_data:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

# CategoricalAge: ๋‚˜์ด๋Œ€ ๊ตฌ๋ถ„
for dataset in full_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size = age_null_count) # ๋žœ๋คํ•œ ๋‚˜์ด ์‚ฝ์ž…
    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)
train['CategoricalAge'] = pd.cut(train['Age'], 5)

# Title
# ์Šน๊ฐ ์ด๋ฆ„์—์„œ title๋งŒ ๋ฝ‘์•„๋‚ด๋Š” ํ•จ์ˆ˜
def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    if title_search: # title์ด ์žˆ๋‹ค๋ฉด
        return title_search.group(1)
    return ""

for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)

# ํ”ํ•˜์ง€ ์•Š์€ title์€ 'Rare'๋กœ ํ•ฉ์นจ
for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 
                                                 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
### ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ

# Embarked: 'S'๋กœ ๋Œ€์ฒด
for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')

# Fare: ์ค‘๊ฐ„๊ฐ’์œผ๋กœ ๋Œ€์ฒด, 4๊ฐœ์˜ ๋ฒ”์ฃผ๋กœ ๊ตฌ๋ถ„
for dataset in full_data:
    dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)
### ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ -> ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜
# map ํ•จ์ˆ˜ ํ™œ์šฉ

for dataset in full_data:
    # Sex
    dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
    
    # Title
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)
    
    # Embarked
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
    
    # Fare
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)
    
    # Age
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age'] = 4
### ๋ณ€์ˆ˜ ์„ ํƒ

drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp']

train = train.drop(drop_elements, axis = 1)
train = train.drop(['CategoricalAge', 'CategoricalFare'], axis = 1)

test  = test.drop(drop_elements, axis = 1)
  • feature๋“ค์ด ๋ชจ๋‘ ์ˆซ์žํ˜•

    • ๊ธฐ๊ณ„ํ•™์Šต์— ์ ํ•ฉํ•œ ํ˜•์‹

1-2. ์‹œ๊ฐํ™”

### ๋ฐ์ดํ„ฐ ํ™•์ธ

train.head(3)
Survived Pclass Sex Age Parch Fare Embarked Name_length Has_Cabin FamilySize IsAlone Title
0 0 3 1 1 0 0 0 23 0 2 0 1
1 1 1 0 2 0 3 1 51 1 2 0 3
2 1 3 0 1 0 1 0 22 0 1 1 2
### ์ƒ๊ด€๊ณ„์ˆ˜(ํ”ผ์–ด์“ด ์ƒ๊ด€๊ณ„์ˆ˜) ์‹œ๊ฐํ™”

colormap = plt.cm.RdBu
plt.figure(figsize = (14,12))
plt.title('Pearson Correlation of Features', y = 1.05, size = 15)
sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)
<Axes: title={'center': 'Pearson Correlation of Features'}>

  • ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๋Š” feature๊ฐ€ ๋งŽ์ง€๋Š” ์•Š์Œ

    • train set์— ์ค‘๋ณต๋˜๊ฑฐ๋‚˜ ๋ถˆํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์ง€๋Š” ์•Š๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ

    • ๊ฐ feature๋“ค์ด ๊ณ ์œ ํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ

  • FamilySize ๋ณ€์ˆ˜์™€ Parch ๋ณ€์ˆ˜ ๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ์ƒ๋‹นํžˆ ๋†’์Œ

    • ์ผ๋‹จ์€ ๋ชจ๋‘ ๋ถ„์„์— ํ™œ์šฉ
### pairplot
# ํ•œ feature์—์„œ ๋‹ค๋ฅธ feature๋กœ์˜ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ๊ด€์ฐฐ
g = sns.pairplot(train[[u'Survived', u'Pclass', u'Sex', u'Age', u'Parch', 
                        u'Fare', u'Embarked',u'FamilySize', u'Title']], 
                 hue='Survived', palette = 'seismic',size = 1.2, 
                 diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=10) )
g.set(xticklabels=[])
<seaborn.axisgrid.PairGrid at 0x7f5b5cf21fa0>

2. ์•™์ƒ๋ธ” &์Šคํƒœํ‚น ๋ชจ๋ธ๋“ค

2-1. Python Class ํ™œ์šฉ

  • ํŒŒ์ด์ฌ์˜ ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŽธ์˜์„ฑ ์ฆ๋Œ€

  • ํด๋ž˜์Šค:

    • ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ ์ฝ”๋“œ/ํ”„๋กœ๊ทธ๋žจ์„ ํ™•์žฅํ•˜๊ณ , ํ•ด๋‹น ํด๋ž˜์Šค์— ํŠน์ •ํ•œ ๊ธฐ๋Šฅ๊ณผ ๋ฐฉ๋ฒ•์„ ๊ตฌํ˜„

    • ๋™์ผํ•œ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์ค‘๋ณต์„ฑ ์ œ๊ฑฐ

  • SklearnHelper ํด๋ž˜์Šค๋ฅผ ์ž‘์„ฑ

    • ๋ชจ๋“  Sklearn ๋ถ„๋ฅ˜๊ธฐ์— ๊ณตํ†ต๋œ ๋‚ด์žฅ ๋ฐฉ๋ฒ•(ํ›ˆ๋ จ, ์˜ˆ์ธก, ์ ํ•ฉ ๋“ฑ)์„ ํ™•์žฅ ๊ฐ€๋Šฅ
# parameter ์ •์˜
ntrain = train.shape[0]
ntest = test.shape[0]
SEED = 0 # for ์žฌ์ƒ์‚ฐ
NFOLDS = 5 # ๊ต์ฐจ ๊ฒ€์ฆ ํšŸ์ˆ˜
kf = KFold(n_splits = NFOLDS, shuffle = True, random_state = SEED)

# Sklearn ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ™•์žฅํ•˜๊ธฐ ์œ„ํ•œ ํด๋ž˜์Šค
class SklearnHelper(object):
    # ์ƒ์„ฑ์ž(ํด๋ž˜์Šค ๊ฐ์ฒด ์ดˆ๊ธฐํ™”)
    def __init__(self, clf, seed=0, params=None):
        params['random_state'] = seed
        self.clf = clf(**params)
    # ํ•™์Šต
    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)
    # ์˜ˆ์ธก
    def predict(self, x):
        return self.clf.predict(x)
    # ์ ํ•ฉ
    def fit(self,x,y):
        return self.clf.fit(x,y)
    # ๋ณ€์ˆ˜ ์ค‘์š”๋„
    def feature_importances(self,x,y):
        print(self.clf.fit(x,y).feature_importances_)

2-2. Out of Fold ์˜ˆ์ธก

  • ์Šคํƒœํ‚น์€ ๊ธฐ๋ณธ ๋ถ„๋ฅ˜๊ธฐ์˜ ์˜ˆ์ธก(์ฒซ๋ฒˆ์งธ ์ˆ˜์ค€์—์„œ์˜ ์˜ˆ์ธก)์„ ๋‘ ๋ฒˆ์งธ ์ˆ˜์ค€ ๋ชจ๋ธ์—์„œ ํ•™์Šต์„ ์œ„ํ•œ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ

  • ํ•˜์ง€๋งŒ ์ „์ฒด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๊ธฐ๋ณธ ๋ชจ๋ธ์„ ๋‹จ์ˆœํžˆ ํ›ˆ๋ จ์‹œํ‚ค๊ณ , ์ „์ฒด ํ…Œ์ŠคํŠธ ์„ธํŠธ์— ๋Œ€ํ•œ ์˜ˆ์ธก์„ ์ƒ์„ฑํ•œ ๋‹ค์Œ 2๋‹จ๊ณ„ ํ›ˆ๋ จ์„ ์œ„ํ•ด ์ด๋ฅผ ์ถœ๋ ฅํ•  ์ˆ˜๋Š” ์—†์Œ

    • ๊ธฐ๋ณธ ๋ชจํ˜• ์˜ˆ์ธก์ด ์ด๋ฏธ ๊ฒ€์ • ์ง‘ํ•ฉ์„ ๋ณด์•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณผ๋Œ€ ์ ํ•ฉ๋ (overfitting) ์œ„ํ—˜์ด ์žˆ์Œ

โž• (์„ค๋ช… ์ถ”๊ฐ€) KFold์˜ CV์™€ OOF

  • KFold CV: ๊ต์ฐจ ๊ฒ€์ฆ์˜ ํ•œ ๋ฐฉ๋ฒ•(๋ฐฉ๋ฒ•๋ก ์  ๊ด€์ )

  • OOF: KFold์—์„œ ๊ฐ ๋ชจ๋ธ์˜ output๋ฅผ ์•™์ƒ๋ธ”ํ•˜์—ฌ ๋” ๋‚˜์€ ์˜ˆ์ธก์„ ์–ป๋Š” ๋ฐฉ๋ฒ•(์˜ˆ์ธก ๊ธฐ๋ฒ•)

image.png

### OOF๋ฅผ ์œ„ํ•œ ํ•จ์ˆ˜

def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,)) # null vector
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))

    # ๊ฐ ์‹œ๋„๋งˆ๋‹ค
    for i, (train_index, test_index) in enumerate(kf.split(train)):
        ### train ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
        x_tr = x_train[train_index] # X_train
        y_tr = y_train[train_index] # y_train
        x_te = x_train[test_index] # X_valid

        clf.train(x_tr, y_tr) 

        oof_train[test_index] = clf.predict(x_te) 
        oof_test_skf[i, :] = clf.predict(x_test)

    oof_test[:] = oof_test_skf.mean(axis=0)
    
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

3. First-level ๋ชจ๋ธ ์ƒ์„ฑํ•˜๊ธฐ

  • ํ™œ์šฉ ๋ชจ๋ธ๋“ค

    • RandomForestClassifier

    • ExtraTreesClassifier

    • AdaBoost

    • GradientBoostingClassifier

    • Support Vector Machine(SVM)

3-1. Parameters

  • n_jobs:

    • ํ›ˆ๋ จ ํ”„๋กœ์„ธ์Šค์— ์‚ฌ์šฉ๋˜๋Š” ์ฝ”์–ด ์ˆ˜์ž…๋‹ˆ๋‹ค

    • -1๋กœ ์„ค์ •ํ•˜๋ฉด ๋ชจ๋“  ์ฝ”์–ด๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • n_estimators:

    • ํ•™์Šต ๋ชจ๋ธ์˜ ๋ถ„๋ฅ˜ ํŠธ๋ฆฌ ์ˆ˜

    • ๊ธฐ๋ณธ๊ฐ’ ๋‹น 10๊ฐœ๋กœ ์„ค์ •

  • max_depth:

    • ํŠธ๋ฆฌ์˜ ์ตœ๋Œ€ ๊นŠ์ด ๋˜๋Š” ๋…ธ๋“œ๋ฅผ ํ™•์žฅํ•  ์–‘

    • ๋„ˆ๋ฌด ๋†’๊ฒŒ ์„ค์ •ํ•˜๋ฉด ๋‚˜๋ฌด๋ฅผ ๋„ˆ๋ฌด ๊นŠ๊ฒŒ ํ‚ค์šธ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๊ณผ์ ํ•ฉ ์œ„ํ—˜์ด ์žˆ์Œ

  • verbose:

    • ํ•™์Šต ํ”„๋กœ์„ธ์Šค ์ค‘์— ํ…์ŠคํŠธ๋ฅผ ์ถœ๋ ฅํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ์ œ์–ด

    • ๊ฐ’์ด 0์ด๋ฉด ๋ชจ๋“  ํ…์ŠคํŠธ๊ฐ€ ์–ต์ œ

    • ๊ฐ’์ด 3์ด๋ฉด ๋ชจ๋“  ๋ฐ˜๋ณต์—์„œ ํŠธ๋ฆฌ ํ•™์Šต ํ”„๋กœ์„ธ์Šค๊ฐ€ ์ถœ๋ ฅ๋จ

### Parameter ์ •์˜

# RandomForest parameters
rf_params = {
    'n_jobs': -1,
    'n_estimators': 500,
     'warm_start': True, 
     #'max_features': 0.2,
    'max_depth': 6,
    'min_samples_leaf': 2,
    'max_features' : 'sqrt',
    'verbose': 0
}

# ExtraTrees Parameters
et_params = {
    'n_jobs': -1,
    'n_estimators':500,
    #'max_features': 0.5,
    'max_depth': 8,
    'min_samples_leaf': 2,
    'verbose': 0
}

# AdaBoost parameters
ada_params = {
    'n_estimators': 500,
    'learning_rate' : 0.75
}

# GradientBoosting parameters
gb_params = {
    'n_estimators': 500,
     #'max_features': 0.2,
    'max_depth': 5,
    'min_samples_leaf': 2,
    'verbose': 0
}

# Support Vector Classifier parameters 
svc_params = {
    'kernel' : 'linear',
    'C' : 0.025
}

3-2. ๋ชจ๋ธ ๊ฐ์ฒด ์ƒ์„ฑํ•˜๊ธฐ

  • SKlearnHelper ํด๋ž˜์Šค ํ™œ์šฉ
rf = SklearnHelper(clf = RandomForestClassifier, seed=SEED, params=rf_params)
et = SklearnHelper(clf = ExtraTreesClassifier, seed=SEED, params=et_params)
ada = SklearnHelper(clf = AdaBoostClassifier, seed=SEED, params=ada_params)
gb = SklearnHelper(clf = GradientBoostingClassifier, seed=SEED, params=gb_params)
svc = SklearnHelper(clf = SVC, seed=SEED, params=svc_params)

3-3. Train/ Test set ์ค€๋น„

y_train = train['Survived'].ravel()
train = train.drop(['Survived'], axis=1)
x_train = train.values # Creates an array of the train data
x_test = test.values # Creats an array of the test data

np.ravel(a, order = 'C)

  • ์—ฐ์†์ ์ธ flatten๋œ ๋ฐฐ์—ด ๋ฐ˜ํ™˜

3-4. First-level ์˜ˆ์ธก ์ˆ˜ํ–‰

  • ์ดํ›„ ์ƒˆ๋กœ์šด feature๋กœ ์‚ฌ์šฉ๋จ
et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost 
gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost
svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier

print("Training is complete")
Training is complete

3-5. Feature ์ค‘์š”๋„

  • sklearn ๋ฌธ์„œ์— ๋”ฐ๋ฅด๋ฉด ๋Œ€๋ถ€๋ถ„์˜ ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋ถ„๋ฅ˜๊ธฐ.feature_importance์†์„ฑ์„ ํฌํ•จํ•˜๊ณ  ์žˆ์Œ

    • feature ์ค‘์š”๋„ ๋ฐ˜ํ™˜
  • ๊ฐ๊ฐ์˜ ๋ถ„๋ฅ˜๊ธฐ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ feature๋“ค์˜ ์ค‘์š”๋„ ํŒŒ์•…ํ•˜๊ธฐ

โ–  ๊ฐ ๋ถ„๋ฅ˜๊ธฐ์˜ feature importance

rf_feature = rf.feature_importances(x_train,y_train)
print()

et_feature = et.feature_importances(x_train, y_train)
print()

ada_feature = ada.feature_importances(x_train, y_train)
print()

gb_feature = gb.feature_importances(x_train,y_train)
print()
[0.10479793 0.20555629 0.03619519 0.0206513  0.0474984  0.02867349
 0.12913927 0.04875588 0.07241203 0.01115511 0.29516512]

[0.11750552 0.38233278 0.02881993 0.01696399 0.05547146 0.02834788
 0.04725486 0.08408039 0.04599613 0.02178079 0.17144627]

[0.028 0.01  0.018 0.068 0.038 0.01  0.692 0.012 0.056 0.    0.068]

[0.0893909  0.01072666 0.05164544 0.01271789 0.04953828 0.02331858
 0.17598644 0.03803139 0.1119689  0.00617294 0.43050258]

### ํ•ด๋‹น ์†์„ฑ๊ฐ’์„ ์ €์žฅ

rf_features = [0.10474135,  0.21837029,  0.04432652,  0.02249159,  0.05432591,  0.02854371
  ,0.07570305,  0.01088129 , 0.24247496,  0.13685733 , 0.06128402]
et_features = [ 0.12165657,  0.37098307  ,0.03129623 , 0.01591611 , 0.05525811 , 0.028157
  ,0.04589793 , 0.02030357 , 0.17289562 , 0.04853517,  0.08910063]
ada_features = [0.028 ,   0.008  ,      0.012   ,     0.05866667,   0.032 ,       0.008
  ,0.04666667 ,  0.     ,      0.05733333,   0.73866667,   0.01066667]
gb_features = [ 0.06796144 , 0.03889349 , 0.07237845 , 0.02628645 , 0.11194395,  0.04778854
  ,0.05965792 , 0.02774745,  0.07462718,  0.4593142 ,  0.01340093]
### ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•ด dataframe ์ƒ์„ฑ

cols = train.columns.values # ์ปฌ๋Ÿผ๋ช… ์ง€์ •

feature_dataframe = pd.DataFrame({
    'features': cols,
    'Random Forest feature importances': rf_features,
    'Extra Trees  feature importances': et_features,
    'AdaBoost feature importances': ada_features,
    'Gradient Boost feature importances': gb_features
})
### Plotly๋ฅผ ํ†ตํ•ด interactiveํ•œ plot ๋ณด๊ธฐ
# Scatter plot 

### RandomForest
trace = go.Scatter(
    y = feature_dataframe['Random Forest feature importances'].values,
    x = feature_dataframe['features'].values,
    mode = 'markers',
    marker = dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
        # size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Random Forest feature importances'].values,
        colorscale = 'Portland',
        showscale = True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout = go.Layout(
    autosize =  True,
    title = 'Random Forest Feature Importance',
    hovermode = 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis = dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
# py.iplot(fig,filename='scatter2010')
pio.write_html(fig,file = 'rf.html',auto_open = True)
fig.show(renderer='colab') # <-- ์ฝ”๋žฉ์—์„œ ๋ณด์ด๊ฒŒ ํ•˜๋ ค๋ฉด ํ•ด๋‹น ๋ช…๋ น์–ด๋ฅผ ์ถ”๊ฐ€ํ•ด ์ฃผ์–ด์•ผ ํ•จ

newplot.png

### ExtraTreeClassifier
trace = go.Scatter(
    y = feature_dataframe['Extra Trees  feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
        # size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Extra Trees  feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Extra Trees Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
# py.iplot(fig,filename='scatter2010')
pio.write_html(fig,file = 'et.html',auto_open = True)
fig.show(renderer='colab') # <-- ์ฝ”๋žฉ์—์„œ ๋ณด์ด๊ฒŒ ํ•˜๋ ค๋ฉด ํ•ด๋‹น ๋ช…๋ น์–ด๋ฅผ ์ถ”๊ฐ€ํ•ด ์ฃผ์–ด์•ผ ํ•จ

newplot (1).png

### AdaBoost
trace = go.Scatter(
    y = feature_dataframe['AdaBoost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode = 'markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
        # size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['AdaBoost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'AdaBoost Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
# py.iplot(fig,filename='scatter2010')
pio.write_html(fig,file = 'ada.html',auto_open = True)
fig.show(renderer='colab') # <-- ์ฝ”๋žฉ์—์„œ ๋ณด์ด๊ฒŒ ํ•˜๋ ค๋ฉด ํ•ด๋‹น ๋ช…๋ น์–ด๋ฅผ ์ถ”๊ฐ€ํ•ด ์ฃผ์–ด์•ผ ํ•จ

newplot (2).png

### GradientBoosting
trace = go.Scatter(
    y = feature_dataframe['Gradient Boost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Gradient Boost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Gradient Boosting Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
# py.iplot(fig,filename='scatter2010')
pio.write_html(fig,file = 'gb.html',auto_open = True)
fig.show(renderer='colab') # <-- ์ฝ”๋žฉ์—์„œ ๋ณด์ด๊ฒŒ ํ•˜๋ ค๋ฉด ํ•ด๋‹น ๋ช…๋ น์–ด๋ฅผ ์ถ”๊ฐ€ํ•ด ์ฃผ์–ด์•ผ ํ•จ

newplot (3).png

โ–  ๋ชจ๋“  feature ์ค‘์š”๋„์˜ ํ‰๊ท  ๊ณ„์‚ฐ

feature_dataframe['mean'] = feature_dataframe.mean(axis= 1) 
# axis=1: ์—ด ๋ฐฉํ–ฅ --> ๊ฐ feature์— ๋Œ€ํ•ด ๊ฐ๊ฐ์˜ ํ–‰์— ์žˆ๋Š” ๊ฐ’๋“ค์„ ๊ฐ€์ ธ์™€ ํ‰๊ท 

feature_dataframe.head(3)
features Random Forest feature importances Extra Trees feature importances AdaBoost feature importances Gradient Boost feature importances mean
0 Pclass 0.104741 0.121657 0.028 0.067961 0.080590
1 Sex 0.218370 0.370983 0.008 0.038893 0.159062
2 Age 0.044327 0.031296 0.012 0.072378 0.040000
### ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„๋กœ ์‹œ๊ฐํ™”

y = feature_dataframe['mean'].values
x = feature_dataframe['features'].values

data = [go.Bar(
            x = x,
            y = y,
            width = 0.5,
            marker = dict(
                color = feature_dataframe['mean'].values,
              colorscale='Portland',
              showscale=True,
              reversescale = False
            ),
            opacity=0.6
        )]

layout = go.Layout(
    autosize = True,
    title= 'Barplots of Mean Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis = dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend = False
)
fig = go.Figure(data=data, layout=layout)
# py.iplot(fig,filename='scatter2010')
pio.write_html(fig,file = 'barplot.html',auto_open = True)
fig.show(renderer='colab') # <-- ์ฝ”๋žฉ์—์„œ ๋ณด์ด๊ฒŒ ํ•˜๋ ค๋ฉด ํ•ด๋‹น ๋ช…๋ น์–ด๋ฅผ ์ถ”๊ฐ€ํ•ด ์ฃผ์–ด์•ผ ํ•จ

newplot (4).png

4. Second-level ์˜ˆ์ธก ์ˆ˜ํ–‰

โญ First-level์˜ ์ถœ๋ ฅ์„ ์ƒˆ๋กœ์šด feature๋กœ

  • first - level ์˜ˆ์ธก์„ ๋‹ค์Œ ๋ถ„๋ฅ˜๊ธฐ์˜ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋„๋ก ์ƒˆ๋กœ์šด feature ์„ธํŠธ๋ฅผ ๊ตฌ์ถ•

    • ์ด์ „ ๋ถ„๋ฅ˜๊ธฐ์˜ first-level ์˜ˆ์ธก์„ ์ƒˆ๋กœ์šด ์—ด๋กœ ๊ฐ€์ง€๊ณ , ๋‹ค์Œ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ›ˆ๋ จ
base_predictions_train = pd.DataFrame({
    'RandomForest': rf_oof_train.ravel(),
    'ExtraTrees': et_oof_train.ravel(),
    'AdaBoost': ada_oof_train.ravel(),
    'GradientBoost': gb_oof_train.ravel()
    })

base_predictions_train.head()
RandomForest ExtraTrees AdaBoost GradientBoost
0 0.0 0.0 0.0 0.0
1 1.0 1.0 1.0 1.0
2 1.0 0.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 0.0 0.0 0.0 0.0

4-1. Second-level ํ›ˆ๋ จ set์˜ ์ƒ๊ด€๊ณ„์ˆ˜ heatmap

data = [
    go.Heatmap(
        z = base_predictions_train.astype(float).corr().values ,
        x = base_predictions_train.columns.values,
        y = base_predictions_train.columns.values,
        colorscale='Viridis',
        showscale=True,
        reversescale = True
    )
]

layout = go.Layout(
    autosize = True,
    title= 'Correlation Heatmap of the Second Level Training set',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    # yaxis = dict(
    #     title= 'Feature Importance',
    #     ticklen= 5,
    #     gridwidth= 2
    # ),
    # showlegend = False
)
fig = go.Figure(data=data, layout=layout)
# py.iplot(data, filename='labelled-heatmap')
pio.write_html(fig,file = 'heatmap.html',auto_open = True)
fig.show(renderer='colab') # <-- ์ฝ”๋žฉ์—์„œ ๋ณด์ด๊ฒŒ ํ•˜๋ ค๋ฉด ํ•ด๋‹น ๋ช…๋ น์–ด๋ฅผ ์ถ”๊ฐ€ํ•ด ์ฃผ์–ด์•ผ ํ•จ

newplot (5).png

  • feature๋“ค ๊ฐ„์˜ ์ƒ๊ด€๋„๊ฐ€ ๋‚ฎ์„์ˆ˜๋ก ๋” ์ข‹์€ ์ ์ˆ˜๋ฅผ ๋„์ถœํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค.
x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, 
                          gb_oof_train, svc_oof_train), axis=1)
x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, 
                         gb_oof_test, svc_oof_test), axis=1)
  • first-level ํ›ˆ๋ จ๊ณผ ํ…Œ์ŠคํŠธ ์˜ˆ์ธก๊ฐ’๋“ค์„ x_train๊ณผ x_test๋กœ ์ €์žฅ

    • 2๋‹จ๊ณ„ ํ•™์Šต ๋ชจ๋ธ์— ์ ํ•ฉ ๊ฐ€๋Šฅ

4-2. XGBoost๋ฅผ ํ†ตํ•œ second-level ํ•™์Šต

  • XGBoost:

    • ๋Œ€๊ทœ๋ชจ ๋ถ€์ŠคํŠธ ํŠธ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋งŒ๋“ค์–ด์ง
gbm = xgb.XGBClassifier(
    #learning_rate = 0.02,
    n_estimators= 2000,
    max_depth= 4,
    min_child_weight= 2,
    #gamma=1,
    gamma=0.9,                        
    subsample=0.8,
    colsample_bytree=0.8,
    objective= 'binary:logistic',
    nthread= -1,
    scale_pos_weight=1).fit(x_train, y_train)

predictions = gbm.predict(x_test)

๐Ÿ“Œ XGBoost์˜ parameters

  • max_depth:

    • ํŠธ๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊นŠ๊ฒŒ ํ‚ค์šฐ๊ณ  ์‹ถ์€์ง€

    • ๊ฐ’์„ ๋„ˆ๋ฌด ๋†’๊ฒŒ ์„ค์ •ํ•˜๋ฉด ๊ณผ์ ํ•ฉ์˜ ์œ„ํ—˜์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Œ

  • gamma:

    • ํŠธ๋ฆฌ์˜ ๋ฆฌํ”„ ๋…ธ๋“œ์—์„œ ์ถ”๊ฐ€์ ์ธ ํŒŒํ‹ฐ์…˜์„ ๋งŒ๋“œ๋Š” ๋ฐ ํ•„์š”ํ•œ ์ตœ์†Œํ•œ์˜ loss

    • ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ํด์ˆ˜๋ก ๋” ๋ณด์ˆ˜์ ์ž„

  • eta:

    • ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๋ถ€์ŠคํŒ… ๋‹จ๊ณ„์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋‹จ๊ณ„ ํฌ๊ธฐ ์ˆ˜์ถ•

5. Submission

  • ๋Œ€ํšŒ ์ œ์ถœ์„ ์œ„ํ•œ ์ตœ์ข… ์˜ˆ์ธก ์ˆ˜ํ–‰
StackingSubmission = pd.DataFrame({
    'PassengerId': PassengerId,
    'Survived': predictions })

StackingSubmission.to_csv("StackingSubmission.csv", index = False)

6. Conclusion

  • ์ถ”๊ฐ€์ ์œผ๋กœ ์•„๋ž˜์˜ ๋ฐฉ๋ฒ•๋“ค์„ ์‹œ๋„ํ•ด ๋ณผ ์ˆ˜ ์žˆ์Œ

    • ์ตœ์ ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐ’์„ ์ฐพ๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ๋ฐ ์ ํ•ฉํ•œ ๊ต์ฐจ ๊ฒ€์ฆ ์ „๋žต ๊ตฌํ˜„

    • feature๋“ค์ด ์„œ๋กœ uncorrelated ํ• ์ˆ˜๋ก ์ตœ์ข… ์ ์ˆ˜๊ฐ€ ์ข‹์Œ

๐Ÿ“šReference

ํƒœ๊ทธ: , ,

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ: