[ECC DS 1์ฃผ์ฐจ] Introduction to Ensembling/Stacking in Python
0. Introduction
-
๊ธฐ๋ณธ ํ์ต ๋ชจ๋ธ๋ค์ ์์๋ธํ๋ ๋ฐฉ๋ฒ์ ๋ํด ์ค๋ช (ํนํ stacking)
-
๋ช ๊ฐ์ง ๊ธฐ๋ณธ ๋ถ๋ฅ๊ธฐ์ ์์ธก์ 1์ฐจ ์์ค(๋ฒ ์ด์ค)์ผ๋ก ์ฌ์ฉํ ๋ค์ 2์ฐจ ์์ค์์ ๋ค๋ฅธ ๋ชจ๋ธ์ ์ฌ์ฉํ์ฌ ์ด์ 1์ฐจ ์์ค ์์ธก์ ์ถ๋ ฅ์ ์์ธก
### Import libraries
import pandas as pd
import numpy as np
import re
import sklearn
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.offline as py
import plotly.io as pio
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import warnings
warnings.filterwarnings('ignore')
### stacking์ ์ฌ์ฉ๋ base model 5๊ฐ์ง
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier,
GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.svm import SVC
from sklearn.model_selection import KFold # sklearn version์ ๋ง๊ฒ ๋ชจ๋ ๋ณ๊ฒฝ
1. ๋ณ์ ํ์ , ๊ฐ๊ณต, ์ ์ฒ๋ฆฌ
-
๋ฐ์ดํฐ ํ์ -> ๊ฐ๋ฅํ ๊ฐ๊ณต ๋ฐฉ๋ฒ ์ฐพ๊ธฐ
-
๋ชจ๋ ๋ฒ์ฃผํ feature๋ค์ ์์น์ ์ผ๋ก ๋ณํ(encoding)
### Data Loading
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/1แแ
ฎแแ
ก/data/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/1แแ
ฎแแ
ก/data/test.csv')
# ์ฌ์ด ์ ๊ทผ์ ์ํค PassengerId๋ง ๋ฐ๋ก ์ ์ฅํ๊ธฐ
PassengerId = test['PassengerId']
train.head(3)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
1-1. ํน์ฑ ๊ณตํ(Feature Engineering)
### ์ ์ฒด ๋ฐ์ดํฐ
full_data = [train, test]
### ์๋ก์ด ๋ณ์ ์์ฑ
# ์ด๋ฆ ๊ธธ์ด
train['Name_length'] = train['Name'].apply(len)
test['Name_length'] = test['Name'].apply(len)
# cabin ์ ๋ฌด
train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
test['Has_Cabin'] = test["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
# FamilySize = SibSp + Parch
for dataset in full_data:
dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
# ํผ์์ธ ์น๊ฐ
for dataset in full_data:
dataset['IsAlone'] = 0
dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
# CategoricalAge: ๋์ด๋ ๊ตฌ๋ถ
for dataset in full_data:
age_avg = dataset['Age'].mean()
age_std = dataset['Age'].std()
age_null_count = dataset['Age'].isnull().sum()
age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size = age_null_count) # ๋๋คํ ๋์ด ์ฝ์
dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
dataset['Age'] = dataset['Age'].astype(int)
train['CategoricalAge'] = pd.cut(train['Age'], 5)
# Title
# ์น๊ฐ ์ด๋ฆ์์ title๋ง ๋ฝ์๋ด๋ ํจ์
def get_title(name):
title_search = re.search(' ([A-Za-z]+)\.', name)
if title_search: # title์ด ์๋ค๋ฉด
return title_search.group(1)
return ""
for dataset in full_data:
dataset['Title'] = dataset['Name'].apply(get_title)
# ํํ์ง ์์ title์ 'Rare'๋ก ํฉ์นจ
for dataset in full_data:
dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don',
'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
### ๊ฒฐ์ธก์น ์ ๊ฑฐ
# Embarked: 'S'๋ก ๋์ฒด
for dataset in full_data:
dataset['Embarked'] = dataset['Embarked'].fillna('S')
# Fare: ์ค๊ฐ๊ฐ์ผ๋ก ๋์ฒด, 4๊ฐ์ ๋ฒ์ฃผ๋ก ๊ตฌ๋ถ
for dataset in full_data:
dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)
### ๋ฒ์ฃผํ ๋ณ์ -> ์์นํ ๋ณ์
# map ํจ์ ํ์ฉ
for dataset in full_data:
# Sex
dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
# Title
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
dataset['Title'] = dataset['Title'].map(title_mapping)
dataset['Title'] = dataset['Title'].fillna(0)
# Embarked
dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
# Fare
dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
dataset['Fare'] = dataset['Fare'].astype(int)
# Age
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[ dataset['Age'] > 64, 'Age'] = 4
### ๋ณ์ ์ ํ
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp']
train = train.drop(drop_elements, axis = 1)
train = train.drop(['CategoricalAge', 'CategoricalFare'], axis = 1)
test = test.drop(drop_elements, axis = 1)
-
feature๋ค์ด ๋ชจ๋ ์ซ์ํ
- ๊ธฐ๊ณํ์ต์ ์ ํฉํ ํ์
1-2. ์๊ฐํ
### ๋ฐ์ดํฐ ํ์ธ
train.head(3)
Survived | Pclass | Sex | Age | Parch | Fare | Embarked | Name_length | Has_Cabin | FamilySize | IsAlone | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 1 | 1 | 0 | 0 | 0 | 23 | 0 | 2 | 0 | 1 |
1 | 1 | 1 | 0 | 2 | 0 | 3 | 1 | 51 | 1 | 2 | 0 | 3 |
2 | 1 | 3 | 0 | 1 | 0 | 1 | 0 | 22 | 0 | 1 | 1 | 2 |
### ์๊ด๊ณ์(ํผ์ด์ด ์๊ด๊ณ์) ์๊ฐํ
colormap = plt.cm.RdBu
plt.figure(figsize = (14,12))
plt.title('Pearson Correlation of Features', y = 1.05, size = 15)
sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0,
square=True, cmap=colormap, linecolor='white', annot=True)
<Axes: title={'center': 'Pearson Correlation of Features'}>
-
๊ฐํ ์๊ด๊ด๊ณ๋ฅผ ๊ฐ์ง๋ feature๊ฐ ๋ง์ง๋ ์์
-
train set์ ์ค๋ณต๋๊ฑฐ๋ ๋ถํ์ํ ๋ฐ์ดํฐ๊ฐ ๋ง์ง๋ ์๋ค๋ ๊ฒ์ ์๋ฏธ
-
๊ฐ feature๋ค์ด ๊ณ ์ ํ ์ ๋ณด๋ฅผ ๊ฐ์ง๊ณ ์์
-
-
FamilySize ๋ณ์์ Parch ๋ณ์ ๊ฐ์ ์๊ด๊ณ์๊ฐ ์๋นํ ๋์
- ์ผ๋จ์ ๋ชจ๋ ๋ถ์์ ํ์ฉ
### pairplot
# ํ feature์์ ๋ค๋ฅธ feature๋ก์ ๋ฐ์ดํฐ ๋ถํฌ๋ฅผ ๊ด์ฐฐ
g = sns.pairplot(train[[u'Survived', u'Pclass', u'Sex', u'Age', u'Parch',
u'Fare', u'Embarked',u'FamilySize', u'Title']],
hue='Survived', palette = 'seismic',size = 1.2,
diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=10) )
g.set(xticklabels=[])
<seaborn.axisgrid.PairGrid at 0x7f5b5cf21fa0>
2. ์์๋ธ &์คํํน ๋ชจ๋ธ๋ค
2-1. Python Class ํ์ฉ
-
ํ์ด์ฌ์ ํด๋์ค๋ฅผ ์ฌ์ฉํ์ฌ ํธ์์ฑ ์ฆ๋
-
ํด๋์ค:
-
๊ฐ์ฒด๋ฅผ ์์ฑํ๊ธฐ ์ํ ์ฝ๋/ํ๋ก๊ทธ๋จ์ ํ์ฅํ๊ณ , ํด๋น ํด๋์ค์ ํน์ ํ ๊ธฐ๋ฅ๊ณผ ๋ฐฉ๋ฒ์ ๊ตฌํ
-
๋์ผํ ๋ฐฉ๋ฒ์ ๋ํ ์ค๋ณต์ฑ ์ ๊ฑฐ
-
-
SklearnHelper ํด๋์ค๋ฅผ ์์ฑ
- ๋ชจ๋ Sklearn ๋ถ๋ฅ๊ธฐ์ ๊ณตํต๋ ๋ด์ฅ ๋ฐฉ๋ฒ(ํ๋ จ, ์์ธก, ์ ํฉ ๋ฑ)์ ํ์ฅ ๊ฐ๋ฅ
# parameter ์ ์
ntrain = train.shape[0]
ntest = test.shape[0]
SEED = 0 # for ์ฌ์์ฐ
NFOLDS = 5 # ๊ต์ฐจ ๊ฒ์ฆ ํ์
kf = KFold(n_splits = NFOLDS, shuffle = True, random_state = SEED)
# Sklearn ๋ถ๋ฅ๊ธฐ๋ฅผ ํ์ฅํ๊ธฐ ์ํ ํด๋์ค
class SklearnHelper(object):
# ์์ฑ์(ํด๋์ค ๊ฐ์ฒด ์ด๊ธฐํ)
def __init__(self, clf, seed=0, params=None):
params['random_state'] = seed
self.clf = clf(**params)
# ํ์ต
def train(self, x_train, y_train):
self.clf.fit(x_train, y_train)
# ์์ธก
def predict(self, x):
return self.clf.predict(x)
# ์ ํฉ
def fit(self,x,y):
return self.clf.fit(x,y)
# ๋ณ์ ์ค์๋
def feature_importances(self,x,y):
print(self.clf.fit(x,y).feature_importances_)
2-2. Out of Fold ์์ธก
-
์คํํน์ ๊ธฐ๋ณธ ๋ถ๋ฅ๊ธฐ์ ์์ธก(์ฒซ๋ฒ์งธ ์์ค์์์ ์์ธก)์ ๋ ๋ฒ์งธ ์์ค ๋ชจ๋ธ์์ ํ์ต์ ์ํ ์ ๋ ฅ์ผ๋ก ์ฌ์ฉ
-
ํ์ง๋ง ์ ์ฒด ํ๋ จ ๋ฐ์ดํฐ์ ๋ํ ๊ธฐ๋ณธ ๋ชจ๋ธ์ ๋จ์ํ ํ๋ จ์ํค๊ณ , ์ ์ฒด ํ ์คํธ ์ธํธ์ ๋ํ ์์ธก์ ์์ฑํ ๋ค์ 2๋จ๊ณ ํ๋ จ์ ์ํด ์ด๋ฅผ ์ถ๋ ฅํ ์๋ ์์
- ๊ธฐ๋ณธ ๋ชจํ ์์ธก์ด ์ด๋ฏธ ๊ฒ์ ์งํฉ์ ๋ณด์๊ธฐ ๋๋ฌธ์ ๊ณผ๋ ์ ํฉ๋ (overfitting) ์ํ์ด ์์
โ (์ค๋ช ์ถ๊ฐ) KFold์ CV์ OOF
-
KFold CV: ๊ต์ฐจ ๊ฒ์ฆ์ ํ ๋ฐฉ๋ฒ(๋ฐฉ๋ฒ๋ก ์ ๊ด์ )
-
OOF: KFold์์ ๊ฐ ๋ชจ๋ธ์ output๋ฅผ ์์๋ธํ์ฌ ๋ ๋์ ์์ธก์ ์ป๋ ๋ฐฉ๋ฒ(์์ธก ๊ธฐ๋ฒ)
### OOF๋ฅผ ์ํ ํจ์
def get_oof(clf, x_train, y_train, x_test):
oof_train = np.zeros((ntrain,)) # null vector
oof_test = np.zeros((ntest,))
oof_test_skf = np.empty((NFOLDS, ntest))
# ๊ฐ ์๋๋ง๋ค
for i, (train_index, test_index) in enumerate(kf.split(train)):
### train ๋ฐ์ดํฐ ๋ถํ
x_tr = x_train[train_index] # X_train
y_tr = y_train[train_index] # y_train
x_te = x_train[test_index] # X_valid
clf.train(x_tr, y_tr)
oof_train[test_index] = clf.predict(x_te)
oof_test_skf[i, :] = clf.predict(x_test)
oof_test[:] = oof_test_skf.mean(axis=0)
return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)
3. First-level ๋ชจ๋ธ ์์ฑํ๊ธฐ
-
ํ์ฉ ๋ชจ๋ธ๋ค
-
RandomForestClassifier
-
ExtraTreesClassifier
-
AdaBoost
-
GradientBoostingClassifier
-
Support Vector Machine(SVM)
-
3-1. Parameters
-
n_jobs:
-
ํ๋ จ ํ๋ก์ธ์ค์ ์ฌ์ฉ๋๋ ์ฝ์ด ์์ ๋๋ค
-
-1๋ก ์ค์ ํ๋ฉด ๋ชจ๋ ์ฝ์ด๊ฐ ์ฌ์ฉ๋ฉ๋๋ค.
-
-
n_estimators:
-
ํ์ต ๋ชจ๋ธ์ ๋ถ๋ฅ ํธ๋ฆฌ ์
-
๊ธฐ๋ณธ๊ฐ ๋น 10๊ฐ๋ก ์ค์
-
-
max_depth:
-
ํธ๋ฆฌ์ ์ต๋ ๊น์ด ๋๋ ๋ ธ๋๋ฅผ ํ์ฅํ ์
-
๋๋ฌด ๋๊ฒ ์ค์ ํ๋ฉด ๋๋ฌด๋ฅผ ๋๋ฌด ๊น๊ฒ ํค์ธ ์ ์์ผ๋ฏ๋ก ๊ณผ์ ํฉ ์ํ์ด ์์
-
-
verbose:
-
ํ์ต ํ๋ก์ธ์ค ์ค์ ํ ์คํธ๋ฅผ ์ถ๋ ฅํ ์ง ์ฌ๋ถ๋ฅผ ์ ์ด
-
๊ฐ์ด 0์ด๋ฉด ๋ชจ๋ ํ ์คํธ๊ฐ ์ต์
-
๊ฐ์ด 3์ด๋ฉด ๋ชจ๋ ๋ฐ๋ณต์์ ํธ๋ฆฌ ํ์ต ํ๋ก์ธ์ค๊ฐ ์ถ๋ ฅ๋จ
-
### Parameter ์ ์
# RandomForest parameters
rf_params = {
'n_jobs': -1,
'n_estimators': 500,
'warm_start': True,
#'max_features': 0.2,
'max_depth': 6,
'min_samples_leaf': 2,
'max_features' : 'sqrt',
'verbose': 0
}
# ExtraTrees Parameters
et_params = {
'n_jobs': -1,
'n_estimators':500,
#'max_features': 0.5,
'max_depth': 8,
'min_samples_leaf': 2,
'verbose': 0
}
# AdaBoost parameters
ada_params = {
'n_estimators': 500,
'learning_rate' : 0.75
}
# GradientBoosting parameters
gb_params = {
'n_estimators': 500,
#'max_features': 0.2,
'max_depth': 5,
'min_samples_leaf': 2,
'verbose': 0
}
# Support Vector Classifier parameters
svc_params = {
'kernel' : 'linear',
'C' : 0.025
}
3-2. ๋ชจ๋ธ ๊ฐ์ฒด ์์ฑํ๊ธฐ
- SKlearnHelper ํด๋์ค ํ์ฉ
rf = SklearnHelper(clf = RandomForestClassifier, seed=SEED, params=rf_params)
et = SklearnHelper(clf = ExtraTreesClassifier, seed=SEED, params=et_params)
ada = SklearnHelper(clf = AdaBoostClassifier, seed=SEED, params=ada_params)
gb = SklearnHelper(clf = GradientBoostingClassifier, seed=SEED, params=gb_params)
svc = SklearnHelper(clf = SVC, seed=SEED, params=svc_params)
3-3. Train/ Test set ์ค๋น
y_train = train['Survived'].ravel()
train = train.drop(['Survived'], axis=1)
x_train = train.values # Creates an array of the train data
x_test = test.values # Creats an array of the test data
np.ravel(a, order = 'C)
- ์ฐ์์ ์ธ flatten๋ ๋ฐฐ์ด ๋ฐํ
3-4. First-level ์์ธก ์ํ
- ์ดํ ์๋ก์ด feature๋ก ์ฌ์ฉ๋จ
et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost
gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost
svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier
print("Training is complete")
Training is complete
3-5. Feature ์ค์๋
-
sklearn ๋ฌธ์์ ๋ฐ๋ฅด๋ฉด ๋๋ถ๋ถ์ ๋ถ๋ฅ๊ธฐ๋
๋ถ๋ฅ๊ธฐ.feature_importance
์์ฑ์ ํฌํจํ๊ณ ์์- feature ์ค์๋ ๋ฐํ
-
๊ฐ๊ฐ์ ๋ถ๋ฅ๊ธฐ๋ก๋ถํฐ ์์ฑ๋ feature๋ค์ ์ค์๋ ํ์ ํ๊ธฐ
โ ๊ฐ ๋ถ๋ฅ๊ธฐ์ feature importance
rf_feature = rf.feature_importances(x_train,y_train)
print()
et_feature = et.feature_importances(x_train, y_train)
print()
ada_feature = ada.feature_importances(x_train, y_train)
print()
gb_feature = gb.feature_importances(x_train,y_train)
print()
[0.10479793 0.20555629 0.03619519 0.0206513 0.0474984 0.02867349 0.12913927 0.04875588 0.07241203 0.01115511 0.29516512] [0.11750552 0.38233278 0.02881993 0.01696399 0.05547146 0.02834788 0.04725486 0.08408039 0.04599613 0.02178079 0.17144627] [0.028 0.01 0.018 0.068 0.038 0.01 0.692 0.012 0.056 0. 0.068] [0.0893909 0.01072666 0.05164544 0.01271789 0.04953828 0.02331858 0.17598644 0.03803139 0.1119689 0.00617294 0.43050258]
### ํด๋น ์์ฑ๊ฐ์ ์ ์ฅ
rf_features = [0.10474135, 0.21837029, 0.04432652, 0.02249159, 0.05432591, 0.02854371
,0.07570305, 0.01088129 , 0.24247496, 0.13685733 , 0.06128402]
et_features = [ 0.12165657, 0.37098307 ,0.03129623 , 0.01591611 , 0.05525811 , 0.028157
,0.04589793 , 0.02030357 , 0.17289562 , 0.04853517, 0.08910063]
ada_features = [0.028 , 0.008 , 0.012 , 0.05866667, 0.032 , 0.008
,0.04666667 , 0. , 0.05733333, 0.73866667, 0.01066667]
gb_features = [ 0.06796144 , 0.03889349 , 0.07237845 , 0.02628645 , 0.11194395, 0.04778854
,0.05965792 , 0.02774745, 0.07462718, 0.4593142 , 0.01340093]
### ์๊ฐํ๋ฅผ ์ํด dataframe ์์ฑ
cols = train.columns.values # ์ปฌ๋ผ๋ช
์ง์
feature_dataframe = pd.DataFrame({
'features': cols,
'Random Forest feature importances': rf_features,
'Extra Trees feature importances': et_features,
'AdaBoost feature importances': ada_features,
'Gradient Boost feature importances': gb_features
})
### Plotly๋ฅผ ํตํด interactiveํ plot ๋ณด๊ธฐ
# Scatter plot
### RandomForest
trace = go.Scatter(
y = feature_dataframe['Random Forest feature importances'].values,
x = feature_dataframe['features'].values,
mode = 'markers',
marker = dict(
sizemode = 'diameter',
sizeref = 1,
size = 25,
# size= feature_dataframe['AdaBoost feature importances'].values,
#color = np.random.randn(500), #set color equal to a variable
color = feature_dataframe['Random Forest feature importances'].values,
colorscale = 'Portland',
showscale = True
),
text = feature_dataframe['features'].values
)
data = [trace]
layout = go.Layout(
autosize = True,
title = 'Random Forest Feature Importance',
hovermode = 'closest',
# xaxis= dict(
# title= 'Pop',
# ticklen= 5,
# zeroline= False,
# gridwidth= 2,
# ),
yaxis = dict(
title= 'Feature Importance',
ticklen= 5,
gridwidth= 2
),
showlegend= False
)
fig = go.Figure(data=data, layout=layout)
# py.iplot(fig,filename='scatter2010')
pio.write_html(fig,file = 'rf.html',auto_open = True)
fig.show(renderer='colab') # <-- ์ฝ๋ฉ์์ ๋ณด์ด๊ฒ ํ๋ ค๋ฉด ํด๋น ๋ช
๋ น์ด๋ฅผ ์ถ๊ฐํด ์ฃผ์ด์ผ ํจ
### ExtraTreeClassifier
trace = go.Scatter(
y = feature_dataframe['Extra Trees feature importances'].values,
x = feature_dataframe['features'].values,
mode='markers',
marker=dict(
sizemode = 'diameter',
sizeref = 1,
size = 25,
# size= feature_dataframe['AdaBoost feature importances'].values,
#color = np.random.randn(500), #set color equal to a variable
color = feature_dataframe['Extra Trees feature importances'].values,
colorscale='Portland',
showscale=True
),
text = feature_dataframe['features'].values
)
data = [trace]
layout= go.Layout(
autosize= True,
title= 'Extra Trees Feature Importance',
hovermode= 'closest',
# xaxis= dict(
# title= 'Pop',
# ticklen= 5,
# zeroline= False,
# gridwidth= 2,
# ),
yaxis=dict(
title= 'Feature Importance',
ticklen= 5,
gridwidth= 2
),
showlegend= False
)
fig = go.Figure(data=data, layout=layout)
# py.iplot(fig,filename='scatter2010')
pio.write_html(fig,file = 'et.html',auto_open = True)
fig.show(renderer='colab') # <-- ์ฝ๋ฉ์์ ๋ณด์ด๊ฒ ํ๋ ค๋ฉด ํด๋น ๋ช
๋ น์ด๋ฅผ ์ถ๊ฐํด ์ฃผ์ด์ผ ํจ
### AdaBoost
trace = go.Scatter(
y = feature_dataframe['AdaBoost feature importances'].values,
x = feature_dataframe['features'].values,
mode = 'markers',
marker=dict(
sizemode = 'diameter',
sizeref = 1,
size = 25,
# size= feature_dataframe['AdaBoost feature importances'].values,
#color = np.random.randn(500), #set color equal to a variable
color = feature_dataframe['AdaBoost feature importances'].values,
colorscale='Portland',
showscale=True
),
text = feature_dataframe['features'].values
)
data = [trace]
layout= go.Layout(
autosize= True,
title= 'AdaBoost Feature Importance',
hovermode= 'closest',
# xaxis= dict(
# title= 'Pop',
# ticklen= 5,
# zeroline= False,
# gridwidth= 2,
# ),
yaxis=dict(
title= 'Feature Importance',
ticklen= 5,
gridwidth= 2
),
showlegend= False
)
fig = go.Figure(data=data, layout=layout)
# py.iplot(fig,filename='scatter2010')
pio.write_html(fig,file = 'ada.html',auto_open = True)
fig.show(renderer='colab') # <-- ์ฝ๋ฉ์์ ๋ณด์ด๊ฒ ํ๋ ค๋ฉด ํด๋น ๋ช
๋ น์ด๋ฅผ ์ถ๊ฐํด ์ฃผ์ด์ผ ํจ
### GradientBoosting
trace = go.Scatter(
y = feature_dataframe['Gradient Boost feature importances'].values,
x = feature_dataframe['features'].values,
mode='markers',
marker=dict(
sizemode = 'diameter',
sizeref = 1,
size = 25,
# size= feature_dataframe['AdaBoost feature importances'].values,
#color = np.random.randn(500), #set color equal to a variable
color = feature_dataframe['Gradient Boost feature importances'].values,
colorscale='Portland',
showscale=True
),
text = feature_dataframe['features'].values
)
data = [trace]
layout= go.Layout(
autosize= True,
title= 'Gradient Boosting Feature Importance',
hovermode= 'closest',
# xaxis= dict(
# title= 'Pop',
# ticklen= 5,
# zeroline= False,
# gridwidth= 2,
# ),
yaxis=dict(
title= 'Feature Importance',
ticklen= 5,
gridwidth= 2
),
showlegend= False
)
fig = go.Figure(data=data, layout=layout)
# py.iplot(fig,filename='scatter2010')
pio.write_html(fig,file = 'gb.html',auto_open = True)
fig.show(renderer='colab') # <-- ์ฝ๋ฉ์์ ๋ณด์ด๊ฒ ํ๋ ค๋ฉด ํด๋น ๋ช
๋ น์ด๋ฅผ ์ถ๊ฐํด ์ฃผ์ด์ผ ํจ
โ ๋ชจ๋ feature ์ค์๋์ ํ๊ท ๊ณ์ฐ
feature_dataframe['mean'] = feature_dataframe.mean(axis= 1)
# axis=1: ์ด ๋ฐฉํฅ --> ๊ฐ feature์ ๋ํด ๊ฐ๊ฐ์ ํ์ ์๋ ๊ฐ๋ค์ ๊ฐ์ ธ์ ํ๊ท
feature_dataframe.head(3)
features | Random Forest feature importances | Extra Trees feature importances | AdaBoost feature importances | Gradient Boost feature importances | mean | |
---|---|---|---|---|---|---|
0 | Pclass | 0.104741 | 0.121657 | 0.028 | 0.067961 | 0.080590 |
1 | Sex | 0.218370 | 0.370983 | 0.008 | 0.038893 | 0.159062 |
2 | Age | 0.044327 | 0.031296 | 0.012 | 0.072378 | 0.040000 |
### ๋ง๋ ๊ทธ๋ํ๋ก ์๊ฐํ
y = feature_dataframe['mean'].values
x = feature_dataframe['features'].values
data = [go.Bar(
x = x,
y = y,
width = 0.5,
marker = dict(
color = feature_dataframe['mean'].values,
colorscale='Portland',
showscale=True,
reversescale = False
),
opacity=0.6
)]
layout = go.Layout(
autosize = True,
title= 'Barplots of Mean Feature Importance',
hovermode= 'closest',
# xaxis= dict(
# title= 'Pop',
# ticklen= 5,
# zeroline= False,
# gridwidth= 2,
# ),
yaxis = dict(
title= 'Feature Importance',
ticklen= 5,
gridwidth= 2
),
showlegend = False
)
fig = go.Figure(data=data, layout=layout)
# py.iplot(fig,filename='scatter2010')
pio.write_html(fig,file = 'barplot.html',auto_open = True)
fig.show(renderer='colab') # <-- ์ฝ๋ฉ์์ ๋ณด์ด๊ฒ ํ๋ ค๋ฉด ํด๋น ๋ช
๋ น์ด๋ฅผ ์ถ๊ฐํด ์ฃผ์ด์ผ ํจ
4. Second-level ์์ธก ์ํ
โญ First-level์ ์ถ๋ ฅ์ ์๋ก์ด feature๋ก
-
first - level ์์ธก์ ๋ค์ ๋ถ๋ฅ๊ธฐ์ ํ๋ จ ๋ฐ์ดํฐ๋ก ์ฌ์ฉ๋ ์ ์๋๋ก ์๋ก์ด feature ์ธํธ๋ฅผ ๊ตฌ์ถ
- ์ด์ ๋ถ๋ฅ๊ธฐ์ first-level ์์ธก์ ์๋ก์ด ์ด๋ก ๊ฐ์ง๊ณ , ๋ค์ ๋ถ๋ฅ๊ธฐ๋ฅผ ํ๋ จ
base_predictions_train = pd.DataFrame({
'RandomForest': rf_oof_train.ravel(),
'ExtraTrees': et_oof_train.ravel(),
'AdaBoost': ada_oof_train.ravel(),
'GradientBoost': gb_oof_train.ravel()
})
base_predictions_train.head()
RandomForest | ExtraTrees | AdaBoost | GradientBoost | |
---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1.0 | 1.0 | 1.0 | 1.0 |
2 | 1.0 | 0.0 | 1.0 | 1.0 |
3 | 1.0 | 1.0 | 1.0 | 1.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 |
4-1. Second-level ํ๋ จ set์ ์๊ด๊ณ์ heatmap
data = [
go.Heatmap(
z = base_predictions_train.astype(float).corr().values ,
x = base_predictions_train.columns.values,
y = base_predictions_train.columns.values,
colorscale='Viridis',
showscale=True,
reversescale = True
)
]
layout = go.Layout(
autosize = True,
title= 'Correlation Heatmap of the Second Level Training set',
hovermode= 'closest',
# xaxis= dict(
# title= 'Pop',
# ticklen= 5,
# zeroline= False,
# gridwidth= 2,
# ),
# yaxis = dict(
# title= 'Feature Importance',
# ticklen= 5,
# gridwidth= 2
# ),
# showlegend = False
)
fig = go.Figure(data=data, layout=layout)
# py.iplot(data, filename='labelled-heatmap')
pio.write_html(fig,file = 'heatmap.html',auto_open = True)
fig.show(renderer='colab') # <-- ์ฝ๋ฉ์์ ๋ณด์ด๊ฒ ํ๋ ค๋ฉด ํด๋น ๋ช
๋ น์ด๋ฅผ ์ถ๊ฐํด ์ฃผ์ด์ผ ํจ
- feature๋ค ๊ฐ์ ์๊ด๋๊ฐ ๋ฎ์์๋ก ๋ ์ข์ ์ ์๋ฅผ ๋์ถํ๋ ๊ฒฝํฅ์ด ์๋ค.
x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train,
gb_oof_train, svc_oof_train), axis=1)
x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test,
gb_oof_test, svc_oof_test), axis=1)
-
first-level ํ๋ จ๊ณผ ํ ์คํธ ์์ธก๊ฐ๋ค์ x_train๊ณผ x_test๋ก ์ ์ฅ
- 2๋จ๊ณ ํ์ต ๋ชจ๋ธ์ ์ ํฉ ๊ฐ๋ฅ
4-2. XGBoost๋ฅผ ํตํ second-level ํ์ต
-
XGBoost:
- ๋๊ท๋ชจ ๋ถ์คํธ ํธ๋ฆฌ ์๊ณ ๋ฆฌ์ฆ์ ์ต์ ํํ๊ธฐ ์ํด ๋ง๋ค์ด์ง
gbm = xgb.XGBClassifier(
#learning_rate = 0.02,
n_estimators= 2000,
max_depth= 4,
min_child_weight= 2,
#gamma=1,
gamma=0.9,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread= -1,
scale_pos_weight=1).fit(x_train, y_train)
predictions = gbm.predict(x_test)
๐ XGBoost์ parameters
-
max_depth:
-
ํธ๋ฆฌ๋ฅผ ์ผ๋ง๋ ๊น๊ฒ ํค์ฐ๊ณ ์ถ์์ง
-
๊ฐ์ ๋๋ฌด ๋๊ฒ ์ค์ ํ๋ฉด ๊ณผ์ ํฉ์ ์ํ์ด ๋ฐ์ํ ์ ์์
-
-
gamma:
-
ํธ๋ฆฌ์ ๋ฆฌํ ๋ ธ๋์์ ์ถ๊ฐ์ ์ธ ํํฐ์ ์ ๋ง๋๋ ๋ฐ ํ์ํ ์ต์ํ์ loss
-
์๊ณ ๋ฆฌ์ฆ์ด ํด์๋ก ๋ ๋ณด์์ ์
-
-
eta:
- ๊ณผ์ ํฉ์ ๋ฐฉ์งํ๊ธฐ ์ํด ๊ฐ ๋ถ์คํ ๋จ๊ณ์์ ์ฌ์ฉ๋๋ ๋จ๊ณ ํฌ๊ธฐ ์์ถ
5. Submission
- ๋ํ ์ ์ถ์ ์ํ ์ต์ข ์์ธก ์ํ
StackingSubmission = pd.DataFrame({
'PassengerId': PassengerId,
'Survived': predictions })
StackingSubmission.to_csv("StackingSubmission.csv", index = False)
6. Conclusion
-
์ถ๊ฐ์ ์ผ๋ก ์๋์ ๋ฐฉ๋ฒ๋ค์ ์๋ํด ๋ณผ ์ ์์
-
์ต์ ์ ๋งค๊ฐ๋ณ์ ๊ฐ์ ์ฐพ๊ธฐ ์ํด ๋ชจ๋ธ์ ํ๋ จํ๋ ๋ฐ ์ ํฉํ ๊ต์ฐจ ๊ฒ์ฆ ์ ๋ต ๊ตฌํ
-
feature๋ค์ด ์๋ก uncorrelated ํ ์๋ก ์ต์ข ์ ์๊ฐ ์ข์
-