0. ํ”„๋กœ์ ํŠธ ์†Œ๊ฐœ

  • ๋…ธํŠธ๋ถ์˜ ๋ชฉํ‘œ: ์˜ˆ์ธก ๋ชจ๋ธ๋ง ๋ฌธ์ œ์—์„œ ์›Œํฌ ํ”Œ๋กœ์šฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ์•„์ด๋””์–ด๋ฅผ ์ œ๊ณต

  • feature ํ™•์ธ ๋ฐฉ๋ฒ•, ์ƒˆ๋กœ์šด feature์˜ ์ถ”๊ฐ€ ๋ฐ Machine Learning ๊ฐœ๋… ์ ์šฉ

1. Exploratory Data Analysis(EDA)

### Import Libraries

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # ํ˜„์žฌ ์ฝ”๋žฉ์—์„œ๋Š” 0.12.0 ๋ฒ„์ „

plt.style.use('fivethirtyeight')

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

โŒ Version Issue

  • seaborn version issue๋กœ ์ธํ•ด ์›๋ณธ ๋…ธํŠธ๋ถ์—์„œ ์ผ๋ถ€ ํ•จ์ˆ˜ ๋ณ€๊ฒฝ(factorplot -> pointplot, catplot)

  • plotting์„ ์ˆ˜ํ–‰ํ•  ๋•Œ ๋ช‡๋ช‡ ํ•จ์ˆ˜๋Š” ์œ„์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ง€์ •์ด ์ž˜ ๋˜์ง€ x -> ํ‚ค์›Œ๋“œ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๋ณ€๊ฒฝ

### ๋ฐ์ดํ„ฐ ์ค€๋น„ํ•˜๊ธฐ

data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/1แ„Œแ…ฎแ„Žแ…ก/data/train.csv')
### ์ผ๋ถ€ ๋ฐ์ดํ„ฐ ํ™•์ธ

data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
### ๊ฒฐ์ธก์น˜ ํ™•์ธ

data.isnull().sum() 
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
  • Age, Cabin, Embarked์— ๊ฒฐ์ธก์น˜๊ฐ€ ์กด์žฌํ•œ๋‹ค.

1-1. Features

### Survived

f,ax = plt.subplots(1,2,figsize = (18,8))

data['Survived'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Survived')
ax[0].set_ylabel('')

sns.countplot(x = 'Survived', data = data, ax = ax[1])
ax[1].set_title('Survived')

plt.show()

  • ์‚ฌ๊ณ ์—์„œ ์‚ด์•„๋‚จ์€ ์Šน๊ฐ(Survived = 1)์ด ๋งŽ์ง€ ์•Š์€ ๊ฒƒ์€ ๋ถ„๋ช…ํ•˜๋‹ค.

    • 891๋ช…์˜ ์Šน๊ฐ ์ค‘ 350๋ช…๋งŒ์ด ์‚ด์•„๋‚จ์Œ

๐Ÿ“Œ Feature์˜ ์ข…๋ฅ˜

1. ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜(Categorical Features)

  • ๋‘ ๊ฐœ ์ด์ƒ์˜ ๋ฒ”์ฃผ๊ฐ€ ์žˆ๋Š” ๋ณ€์ˆ˜

    • ํ•ด๋‹น ํ”ผ์ณ์˜ ๊ฐ ๊ฐ’์€ ๋ฒ”์ฃผ๋ณ„๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ์Œ
  • ๋ณ€์ˆ˜๋ฅผ ๋ถ„๋ฅ˜ํ•˜๊ฑฐ๋‚˜ ์ˆœ์„œ๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์—†์Œ

  • Sex, Embarked๊ฐ€ ํ•ด๋‹น๋จ

2. ์ˆœ์„œํ˜• ๋ณ€์ˆ˜(Ordinal Features)

  • ๋ฒ”์ฃผํ˜• ๊ฐ’๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ ๊ฐ’ ์‚ฌ์ด์— ์ƒ๋Œ€์  ์ˆœ์„œ ๋˜๋Š” ์ •๋ ฌ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์ด ์ฐจ์ด

    • ex> ๋†’์ด(๋†’์ด), ์ค‘๊ฐ„(์ค‘๊ฐ„), ์งง์€ ๊ฐ’๊ณผ ๊ฐ™์€ ๊ธฐ๋Šฅ์ด ์žˆ๋Š” ๊ฒฝ์šฐ ๋†’์ด๋Š” ์ˆœ์„œํ˜• ๋ณ€์ˆ˜
  • ๋ณ€์ˆ˜์— ์ƒ๋Œ€์ ์ธ ์ •๋ ฌ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ

  • PClass๊ฐ€ ํ•ด๋‹น๋จ

3. ์—ฐ์†ํ˜• ๋ณ€์ˆ˜(Continous Feature)

  • feature๊ฐ€ ๋‘ ์  ์‚ฌ์ด ๋˜๋Š” feature column์˜ ์ตœ์†Œ๊ฐ’ ๋˜๋Š” ์ตœ๋Œ€๊ฐ’ ์‚ฌ์ด์˜ ๊ฐ’์„ ์ทจํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ

  • Age๊ฐ€ ํ•ด๋‹น๋จ

1-2. ๋‹ค์–‘ํ•œ feature๋“ค ์‚ฌ์ด์˜ ๊ด€๊ณ„

Sex

  • Categorical ๋ณ€์ˆ˜
data.groupby(['Sex','Survived'])['Survived'].count()
Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64
f,ax = plt.subplots(1,2,figsize = (18,8))

data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax = ax[0])
ax[0].set_title('Survived vs Sex')

sns.countplot(x = 'Sex',hue = 'Survived',data = data,ax = ax[1])
ax[1].set_title('Sex:Survived vs Dead')

plt.show()

  • ๋ฐฐ์— ํƒ€๊ณ  ์žˆ๋˜ ๋‚จ์„ฑ์˜ ์ˆ˜๋Š” ์—ฌ์„ฑ์˜ ์ˆ˜๋ณด๋‹ค ํ›จ์”ฌ ๋งŽ๋‹ค.

    • ๊ทธ๋Ÿฌ๋‚˜ ๊ตฌ์กฐ๋œ ์—ฌ์„ฑ์˜ ์ˆ˜๋Š” ๊ตฌ์กฐ๋œ ๋‚จ์„ฑ์˜ ๊ฑฐ์˜ ๋‘ ๋ฐฐ์ž„
  • ์—ฌ์„ฑ์˜ ์ƒ์กด์œจ์€ ์•ฝ 75%์ธ ๋ฐ˜๋ฉด ๋‚จ์„ฑ์˜ ์ƒ์กด์œจ์€ ์•ฝ 18~19%์ž„

  • Sex๋Š” ๋ชจ๋ธ๋ง ์‹œ ๊ต‰์žฅํžˆ ์ค‘์š”ํ•œ ๋ณ€์ˆ˜์ธ ๊ฒƒ ๊ฐ™์Œ

PClass

  • ์ˆœ์„œํ˜• ๋ณ€์ˆ˜
pd.crosstab(data.Pclass,data.Survived,margins = True).style.background_gradient(cmap = 'summer_r')
Survived 0 1 All
Pclass      
1 80 136 216
2 97 87 184
3 372 119 491
All 549 342 891
f,ax = plt.subplots(1,2,figsize = (18,8))

data['Pclass'].value_counts().plot.bar(color=['#CD7F32','#FFDF00','#D3D3D3'],ax=ax[0])
ax[0].set_title('Number Of Passengers By Pclass')
ax[0].set_ylabel('Count')

sns.countplot(x = 'Pclass',hue='Survived',data=data,ax=ax[1])
ax[1].set_title('Pclass:Survived vs Dead')

plt.show()

  • PClass๊ฐ€ 1์ธ ์Šน๊ฐ๋“ค์ด ์šฐ์„ ์ ์œผ๋กœ ๊ตฌ์กฐ๋˜์—ˆ์Œ์„ ์ง์ž‘ํ•  ์ˆ˜ ์žˆ์Œ

    • PClass๊ฐ€ 3์ธ ์Šน๊ฐ ์ˆ˜๊ฐ€ ํ›จ์”ฌ ๋” ๋งŽ์•˜์ง€๋งŒ, ์—ฌ์ „ํžˆ ๊ทธ๋“ค ์ค‘ ์ƒ์กด์ž ์ˆ˜๋Š” 25% ์ •๋„๋กœ ๋งค์šฐ ๋‚ฎ์Œ

    • Pclass๊ฐ€ 1์ธ ๊ฒฝ์šฐ ์ƒ์กด์œจ์ด ์•ฝ 63%์ธ ๋ฐ˜๋ฉด Pclass๊ฐ€ 2์ธ ๊ฒฝ์šฐ ์•ฝ 48%์ž„

Sex + Pclass

pd.crosstab([data.Sex,data.Survived],data.Pclass,margins=True).style.background_gradient(cmap='summer_r')
  Pclass 1 2 3 All
Sex Survived        
female 0 3 6 72 81
1 91 70 72 233
male 0 77 91 300 468
1 45 17 47 109
All 216 184 491 891
sns.pointplot(x = 'Pclass', y = 'Survived',hue='Sex',data = data) 
plt.show()

  • ๋ฒ”์ฃผํ˜• ๊ฐ’๋“ค์„ ์‰ฝ๊ฒŒ ๋ถ„๋ฆฌํ•˜์—ฌ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด pointplot()์„ ์‚ฌ์šฉ

  • Pclass = 1์ธ ์—ฌ์„ฑ์˜ ์ƒ์กด์œจ์ด ์•ฝ 95~96%(94๋ช… ์ค‘ 3๋ช…๋งŒ ์‚ฌ๋ง)

    • PClass์™€ ์ƒ๊ด€์—†์ด ๊ตฌ์กฐ ๊ณผ์ •์—์„œ ์—ฌ์„ฑ์—๊ฒŒ

    ์šฐ์„  ์ˆœ์œ„๊ฐ€ ๋ถ€์—ฌ๋œ ๊ฒƒ์€ ๋ถ„๋ช…ํ•จ

  • PClass๋„ ์ค‘์š”ํ•œ feature๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์Œ

Age

  • ์—ฐ์†ํ˜• ๋ณ€์ˆ˜
print('Oldest Passenger was of:',data['Age'].max(),'Years')
print('Youngest Passenger was of:',data['Age'].min(),'Years')
print('Average Age on the ship:',data['Age'].mean(),'Years')
Oldest Passenger was of: 80.0 Years
Youngest Passenger was of: 0.42 Years
Average Age on the ship: 29.69911764705882 Years
f,ax = plt.subplots(1,2,figsize=(18,8))

sns.violinplot(x = "Pclass",y = "Age", hue="Survived", data=data,split=True,ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))

sns.violinplot(x = "Sex",y = "Age", hue="Survived", data=data,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))

plt.show()

โœ” Observations

  • ์–ด๋ฆฐ์ด์˜ ์ˆ˜๋Š” PClass์— ๋”ฐ๋ผ ์ฆ๊ฐ€ํ•˜๋ฉฐ 10์„ธ ๋ฏธ๋งŒ์˜ ์Šน๊ฐ(์ฆ‰, ์–ด๋ฆฐ์ด)์˜ ์ƒ์กด์œจ์€ PClass์™€ ์ƒ๊ด€์—†์ด ์–‘ํ˜ธํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ž„

  • Pclass = 1์—์„œ 20-50์„ธ ์Šน๊ฐ์˜ ์ƒ์กด ๊ฐ€๋Šฅ์„ฑ์€ ๋†’๊ณ  ์—ฌ์„ฑ์ด ํ›จ์”ฌ ๋” ๋†’์Œ

  • ๋‚จ์„ฑ์˜ ๊ฒฝ์šฐ, ์ƒ์กด ๊ฐ€๋Šฅ์„ฑ์€ ๋‚˜์ด๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ๊ฐ์†Œ

โœ” ๊ฒฐ์ธก์น˜(NaN) ์ฒ˜๋ฆฌ

  • Age feature์—๋Š” 177๊ฐœ์˜ null ๊ฐ’์ด ์žˆ์Œ

  • NaN ๊ฐ’์„ ๋Œ€์ฒดํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ํ‰๊ท  ์—ฐ๋ น์„ ํ• ๋‹นํ•  ์ˆ˜ ์žˆ์Œ

    • but ์‚ฌ๋žŒ๋“ค์˜ ์—ฐ๋ น์€ ๋งค์šฐ ๋‹ค์–‘ํ•จ
  • ์ด๋ฆ„ ์•ž์˜ ๋ถ™์€ ํ‚ค์›Œ๋“œ(Mr, Mrs.)๋ฅผ ํ†ตํ•ด ๊ทธ๋ฃนํ™” ํ›„ ๊ฐ ๊ทธ๋ฃน์˜ ํ‰๊ท ๊ฐ’์„ ํ• ๋‹น ๊ฐ€๋Šฅ

### ์ด๋ฆ„์—์„œ ํ‚ค์›Œ๋“œ ์ถ”์ถœํ•˜๊ธฐ

data['Initial'] = 0
for i in data:
    data['Initial'] = data.Name.str.extract('([A-Za-z]+)\.') # ์ด๋ฆ„์—์„œ .(dot) ์•ž์— ๋ถ€๋ถ„๋งŒ ์ถ”์ถœ
### Sex์™€ ํ•จ๊ป˜ ์ด๋‹ˆ์…œ ํ™•์ธํ•˜๊ธฐ

pd.crosstab(data.Initial,data.Sex).T.style.background_gradient(cmap='summer_r') 
Initial Capt Col Countess Don Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs Ms Rev Sir
Sex                                  
female 0 0 1 0 1 0 1 0 0 182 2 1 0 125 1 0 0
male 1 2 0 1 6 1 0 2 40 0 0 0 517 0 0 6 1
### ์ž˜๋ชป ํ‘œ๊ธฐ๋œ ์ด๋‹ˆ์…œ ๋ณ€๊ฒฝ
# Mile์ด๋‚˜ Mme ๋“ฑ

data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace = True)
### ์ด๋‹ˆ์…œ ๋ณ„ ํ‰๊ท  ๋‚˜์ด

data.groupby('Initial')['Age'].mean() 
Initial
Master     4.574167
Miss      21.860000
Mr        32.739609
Mrs       35.981818
Other     45.888889
Name: Age, dtype: float64
### ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ(NaN ์ฑ„์šฐ๊ธฐ)
# ํ‰๊ท  ์—ฐ๋ น์˜ ์˜ฌ๋ฆผ ๊ฐ’์„ ํ™œ์šฉ

data.loc[(data.Age.isnull()) & (data.Initial=='Mr'),'Age'] = 33
data.loc[(data.Age.isnull()) & (data.Initial=='Mrs'),'Age'] = 36
data.loc[(data.Age.isnull()) & (data.Initial=='Master'),'Age'] = 5
data.loc[(data.Age.isnull()) & (data.Initial=='Miss'),'Age'] = 22
data.loc[(data.Age.isnull()) & (data.Initial=='Other'),'Age'] = 46
data.Age.isnull().any() 

# ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ ์™„๋ฃŒ!
False
### ์‹œ๊ฐํ™”

f,ax = plt.subplots(1,2,figsize=(20,10))

data[data['Survived']==0].Age.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('Survived= 0')
x1 = list(range(0,85,5))
ax[0].set_xticks(x1)

data[data['Survived']==1].Age.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('Survived= 1')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)

plt.show()

โœ” Observations

  • ์œ ์•„(์—ฐ๋ น์ด 5์„ธ ๋ฏธ๋งŒ)๋Š” ์ƒ์กด๋ฅ ์ด ๋†’์Œ

  • ๊ฐ€์žฅ ๋‚˜์ด๊ฐ€ ๋งŽ์€ ์Šน๊ฐ์€ ๊ตฌ์กฐ๋˜์—ˆ์Œ(80์„ธ)

  • ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ๋งํ•œ ์Šน๊ฐ๋“ค์˜ ์—ฐ๋ น๋Œ€๋Š” 30 ~ 40๋Œ€

Embarked

  • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
pd.crosstab([data.Embarked,data.Pclass],[data.Sex,data.Survived],margins=True).style.background_gradient(cmap='summer_r')
  Sex female male All
  Survived 0 1 0 1
Embarked Pclass          
C 1 1 42 25 17 85
2 0 7 8 2 17
3 8 15 33 10 66
Q 1 0 1 1 0 2
2 0 2 1 0 3
3 9 24 36 3 72
S 1 2 46 51 28 127
2 6 61 82 15 164
3 55 33 231 34 353
All 81 231 468 109 889
### Embarked์— ๋”ฐ๋ฅธ ์ƒ์กด๋ฅ 

sns.pointplot(x = 'Embarked',y = 'Survived',data = data)
fig = plt.gcf()
fig.set_size_inches(5,3)
plt.show()

  • ํ•ญ๊ตฌ C์—์„œ์˜ ์ƒ์กด๋ฅ ์ด 0.55 ์ •๋„๋กœ ๊ฐ€์žฅ ๋†’์Œ

  • ํ•ญ๊ตฌ S์—์„œ์˜ ์ƒ์กด๋ฅ ์ด ๊ฐ€์žฅ ๋‚ฎ์Œ

f,ax = plt.subplots(2,2,figsize = (20,15))

sns.countplot(x = 'Embarked',data = data,ax = ax[0,0])
ax[0,0].set_title('No. Of Passengers Boarded')

sns.countplot(x = 'Embarked',hue = 'Sex',data = data,ax = ax[0,1])
ax[0,1].set_title('Male-Female Split for Embarked')

sns.countplot(x = 'Embarked',hue = 'Survived',data = data,ax = ax[1,0])
ax[1,0].set_title('Embarked vs Survived')

sns.countplot(x = 'Embarked',hue = 'Pclass',data = data,ax = ax[1,1])
ax[1,1].set_title('Embarked vs Pclass')

plt.subplots_adjust(wspace = 0.2,hspace = 0.5)
plt.show()

โœ” Observations

  • S์—์„œ ํƒ‘์Šนํ•œ ์Šน๊ฐ๋“ค ์ค‘ ๋Œ€๋‹ค์ˆ˜๋Š” Pclass = 3 ์ถœ์‹ 

  • C์—์„œ ์˜จ ์Šน๊ฐ๋“ค์€ ๊ทธ๋“ค ์ค‘ ์ƒ๋‹นํ•œ ๋น„์œจ์ด ์‚ด์•„๋‚จ์Œ

    • Pclass = 1๊ณผ Pclass = 2 ์Šน๊ฐ ์ „์›์„ ๊ตฌ์กฐํ•œ ๊ฒƒ์ผ ์ˆ˜ ์žˆ์Œ
  • Embarked = S๋Š” ๋Œ€๋ถ€๋ถ„ ๋ถ€์ž๋“ค์ด ํƒ‘์Šนํ•œ ํ•ญ๊ตฌ๋กœ ๋ณด์ž„

    • ์—ฌ์ „ํžˆ ์ด๊ณณ์—์„œ๋Š” ์ƒ์กด ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ์Œ

    • PClass = 3 ์Šน๊ฐ์˜ 81%๊ฐ€ ์ •๋„ ์‚ด์•„๋‚จ์ง€ ๋ชปํ•จ

  • Port Q had almost 95% of the passengers were from Pclass3

sns.catplot(x = 'Pclass',y = 'Survived',hue = 'Sex',col = 'Embarked',data = data, kind = 'point')
plt.show()

โœ” Observations

  • PClass์— ๊ด€๊ณ„์—†์ด PClass = 1๊ณผ PClass = 2์˜ ์—ฌ์„ฑ์˜ ์ƒ์กด ํ™•๋ฅ ์€ ๊ฑฐ์˜ 1์ด๋‹ค.

  • Pclass = 3์— ๋Œ€ํ•ด Embarked = S๋Š” ๋‚จ๋…€ ๋ชจ๋‘ ์ƒ์กด์œจ์ด ๋งค์šฐ ๋‚ฎ์Œ

  • Embarked = Q๋Š” ๊ฑฐ์˜ ๋ชจ๋‘ PClass = 3 ์ถœ์‹ 

    • ๋‚จ์„ฑ์—๊ฒŒ ๊ฐ€์žฅ ๋ถˆ์šดํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ž„

โœ” ๊ฒฐ์ธก์น˜(NaN) ์ฒ˜๋ฆฌ

  • ๋งŽ์€ ์Šน๊ฐ๋“ค์ด S ํ•ญ๊ตฌ์—์„œ ํƒ‘์Šนํ•˜์˜€์Œ

    • NaN์„ S๋กœ ๋Œ€์ฒด
data['Embarked'].fillna('S',inplace = True)
data.Embarked.isnull().any()

# ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ๊ฐ€ ์ •์ƒ์ ์œผ๋กœ ์ˆ˜ํ–‰๋จ
False

SibSip

  • ์ด์‚ฐ ๋ณ€์ˆ˜(Discrete Feature)

  • ํ˜ผ์ž ํƒ”๋Š”์ง€ ์•„๋‹ˆ๋ฉด ๊ทธ์˜ ๊ฐ€์กฑ ๊ตฌ์„ฑ์›๊ณผ ํ•จ๊ป˜ ํƒ”๋Š”์ง€

  • ํ˜•์ œ => ํ˜•์ œ, ์ž๋งค, ์˜๋ถ“๋™์ƒ, ์˜๋ถ“์–ธ๋‹ˆ

  • ๋ฐฐ์šฐ์ž => ๋‚จํŽธ, ์•„๋‚ด

pd.crosstab([data.SibSp],data.Survived).style.background_gradient(cmap='summer_r')
Survived 0 1
SibSp    
0 398 210
1 97 112
2 15 13
3 12 4
4 15 3
5 5 0
8 7 0
### ์‹œ๊ฐํ™”

f,ax = plt.subplots(1,2,figsize = (20,8))

sns.barplot(x = 'SibSp',y = 'Survived',data = data, ax = ax[0])
ax[0].set_title('SibSp vs Survived')

sns.pointplot(x = 'SibSp',y = 'Survived',data = data, ax = ax[1])
ax[1].set_title('SibSp vs Survived')
plt.close(2)

plt.show()

### PClass์— ๋”ฐ๋ฅธ ์ž์‹ ์ˆ˜

pd.crosstab(data.SibSp,data.Pclass).style.background_gradient(cmap='summer_r')
Pclass 1 2 3
SibSp      
0 137 120 351
1 71 55 83
2 5 8 15
3 3 1 12
4 0 0 18
5 0 0 5
8 0 0 7

โœ” Observartions

  • barplot ๋ฐ factorplot์€ ์Šน๊ฐ์ด ํ˜ผ์ž ํƒ‘์Šนํ•œ ๊ฒฝ์šฐ(SibSp = 0) ์ƒ์กด์œจ์ด 34.5%์ž„์„ ๋ณด์—ฌ์คŒ

    • ํ˜•์ œ์ž๋งค์˜ ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•˜๋ฉด ๊ทธ๋ž˜ํ”„๋Š” ๋Œ€๋žต ๊ฐ์†Œ

    • ์ฆ‰, ๋งŒ์•ฝ ๊ฐ€์กฑ์„ ํƒœ์šด๋‹ค๋ฉด, ์ž์‹ ์„ ๋จผ์ € ๊ตฌํ•˜๋Š” ๋Œ€์‹  ๊ทธ๋“ค์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ๋…ธ๋ ฅํ•  ๊ฒƒ

  • ๊ตฌ์„ฑ์›์ด 5-8๋ช…์ธ ๊ฐ€์กฑ์˜ ์ƒ์กด์œจ์€ 0%

  • ์ด์œ : PClass

  • crosstab => SibSp > 3๋ฅผ ๊ฐ€์ง„ ์‚ฌ๋žŒ์ด ๋ชจ๋‘ Pclass = 3์ž„์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Œ

Parch

pd.crosstab(data.Parch,data.Pclass).style.background_gradient(cmap='summer_r')
Pclass 1 2 3
Parch      
0 163 134 381
1 31 32 55
2 21 16 43
3 0 2 3
4 1 0 3
5 0 0 5
6 0 0 1
  • ๊ฐ€์กฑ ๊ตฌ์„ฑ์›์˜ ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ์ฃผ๋กœ PClass = 3์— ์†ํ•จ
### ์‹œ๊ฐํ™”

f,ax = plt.subplots(1,2,figsize = (20,8))

sns.barplot(x = 'Parch',y = 'Survived',data=data,ax=ax[0])
ax[0].set_title('Parch vs Survived')

sns.pointplot(x = 'Parch',y = 'Survived',data=data,ax=ax[1])
ax[1].set_title('Parch vs Survived')
plt.close(2)

plt.show()

Fare

  • ์—ฐ์†ํ˜• ๋ณ€์ˆ˜
print('Highest Fare was:',data['Fare'].max())
print('Lowest Fare was:',data['Fare'].min())
print('Average Fare was:',data['Fare'].mean())
Highest Fare was: 512.3292
Lowest Fare was: 0.0
Average Fare was: 32.204207968574636
f,ax = plt.subplots(1,3,figsize=(20,8))

sns.distplot(data[data['Pclass']==1].Fare,ax=ax[0])
ax[0].set_title('Fares in Pclass 1')

sns.distplot(data[data['Pclass']==2].Fare,ax=ax[1])
ax[1].set_title('Fares in Pclass 2')

sns.distplot(data[data['Pclass']==3].Fare,ax=ax[2])
ax[2].set_title('Fares in Pclass 3')
plt.show()

  • Pclass = 1์˜ ์Šน๊ฐ ์š”๊ธˆ์˜ ๋ถ„์‚ฐ์ด ํฐ ๊ฒƒ์œผ๋กœ ๋ณด์ด๋ฉฐ, ๊ธฐ์ค€์ด ๊ฐ์†Œํ•จ์— ๋”ฐ๋ผ ๋ถ„์‚ฐ์ด ๊ณ„์†ํ•ด์„œ ๊ฐ์†Œํ•˜๊ณ  ์žˆ์Œ

  • ์—ฐ์† ๋ณ€์ˆ˜์ด๊ธฐ์—, binning์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด์‚ฐ ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ์Œ

1-3. ๊ฒฝํ–ฅ์„ฑ ํŒŒ์•…

์ „์ฒด ๋ณ€์ˆ˜๋“ค์— ๋Œ€ํ•œ ์š”์•ฝ

  • Sex: ์—ฌ์„ฑ์ด ๋‚จ์„ฑ์— ๋น„ํ•ด ์ƒ์กด ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ

  • PClass

    • ๋“ฑ๊ธ‰์ด ์ข‹์„์ˆ˜๋ก ์ƒ์กด๋ฅ ์ด ๋†’์•„์ง€๋Š” ์ถ”์„ธ๋ฅผ ๋ณด์ž„

    • PClass = 3์˜ ์ƒ์กด๋ฅ ์ด ๋งค์šฐ ๋‚ฎ์Œ

    • ์—ฌ์„ฑ์˜ ๊ฒฝ์šฐ PClass = 1์—์„œ์˜ ์ƒ์กด๋ฅ ์€ ๊ฑฐ์˜ 1์ด๊ณ , PClass = 2์˜ ๊ฒฝ์šฐ๋„ ์ƒ์กด๋ฅ ์ด ๋†’์Œ

  • Age

    • 5-10์„ธ ๋ฏธ๋งŒ ์–ด๋ฆฐ์ด๋“ค์˜ ๊ฒฝ์šฐ ์ƒ์กด ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ

    • 15์„ธ-30์„ธ ์Šน๊ฐ๋“ค์ด ๋งŽ์ด ์ฃฝ์Œ

  • Embarked

    • PClass = 1 ์Šน๊ฐ ๋Œ€๋ถ€๋ถ„์ด S์—์„œ ํƒ‘์Šนํ•˜์˜€๋‹ค๋งŒ, C์—์„œ์˜ ์ƒ์กด๋ฅ ์ด ํ›จ์”ฌ ์ข‹์Œ

    • Q์—์„œ ํƒ‘์Šนํ•œ ์†๋‹˜๋“ค์€ ๋ชจ๋‘ PClass = 3

  • Parch + SibSp

    • 1-2๋ช…์˜ ํ˜•์ œ์ž๋งค๊ฐ€ ์žˆ๊ฑฐ๋‚˜, ๋ฐฐ์šฐ์ž๊ฐ€ ์žˆ๊ฑฐ๋‚˜, 1-3๋ช…์˜ ๋ถ€๋ชจ๊ฐ€ ์žˆ๋Š” ์Šน๊ฐ๋“ค์ด ํ˜ผ์ž ํƒ€๊ฑฐ๋‚˜ ๋Œ€๊ฐ€์กฑ์ธ ๊ฒฝ์šฐ๋ณด๋‹ค ์ƒ์กด๋ฅ ์ด ๋†’์Œ

Feature๋“ค ๊ฐ„์˜ ์ƒ๊ด€๊ฒŒ์ˆ˜

sns.heatmap(data.corr(),annot=True,cmap='RdYlGn',linewidths=0.2) # correlation matrix

fig = plt.gcf()
fig.set_size_inches(10,8)

plt.show()

  • ๊ฐ feature๋“ค ๊ฐ„์˜ ํฐ ์ƒ๊ด€๊ด€๊ณ„๋Š” ์—†์Œ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Œ

  • ๊ฐ€์žฅ ํฐ ์ƒ๊ด€๊ด€๊ณ„

    • SibSp์™€ Parch(0.41)

๋ชจ๋“  feature๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Œ

โœ” Heatmap์— ๋Œ€ํ•œ ํ•ด์„

  • ์•ŒํŒŒ๋ฒณ์ด๋‚˜ ๋ฌธ์ž์—ด ์‚ฌ์ด์—์„œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์—†์Œ

    • numeric feature๋“ค๋งŒ ๋น„๊ต๋จ
  • ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„(positive correlation)

    • feature A์˜ ์ฆ๊ฐ€๊ฐ€ feature B์˜ ์ฆ๊ฐ€๋กœ ์ด์–ด์ง€๋Š” ๊ฒฝ์šฐ

    • 1์€ ์™„์ „ํ•œ ์–‘์˜ ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ์˜๋ฏธ

  • ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„(negative correlation)

    • feature A์˜ ์ฆ๊ฐ€๊ฐ€ feature B์˜ ๊ฐ์†Œ๋กœ ์ด์–ด์ง€๋Š” ๊ฒฝ์šฐ

    • -1์€ ์™„์ „ํ•œ ์–‘์˜ ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ์˜๋ฏธ

โœ” ๋‹ค์ค‘๊ณต์„ ์„ฑ(multicollinearity)

  • ๋‘ ํŠน์ง•์ด ๋งค์šฐ/ ์™„๋ฒฝํ•˜๊ฒŒ ์ƒ๊ด€๋จ

    • ํ•œ feature์˜ ์ฆ๊ฐ€๊ฐ€ ๋‹ค๋ฅธ feature์˜ ์ฆ๊ฐ€๋กœ ์ด์–ด์ง

    • ์ฆ‰, ๋‘ feature ๋ชจ๋‘ ๋งค์šฐ ์œ ์‚ฌํ•œ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์œผ๋ฉฐ ์ •๋ณด์˜ ์ฐจ์ด๊ฐ€ ๊ฑฐ์˜/ ์ „ํ˜€ ์—†์Œ

  • ๋‘ ๊ฐœ์˜ feature๋“ค์ด ์ค‘๋ณต๋˜๊ธฐ์—, ๋‘˜ ๋‹ค ์‚ฌ์šฉํ•˜๋Š” ๋Œ€์‹  ํ•˜๋‚˜๋งŒ ์‚ฌ์šฉํ•ด๋„ ๋ฌด๋ฐฉํ•จ

2. ํŠน์„ฑ ๊ณตํ•™(Feature Engineering) & ๋ฐ์ดํ„ฐ ํด๋ Œ์ง•

  • ๋ชจ๋“  feature๋“ค์ด ์ค‘์š”ํ•œ ๊ฒƒ์€ ์•„๋‹˜

    • ์ œ๊ฑฐํ•ด์•ผ ํ•  ์ค‘๋ณต๋œ ์„ฑ๊ฒฉ์˜ feature๋“ค์ด ๋งŽ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Œ
  • ๋‹ค๋ฅธ feature๋“ค์„ ๊ด€์ฐฐํ•˜๊ฑฐ๋‚˜ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜์—ฌ ์ƒˆ๋กœ์šด feature๋กœ ๊ฐ€์ ธ์˜ค๊ฑฐ๋‚˜ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ

2-1. ์ƒˆ๋กœ์šด feature ์ถ”๊ฐ€

Age_band

โœ” Age feature์˜ ๋ฌธ์ œ์ 

  • Age๋Š” ์—ฐ์†ํ˜• ๋ณ€์ˆ˜

    • ML์—์„œ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ์Œ
  • binning ๋˜๋Š” ์ •๊ทœํ™”๋ฅผ ํ†ตํ•ด ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋กœ ๋ณ€ํ™˜ ํ•„์š”

    • binning์„ ํ™œ์šฉ

    • ์—ฐ๋ น ๋ฒ”์œ„๋ฅผ ๋‹จ์ผ ๋นˆ์œผ๋กœ groupํ™” ํ•˜๊ฑฐ๋‚˜ ๋‹จ์ผ ๊ฐ’ ํ• ๋‹น

    • 0 - 80์„ธ๋ฅผ 5๊ฐœ์˜ bin์œผ๋กœ ๋‚˜๋ˆ„๊ธฐ

### ์—ฐ๋ น๋Œ€ ๋‚˜๋ˆ„๊ธฐ

data['Age_band'] = 0
data.loc[data['Age']<= 16,'Age_band'] = 0
data.loc[(data['Age']>16)&(data['Age']<=32),'Age_band']=1
data.loc[(data['Age']>32)&(data['Age']<=48),'Age_band']=2
data.loc[(data['Age']>48)&(data['Age']<=64),'Age_band']=3
data.loc[data['Age']>64,'Age_band']=4
data.head(2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Initial Age_band
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S Mr 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C Mrs 2
### ๊ฐ ์—ฐ๋ น๋Œ€์— ์†ํ•˜๋Š” ์Šน๊ฐ ์ˆ˜

data['Age_band'].value_counts().to_frame().style.background_gradient(cmap='summer')
  Age_band
1 382
2 325
0 104
3 69
4 11
sns.catplot(x = 'Age_band', y = 'Survived',data=data,col='Pclass', kind = 'point')
plt.show()

  • PClass์™€ ์ƒ๊ด€ ์—†์ด ๋‚˜์ด๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ์ƒ์กด๋ฅ ์ด ๋‚ฎ์•„์ง

Family_Size & Alone

  • Parch + SibSp

  • ์ƒ์กด๋ฅ ์ด ๊ฐ€์กฑ ๊ตฌ์„ฑ์› ์ˆ˜์™€ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€ ํ™•์ธ

  • ๋‹จ๋…์œผ๋กœ ์Šน๊ฐ์ด ํ˜ผ์ž์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ

data['Family_Size'] = 0
data['Family_Size'] = data['Parch'] + data['SibSp'] #family size

data['Alone'] = 0
data.loc[data.Family_Size == 0,'Alone'] = 1 #Alone

### ์‹œ๊ฐํ™”

f,ax = plt.subplots(1,2,figsize = (18,6))
sns.pointplot(x = 'Family_Size',y = 'Survived',data = data, ax = ax[0])
ax[0].set_title('Family_Size vs Survived')
sns.pointplot(x = 'Alone',y = 'Survived',data = data, ax = ax[1])
ax[1].set_title('Alone vs Survived')
plt.close(2)
plt.close(3)
plt.show()

  • Family_Size = 0: ์Šน๊ฐ์ด ํ˜ผ์ž์ž„์„ ์˜๋ฏธ

  • ํ˜ผ์ž์ด๊ฑฐ๋‚˜ family_size = 0์ด๋ฉด ์ƒ์กด๋ฅ ์ด ๋งค์šฐ ๋‚ฎ์Œ

  • family_size > 4์ธ ๊ฒฝ์šฐ๋„ ์ƒ์กด๋ฅ  ๊ฐ์†Œ

sns.catplot(x = 'Alone',y = 'Survived',data=data,hue='Sex',col='Pclass', kind = 'point')
plt.show()

  • ๊ฐ€์กฑ์ด ์žˆ๋Š” ์‚ฌ๋žŒ๋ณด๋‹ค ํ˜ผ์ž์ธ ์—ฌ์„ฑ์˜ ํ™•๋ฅ ์ด ๋†’์€ Pclass = 3์„ ์ œ์™ธํ•˜๊ณ ๋Š” Sex, Pclass ๊ตฌ๋ถ„ ์—†์ด ํ˜ผ์ž ์žˆ๋Š” ๊ฒƒ์ด ์œ„ํ—˜ํ•จ

Fare_Range

  • Fare ๋˜ํ•œ ์—ฐ์†ํ˜• ๋ณ€์ˆ˜ -> ์ „์ฒ˜๋ฆฌ ํ•„์š”

  • ์ „์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด pandas.qcut()์„ ํ™œ์šฉ

  • pd.qcut(data, bins)

    • ํ†ต๊ณผํ•œ bin์˜ ์ˆ˜์— ๋”ฐ๋ผ ๊ฐ’์„ ๋ถ„ํ• /๋ฐฐ์—ด

    • 5๊ฐœ์˜ bin์— ๋Œ€ํ•ด ์ „๋‹ฌ ์‹œ ๊ฐ’์ด 5๊ฐœ์˜ bin ๋˜๋Š” ๊ฐ’ ๋ฒ”์œ„๋กœ ๊ท ๋“ฑํ•˜๊ฒŒ ๋ฐฐ์—ด๋จ

data['Fare_Range'] = pd.qcut(data['Fare'], 4)
data.groupby(['Fare_Range'])['Survived'].mean().to_frame().style.background_gradient(cmap='summer_r')
  Survived
Fare_Range  
(-0.001, 7.91] 0.197309
(7.91, 14.454] 0.303571
(14.454, 31.0] 0.454955
(31.0, 512.329] 0.581081
  • fare_range๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ์ƒ์กด๋ฅ ์ด ์ฆ๊ฐ€

  • Fare_range ๊ฐ’์„ Age_band์™€ ๊ฐ™์ด ๋ฒ”์ฃผํ˜• ๊ฐ’์œผ๋กœ ๋ณ€๊ฒฝ

data['Fare_cat'] = 0
data.loc[data['Fare'] <= 7.91,'Fare_cat'] = 0
data.loc[(data['Fare'] > 7.91) & (data['Fare'] <= 14.454),'Fare_cat'] = 1
data.loc[(data['Fare'] > 14.454) & (data['Fare'] <= 31),'Fare_cat'] = 2
data.loc[(data['Fare'] > 31) & (data['Fare'] <= 513),'Fare_cat'] = 3
sns.pointplot(x = 'Fare_cat',y = 'Survived',data=data,hue='Sex')
plt.show()

  • Fare_cat์ด ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ์ƒ์กด๋ฅ ์ด ์ฆ๊ฐ€ํ•จ

  • Sex์™€ ๋”๋ถˆ์–ด ๋ชจ๋ธ๋ง ์‹œ ์ค‘์š”ํ•œ feature๋กœ ์˜ˆ์ƒ๋จ

2-2. feature ๋ณ€ํ™˜

  • ๋ชจ๋ธ์— ์ ํ•ฉํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ feature๋“ค์„ ๋ณ€ํ™˜

String -> ์ˆ˜์น˜ํ˜•

  • ML ๋ชจํ˜•์€ ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜๋“ค๋งŒ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
data['Sex'].replace(['male','female'],[0,1],inplace = True)
data['Embarked'].replace(['S','C','Q'],[0,1,2],inplace = True)
data['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace = True)

2-3. ๋ถˆํ•„์š”ํ•œ feature ์ œ๊ฑฐ

  • Name: ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋กœ ๋ณ€ํ™˜ ๋ถˆ๊ฐ€

  • Age: Age_band ๋ณ€์ˆ˜๋กœ ๋Œ€์ฒด

  • Ticket: ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ์—” ๋„ˆ๋ฌด ๋‹ค์ฑ„๋กœ์›€

  • Fare: Fare_cat ๋ณ€์ˆ˜๋กœ ๋Œ€์ฒด

  • Cabin: ๋งŽ์€ ๊ฒฐ์ธก์น˜(NaN), ํ•œ ์Šน๊ฐ์ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ Cabin

  • Fare_range: fare_cat ๋ณ€์ˆ˜๋กœ ๋Œ€์ฒด

  • PassengerId: ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋กœ ๋ณ€ํ™˜ ๋ถˆ๊ฐ€

data.drop(['Name','Age','Ticket','Fare','Cabin','Fare_Range','PassengerId'],axis = 1,inplace = True)
### ์ตœ์ข… ๋ณ€์ˆ˜๋“ค์˜ heatmap

sns.heatmap(data.corr(),annot=True,cmap='RdYlGn',linewidths=0.2,annot_kws={'size':20})

fig = plt.gcf()
fig.set_size_inches(18,15)
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 14)
plt.show()

  • ๋ช‡๋ช‡ ๋ณ€์ˆ˜๋“ค์ด ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

    • SibSp & Family_Size

    • Parch & Family_Size

  • ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

    • Alone & Family_Size

3. ์˜ˆ์ธก์  ๋ชจ๋ธ๋ง

  • ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์Šน๊ฐ์˜ ์ƒ์กด ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธก

  • ํ™œ์šฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค

    • ์„œํฌํŠธ ๋ฒกํ„ฐ ๋จธ์‹ (Support Vector Machine)

    • ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€(LogisticRegression)

    • ๊ฒฐ์ • ํŠธ๋ฆฌ(Decision Tree)

    • K-์ตœ๊ทผ์ ‘ ์ด์›ƒ(K-Nearest Neighbors)

    -๊ฐ€์šฐ์Šค ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ(Gauss Naive Bayes)

    • ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ(RandomForest)

    • ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€(LogisticRegression)

### ML์— ํ•„์š”ํ•œ ๋ชจ๋“  library import

from sklearn.linear_model import LogisticRegression 
from sklearn import svm 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.naive_bayes import GaussianNB 
from sklearn.tree import DecisionTreeClassifier 

from sklearn.model_selection import train_test_split 
from sklearn import metrics # accuracy measure
from sklearn.metrics import confusion_matrix # for confusion matrix
### ๋ฐ์ดํ„ฐ ๋ถ„ํ• 

train, test = train_test_split(data,test_size = 0.3,random_state = 0,stratify = data['Survived'])

train_X = train[train.columns[1:]]
train_Y = train[train.columns[:1]]
test_X = test[test.columns[1:]]
test_Y = test[test.columns[:1]]

X = data[data.columns[1:]]
Y = data['Survived']

3-1. ๊ธฐ๋ณธ ์•Œ๊ณ ๋ฆฌ์ฆ˜

Support Vector Machines

  • C: ๋งˆ์ง„ ์˜ค๋ฅ˜๋ฅผ ์–ผ๋งˆ๋‚˜ ํ—ˆ์šฉํ•  ๊ฒƒ์ธ๊ฐ€

    • ํด์ˆ˜๋ก ๋งˆ์ง„์ด ๋„“์–ด์ง€๊ณ  ์˜ค๋ฅ˜ ์ฆ๊ฐ€

    • ์ž‘์„์ˆ˜๋ก ๋งˆ์ง„์ด ์ข์•„์ง€๊ณ  ์˜ค๋ฅ˜ ๊ฐ์†Œ

  • kernel: ์ปค๋„ ํ•จ์ˆ˜ ์ข…๋ฅ˜

    • โ€˜linearโ€™, โ€˜polyโ€™, โ€˜rbfโ€™, โ€˜sigmoidโ€™
  • gamma: ์ปค๋„ ๊ณ„์ˆ˜ ์ง€์ •

    • kernel์ด โ€˜polyโ€™, โ€˜rbfโ€™, โ€˜sigmoidโ€™์ผ ๋•Œ ์œ ํšจ

โ€ป Reference: ํ•ธ์ฆˆ์˜จ ๋จธ์‹ ๋Ÿฌ๋‹

### Radial Support Vector Machines(rbf-SVM)

model = svm.SVC(kernel = 'rbf',C = 1,gamma = 0.1) # ๋ชจ๋ธ ๊ฐ์ฒด ์ƒ์„ฑ

model.fit(train_X,train_Y) # ํ•™์Šต
prediction1 = model.predict(test_X) # ์˜ˆ์ธก
print('Accuracy for rbf SVM is ',metrics.accuracy_score(prediction1,test_Y)) # ํ‰๊ฐ€
Accuracy for rbf SVM is  0.835820895522388
### Linear Support Vector Machine(linear-SVM)

model = svm.SVC(kernel = 'linear',C = 0.1,gamma = 0.1)
model.fit(train_X,train_Y)
prediction2=model.predict(test_X)
print('Accuracy for linear SVM is',metrics.accuracy_score(prediction2,test_Y))
Accuracy for linear SVM is 0.8171641791044776

Logistic Regression

model = LogisticRegression()
model.fit(train_X,train_Y)
prediction3 = model.predict(test_X)
print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction3,test_Y))
The accuracy of the Logistic Regression is 0.8134328358208955

๊ฒฐ์ • ํŠธ๋ฆฌ

model = DecisionTreeClassifier()
model.fit(train_X,train_Y)
prediction4 = model.predict(test_X)
print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction4,test_Y))
The accuracy of the Decision Tree is 0.8059701492537313

K-Nearest Neighbours(KNN)

model = KNeighborsClassifier() 
model.fit(train_X,train_Y)
prediction5 = model.predict(test_X)
print('The accuracy of the KNN is',metrics.accuracy_score(prediction5,test_Y))
The accuracy of the KNN is 0.8134328358208955
### n_neignbors ๊ฐ’์„ ๋ณ€๊ฒฝํ•˜๋ฉฐ KNN ๋ชจ๋ธ์˜ ์ •ํ™•๋„ ํ™•์ธํ•˜๊ธฐ

a_index = list(range(1,11))
a = pd.Series() # ์ •ํ™•๋„๊ฐ€ ์ €์žฅ๋  Series ๊ฐ์ฒด
x = [0,1,2,3,4,5,6,7,8,9,10]

for i in list(range(1,11)):
    model = KNeighborsClassifier(n_neighbors = i)  
    model.fit(train_X, train_Y)
    prediction = model.predict(test_X)
    a = a.append(pd.Series(metrics.accuracy_score(prediction,test_Y)))

plt.plot(a_index, a)
plt.xticks(x)
fig = plt.gcf()
fig.set_size_inches(12,6)
plt.show()

print()
print('Accuracies for different values of n are:',a.values,'with the max value as ',a.values.max())


Accuracies for different values of n are: [0.73134328 0.76119403 0.79477612 0.80597015 0.81343284 0.80223881
 0.82835821 0.83208955 0.84701493 0.82835821] with the max value as  0.8470149253731343

Gaussian Naive Bayes

model = GaussianNB()
model.fit(train_X,train_Y)
prediction6 = model.predict(test_X)
print('The accuracy of the NaiveBayes is',metrics.accuracy_score(prediction6,test_Y))
The accuracy of the NaiveBayes is 0.8134328358208955

Random Forests

model = RandomForestClassifier(n_estimators = 100)
model.fit(train_X,train_Y)
prediction7 = model.predict(test_X)
print('The accuracy of the Random Forests is',metrics.accuracy_score(prediction7,test_Y))
The accuracy of the Random Forests is 0.8059701492537313
  • ๋ชจ๋ธ์˜ ์ •ํ™•์„ฑ์ด ๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅ์„ ๊ฒฐ์ •ํ•˜๋Š” ์œ ์ผํ•œ ์š”์†Œ๋Š” ์•„๋‹˜

    • ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์ž์ฒด ๊ต์œก์— ์‚ฌ์šฉํ•  ๋ชจ๋“  ์ธ์Šคํ„ด์Šค๋ฅผ ๊ฒฐ์ •ํ•  ์ˆ˜๋Š” ์—†์Œ

    • train ๋ฐ test ๋ฐ์ดํ„ฐ๊ฐ€ ๋ณ€๊ฒฝ๋จ์— ๋”ฐ๋ผ ์ •ํ™•๋„๋„ ๋ณ€๊ฒฝ๋จ

๋ชจํ˜• ๋ถ„์‚ฐ(model variance)

  • ์ผ๋ฐ˜ํ™”๋œ ๋ชจ๋ธ์„ ์–ป๊ธฐ ์œ„ํ•ด ๊ต์ฐจ ๊ฒ€์ฆ์„ ํ™œ์šฉ

3-2. ๊ต์ฐจ ๊ฒ€์ฆ(Cross Validation)

  • ๋งŽ์€ ๊ฒฝ์šฐ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถˆ๊ท ํ˜•ํ•จ

    • ์ตœ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๋ชจ๋“  ์ธ์Šคํ„ด์Šค์—์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ trainํ•˜๊ณ  test ํ•ด์•ผ ํ•จ

    • ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•ด ์•Œ๋ ค์ง„ ๋ชจ๋“  ์ •ํ™•๋„์˜ ํ‰๊ท ์„ ์–ป๊ธฐ

K-Fold ๊ต์ฐจ ๊ฒ€์ฆ

  • ๋จผ์ € ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ k-๋ถ€๋ถ„ ์ง‘ํ•ฉ์œผ๋กœ ๋‚˜๋ˆ„๊ธฐ

  • ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ (k = 5) ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋ˆˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด test๋ฅผ ์œ„ํ•ด 1๊ฐœ์˜ ๋ถ€๋ถ„์„ ์ •ํ•˜๊ณ  4๊ฐœ์˜ ๋ถ€๋ถ„์— ๊ฑธ์ณ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ train

  • ๊ฐ ๋ฐ˜๋ณต์—์„œ test ๋ถ€๋ถ„์„ ๋ณ€๊ฒฝํ•˜๊ณ  ๋‹ค๋ฅธ ๋ถ€๋ถ„์— ๋Œ€ํ•ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ›ˆ๋ จ

    • ์ดํ›„ ๊ฐ ๋ฐ˜๋ณต๋งˆ๋‹ค ์–ป์€ ์ •ํ™•๋„์™€ ์˜ค์ฐจ๋ฅผ ํ‰๊ท 
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ผ๋ถ€ train ๋ฐ์ดํ„ฐ์— ์ ํ•ฉํ•˜์ง€ ์•Š๊ฑฐ๋‚˜(underfitting) ์ง€๋‚˜์น˜๊ฒŒ ์ ํ•ฉ๋˜๋Š” ๊ฒƒ(overfitting) ๋ฐฉ์ง€

### import libraries
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score # score evaluation
from sklearn.model_selection import cross_val_predict # prediction

kfold = KFold(n_splits = 10, random_state = 22, shuffle = True) # k=10, split the data into 10 equal parts
xyz = []
accuracy = []
std = []
classifiers = ['Linear Svm','Radial Svm','Logistic Regression','KNN',
               'Decision Tree','Naive Bayes','Random Forest']

models = [svm.SVC(kernel='linear'),svm.SVC(kernel='rbf'),LogisticRegression(),
          KNeighborsClassifier(n_neighbors=9),DecisionTreeClassifier(),GaussianNB(),
          RandomForestClassifier(n_estimators=100)]
for i in models:
    model = i
    cv_result = cross_val_score(model,X,Y, cv = kfold,scoring = "accuracy") # ๊ต์ฐจ ๊ฒ€์ฆ ์ˆ˜ํ–‰
    cv_result = cv_result
    xyz.append(cv_result.mean())
    std.append(cv_result.std())
    accuracy.append(cv_result)

new_models_dataframe2 = pd.DataFrame({'CV Mean':xyz,'Std':std},index=classifiers)       
new_models_dataframe2
CV Mean Std
Linear Svm 0.784607 0.057841
Radial Svm 0.828377 0.057096
Logistic Regression 0.799176 0.040154
KNN 0.808140 0.035630
Decision Tree 0.805855 0.042848
Naive Bayes 0.795843 0.054861
Random Forest 0.811486 0.049518
### ์‹œ๊ฐํ™”

plt.subplots(figsize = (12,6))
box = pd.DataFrame(accuracy,index = [classifiers])
box.T.boxplot()
<Axes: >

new_models_dataframe2['CV Mean'].plot.barh(width = 0.8)

plt.title('Average CV Mean Accuracy')
fig = plt.gcf()
fig.set_size_inches(8,5)
plt.show()

  • ๋ถ„๋ฅ˜ ์ •ํ™•๋„๋Š” ๋ถˆ๊ท ํ˜•์œผ๋กœ ์ธํ•ด ๋•Œ๋•Œ๋กœ ์˜คํ•ด์˜ ์†Œ์ง€๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ

  • ๋ชจ๋ธ์ด ์–ด๋””์„œ ์ž˜๋ชป๋˜์—ˆ๋Š”์ง€, ๋˜๋Š” ๋ชจ๋ธ์ด ์–ด๋–ค ํด๋ž˜์Šค๋ฅผ ์ž˜๋ชป ์˜ˆ์ธกํ–ˆ๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์˜ค์ฐจ ํ–‰๋ ฌ(confusion matrix)์˜ ๋„์›€์œผ๋กœ ์š”์•ฝ๋œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ

์˜ค์ฐจ ํ–‰๋ ฌ

  • ๋ถ„๋ฅ˜๊ธฐ์— ์˜ํ•ด ๋งŒ๋“ค์–ด์ง„ ์ •ํ™•ํ•œ ๋ถ„๋ฅ˜์™€ ์ž˜๋ชป๋œ ๋ถ„๋ฅ˜์˜ ์ˆ˜๋ฅผ ์ œ๊ณต
f,ax = plt.subplots(3,3,figsize=(12,10))

y_pred = cross_val_predict(svm.SVC(kernel = 'rbf'),X,Y,cv = 10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,0],annot=True,fmt='2.0f')
ax[0,0].set_title('Matrix for rbf-SVM')

y_pred = cross_val_predict(svm.SVC(kernel='linear'),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,1],annot=True,fmt='2.0f')
ax[0,1].set_title('Matrix for Linear-SVM')

y_pred = cross_val_predict(KNeighborsClassifier(n_neighbors=9),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,2],annot=True,fmt='2.0f')
ax[0,2].set_title('Matrix for KNN')

y_pred = cross_val_predict(RandomForestClassifier(n_estimators=100),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,0],annot=True,fmt='2.0f')
ax[1,0].set_title('Matrix for Random-Forests')

y_pred = cross_val_predict(LogisticRegression(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,1],annot=True,fmt='2.0f')
ax[1,1].set_title('Matrix for Logistic Regression')

y_pred = cross_val_predict(DecisionTreeClassifier(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,2],annot=True,fmt='2.0f')
ax[1,2].set_title('Matrix for Decision Tree')

y_pred = cross_val_predict(GaussianNB(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[2,0],annot=True,fmt='2.0f')
ax[2,0].set_title('Matrix for Naive Bayes')

plt.subplots_adjust(hspace=0.2,wspace=0.2)
plt.show()

โœ” ์˜ค์ฐจ ํ–‰๋ ฌ ํ•ด์„

  • ์™ผ์ชฝ ๋Œ€๊ฐ์„ : ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•ด ์ˆ˜ํ–‰๋œ ์ •ํ™•ํ•œ ์˜ˆ์ธก์˜ ์ˆ˜/ ์˜ค๋ฅธ์ชฝ ๋Œ€๊ฐ์„ : ์ž˜๋ชป๋œ ์˜ˆ์ธก์˜ ์ˆ˜๋ฅผ ํ‘œ์‹œ

  • rbf-SVM์— ๋Œ€ํ•œ ์ฒซ ๋ฒˆ์งธ ๊ทธ๋ฆผ ํ•ด์„

  • ์ •ํ™•ํ•œ ์˜ˆ์ธก์˜ ์ˆ˜๋Š” 491(์‚ฌ๋ง์ž์˜ ๊ฒฝ์šฐ) + 247(์ƒ์กด์ž์˜ ๊ฒฝ์šฐ)์ด๋ฉฐ ํ‰๊ท  CV ์ •ํ™•๋„๋Š” (491+247)/891 = 82.8%

  • ์˜ค๋ฅ˜โ€“> 58๋ช…์˜ ์‚ฌ๋ง์ž๋ฅผ ์ƒ์กด์ž๋กœ ์ž˜๋ชป ๋ถ„๋ฅ˜ํ•˜๊ณ  95๋ช…์€ ์‚ฌ๋ง์ž๋ฅผ ์ƒ์กด์ž๋กœ ์ž˜๋ชป ๋ถ„๋ฅ˜

  • ๋ชจ๋“  ํ–‰๋ ฌ์„ ์‚ดํŽด๋ณธ ํ›„, ์šฐ๋ฆฌ๋Š” rbf-SVM์ด ์‚ฌ๋งํ•œ ์Šน๊ฐ์„ ์ •ํ™•ํ•˜๊ฒŒ ์˜ˆ์ธกํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋” ๋†’์ง€๋งŒ, ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ๋Š” ์ƒ์กดํ•œ ์Šน๊ฐ์„ ์ •ํ™•ํ•˜๊ฒŒ ์˜ˆ์ธกํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋” ๋†’๋‹ค๊ณ  ๋งํ•  ์ˆ˜ ์žˆ์Œ

โœ” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

  • ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ: ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ„๋ฅ˜๊ธฐ์— ๋Œ€ํ•œ ์„œ๋กœ ๋‹ค๋ฅธ ๋งค๊ฐœ ๋ณ€์ˆ˜

  • ์ด๋ฅผ ์กฐ์ •ํ•˜์—ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•™์Šต ์†๋„ ๋ณ€๊ฒฝ ๋“ฑ ๋” ๋‚˜์€ ๋ชจ๋ธ์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ => ํŠœ๋‹

### SVM ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

from sklearn.model_selection import GridSearchCV

C = [0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
gamma = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
kernel = ['rbf','linear']
hyper = {'kernel':kernel,'C':C,'gamma':gamma}

# ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹
gd = GridSearchCV(estimator = svm.SVC(),param_grid = hyper,verbose = True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)
Fitting 5 folds for each of 240 candidates, totalling 1200 fits
0.8282593685267716
SVC(C=0.4, gamma=0.3)
  • C = 0.4, gamma = 0.3์ผ ๋•Œ ์ •ํ™•๋„๊ฐ€ 82.82%๋กœ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ
### RandomForest ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

n_estimators = range(100,1000,100)
hyper = {'n_estimators':n_estimators}

gd = GridSearchCV(estimator=RandomForestClassifier(random_state=0),param_grid=hyper,verbose=True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)
Fitting 5 folds for each of 9 candidates, totalling 45 fits
0.819327098110602
RandomForestClassifier(n_estimators=300, random_state=0)
  • n_estimators = 300์ผ ๋•Œ ์ •ํ™•๋„๊ฐ€ 81.9% ์ •๋„๋กœ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ

3-3. ์•™์ƒ๋ธ”(Ensembling)

  • ๋‹ค์–‘ํ•œ ๋‹จ์ˆœํ•œ ๋ชจ๋ธ๋“ค์ด ๊ฒฐํ•ฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ๊ฐ•๋ ฅํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ

    • ๋ชจ๋ธ์˜ ์ •ํ™•๋„๋‚˜ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ์ข‹์€ ๋ฐฉ๋ฒ•
  • ๋ฐฉ๋ฒ•

    • Voting Classifier

    • Bagging

    • Boosting

VotingClassifier

  • ๋‹ค์–‘ํ•œ ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ์˜ ์˜ˆ์ธก์„ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•

  • ๋ชจ๋“  ํ•˜์œ„ ๋ชจ๋ธ์˜ ์˜ˆ์ธก์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ‰๊ท  ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณต

    • ์ด๋•Œ ํ•˜์œ„ ๋ชจ๋ธ ๋˜๋Š” ๊ธฐ๋ณธ ๋ชจ๋ธ์€ ๋ชจ๋‘ ๋‹ค๋ฅธ ์œ ํ˜•
from sklearn.ensemble import VotingClassifier

ensemble_lin_rbf = VotingClassifier(estimators = [('KNN',KNeighborsClassifier(n_neighbors=10)),
                                              ('RBF',svm.SVC(probability=True,kernel='rbf',C=0.5,gamma=0.1)),
                                              ('RFor',RandomForestClassifier(n_estimators=500,random_state=0)),
                                              ('LR',LogisticRegression(C=0.05)),
                                              ('DT',DecisionTreeClassifier(random_state=0)),
                                              ('NB',GaussianNB()),
                                              ('svm',svm.SVC(kernel='linear',probability=True))
                                             ], 
                       voting = 'soft').fit(train_X,train_Y) # soft voting
print('The accuracy for ensembled model is:',ensemble_lin_rbf.score(test_X,test_Y))
cross = cross_val_score(ensemble_lin_rbf,X,Y, cv = 10, scoring = "accuracy")
print('The cross validated score is',cross.mean())
The accuracy for ensembled model is: 0.8171641791044776
The cross validated score is 0.8249188514357053

Bagging

  • ์ผ๋ฐ˜์ ์ธ ์•™์ƒ๋ธ” ๋ฐฉ์‹

  • ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ž‘์€ ํŒŒํ‹ฐ์…˜(์ผ๋ถ€๋ถ„)์— ์œ ์‚ฌํ•œ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ ์šฉํ•œ ๋‹ค์Œ ๋ชจ๋“  ์˜ˆ์ธก์˜ ํ‰๊ท ์„ ์ทจํ•จ์œผ๋กœ์จ ์ž‘๋™

  • ํ‰๊ท ํ™”๋กœ ์ธํ•ด ๋ถ„์‚ฐ์ด ๊ฐ์†Œ๋จ

โœ” Baged KNN

  • ๋ฐฐ๊น…์€ ๋ถ„์‚ฐ์ด ๋†’์€ ๋ชจํ˜•์—์„œ ๊ฐ€์žฅ ์ž˜ ์ž‘๋™

ex> ์˜์‚ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ, ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ

  • n_neighbors๋กœ ์ž‘์€ ๊ฐ’์„ ๊ฐ€์ง€๋Š” KNN๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ
from sklearn.ensemble import BaggingClassifier

model = BaggingClassifier(base_estimator = KNeighborsClassifier(n_neighbors=3),random_state=0,n_estimators=700)
model.fit(train_X,train_Y)
prediction=model.predict(test_X)
print('The accuracy for bagged KNN is:',metrics.accuracy_score(prediction,test_Y))

result = cross_val_score(model,X,Y,cv = 10,scoring='accuracy')
print('The cross validated score for bagged KNN is:',result.mean())
The accuracy for bagged KNN is: 0.832089552238806
The cross validated score for bagged KNN is: 0.8104244694132333

โœ” Bagged DecisionTree

model = BaggingClassifier(base_estimator=DecisionTreeClassifier(),random_state=0,n_estimators=100)
model.fit(train_X,train_Y)
prediction = model.predict(test_X)
print('The accuracy for bagged Decision Tree is:',metrics.accuracy_score(prediction,test_Y))

result = cross_val_score(model,X,Y,cv = 10,scoring = 'accuracy')
print('The cross validated score for bagged Decision Tree is:',result.mean())
The accuracy for bagged Decision Tree is: 0.8208955223880597
The cross validated score for bagged Decision Tree is: 0.8171410736579275

Boosting

  • ๋ถ„๋ฅ˜๊ธฐ์˜ ์ˆœ์ฐจ ํ•™์Šต์„ ์‚ฌ์šฉํ•˜๋Š” ์•™์ƒ๋ธ” ๊ธฐ์ˆ 

    • ์•ฝํ•œ ๋ชจ๋ธ์„ ๋‹จ๊ณ„์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ
  • ๋จผ์ € ์ „์ฒด ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•ด ํ›ˆ๋ จ

  • ์ผ๋ถ€ ์ธ์Šคํ„ด์Šค๋Š” ๋งž์ง€๋งŒ ์ผ๋ถ€ ์ธ์Šคํ„ด์Šค๋Š” ํ‹€๋ฆผ

  • ์ดํ›„ ๋‹ค์Œ ๋ฐ˜๋ณต์—์„œ ์ž˜๋ชป ์˜ˆ์ธก๋œ ์‚ฌ๋ก€์— ๋ฐ ์ง‘์ค‘ํ•˜๊ฑฐ๋‚˜ ๋น„์ค‘์„ ๋‘์–ด ์ž˜๋ชป๋œ ์˜ˆ์ธก์„ ๋ฐ”๋กœ์žก์œผ๋ ค ๋…ธ๋ ฅ

    • ์ •ํ™•๋„๊ฐ€ ํ•œ๊ณ„์— ๋‹ค๋‹ค๋ฅผ ๋•Œ๊นŒ์ง€ ์ƒˆ๋กœ์šด ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ถ”๊ฐ€์‹œํ‚ค๋ฉฐ ํ•™์Šต

โœ” AdaBoost(Adaptive Boosting)

  • ์•ฝํ•œ ๋ถ„๋ฅ˜๊ธฐ: ์˜์‚ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ

    • base_estimator ์˜ต์…˜์—์„œ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ณ€๊ฒฝ ๊ฐ€๋Šฅ
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(n_estimators = 200,random_state = 0,learning_rate = 0.1)
result = cross_val_score(ada,X,Y,cv = 10,scoring = 'accuracy')
print('The cross validated score for AdaBoost is:',result.mean())
The cross validated score for AdaBoost is: 0.8249188514357055

โœ” Stochastic Gradient Boosting

  • ์•ฝํ•œ ๋ถ„๋ฅ˜๊ธฐ: ์˜์‚ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ
from sklearn.ensemble import GradientBoostingClassifier

grad = GradientBoostingClassifier(n_estimators=500,random_state=0,learning_rate=0.1)
result = cross_val_score(grad,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for Gradient Boosting is:',result.mean())
The cross validated score for Gradient Boosting is: 0.8115230961298376

โœ” XGBoost

import xgboost as xg

xgboost = xg.XGBClassifier(n_estimators=900,learning_rate=0.1)
result = cross_val_score(xgboost,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for XGBoost is:',result.mean())
The cross validated score for XGBoost is: 0.8160299625468165
  • AdaBoost์—์„œ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ
### AdaBoost - hyper parameter tuning

n_estimators = list(range(100,1100,100))
learn_rate = [0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
hyper = {'n_estimators':n_estimators,'learning_rate':learn_rate}

gd = GridSearchCV(estimator=AdaBoostClassifier(),param_grid=hyper,verbose=True)
gd.fit(X,Y)

print(gd.best_score_)
print(gd.best_estimator_)
Fitting 5 folds for each of 120 candidates, totalling 600 fits
0.8293892411022534
AdaBoostClassifier(learning_rate=0.1, n_estimators=100)
  • learning_rate = 0.1, n_estimators = 100์ผ ๋•Œ ์ •ํ™•๋„๊ฐ€ 82.94% ์ •๋„๋กœ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„

์ตœ์  ๋ชจํ˜•์— ๋Œ€ํ•œ ์˜ค์ฐจ ํ–‰๋ ฌ

ada = AdaBoostClassifier(n_estimators=100,random_state=0,learning_rate=0.1)
result = cross_val_predict(ada,X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,result),cmap = 'winter',annot=True,fmt='2.0f')
plt.show()

3-4. Feature ์ค‘์š”๋„

f,ax = plt.subplots(2,2,figsize=(15,12))

### RandomForest
model = RandomForestClassifier(n_estimators=500,random_state=0) 
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,0])
ax[0,0].set_title('Feature Importance in Random Forests')

### AdaBoost
model=AdaBoostClassifier(n_estimators=100,learning_rate=0.1,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,1],color='#ddff11')
ax[0,1].set_title('Feature Importance in AdaBoost')

### Gradient Boosting
model=GradientBoostingClassifier(n_estimators=500,learning_rate=0.1,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,0],cmap='RdYlGn_r')
ax[1,0].set_title('Feature Importance in Gradient Boosting')

### XGBoost
model=xg.XGBClassifier(n_estimators=900,learning_rate=0.1)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,1],color='#FD0F00')
ax[1,1].set_title('Feature Importance in XgBoost')

plt.show()

โœ” Observations

  • ๊ณตํ†ต์ ์œผ๋กœ ์ค‘์š”ํ•˜๋‹ค๊ณ  ๋‚˜ํƒ€๋‚˜๋Š” feature๋“ค: Initial, Fare_cat, Pclass, Family_Size

  • Sex ๊ธฐ๋Šฅ์€ ์ค‘์š”ํ•˜์ง€ ์•Š์€ feature๋กœ ๋ณด์ž„

    • RandomForest์—์„œ๋งŒ Sex๊ฐ€ ์ค‘์š”ํ•ด ๋ณด์ž„

    • ๊ทธ๋Ÿฌ๋‚˜ ๋งŽ์€ ๋ถ„๋ฅ˜๊ธฐ์—์„œ ๋งจ ์œ„์— Initial feature๊ฐ€ ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋Š” ๋‘˜ ๋‹ค ์„ฑ๋ณ„์„ ์–ธ๊ธ‰

ํƒœ๊ทธ: , ,

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ: