0. Introduction

  • ์šด์ „์ž๊ฐ€ ๋‚ด๋…„์— ๋ณดํ—˜ ์ฒญ๊ตฌ๋ฅผ ์‹œ์ž‘ํ•  ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•˜๋Š” ํ”„๋กœ์ ํŠธ

  • Python ๋™์  ์‹œ๊ฐํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ธ plot.ly๋ฅผ ํ™œ์šฉ

  • ํ•ด๋‹น ๋…ธํŠธ๋ถ์—์„œ ํ™œ์šฉํ•œ plot.ly์˜ ์—ฌ๋Ÿฌ ๊ธฐ๋Šฅ๋“ค

    • ๋‹จ์ˆœ ๊ฐ€๋กœ barplot: target ๋ณ€์ˆ˜ ๋ถ„ํฌ๋ฅผ ๊ฒ€์‚ฌํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ

    • ์ƒ๊ด€๊ณ„์ˆ˜ heatmap: ์—ฌ๋Ÿฌ feature๋“ค ๊ฐ„์˜ ์ƒ๊ด€ ๊ด€๊ณ„ ํ™•์ธ

    -์‚ฐ์ ๋„ plot: RandomForest ๋ฐ GradientBoosting ๋ชจ๋ธ์—์„œ ์ƒ์„ฑ๋œ feature ์ค‘์š”๋„ ๋น„๊ต

    • ์ˆ˜์ง barplot: ์—ฌ๋Ÿฌ feature๋“ค์„ ๋Œ€์ƒ์œผ๋กœ ์ค‘์š”๋„๊ฐ€ ๋†’์€ ์ˆœ์„œ๋Œ€๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ

    • 3D ์‚ฐ์ ๋„ plot

๐Ÿ“Œ ํ•ด๋‹น ๋…ธํŠธ๋ถ์˜ ๋ชฉํ‘œ

  1. ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ์ ๊ฒ€: ๋ชจ๋“  ๊ฒฐ์ธก๊ฐ’/Null๊ฐ’(-1์ธ ๊ฐ’) ์‹œ๊ฐํ™” ๋ฐ ํ‰๊ฐ€

  2. feature ๊ฒ€์‚ฌ ๋ฐ ํ•„ํ„ฐ๋ง

  • ๋Œ€์ƒ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ์ƒ๊ด€๊ด€๊ณ„ ๋ฐ ํ˜•์ƒ ์ƒํ˜ธ ์ •๋ณด ๊ทธ๋ฆผ

  • ์ดํ•ญ, ๋ฒ”์ฃผํ˜• ๋ฐ ๊ธฐํƒ€ ๋ณ€์ˆ˜์˜ ๊ฒ€์‚ฌ

  • ํ•™์Šต ๋ชจ๋ธ์„ ํ†ตํ•œ feature ์ค‘์š”๋„ ์ˆœ์œ„ ๋งค๊ธฐ๊ธฐ

  • ํ•™์Šต ๊ณผ์ •์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ feature๋“ค์„ ์ˆœ์œ„ํ™” ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ฃผ๋Š” n building Random Forest์™€ Gradient Boosted model

1. Import Libraries & Data Loading

plotly ๊ฒฐ๊ณผ ์ถœ๋ ฅ ๊ด€๋ จ

Reference

### ๊ด€๋ จ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.offline as py
py.init_notebook_mode(connected = True)
import plotly.graph_objs as go
import plotly.tools as tls

import plotly.io as pio
pio.renderers.default = "svg"

import chart_studio
chart_studio.tools.set_credentials_file(username = 'username', api_key = 'api_key')

import warnings
from collections import Counter
from sklearn.feature_selection import mutual_info_classif
warnings.filterwarnings('ignore')
### Colab Notebook์—์„œ render ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ํ•จ์ˆ˜

def configure_plotly_browser_state():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))
### ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

train = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/2แ„Œแ…ฎแ„Žแ…ก/data/train.csv")
train.head()
id target ps_ind_01 ps_ind_02_cat ps_ind_03 ps_ind_04_cat ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ... ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
0 7 0 2 2 5 1 0 0 1 0 ... 9 1 5 8 0 1 1 0 0 1
1 9 0 1 1 7 0 0 0 0 1 ... 3 1 1 9 0 1 1 0 1 0
2 13 0 5 4 9 1 0 0 0 1 ... 4 2 7 7 0 1 1 0 1 0
3 16 0 0 1 2 0 0 1 0 0 ... 2 2 4 9 0 0 0 0 0 0
4 17 0 0 2 0 1 0 1 0 0 ... 3 1 1 3 0 0 0 1 1 0

5 rows ร— 59 columns

### train ๋ฐ์ดํ„ฐ์˜ ํ–‰, ์—ด ๊ฐœ์ˆ˜ ํŒŒ์•…ํ•˜๊ธฐ

rows = train.shape[0]
columns = train.shape[1]
print("The train dataset contains {0} rows and {1} columns".format(rows, columns))
The train dataset contains 595212 rows and 59 columns

2. ๋ฐ์ดํ„ฐ ํ™•์ธํ•˜๊ธฐ(Data Quality Check)

2-1. Null๊ฐ’/๊ฒฐ์ธก์น˜ ํ™•์ธ

### Null๊ฐ’ ํ™•์ธ
# ๋ชจ๋“  ์—ด์— ๋Œ€ํ•ด isnull ๊ฒ€์‚ฌ๋ฅผ ์‹คํ–‰ํ•˜๊ธฐ ์œ„ํ•ด any()๋ฅผ ๋‘ ๋ฒˆ ์ ์šฉ

train.isnull().any().any()
False
  • Null ๊ฐ’ ๊ฒ€์‚ฌ์—์„œ False๋ฅผ ๋ฐ˜ํ™˜ํ•˜์ง€๋งŒ ์‚ฌ์‹ค -1 ๋˜ํ•œ ๊ฒฐ์ธก์น˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค๋Š” ์ ์„ ์ฃผ์˜ํ•ด์•ผ ํ•จ
### ๊ฒฐ์ธก์น˜ ํ™•์ธ
# ๋ชจ๋“  -1์„ null๋กœ ์‰ฝ๊ฒŒ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐ’์— -1์ด ํฌํ•จ๋œ ์—ด ํ™•์ธํ•˜๊ธฐ

train_copy = train
train_copy = train_copy.replace(-1, np.NaN) # -1 -> NaN

2-2. ๊ฒฐ์ธก์น˜ ์‹œ๊ฐํ™”

  • missingno ํŒจํ‚ค์ง€ ํ™œ์šฉ
import missingno as msno

# ์—ด๋ณ„ null ๋˜๋Š” ๊ฒฐ์ธก๊ฐ’
msno.matrix(df = train_copy.iloc[:,2:39], figsize = (20, 14), color = (0.42, 0.1, 0.05))

image

  • ์ˆ˜์ง์˜ ์–ด๋‘์šด ๋นจ๊ฐ„์ƒ‰ ๋ (๋ˆ„๋ฝ๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ)์— ๊ฒน์ณ์ง„ ๋นˆ ํฐ์ƒ‰ ๋ ๋Š” ํŠน์ • ์—ด์˜ ๋ฐ์ดํ„ฐ์˜ ๊ฒฐ์ธก์„ ๋ฐ˜์˜

  • ํ•ด๋‹น ๊ฒฝ์šฐ ์ „์ฒด 59๊ฐœ feature ์ค‘ 7๊ฐœ์˜ feature๊ฐ€ ์‹ค์ œ๋กœ null ๊ฐ’์„ ํฌํ•จํ•˜๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ

    • ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ์—ด์€ ์‹ค์ œ๋กœ ์ด 13๊ฐœ

      • ๊ฒฐ์ธก ํ–‰๋ ฌ ๊ทธ๋ฆผ์ด ํ•˜๋‚˜์˜ ๊ทธ๋ฆผ์— ์•ฝ 40๊ฐœ์˜ ํ™€์ˆ˜ ํ˜•์ƒ์—๋งŒ ์ ํ•ฉํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ -> ์ผ๋ถ€ ์—ด์ด ์ƒ๋žต๋œ ์ƒํƒœ

      • ๋ชจ๋“  null์„ ์‹œ๊ฐํ™” ํ•˜๋ ค๋ฉด figsize ์ธ์ˆ˜๋ฅผ ๋ณ€๊ฒฝํ•˜๊ณ  ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๋ถ„ํ• 

  • ์ œ์™ธ๋œ 7๊ฐœ์˜ ์ปฌ๋Ÿผ:

    • ps_ind_05_cat

    • ps_reg_03

    • ps_car_03_cat

    • ps_car_05_cat

    • ps_car_07_cat

    • ps_car_09_cat

    • ps_car_14

  • ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฐ์ธก๊ฐ’์€ _cat์ด ๋ถ™์€ ์—ด์—์„œ ๋ฐœ์ƒ

  • ps_reg_03, ps_car_03_cat, ps_car_05_cat ์—ด์˜ ๊ฒฝ์šฐ ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ’์ด ๋ˆ„๋ฝ๋จ

    -> Null์— ๋Œ€ํ•ด -1์„ ์ „์ฒด์ ์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๊ฒƒ์€ ๊ทธ๋‹ค์ง€ ์ข‹์•„ ๋ณด์ด์ง€ x

2-3. Target ๋ณ€์ˆ˜ ํ™•์ธํ•˜๊ธฐ

  • target ๊ฐ’์€ ํด๋ž˜์Šค/๋ผ๋ฒจ/์ •๋‹ต์ด๋ผ๋Š” ์ด๋ฆ„์œผ๋กœ ์ œ๊ณต๋˜๋ฉฐ, ํ•™์Šต๋œ ํ•จ์ˆ˜๊ฐ€ ์ผ๋ฐ˜ํ™” ๋ฐ ์˜ˆ์ธก์„ ์ž˜ ํ•  ์ˆ˜ ์žˆ๊ธฐ๋ฅผ ๋ฐ”๋žŒ

  • ๋ฐ์ดํ„ฐ๋ฅผ ๋ชฉํ‘œ๊ฐ’์— ๊ฐ€์žฅ ์ž˜ ๋งคํ•‘ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ(์šฐ๋ฆฌ์˜ ๊ฒฝ์šฐ id ์—ด์„ ์ œ์™ธํ•œ ๋ชจ๋“  train ๋ฐ์ดํ„ฐ)์™€ ํ•จ๊ป˜ ์ง€๋„ ํ•™์Šต ๋ชจ๋ธ์— ์ƒˆ๋กœ์šด ๋ณด์ด์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ

configure_plotly_browser_state() # ๋งค ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ ์‹œ ํ˜ธ์ถœํ•ด์ค€๋‹ค.

data = [go.Bar(
            x = train["target"].value_counts().index.values,
            y = train["target"].value_counts().values,
            text ='Distribution of target variable'
    )]

layout = go.Layout(
    title='Target variable distribution'
)

fig = go.Figure(data = data, layout = layout)

py.iplot(fig, filename = 'basic-bar')
  • target ๋ณ€์ˆ˜๊ฐ€ ๊ต‰์žฅํžˆ ๋ถˆ๊ท ํ˜•ํ•œ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

2-4. ๋ฐ์ดํ„ฐ ํƒ€์ž…(dtype) ํ™•์ธ

  • train ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐ์ดํ„ฐ ์œ ํ˜• ํ™•์ธ

    • ์ •์ˆ˜, ๋ฌธ์ž ๋˜๋Š” ์‹ค์ˆ˜
  • Python ์‹œํ€€์Šค์—์„œ ๊ณ ์œ ํ•œ ์œ ํ˜•์˜ ์นด์šดํŠธ๋ฅผ ๊ฐ€์ ธ์˜ค๊ธฐ ์œ„ํ•ด Collections ๋ชจ๋“ˆ์„ ๊ฐ€์ ธ์˜ฌ ๋•Œ Counter() ๋ฉ”์„œ๋“œ๋ฅผ ํ™œ์šฉ

Counter(train.dtypes.values)
Counter({dtype('int64'): 49, dtype('float64'): 10})
  • train data๋Š” ์ด 59๊ฐœ์˜ ์ปฌ๋Ÿผ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ

    • ์ •์ˆ˜/์‹ค์ˆ˜ 2๊ฐœ์˜ ๋ฐ์ดํ„ฐ ํƒ€์ž…์œผ๋กœ ๊ตฌ์„ฑ๋จ
  • ๋ฐ์ดํ„ฐ์— _bin, _cat ๋ฐ _reg์™€ ๊ฐ™์€ ์ ‘๋ฏธ์‚ฌ๊ฐ€ ๋ถ™์–ด์žˆ์Œ

    • _bin: ์ด์ง„ํ˜•(binary) feature

    • _cat: ๋ฒ”์ฃผํ˜• feature

    • _reg: ์—ฐ์†/์ˆœ์„œํ˜• feature

train_float = train.select_dtypes(include = ['float64'])
train_int = train.select_dtypes(include = ['int64'])

2-5. ์ƒ๊ด€๊ณ„์ˆ˜ Plot

a) ์‹ค์ˆ˜ํ˜• feature๋“ค์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ์‹œ๊ฐํ™”

  • sns.heatmap() ํ™œ์šฉ

  • pandas ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์—๋Š” Pearson ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” corr() ๋ฐฉ๋ฒ•์ด ๋‚ด์žฅ๋˜์–ด ์žˆ์Œ

colormap = plt.cm.magma
plt.figure(figsize = (16,12))
plt.title('Pearson correlation of continuous features', y = 1.05, size = 15)
sns.heatmap(train_float.corr(),linewidths = 0.1,vmax = 1.0, square = True, 
            cmap = colormap, linecolor = 'white', annot = True)

image

  • ๋Œ€๋ถ€๋ถ„์˜ feature๋“ค ๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ 0์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

  • ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๋Š” feature๋“ค์˜ ์Œ

    • (ps_reg_01, ps_reg_03)

    • (ps_reg_02, ps_reg_03)

    • (ps_car_12, ps_car_13)

    • (ps_car_13, ps_car_15)

b) ์ •์ˆ˜ํ˜• feature๋“ค์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ์‹œ๊ฐํ™”

  • Plotly ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ๊ด€ ๊ด€๊ณ„ ๊ฐ’์˜ ์—ด ์ง€๋„๋ฅผ ๋Œ€ํ™”์‹์œผ๋กœ ์ƒ์„ฑ

  • ์ด์ „์˜ plotly plot๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ go()๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ heatmap ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑ

    • x์ถ•๊ณผ y์ถ•์€ ์—ด ์ด๋ฆ„์„ ์‚ฌ์šฉ

    • ์ƒ๊ด€ ๊ด€๊ณ„ ๊ฐ’์€ z์ถ•์—์„œ ์ œ๊ณต

### ์ •์  heatmap

train_int = train_int.drop(["id", "target"], axis = 1)
colormap = plt.cm.bone
plt.figure(figsize = (21,16))
plt.title('Pearson correlation of categorical features', y = 1.05, size = 15)
sns.heatmap(train_int.corr(),linewidths = 0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=False)

image

### ๋™์  heatmap ์ƒ์„ฑ

configure_plotly_browser_state() # ๋งค ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ ์‹œ ํ˜ธ์ถœํ•ด์ค€๋‹ค.

data = [
    go.Heatmap(
        z= train_int.corr().values,
        x=train_int.columns.values,
        y=train_int.columns.values,
        colorscale='Viridis',
        reversescale = False,
        opacity = 1.0 )
]

layout = go.Layout(
    title='Pearson Correlation of Integer-type features',
    xaxis = dict(ticks='', nticks=36),
    yaxis = dict(ticks='' ),
    width = 900, height = 700)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='labelled-heatmap')
  • correlation plot์—์„œ ๊ฐ’์ด 0์ธ ์…€์ด ์ƒ๋‹นํžˆ ๋งŽ์ด ๊ด€์ฐฐ๋จ

    • ์„œ๋กœ ์„ ํ˜• ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ์ „ํ˜€ ์—†๋Š” feature๋“ค์ด ๋งŽ์Œ
  • ์ฃผ์„ฑ๋ถ„ ๋ถ„์„(PCA)๊ณผ ๊ฐ™์€ ์ฐจ์› ์ถ•์†Œ ๋ณ€ํ™˜์„ ์ˆ˜ํ–‰ํ•˜๋ ค๋Š” ๊ฒฝ์šฐ์—๋Š” ์–ด๋Š ์ •๋„์˜ ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ํ•„์š”

2-6. ์ƒํ˜ธ ์ •๋ณด Plot

  • target ๋ณ€์ˆ˜์™€ target ๋ณ€์ˆ˜๊ฐ€ ๊ณ„์‚ฐ๋˜๋Š” ํ•ด๋‹น ํ˜•์ƒ ์‚ฌ์ด์˜ ์ƒํ˜ธ ์ •๋ณด๋ฅผ ๊ฒ€์‚ฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ์œ ์šฉํ•œ ๋„๊ตฌ

  • ๋ถ„๋ฅ˜ ๋ฌธ์ œ์˜ ๊ฒฝ์šฐ Sklearn์˜ mutual_info_classif() ๋ฉ”์„œ๋“œ๋ฅผ ํ˜ธ์ถœ

    • k - ๊ทผ์ ‘ ์ด์›ƒ ๊ฑฐ๋ฆฌ์˜ ์—”ํŠธ๋กœํ”ผ ์ถ”์ •์— ๊ธฐ๋ฐ˜ํ•œ ๋น„๋ชจ์ˆ˜ ๋ฐฉ๋ฒ•์— ์˜์กด

    • ๋‘ ๋žœ๋ค ๋ณ€์ˆ˜ ์‚ฌ์ด์˜ ์˜์กด์„ฑ์„ ์ธก์ •

      • 0(๋žœ๋ค ๋ณ€์ˆ˜๊ฐ€ ์„œ๋กœ ๋…๋ฆฝ์ ์ธ ๊ฒฝ์šฐ)์—์„œ ๋” ๋†’์€ ๊ฐ’(์ผ๋ถ€ ์ข…์†์„ฑ์„ ๋‚˜ํƒ€๋ƒ„)๊นŒ์ง€ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง
    • target์˜ ์ •๋ณด๊ฐ€ feature ๋‚ด์— ์–ผ๋งˆ๋‚˜ ํฌํ•จ๋  ์ˆ˜ ์žˆ๋Š”์ง€ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Œ

mf = mutual_info_classif(train_float.values, train.target.values,
                         n_neighbors = 3, random_state = 17)
print(mf)
[0.02599971 0.00767074 0.00617141 0.01855302 0.00158483 0.00338192
 0.01668813 0.0134428  0.01334669 0.01348572]

2-7. ์ด์ง„(binary) ๋ณ€์ˆ˜ ํ™•์ธ

  • ์ด์ง„ ๋ณ€์ˆ˜: ๊ฐ’์œผ๋กœ 1 ๋˜๋Š” 0 ์ค‘ ํ•˜๋‚˜๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๋ณ€์ˆ˜

  • ์ด์ง„ ๊ฐ’์„ ํฌํ•จํ•˜๋Š” ๋ชจ๋“  ์—ด์„ ์ €์žฅํ•œ ๋‹ค์Œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด vertical barplot์„ ์ƒ์„ฑ

bin_col = [col for col in train.columns if '_bin' in col] # ์ด์ง„ ๋ณ€์ˆ˜ ์ถ”์ถœ
zero_list = [] # ๊ฐ’์ด 0
one_list = [] # ๊ฐ’์ด 1
for col in bin_col:
    zero_list.append((train[col] == 0).sum())
    one_list.append((train[col] == 1).sum())
### ์‹œ๊ฐํ™”

configure_plotly_browser_state() # ๋งค ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ ์‹œ ํ˜ธ์ถœํ•ด์ค€๋‹ค.

# 0
trace1 = go.Bar(
    x = bin_col,
    y = zero_list ,
    name = 'Zero count'
)
# 1
trace2 = go.Bar(
    x = bin_col,
    y = one_list,
    name = 'One count'
)

data = [trace1, trace2]
layout = go.Layout(
    barmode = 'stack',
    title = 'Count of 1 and 0 in binary variables'
)

fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename='stacked-bar')
  • ps_ind_10_bin, ps_ind_11_bin, ps_ind_12_bin,ps_ind_13_bin์—์„œ 0์˜ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ๋ณ€์ˆ˜๋“ค์˜ ๋น„์œจ์ด ๋†’์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

2-8. ๋ฒ”์ฃผํ˜•(categorical) ๋ณ€์ˆ˜์™€ ์ˆœ์„œํ˜•(ordinal) ๋ณ€์ˆ˜ ํ™•์ธ

โœ… ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋ฅผ ํ†ตํ•œ ๊ธฐ๋Šฅ ์ค‘์š”๋„

  • train ๋ฐ์ดํ„ฐ๋ฅผ RandomForestClassifier๋กœ ๋งž์ถ”๊ณ  ๋ชจ๋ธ์ด ํ›ˆ๋ จ์„ ๋งˆ์นœ ํ›„ feature๋“ค์˜ ์ˆœ์œ„ ํŒŒ์•…

  • ์œ ์šฉํ•œ ๊ธฐ๋Šฅ ์ค‘์š”๋„๋ฅผ ์–ป๋Š” ๋ฐ ๋งŽ์€ ๋งค๊ฐœ ๋ณ€์ˆ˜ ์กฐ์ •์ด ํ•„์š”ํ•˜์ง€ ์•Š๊ณ  ๋ถˆ๊ท ํ˜• feature์— ๋Œ€ํ•ด์„œ๋„ ๋งค์šฐ ๊ฐ•๋ ฅํ•œ ์•™์ƒ๋ธ” ๋ชจ๋ธ(Bootstrap ์ง‘๊ณ„ ํ•˜์— ์ ์šฉ๋œ ์•ฝํ•œ ์˜์‚ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ ํ•™์Šต์ž์˜ ์•™์ƒ๋ธ”)์„ ์‚ฌ์šฉํ•˜๋Š” ๋น ๋ฅธ ๋ฐฉ๋ฒ•

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators = 150, max_depth = 8, min_samples_leaf = 4, 
                            max_features = 0.2, n_jobs = -1, random_state = 0)
rf.fit(train.drop(['id', 'target'],axis = 1), train.target)
features = train.drop(['id', 'target'],axis = 1).columns.values
print("----- Training Done -----")
----- Training Done -----

์‹œ๊ฐํ™”

  • RandomForest๋ฅผ ํ•™์Šต์‹œํ‚จ ํ›„ feature_importances_ ์†์„ฑ์„ ํ˜ธ์ถœํ•˜์—ฌ feature ์ค‘์š”๋„ ๋ชฉ๋ก์„ ์–ป๊ณ  plotly์˜ ์‚ฐ์ ๋„ plot์„ ์‹œ๊ฐํ™”

  • Scatter ๋ช…๋ น์„ ์‹คํ–‰ํ•˜๊ณ  ์ด์ „์˜ plotly plot์— ๋”ฐ๋ผ y์ถ•๊ณผ x์ถ•์„ ์ •์˜ํ•ด์•ผ ํ•จ

    • marker ์†์„ฑ: ์ ์˜ ํฌ๊ธฐ, ์ƒ‰์ƒ ๋ฐ ์ฒ™๋„๋ฅผ ์ •์˜/์ œ์–ด
### ์‹œ๊ฐํ™”(์‚ฐ์ ๋„)

configure_plotly_browser_state() # ๋งค ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ ์‹œ ํ˜ธ์ถœํ•ด์ค€๋‹ค.

trace = go.Scatter(
    y = rf.feature_importances_,
    x = features,
    mode = 'markers',
    marker = dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 13,
        #size= rf.feature_importances_,
        #color = np.random.randn(500), #set color equal to a variable
        color = rf.feature_importances_,
        colorscale='Portland',
        showscale=True
    ),
    text = features
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Random Forest Feature Importance',
    hovermode= 'closest',
     xaxis= dict(
         ticklen= 5,
         showgrid=False,
        zeroline=False,
        showline=False
     ),
    yaxis=dict(
        title= 'Feature Importance',
        showgrid=False,
        zeroline=False,
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')
### ์‹œ๊ฐํ™”(barplot)

configure_plotly_browser_state() # ๋งค ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ ์‹œ ํ˜ธ์ถœํ•ด์ค€๋‹ค.

x, y = (list(x) for x in zip(*sorted(zip(rf.feature_importances_, features), 
                                                            reverse = False)))
trace2 = go.Bar(
    x = x, # feature importance
    y = y, # feature name
    marker = dict(
        color = x,
        colorscale = 'Viridis',
        reversescale = True
    ),
    name = 'Random Forest Feature importance',
    orientation = 'h',
)

layout = dict(
    title = 'Barplot of Feature importances',
     width = 900, height = 2000,
    yaxis = dict(
        showgrid = False,
        showline = False,
        showticklabels = True,
      # domain=[0, 0.85],
    ))

fig1 = go.Figure(data=[trace2])
fig1['layout'].update(layout)

py.iplot(fig1, filename='plots')

์˜์‚ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ ์‹œ๊ฐํ™”

  • ๋‹จ์ˆœํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด Decisiontree(max_depth = 3)๋ฅผ ์ ํ•ฉ

  • sklearn.export_graphviz์—์„œ ๊ทธ๋ž˜ํ”„ ์‹œ๊ฐํ™” ์†์„ฑ์œผ๋กœ ๋‚ด๋ณด๋‚ด๊ธฐ๋ฅผ ์‚ฌ์šฉ

from sklearn import tree
from IPython.display import Image as PImage
from subprocess import check_call
from PIL import Image, ImageDraw, ImageFont
import re

decision_tree = tree.DecisionTreeClassifier(max_depth = 3)
decision_tree.fit(train.drop(['id', 'target'],axis=1), train.target)

# Export our trained model as a .dot file
with open("tree1.dot", 'w') as f:
     f = tree.export_graphviz(decision_tree,
                              out_file=f,
                              max_depth = 4,
                              impurity = False,
                              feature_names = train.drop(['id', 'target'],axis=1).columns.values,
                              class_names = ['No', 'Yes'],
                              rounded = True,
                              filled= True )
        
#Convert .dot to .png to allow display in web notebook
check_call(['dot','-Tpng','tree1.dot','-o','tree1.png'])

# Annotating chart with PIL
img = Image.open("tree1.png")
draw = ImageDraw.Draw(img)
img.save('sample-out.png')
PImage("sample-out.png",)

โœ… Gradient Boosting์„ ํ†ตํ•œ ๊ธฐ๋Šฅ ์ค‘์š”๋„

  • ๊ฐ ๋‹จ๊ณ„์—์„œ ์†์‹ค ํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ(๊ธฐ๋ณธ์ ์œผ๋กœ Sklearn ๊ตฌํ˜„์˜ ํŽธ์ฐจ๋กœ ์„ค์ •๋จ)์— ์ ํ•ฉํ•œ ์ „์ง„ ๋‹จ๊ณ„๋ณ„ ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰๋จ
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(n_estimators = 100, max_depth = 3, min_samples_leaf = 4, 
                                max_features = 0.2, random_state = 0)
gb.fit(train.drop(['id', 'target'],axis = 1), train.target)
features = train.drop(['id', 'target'],axis = 1).columns.values
print("----- Training Done -----")
----- Training Done -----
### ์‹œ๊ฐํ™”(์‚ฐ์ ๋„)

configure_plotly_browser_state() # ๋งค ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ ์‹œ ํ˜ธ์ถœํ•ด์ค€๋‹ค.

trace = go.Scatter(
    y = gb.feature_importances_,
    x = features,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 13,
        #size= rf.feature_importances_,
        #color = np.random.randn(500), #set color equal to a variable
        color = gb.feature_importances_,
        colorscale='Portland',
        showscale=True
    ),
    text = features
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Gradient Boosting Machine Feature Importance',
    hovermode= 'closest',
     xaxis= dict(
         ticklen= 5,
         showgrid=False,
        zeroline=False,
        showline=False
     ),
    yaxis=dict(
        title= 'Feature Importance',
        showgrid=False,
        zeroline=False,
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')
### ์‹œ๊ฐํ™”(barplot)

configure_plotly_browser_state() # ๋งค ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ ์‹œ ํ˜ธ์ถœํ•ด์ค€๋‹ค.

x, y = (list(x) for x in zip(*sorted(zip(gb.feature_importances_, features), 
                                                            reverse = False)))
trace2 = go.Bar(
    x=x ,
    y=y,
    marker=dict(
        color=x,
        colorscale = 'Viridis',
        reversescale = True
    ),
    name='Gradient Boosting Classifer Feature importance',
    orientation='h',
)

layout = dict(
    title='Barplot of Feature importances',
     width = 900, height = 2000,
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
    ))

fig1 = go.Figure(data=[trace2])
fig1['layout'].update(layout)
py.iplot(fig1, filename='plots')
  • RandomForest์™€ GradientBoost ํ•™์Šต ๋ชจ๋ธ์—์„œ ๋ชจ๋‘ ps_car_13 feature๋ฅผ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํŠน์ง•์œผ๋กœ ์„ ํƒํ•จ

3. ๊ฒฐ๋ก 

  • Null ๊ฐ’๊ณผ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์„ ๊ฒ€์‚ฌํ•˜๊ณ , feature๋“ค ๊ฐ„์˜ ์„ ํ˜• ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์กฐ์‚ฌํ•˜์—ฌ Porto Seguro ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ƒ๋‹นํžˆ ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ๊ฒ€์‚ฌํ•จ

  • ์ผ๋ถ€ feature์˜ ๋ถ„ํฌ๋ฅผ ๊ฒ€์‚ฌํ•˜๊ณ  ๋ชจ๋ธ์ด ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋Š” ๊ธฐ๋Šฅ์„ ์‹๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด ๋ช‡ ๊ฐ€์ง€ ํ•™์Šต ๋ชจ๋ธ(RandomForest ๋ฐ GradientBoosting ๋ถ„๋ฅ˜๊ธฐ)์„ ๊ตฌํ˜„

ํƒœ๊ทธ: , ,

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ: