0. Introduction

  • PorteSeguro competition์„ ์œ„ํ•ด ์ข‹์€ insight๋ฅผ ์–ป๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•จ

  • ๋ชจ๋ธ๋ง์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ค€๋น„ํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค

  • ์ฃผ์š” ์„น์…˜

    • ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”

    • ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ •์˜

    • ๊ธฐ์ˆ ํ†ต๊ณ„๋Ÿ‰

    • ๋ถˆ๊ท ํ˜• ํด๋ž˜์Šค ์ฒ˜๋ฆฌํ•˜๊ธฐ

    • ๋ฐ์ดํ„ฐ quality ํ™•์ธ

    • Exploratory data visualization

    • Feature engineering

    • Feature ์„ ํƒ

    • Feature scaling

1. Loading packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import SimpleImputer # version issue
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier

pd.set_option('display.max_columns', 100)

2. Loading data

train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/2แ„Œแ…ฎแ„Žแ…ก/data/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/2แ„Œแ…ฎแ„Žแ…ก/data/test.csv')

3. Data at first sight

  • ๋Œ€ํšŒ ๋ฐ์ดํ„ฐ ์„ค๋ช… ๋ฐœ์ทŒ๋ฌธ

    • ์œ ์‚ฌํ•œ ๊ทธ๋ฃน์— ์†ํ•˜๋Š” feature๋“ค์€ feature name์— ํƒœ๊น…๋˜์–ด ์žˆ์Œ(ex> ind, reg, car, calc)

    • feature๋ช…์—๋Š” binary feature๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” postfix ๋นˆ๊ณผ ๋ฒ”์ฃผํ˜• feature๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•œ cat์ด ํฌํ•จ๋จ

      • ์ด๋Ÿฌํ•œ ์ง€์ •์ด ์—†๋Š” feature๋Š” ์—ฐ์†ํ˜• ๋˜๋Š” ์ˆœ์„œํ˜• feature์ด๋‹ค.
    • -1: ๊ฒฐ์ธก์น˜(missing value)

  • target ์—ด: ํ•ด๋‹น ์ •์ฑ… ์†Œ์œ ์ž์— ๋Œ€ํ•ด ํด๋ ˆ์ž„์ด ์ œ๊ธฐ๋˜์—ˆ๋Š”์ง€์˜ ์—ฌ๋ถ€

โˆŽ ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ

train.head()
id target ps_ind_01 ps_ind_02_cat ps_ind_03 ps_ind_04_cat ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14 ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01 ps_reg_02 ps_reg_03 ps_car_01_cat ps_car_02_cat ps_car_03_cat ps_car_04_cat ps_car_05_cat ps_car_06_cat ps_car_07_cat ps_car_08_cat ps_car_09_cat ps_car_10_cat ps_car_11_cat ps_car_11 ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01 ps_calc_02 ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
0 7 0 2 2 5 1 0 0 1 0 0 0 0 0 0 0 11 0 1 0 0.7 0.2 0.718070 10 1 -1 0 1 4 1 0 0 1 12 2 0.400000 0.883679 0.370810 3.605551 0.6 0.5 0.2 3 1 10 1 10 1 5 9 1 5 8 0 1 1 0 0 1
1 9 0 1 1 7 0 0 0 0 1 0 0 0 0 0 0 3 0 0 1 0.8 0.4 0.766078 11 1 -1 0 -1 11 1 1 2 1 19 3 0.316228 0.618817 0.388716 2.449490 0.3 0.1 0.3 2 1 9 5 8 1 7 3 1 1 9 0 1 1 0 1 0
2 13 0 5 4 9 1 0 0 0 1 0 0 0 0 0 0 12 1 0 0 0.0 0.0 -1.000000 7 1 -1 0 -1 14 1 1 2 1 60 1 0.316228 0.641586 0.347275 3.316625 0.5 0.7 0.1 2 2 9 1 8 2 7 4 2 7 7 0 1 1 0 1 0
3 16 0 0 1 2 0 0 1 0 0 0 0 0 0 0 0 8 1 0 0 0.9 0.2 0.580948 7 1 0 0 1 11 1 1 3 1 104 1 0.374166 0.542949 0.294958 2.000000 0.6 0.9 0.1 2 4 7 1 8 4 2 2 2 4 9 0 0 0 0 0 0
4 17 0 0 2 0 1 0 1 0 0 0 0 0 0 0 0 9 1 0 0 0.7 0.6 0.840759 11 1 -1 0 -1 14 1 1 2 1 82 3 0.316070 0.565832 0.365103 2.000000 0.4 0.6 0.0 2 2 6 3 10 2 12 3 1 1 3 0 0 0 1 1 0
train.tail()
id target ps_ind_01 ps_ind_02_cat ps_ind_03 ps_ind_04_cat ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14 ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01 ps_reg_02 ps_reg_03 ps_car_01_cat ps_car_02_cat ps_car_03_cat ps_car_04_cat ps_car_05_cat ps_car_06_cat ps_car_07_cat ps_car_08_cat ps_car_09_cat ps_car_10_cat ps_car_11_cat ps_car_11 ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01 ps_calc_02 ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
595207 1488013 0 3 1 10 0 0 0 0 0 1 0 0 0 0 0 13 1 0 0 0.5 0.3 0.692820 10 1 -1 0 1 1 1 1 0 1 31 3 0.374166 0.684631 0.385487 2.645751 0.4 0.5 0.3 3 0 9 0 9 1 12 4 1 9 6 0 1 1 0 1 1
595208 1488016 0 5 1 3 0 0 0 0 0 1 0 0 0 0 0 6 1 0 0 0.9 0.7 1.382027 9 1 -1 0 -1 15 0 0 2 1 63 2 0.387298 0.972145 -1.000000 3.605551 0.2 0.2 0.0 2 4 8 6 8 2 12 4 1 3 8 1 0 1 0 1 1
595209 1488017 0 1 1 10 0 0 1 0 0 0 0 0 0 0 0 12 1 0 0 0.9 0.2 0.659071 7 1 -1 0 -1 1 1 1 2 1 31 3 0.397492 0.596373 0.398748 1.732051 0.4 0.0 0.3 3 2 7 4 8 0 10 3 2 2 6 0 0 1 0 0 0
595210 1488021 0 5 2 3 1 0 0 0 1 0 0 0 0 0 0 12 1 0 0 0.9 0.4 0.698212 11 1 -1 0 -1 11 1 1 2 1 101 3 0.374166 0.764434 0.384968 3.162278 0.0 0.7 0.0 4 0 9 4 9 2 11 4 1 4 2 0 1 1 1 0 0
595211 1488027 0 0 1 8 0 0 1 0 0 0 0 0 0 0 0 7 1 0 0 0.1 0.2 -1.000000 7 0 -1 0 -1 0 1 0 2 1 34 2 0.400000 0.932649 0.378021 3.741657 0.4 0.0 0.5 2 3 10 4 10 2 5 4 4 3 8 0 1 0 0 0 0
  • ๊ตฌ์„ฑ ์š”์†Œ

    • ์ดํ•ญ ๋ณ€์ˆ˜(binary variables) โ€“ yes or no

    • ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๊ฐ’์ด ์ •์ˆ˜์ธ ์นดํ…Œ๊ณ ๋ฆฌํ˜• ๋ณ€์ˆ˜

    • ์ •์ˆ˜ ๋˜๋Š” ๋ถ€๋™ ์†Œ์ˆ˜์  ๊ฐ’์ด ์žˆ๋Š” ๊ธฐํƒ€ ๋ณ€์ˆ˜

    • -1(-1์€ ๊ฒฐ์ธก์น˜๋ฅผ ์˜๋ฏธ)์„ ๊ฐ€์ง€๋Š” ๋ณ€์ˆ˜๋“ค

    • target ๋ณ€์ˆ˜์™€ ID ๋ณ€์ˆ˜

### rows, cols ์ˆ˜ ํ™•์ธ

train.shape
(595212, 59)
  • 59๊ฐœ์˜ ๋ณ€์ˆ˜์™€ 595212๊ฐœ์˜ ๊ด€์ธก์น˜๊ฐ€ ์กด์žฌํ•จ
### ์ค‘๋ณต๋œ ๋ฐ์ดํ„ฐ ์ œ๊ฑฐ

train.drop_duplicates()
train.shape
(595212, 59)
### ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ํ˜•ํƒœ ํ™•์ธ

test.shape
(892816, 58)
  • 1๊ฐœ์˜ ๋ณ€์ˆ˜๊ฐ€ ์—†์Œ

    • ์ด๊ฒƒ์€ target ๋ณ€์ˆ˜

    ~(๋‹น์—ฐํžˆ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์—๋Š” ์—†์–ด์•ผ์ง€..)~

  • ๋‚˜์ค‘์— 14๊ฐœ์˜ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋“ค์— ๋Œ€ํ•œ dummy ๋ณ€์ˆ˜๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ

    • ๋นˆ ๋ณ€์ˆ˜๋Š” ์ด๋ฏธ ์ดํ•ญ ๋ณ€์ˆ˜์ด๋ฏ€๋กœ ์ค‘๋ณต๊ฐ’ ์ œ๊ฑฐ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Œ
### ๋ฐ์ดํ„ฐ ์ •๋ณด ํ™•์ธ

train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 595212 entries, 0 to 595211
Data columns (total 59 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   id              595212 non-null  int64  
 1   target          595212 non-null  int64  
 2   ps_ind_01       595212 non-null  int64  
 3   ps_ind_02_cat   595212 non-null  int64  
 4   ps_ind_03       595212 non-null  int64  
 5   ps_ind_04_cat   595212 non-null  int64  
 6   ps_ind_05_cat   595212 non-null  int64  
 7   ps_ind_06_bin   595212 non-null  int64  
 8   ps_ind_07_bin   595212 non-null  int64  
 9   ps_ind_08_bin   595212 non-null  int64  
 10  ps_ind_09_bin   595212 non-null  int64  
 11  ps_ind_10_bin   595212 non-null  int64  
 12  ps_ind_11_bin   595212 non-null  int64  
 13  ps_ind_12_bin   595212 non-null  int64  
 14  ps_ind_13_bin   595212 non-null  int64  
 15  ps_ind_14       595212 non-null  int64  
 16  ps_ind_15       595212 non-null  int64  
 17  ps_ind_16_bin   595212 non-null  int64  
 18  ps_ind_17_bin   595212 non-null  int64  
 19  ps_ind_18_bin   595212 non-null  int64  
 20  ps_reg_01       595212 non-null  float64
 21  ps_reg_02       595212 non-null  float64
 22  ps_reg_03       595212 non-null  float64
 23  ps_car_01_cat   595212 non-null  int64  
 24  ps_car_02_cat   595212 non-null  int64  
 25  ps_car_03_cat   595212 non-null  int64  
 26  ps_car_04_cat   595212 non-null  int64  
 27  ps_car_05_cat   595212 non-null  int64  
 28  ps_car_06_cat   595212 non-null  int64  
 29  ps_car_07_cat   595212 non-null  int64  
 30  ps_car_08_cat   595212 non-null  int64  
 31  ps_car_09_cat   595212 non-null  int64  
 32  ps_car_10_cat   595212 non-null  int64  
 33  ps_car_11_cat   595212 non-null  int64  
 34  ps_car_11       595212 non-null  int64  
 35  ps_car_12       595212 non-null  float64
 36  ps_car_13       595212 non-null  float64
 37  ps_car_14       595212 non-null  float64
 38  ps_car_15       595212 non-null  float64
 39  ps_calc_01      595212 non-null  float64
 40  ps_calc_02      595212 non-null  float64
 41  ps_calc_03      595212 non-null  float64
 42  ps_calc_04      595212 non-null  int64  
 43  ps_calc_05      595212 non-null  int64  
 44  ps_calc_06      595212 non-null  int64  
 45  ps_calc_07      595212 non-null  int64  
 46  ps_calc_08      595212 non-null  int64  
 47  ps_calc_09      595212 non-null  int64  
 48  ps_calc_10      595212 non-null  int64  
 49  ps_calc_11      595212 non-null  int64  
 50  ps_calc_12      595212 non-null  int64  
 51  ps_calc_13      595212 non-null  int64  
 52  ps_calc_14      595212 non-null  int64  
 53  ps_calc_15_bin  595212 non-null  int64  
 54  ps_calc_16_bin  595212 non-null  int64  
 55  ps_calc_17_bin  595212 non-null  int64  
 56  ps_calc_18_bin  595212 non-null  int64  
 57  ps_calc_19_bin  595212 non-null  int64  
 58  ps_calc_20_bin  595212 non-null  int64  
dtypes: float64(10), int64(49)
memory usage: 267.9 MB
  • data type์ด ์ฃผ๋กœ ์ •์ˆ˜ํ˜•(integer) ๋˜๋Š” ์‹ค์ˆ˜ํ˜•(float)์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

  • ํ˜„์žฌ ๋ฐ์ดํ„ฐ์— null ๊ฐ’์ด ์กด์žฌํ•˜์ง€ x

    • ๊ฒฐ์ธก์น˜๊ฐ€ -1๋กœ ๋Œ€์ฒด๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ

4. Metadata

  • ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ๋ฅผ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ๋ฉ”ํƒ€ ์ •๋ณด๋ฅผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— ์ €์žฅ

  • ๋ถ„์„, ์‹œ๊ฐํ™”, ๋ชจ๋ธ๋ง ๋“ฑ์— ์‚ฌ์šฉํ•  ๋ณ€์ˆ˜๋ฅผ ์„ ํƒํ•˜๋ ค๋Š” ๊ฒฝ์šฐ์— ์œ ์šฉ

  • ์ €์žฅํ•  ๋‚ด์šฉ๋“ค:

    • role: input, ID, target

    • level: nominal, interval, ordinal, binary

    • keep: True or False

    • dtype: int, float, str

data = []

for f in train.columns:

    ### role
    if f == 'target':
        role = 'target'
    elif f == 'id':
        role = 'id'
    else:
        role = 'input'
         
    ### level
    if 'bin' in f or f == 'target':
        level = 'binary'
    elif 'cat' in f or f == 'id':
        level = 'nominal'
    elif train[f].dtype == float:
        level = 'interval'
    elif train[f].dtype == int:
        level = 'ordinal'
        
    # Initialize keep to True for all variables except for id
    keep = True
    if f == 'id':
        keep = False
    
    ### dtype 
    dtype = train[f].dtype
    
    # Creating a Dict that contains all the metadata for the variable
    f_dict = {
        'varname': f,
        'role': role,
        'level': level,
        'keep': keep,
        'dtype': dtype
    }
    data.append(f_dict)
meta = pd.DataFrame(data, columns=['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace=True)
meta
role level keep dtype
varname
id id nominal False int64
target target binary True int64
ps_ind_01 input ordinal True int64
ps_ind_02_cat input nominal True int64
ps_ind_03 input ordinal True int64
ps_ind_04_cat input nominal True int64
ps_ind_05_cat input nominal True int64
ps_ind_06_bin input binary True int64
ps_ind_07_bin input binary True int64
ps_ind_08_bin input binary True int64
ps_ind_09_bin input binary True int64
ps_ind_10_bin input binary True int64
ps_ind_11_bin input binary True int64
ps_ind_12_bin input binary True int64
ps_ind_13_bin input binary True int64
ps_ind_14 input ordinal True int64
ps_ind_15 input ordinal True int64
ps_ind_16_bin input binary True int64
ps_ind_17_bin input binary True int64
ps_ind_18_bin input binary True int64
ps_reg_01 input interval True float64
ps_reg_02 input interval True float64
ps_reg_03 input interval True float64
ps_car_01_cat input nominal True int64
ps_car_02_cat input nominal True int64
ps_car_03_cat input nominal True int64
ps_car_04_cat input nominal True int64
ps_car_05_cat input nominal True int64
ps_car_06_cat input nominal True int64
ps_car_07_cat input nominal True int64
ps_car_08_cat input nominal True int64
ps_car_09_cat input nominal True int64
ps_car_10_cat input nominal True int64
ps_car_11_cat input nominal True int64
ps_car_11 input ordinal True int64
ps_car_12 input interval True float64
ps_car_13 input interval True float64
ps_car_14 input interval True float64
ps_car_15 input interval True float64
ps_calc_01 input interval True float64
ps_calc_02 input interval True float64
ps_calc_03 input interval True float64
ps_calc_04 input ordinal True int64
ps_calc_05 input ordinal True int64
ps_calc_06 input ordinal True int64
ps_calc_07 input ordinal True int64
ps_calc_08 input ordinal True int64
ps_calc_09 input ordinal True int64
ps_calc_10 input ordinal True int64
ps_calc_11 input ordinal True int64
ps_calc_12 input ordinal True int64
ps_calc_13 input ordinal True int64
ps_calc_14 input ordinal True int64
ps_calc_15_bin input binary True int64
ps_calc_16_bin input binary True int64
ps_calc_17_bin input binary True int64
ps_calc_18_bin input binary True int64
ps_calc_19_bin input binary True int64
ps_calc_20_bin input binary True int64
### ์‚ญ์ œ๋˜์ง€ ์•Š์€ ๋ชจ๋“  nominal variable ์ถ”์ถœ(์˜ˆ์‹œ)

meta[(meta.level == 'nominal') & (meta.keep)].index
Index(['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat',
       'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat', 'ps_car_05_cat',
       'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat',
       'ps_car_10_cat', 'ps_car_11_cat'],
      dtype='object', name='varname')
### role๊ณผ level ๋ณ„ ๋ณ€์ˆ˜์˜ ์ˆ˜

pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()
role level count
0 id nominal 1
1 input binary 17
2 input interval 10
3 input nominal 14
4 input ordinal 16
5 target binary 1

5. Descriptive statistics

  • ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— ๊ธฐ์ˆ  ๋ฐฉ๋ฒ•(describe method)์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Œ

  • ๊ทธ๋Ÿฌ๋‚˜ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์™€ id ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ํ‰๊ท , ํ‘œ์ค€, โ€ฆ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์€ ๊ทธ๋‹ค์ง€ ์˜๋ฏธ๊ฐ€ ์—†์Œ

    • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ -> ๋‚˜์ค‘์— ์‹œ๊ฐ์ ์œผ๋กœ ์‚ดํŽด๋ณด์ž.
  • meta file์„ ํ†ตํ•ด ๊ธฐ์ˆ ํ†ต๊ณ„๋Ÿ‰์„ ๊ตฌํ•  ๋ณ€์ˆ˜๋ฅผ ์‰ฝ๊ฒŒ ์„ ํƒ ๊ฐ€๋Šฅ

    • ๋ช…ํ™•ํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ํƒ€์ž…๋ณ„๋กœ ์ž‘์—… ์ˆ˜ํ–‰

5-1. Interval ํƒ€์ž…

v = meta[(meta.level == 'interval') & (meta.keep)].index
train[v].describe()
ps_reg_01 ps_reg_02 ps_reg_03 ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01 ps_calc_02 ps_calc_03
count 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000
mean 0.610991 0.439184 0.551102 0.379945 0.813265 0.276256 3.065899 0.449756 0.449589 0.449849
std 0.287643 0.404264 0.793506 0.058327 0.224588 0.357154 0.731366 0.287198 0.286893 0.287153
min 0.000000 0.000000 -1.000000 -1.000000 0.250619 -1.000000 0.000000 0.000000 0.000000 0.000000
25% 0.400000 0.200000 0.525000 0.316228 0.670867 0.333167 2.828427 0.200000 0.200000 0.200000
50% 0.700000 0.300000 0.720677 0.374166 0.765811 0.368782 3.316625 0.500000 0.400000 0.500000
75% 0.900000 0.600000 1.000000 0.400000 0.906190 0.396485 3.605551 0.700000 0.700000 0.700000
max 0.900000 1.800000 4.037945 1.264911 3.720626 0.636396 3.741657 0.900000 0.900000 0.900000

โœ… reg ๋ณ€์ˆ˜

  • ps_reg_03์—๋งŒ ๊ฒฐ์ธก๊ฐ’์ด ์žˆ์Œ

  • ๋ณ€์ˆ˜ ๊ฐ„์˜ ๋ฒ”์œ„(์ตœ์†Œ ~ ์ตœ๋Œ€๊ฐ’)๊ฐ€ ๋‹ค๋ฆ„

    • ์Šค์ผ€์ผ๋ง(ex> StandardScaler)์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Œ

    • ์‚ฌ์šฉํ•  ๋ถ„๋ฅ˜๊ธฐ(classifier)์— ๋”ฐ๋ผ ๋‹ค๋ฆ„

โœ… car ๋ณ€์ˆ˜

  • ps_car_12๊ณผ ps_car_15์—๊ฒฐ์ธก์น˜ ์กด์žฌ

  • ์Šค์ผ€์ผ๋ง ์ ์šฉ ๊ฐ€๋Šฅ

โœ… calc ๋ณ€์ˆ˜

  • ๊ฒฐ์ธก์น˜๊ฐ€ ์กด์žฌํ•˜์ง€ x

  • ์„ธ ๋ณ€์ˆ˜ ๋ชจ๋‘ ๋ถ„ํฌ๊ฐ€ ๋งค์šฐ ์œ ์‚ฌํ•จ

์ „์ฒด์ ์œผ๋กœ interval ๋ณ€์ˆ˜๋“ค์˜ ๋ฒ”์œ„๊ฐ€ ๋‹ค์†Œ ์ž‘๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

  • ๋ฐ์ดํ„ฐ๋ฅผ ์ต๋ช…ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์–ด๋–ค ๋ณ€ํ™˜(ex> log transformation)์ด ์ด๋ฏธ ์ ์šฉ๋œ ๊ฒƒ์€ ์•„๋‹๊นŒ?

5-2. Ordinal ๋ณ€์ˆ˜

v = meta[(meta.level == 'ordinal') & (meta.keep)].index
train[v].describe()
ps_ind_01 ps_ind_03 ps_ind_14 ps_ind_15 ps_car_11 ps_calc_04 ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14
count 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000
mean 1.900378 4.423318 0.012451 7.299922 2.346072 2.372081 1.885886 7.689445 3.005823 9.225904 2.339034 8.433590 5.441382 1.441918 2.872288 7.539026
std 1.983789 2.699902 0.127545 3.546042 0.832548 1.117219 1.134927 1.334312 1.414564 1.459672 1.246949 2.904597 2.332871 1.202963 1.694887 2.746652
min 0.000000 0.000000 0.000000 0.000000 -1.000000 0.000000 0.000000 0.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 0.000000 5.000000 2.000000 2.000000 1.000000 7.000000 2.000000 8.000000 1.000000 6.000000 4.000000 1.000000 2.000000 6.000000
50% 1.000000 4.000000 0.000000 7.000000 3.000000 2.000000 2.000000 8.000000 3.000000 9.000000 2.000000 8.000000 5.000000 1.000000 3.000000 7.000000
75% 3.000000 6.000000 0.000000 10.000000 3.000000 3.000000 3.000000 9.000000 4.000000 10.000000 3.000000 10.000000 7.000000 2.000000 4.000000 9.000000
max 7.000000 11.000000 4.000000 13.000000 3.000000 5.000000 6.000000 10.000000 9.000000 12.000000 7.000000 25.000000 19.000000 10.000000 13.000000 23.000000
  • ์˜ค์ง ps_car_11์—์„œ ๊ฒฐ์ธก์น˜๊ฐ€ ์กด์žฌํ•จ

  • ๋‹ค๋ฅธ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์— ๋Œ€ํ•ด ์Šค์ผ€์ผ๋ง ์ ์šฉ ๊ฐ€๋Šฅ

5-3. Binary ๋ณ€์ˆ˜

v = meta[(meta.level == 'binary') & (meta.keep)].index
train[v].describe()
target ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
count 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000
mean 0.036448 0.393742 0.257033 0.163921 0.185304 0.000373 0.001692 0.009439 0.000948 0.660823 0.121081 0.153446 0.122427 0.627840 0.554182 0.287182 0.349024 0.153318
std 0.187401 0.488579 0.436998 0.370205 0.388544 0.019309 0.041097 0.096693 0.030768 0.473430 0.326222 0.360417 0.327779 0.483381 0.497056 0.452447 0.476662 0.360295
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000
75% 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
  • train ๋ฐ์ดํ„ฐ์—์„œ 1(= true)์˜ ๋น„์œจ์€ 3.645% -> ๋งค์šฐ ๋ถˆ๊ท ํ˜•ํ•œ ๋ฐ์ดํ„ฐ

  • ๋Œ€๋ถ€๋ถ„์˜ ๋ณ€์ˆ˜๋“ค์— ๋Œ€ํ•œ ๊ฐ’์ด 0์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

6. Handling imbalanced classes

  • target = 1์ธ ๋ ˆ์ฝ”๋“œ์˜ ๋น„์œจ์€ target = 0๋ณด๋‹ค ํ›จ์”ฌ ์ ์Œ -> imbalanced data

  • ์ •ํ™•๋„๋Š” ๋†’์ง€๋งŒ ์žก์Œ์ด ๋งŽ์€ ๋ชจ๋ธ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Œ

  • ํ•ด๊ฒฐ ์ „๋žต:

    • target = 1 oversampling

    • target = 0 undersampling

ํ˜„์žฌ train set์ด ์ƒ๋‹นํžˆ ํฌ๊ธฐ์—, undersampling ์ˆ˜ํ–‰

๐Ÿ“š์ฐธ๊ณ ์ž๋ฃŒ

๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ (imbalanced data) ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฒ•

desired_apriori = 0.10 

### target ๋ณ€์ˆ˜์˜ ๊ฐ’๋งˆ๋‹ค index ๊ฐ€์ ธ์˜ค๊ธฐ
idx_0 = train[train.target == 0].index # False
idx_1 = train[train.target == 1].index # True

### target ๋ถ„ํฌ ํ™•์ธ
nb_0 = len(train.loc[idx_0])
nb_1 = len(train.loc[idx_1])

### ์ถ”๊ฐ€ํ•œ ์ฝ”๋“œ
print(nb_0)
print(nb_1)
573518
21694
  • target ๋ฐ์ดํ„ฐ๊ฐ€ ๊ต‰์žฅํžˆ ๋ถˆ๊ท ํ˜•ํ•œ(imbalanced) ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
### Undersampling ์‹œํ‚ฌ ๋น„์œจ ์„ ํƒ

undersampling_rate = ((1 - desired_apriori)*nb_1) / (nb_0*desired_apriori) 
# undersampling ์‹œ์ผœ์•ผ ํ•  ๋น„์œจ -> ํ•ด๋‹น ๋น„์œจ๋งŒํผ target = 0์ธ ๊ฒƒ๋“ค์ด target = 1๋กœ ๋ณ€๊ฒฝ
undersampled_nb_0 = int(undersampling_rate * nb_0)

print('Rate to undersample records with target = 0: {}'.format(undersampling_rate))
print('Number of records with target = 0 after undersampling: {}'.format(undersampled_nb_0))
Rate to undersample records with target = 0: 0.34043569687437886
Number of records with target = 0 after undersampling: 195246
  • ๋ถˆ๊ท ํ˜• ์ •๋„๊ฐ€ ๋งŽ์ด ๊ฐ์†Œ๋จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

    (96% -> 33%)

### Undersampling ์ˆ˜ํ–‰
# RandomSampling ๋ฐฉ๋ฒ•์„ ์ฑ„ํƒ
undersampled_idx = shuffle(idx_0, random_state = 37, n_samples = undersampled_nb_0) 
# ์‹คํ–‰ ์‹œ๋งˆ๋‹ค ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด random_state ์‚ฌ์šฉ

# ๋‚˜๋จธ์ง€ ์ธ๋ฑ์Šค๋กœ list ๊ตฌ์„ฑ => target = 1์˜ index
idx_list = list(undersampled_idx) + list(idx_1)

# undersampling๋œ ๋ฐ์ดํ„ฐ ๋ฐ˜ํ™˜(์ตœ์ข… ๋ฐ์ดํ„ฐ)
train = train.loc[idx_list].reset_index(drop=True)

7. Data Quality Checks

7-1. ๊ฒฐ์ธก์น˜ ํ™•์ธ

  • ๊ฒฐ์ธก์น˜๋Š” -1๋กœ ํ‘œ๊ธฐ๋˜์–ด ์žˆ์Œ
vars_with_missing = [] # ๊ฒฐ์ธก์น˜๋ฅผ ์ €์žฅํ•  list

for f in train.columns: # ๊ฐ column๋ณ„๋กœ
    missings = train[train[f] == -1][f].count()
    if missings > 0:
        vars_with_missing.append(f)
        
        missings_perc = missings / train.shape[0] # ๊ฐ ์ปฌ๋Ÿผ๋ณ„ ๊ฒฐ์ธก์น˜์˜ ๋น„์œจ
        print('Variable {} has {} records ({:.2%}) with missing values'.format(f, missings, missings_perc))

print()
# ์ „์ฒด ๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜        
print('In total, there are {} variables with missing values'.format(len(vars_with_missing)))
Variable ps_ind_02_cat has 103 records (0.05%) with missing values
Variable ps_ind_04_cat has 51 records (0.02%) with missing values
Variable ps_ind_05_cat has 2256 records (1.04%) with missing values
Variable ps_reg_03 has 38580 records (17.78%) with missing values
Variable ps_car_01_cat has 62 records (0.03%) with missing values
Variable ps_car_02_cat has 2 records (0.00%) with missing values
Variable ps_car_03_cat has 148367 records (68.39%) with missing values
Variable ps_car_05_cat has 96026 records (44.26%) with missing values
Variable ps_car_07_cat has 4431 records (2.04%) with missing values
Variable ps_car_09_cat has 230 records (0.11%) with missing values
Variable ps_car_11 has 1 records (0.00%) with missing values
Variable ps_car_14 has 15726 records (7.25%) with missing values

In total, there are 12 variables with missing values
  • ps_car_03_cat ๋ฐ ps_car_05_cat์—๋Š” ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ๋ ˆ์ฝ”๋“œ์˜ ๋น„์œจ์ด ๋งŽ์Œ

    • ํ•ด๋‹น ๋ณ€์ˆ˜ ์ œ๊ฑฐ
  • ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ๋‹ค๋ฅธ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์˜ ๊ฒฝ์šฐ ๊ฒฐ์ธก๊ฐ’ -1์€ ๊ทธ๋Œ€๋กœ ๋‘˜ ์ˆ˜ ์žˆ์Œ

  • ps_reg_03(์—ฐ์†)๋Š” ์ „์ฒด 18% ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฒฐ์ธก์น˜

    • ํ‰๊ท ์œผ๋กœ ๋Œ€์ฒด
  • ps_car_11(์ˆœ์„œ)์—๋Š” ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ๋ ˆ์ฝ”๋“œ๊ฐ€ 5๊ฐœ๋งŒ ์žˆ์Œ

    • ์ตœ๋นˆ๊ฐ’์œผ๋กœ ๋Œ€์ฒด
  • ps_car_12(์—ฐ์†)์—๋Š” ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ๋ ˆ์ฝ”๋“œ๊ฐ€ 1๊ฐœ๋งŒ ์žˆ์Œ

    • ํ‰๊ท ์œผ๋กœ ๋Œ€์ฒด
  • ps_car_14(์—ฐ์†)์—๋Š” ๋ชจ๋“  ๋ ˆ์ฝ”๋“œ์˜ 7%์— ๋Œ€ํ•œ ๊ฒฐ์ธก๊ฐ’์ด ์žˆ์Œ

    • ํ‰๊ท ์œผ๋กœ ๋Œ€์ฒด
### ๊ฒฐ์ธก๊ฐ’์ด ๋„ˆ๋ฌด ๋งŽ์€ ๋ณ€์ˆ˜ ์‚ญ์ œํ•˜๊ธฐ
vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
train.drop(vars_to_drop, inplace = True, axis = 1)
meta.loc[(vars_to_drop),'keep'] = False  # ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋ณ€๊ฒฝํ•˜๊ธฐ

### ๋ฐ์ดํ„ฐ ๊ฐ€๊ณต - ๊ฒฐ์ธก์น˜ ์ฑ„์›Œ๋„ฃ๊ธฐ
mean_imp = SimpleImputer(missing_values = -1, strategy = 'mean')
mode_imp = SimpleImputer(missing_values = -1, strategy = 'most_frequent')
train['ps_reg_03'] = mean_imp.fit_transform(train[['ps_reg_03']]).ravel()
train['ps_car_12'] = mean_imp.fit_transform(train[['ps_car_12']]).ravel()
train['ps_car_14'] = mean_imp.fit_transform(train[['ps_car_14']]).ravel()
train['ps_car_11'] = mode_imp.fit_transform(train[['ps_car_11']]).ravel()

7-2. ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์˜ ์นด๋””๋„๋ฆฌํ‹ฐ(cardinality) ํ™•์ธ

  • ์นด๋””๋„๋ฆฌํ‹ฐ(cardinality, ๊ธฐ์ˆ˜): ๋ณ€์ˆ˜์— ํฌํ•จ๋œ ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐ’์˜ ์ˆ˜

  • ๋‚˜์ค‘์— ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์—์„œ dummy ๋ณ€์ˆ˜๋ฅผ ๋งŒ๋“ค ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์—, ๋‹ค๋ฅธ ๊ฐ’์„ ๊ฐ€์ง„ ๋ณ€์ˆ˜์˜ ๊ฐœ์ˆ˜๋ฅผ ํ™•์ธํ•  ํ•„์š”๊ฐ€ ์žˆ์Œ

    • ํ•ด๋‹น ๋ณ€์ˆ˜๋Š” ๋งŽ์€ dummy ๋ณ€์ˆ˜๋ฅผ ์ดˆ๋ž˜ํ•˜๋ฏ€๋กœ ๋‹ค๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•จ
v = meta[(meta.level == 'nominal') & (meta.keep)].index

for f in v:
    dist_values = train[f].value_counts().shape[0]
    print('Variable {} has {} distinct values'.format(f, dist_values))
Variable ps_ind_02_cat has 5 distinct values
Variable ps_ind_04_cat has 3 distinct values
Variable ps_ind_05_cat has 8 distinct values
Variable ps_car_01_cat has 13 distinct values
Variable ps_car_02_cat has 3 distinct values
Variable ps_car_04_cat has 10 distinct values
Variable ps_car_06_cat has 18 distinct values
Variable ps_car_07_cat has 3 distinct values
Variable ps_car_08_cat has 2 distinct values
Variable ps_car_09_cat has 6 distinct values
Variable ps_car_10_cat has 3 distinct values
Variable ps_car_11_cat has 104 distinct values
  • ps_car_11_cat๊ฐ€ ๋งŽ์€ ๋‹ค์–‘ํ•œ ๋ณ€์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ

  • ํ‰ํ™œํ™”(Smoothing)๋Š” Danielle Micci-Bareca์— ์˜ํ•ด ๋‹ค์Œ ๋…ผ๋ฌธ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋จ

    • https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf
  • parameters>

    • trn_series : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ pd.Series๋กœ ํ›ˆ๋ จ(train)

    • tst_series: ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ pd.Series๋กœ ํ‰๊ฐ€(test)

    • target: target data๋ฅผ pd.Series ํ˜•์œผ๋กœ

    • min_samples_leaf(int):๋ฒ”์ฃผ์˜ ํŠน์„ฑ์„ ์„ค๋ช…ํ•ด ์ค„ ์ˆ˜ ์žˆ๋Š” ์ตœ์†Œํ•œ์˜ ํ‘œ๋ณธ

    • smoothing(int): ๋ฒ”์ฃผํ˜• ํ‰๊ท ๊ณผ ์ด์ „ ํ‰๊ท ์˜ ๊ท ํ˜•์„ ๋งž์ถ”๊ธฐ ์œ„ํ•œ ํ‰ํ™œ(smoothing) ํšจ๊ณผ ์ž‘์šฉ

def add_noise(series, noise_level):
    return series * (1 + noise_level * np.random.randn(len(series)))
def target_encode(trn_series = None, 
                  tst_series = None, 
                  target = None, 
                  min_samples_leaf = 1, 
                  smoothing = 1,
                  noise_level = 0):
   
    assert len(trn_series) == len(target)
    assert trn_series.name == tst_series.name

    '''[assert ๋ฌธ]
    -  ์–ด๋–ค ์กฐ๊ฑด์ด ์ฐธ์ž„์„ ํ™•๊ณ ํžˆ ํ•˜๋Š” ๊ฒƒ
    -  ํ•ด๋‹น ์กฐ๊ฑด์ด ๊ฑฐ์ง“์ด๋ฉด ์—๋Ÿฌ ์ƒํ™ฉ => ์‹คํ–‰์„ ๊ณ„์†ํ•˜์ง€ ๋ชปํ•˜๊ฒŒ ํ•จ
    '''
    
    temp = pd.concat([trn_series, target], axis = 1)
    
    ### target ๋ณ€์ˆ˜์˜ ํ‰๊ท 
    averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
    
    ### ํ‰ํ™œํ™” ์ •๋„ ๊ณ„์‚ฐ
    smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))

    ### ๋ชจ๋“  target ๋ณ€์ˆ˜์— ํ‰ํ™œํ™” ํ•จ์ˆ˜ ์ ์šฉํ•˜๊ธฐ
    prior = target.mean()
    # ์นด์šดํŠธ๊ฐ€ ํด์ˆ˜๋ก full_avg๊ฐ€ ์ ๊ฒŒ ๊ณ ๋ ค๋จ
    averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
    averages.drop(["mean", "count"], axis = 1, inplace = True)
    
    ### train, test set์— ํ‰ํ™œํ™” ์ ์šฉ
    ft_trn_series = pd.merge(
        trn_series.to_frame(trn_series.name),
        averages.reset_index().rename(columns = {'index': target.name, target.name: 'average'}),
        on = trn_series.name,
        how = 'left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    # ์ธ๋ฑ์Šค ์ดˆ๊ธฐํ™”(์žฌ์„ค์ •)
    ft_trn_series.index = trn_series.index 
    
    ft_tst_series = pd.merge(
        tst_series.to_frame(tst_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=tst_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    # ์ธ๋ฑ์Šค ์ดˆ๊ธฐํ™”(์žฌ์„ค์ •)    
    ft_tst_series.index = tst_series.index
    
    return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)
### ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ encoding
# ๋ฒ”์ฃผํ˜• -> ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜(dummy ๋ณ€์ˆ˜)

train_encoded, test_encoded = target_encode(train["ps_car_11_cat"], 
                             test["ps_car_11_cat"], 
                             target = train.target, 
                             min_samples_leaf = 100,
                             smoothing = 10,
                             noise_level = 0.01)
### encoding๋œ ๊ฐ’์œผ๋กœ ๋ณ€๊ฒฝ
train['ps_car_11_cat_te'] = train_encoded
train.drop('ps_car_11_cat', axis = 1, inplace = True)

# ๋ฉ”ํƒ€ ๋ฐ์ดํ„ฐ update
meta.loc['ps_car_11_cat','keep'] = False

test['ps_car_11_cat_te'] = test_encoded
test.drop('ps_car_11_cat', axis=1, inplace = True)

8. Exploratory Data Visualization

8-1. ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜

  • target = 1์ธ ๊ณ ๊ฐ๋“ค์˜ ๋น„์œจ ํŒŒ์•…
v = meta[(meta.level == 'nominal') & (meta.keep)].index

for f in v:
    ### ๊ธฐ๋ณธ ํ‹€ ์„ค์ •
    plt.figure()
    fig, ax = plt.subplots(figsize = (20,10))

    ### target = 1์˜ ๋น„์œจ ๊ณ„์‚ฐ
    cat_perc = train[[f, 'target']].groupby([f],as_index = False).mean()
    cat_perc.sort_values(by = 'target', ascending = False, inplace = True) # ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ
    
    ### ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„(Barplot) ๊ทธ๋ฆฌ๊ธฐ
    # target์˜ ํ‰๊ท ์„ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ
    sns.barplot(ax = ax, x = f, y = 'target', data = cat_perc, order = cat_perc[f])
    plt.ylabel('% target', fontsize = 18)
    plt.xlabel(f, fontsize = 18)
    plt.tick_params(axis = 'both', which = 'major', labelsize = 18)
    plt.show()
<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

  • ๊ฒฐ์ธก๊ฐ’์„ ์ตœ๋นˆ๊ฐ’ ๋“ฑ์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋Œ€์‹  ๋ณ„๋„์˜ ๋ฒ”์ฃผ ๊ฐ’์œผ๋กœ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Œ

  • ๊ฒฐ์ธก๊ฐ’์„ ๊ฐ€์ง„ ๊ณ ๊ฐ๋“ค์ด ๋ณดํ—˜๊ธˆ ์ฒญ๊ตฌ๋ฅผ ์š”์ฒญํ•  ํ™•๋ฅ ์ด ํ›จ์”ฌ ๋†’์€(๊ฒฝ์šฐ์— ๋”ฐ๋ผ ํ›จ์”ฌ ๋‚ฎ์€) ๊ฒƒ์œผ๋กœ ๋ณด์ž„

8-2. ๊ตฌ๊ฐ„ ๋ณ€์ˆ˜

  • ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ํ™•์ธ

  • heatmap์€ ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ์‹œ๊ฐํ™”ํ•˜๋Š” ์ข‹์€ ๋ฐฉ๋ฒ•

### heatmap ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ ํ•จ์ˆ˜

def corr_heatmap(v):
    ### ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜
    correlations = train[v].corr() 

    ### ์‹œ๊ฐํ™”
    cmap = sns.diverging_palette(220, 10, as_cmap = True) # colormap ์‹œ๊ฐํ™” 

    fig, ax = plt.subplots(figsize = (10,10))
    sns.heatmap(correlations, cmap = cmap, vmax = 1.0, center = 0, 
                fmt='.2f', square=True, linewidths=.5, annot=True, 
                cbar_kws={"shrink": .75})
    
    plt.show()
v = meta[(meta.level == 'interval') & (meta.keep)].index
corr_heatmap(v)

  • ํ•ด๋‹น ๋ณ€์ˆ˜ ์‚ฌ์ด์—๋Š” ๊ฐ•๋ ฅํ•œ ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ์กด์žฌํ•จ

    • ps_reg_02 & ps_reg_03 (0.7)

    • ps_car_12 & ps_car13 (0.67)

    • ps_car_12 & ps_car14 (0.58)

    • ps_car_13 & ps_car15 (0.67)

  • Seaborn์—๋Š” ๋ณ€์ˆ˜ ๊ฐ„์˜ (์„ ํ˜•) ๊ด€๊ณ„๋ฅผ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ์œ ์šฉํ•œ ๋„๊ตฌ๊ฐ€ ์žˆ์Œ

    • pairplot์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ์Œ

    • ์ƒ๊ด€์„ฑ์ด ๋†’์€ ๋ณ€์ˆ˜๋ฅผ ๊ฐœ๋ณ„์ ์œผ๋กœ ์‚ดํŽด๋ณด์ž!

### ์ฒ˜๋ฆฌ ์†๋„๋กœ ์ธํ•ด sample์„ ์ถ”์ถœํ•˜์—ฌ ๊ด€์ฐฐ

s = train.sample(frac = 0.1)

โœ… ps_reg_02 & ps_reg_03

sns.lmplot(x = 'ps_reg_02', y = 'ps_reg_03', data = s, hue = 'target', 
           palette = 'Set1', scatter_kws = {'alpha':0.3})
plt.show()

  • ํšŒ๊ท€์„ ์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ๋ณ€์ˆ˜ ์‚ฌ์ด์—๋Š” ์„ ํ˜• ๊ด€๊ณ„๊ฐ€ ์žˆ์Œ

  • ์ƒ‰์ƒ ๋งค๊ฐœ๋ณ€์ˆ˜ ๋•๋ถ„์— target = 0๊ณผ target = 1์˜ ํšŒ๊ท€์„ ์ด ๋™์ผํ•จ์„ ์•Œ ์ˆ˜ ์žˆ์Œ

โœ… ps_car_12 and ps_car_13

sns.lmplot(x = 'ps_car_12', y = 'ps_car_13', data = s, hue = 'target', 
           palette = 'Set1', scatter_kws = {'alpha':0.3})
plt.show()

โœ… ps_car_12 and ps_car_14

sns.lmplot(x = 'ps_car_12', y = 'ps_car_14', data  =s, hue = 'target', 
           palette = 'Set1', scatter_kws = {'alpha':0.3})
plt.show()

โœ… ps_car_13 and ps_car_15

sns.lmplot(x = 'ps_car_15', y = 'ps_car_13', data = s, hue = 'target', 
           palette = 'Set1', scatter_kws = {'alpha':0.3})
plt.show()

  • ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ์ฃผ์„ฑ๋ถ„ ๋ถ„์„(PCA)์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์น˜์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์Œ

    • ์ƒ๊ด€๋œ ๋ณ€์ˆ˜์˜ ์ˆ˜๊ฐ€ ๋‹ค์†Œ ์ ๊ธฐ์—, ๊ทธ๋ƒฅ ๋ƒ…๋‘์ž!

8-3. ์ˆœ์„œํ˜• ๋ณ€์ˆ˜

### ์ƒ๊ด€๊ณ„์ˆ˜

v = meta[(meta.level == 'ordinal') & (meta.keep)].index
corr_heatmap(v)

  • ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ๋†’์€ ๋ณ€์ˆ˜๋“ค์ด ๋ณด์ด์ง€ ์•Š์Œ

    • ๋Œ€์‹ , target ๊ฐ’์œผ๋กœ groupํ™” ์‹œ ๋ถ„ํฌ๊ฐ€ ์–ด๋–ป๊ฒŒ ํ˜•์„ฑ๋˜๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

9. Feature engineering

9-1. Dummy ๋ณ€์ˆ˜ ์ƒ์„ฑ

  • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์˜ ๊ฐ’์€ ์ˆœ์„œ๋‚˜ ํฌ๊ธฐ๋ฅผ ๋‚˜ํƒ€๋‚ด์ง€ ์•Š์Œ

    • ์นดํ…Œ๊ณ ๋ฆฌ 2๋Š” ์นดํ…Œ๊ณ ๋ฆฌ 1์˜ ๊ฐ’์˜ ๋‘ ๋ฐฐ๊ฐ€ ์•„๋‹˜
  • ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด dummy ๋ณ€์ˆ˜๋ฅผ ํ™œ์šฉ

    • ์›๋ž˜ ๋ณ€์ˆ˜์˜ ๋ฒ”์ฃผ์— ๋Œ€ํ•ด ์ƒ์„ฑ๋œ ๋‹ค๋ฅธ ๋”๋ฏธ ๋ณ€์ˆ˜์—์„œ ํŒŒ์ƒ๋  ์ˆ˜ ์žˆ์Œ -> ์ฒซ ๋ฒˆ์งธ ๋”๋ฏธ ๋ณ€์ˆ˜๋ฅผ drop
v = meta[(meta.level == 'nominal') & (meta.keep)].index
print('Before dummification we have {} variables in train'.format(train.shape[1]))

train = pd.get_dummies(train, columns = v, drop_first = True)
print('After dummification we have {} variables in train'.format(train.shape[1]))
Before dummification we have 57 variables in train
After dummification we have 109 variables in train
  • ๋”๋ฏธ ๋ณ€์ˆ˜ ์ƒ์„ฑ์œผ๋กœ ์ธํ•ด 52๊ฐœ์˜ ๋ณ€์ˆ˜๊ฐ€ ์ถ”๊ฐ€์ ์œผ๋กœ ์ƒ์„ฑ๋˜์—ˆ๋‹ค.

9-2. interaction ๋ณ€์ˆ˜ ์ƒ์„ฑ

v = meta[(meta.level == 'interval') & (meta.keep)].index
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
interactions = pd.DataFrame(data=poly.fit_transform(train[v]), columns=poly.get_feature_names_out(v))
interactions.drop(v, axis=1, inplace=True)  # Remove the original columns

# Concat the interaction variables to the train data
print('Before creating interactions we have {} variables in train'.format(train.shape[1]))
train = pd.concat([train, interactions], axis=1)
print('After creating interactions we have {} variables in train'.format(train.shape[1]))
Before creating interactions we have 109 variables in train
After creating interactions we have 164 variables in train
  • train ๋ฐ์ดํ„ฐ์— interaction ๋ณ€์ˆ˜๊ฐ€ ์ถ”๊ฐ€๋จ

  • get_feature_names_out() ๋ฉ”์„œ๋“œ ๋•๋ถ„์— ์—ด ์ด๋ฆ„์„ ์ƒˆ ๋ณ€์ˆ˜์— ํ• ๋‹นํ•  ์ˆ˜ ์žˆ์Œ

10. Feature selection

10-1. ๋ถ„์‚ฐ์ด ๋‚ฎ๊ฑฐ๋‚˜ 0์ธ feature ์ œ๊ฑฐ

  • Sklearn์€ ์ด๋ฅผ ์œ„ํ•œ ํŽธ๋ฆฌํ•œ ๋ฐฉ๋ฒ•์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ

    • ๋ถ„์‚ฐ ๋ฌธํ„ฑ๊ฐ’(Variance Threshold)
  • ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ถ„์‚ฐ์ด 0์ธ ํ˜•์ƒ์„ ์ œ๊ฑฐ

    • ์ด์ „ ๋‹จ๊ณ„์—์„œ zero-variance ๋ณ€์ˆ˜๊ฐ€ ์—†์Œ์„ ํ™•์ธ

    • ๋ถ„์‚ฐ์ด 1% ๋ฏธ๋งŒ์ธ feature๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด 31๊ฐœ์˜ ๋ณ€์ˆ˜๊ฐ€ ์ œ๊ฑฐ๋จ

selector = VarianceThreshold(threshold = .01)
selector.fit(train.drop(['id', 'target'], axis = 1)) 

f = np.vectorize(lambda x : not x) # boolean ๊ฐ’ ์š”์†Œ๋ฅผ ์ „ํ™˜

v = train.drop(['id', 'target'], axis = 1).columns[f(selector.get_support())]

### ๋ถ„์‹ ์ด threshold๋ณด๋‹ค ์ž‘์€ ๋ณ€์ˆ˜ ์ถ”์ถœ
print('{} variables have too low variance.'.format(len(v)))
print('These variables are {}'.format(list(v)))
28 variables have too low variance.
These variables are ['ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_car_12', 'ps_car_14', 'ps_car_11_cat_te', 'ps_ind_05_cat_2', 'ps_ind_05_cat_5', 'ps_car_01_cat_1', 'ps_car_01_cat_2', 'ps_car_04_cat_3', 'ps_car_04_cat_4', 'ps_car_04_cat_5', 'ps_car_04_cat_6', 'ps_car_04_cat_7', 'ps_car_06_cat_2', 'ps_car_06_cat_5', 'ps_car_06_cat_8', 'ps_car_06_cat_12', 'ps_car_06_cat_16', 'ps_car_06_cat_17', 'ps_car_09_cat_4', 'ps_car_10_cat_1', 'ps_car_10_cat_2', 'ps_car_12^2', 'ps_car_12 ps_car_14', 'ps_car_14^2']
  • ๋ถ„์‚ฐ์„ ๊ธฐ์ค€์œผ๋กœ ์„ ํƒํ•œ๋‹ค๋ฉด ๋งŽ์€ ๋ณ€์ˆ˜๋ฅผ ์žƒ๊ฒŒ ๋  ๊ฒƒ์ž„

    • ํ•˜์ง€๋งŒ ๋ณ€์ˆ˜๊ฐ€ ๋งŽ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์„ ํƒํ•˜๋„๋ก ํ•  ๊ฒƒ์ž„

    • ๋ณ€์ˆ˜๊ฐ€ ๋งŽ์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๊ฒฝ์šฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ ์ค„์ผ ์ˆ˜ ์žˆ์Œ

  • ์‚ฌ์ดํ‚ท๋Ÿฐ์—๋Š” ๋‹ค๋ฅธ feature ์„ ํƒ ๋ฐฉ๋ฒ•์œผ๋กœ SelectFromModel์„ ์ œ๊ณตํ•จ

### ์ „์ฒด ๋ณ€์ˆ˜ ํ™œ์šฉ
# RandomForest ํ™œ์šฉ

X_train = train.drop(['id', 'target'], axis = 1)
y_train = train['target']

feat_labels = X_train.columns

rf = RandomForestClassifier(n_estimators = 1000, random_state = 0, n_jobs=-1)

rf.fit(X_train, y_train)
importances = rf.feature_importances_

indices = np.argsort(rf.feature_importances_)[::-1] # ์—ญ์ˆœ์œผ๋กœ ์š”์†Œ๋“ค์„ ์ •๋ ฌ(๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ)

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30,feat_labels[indices[f]], importances[indices[f]]))
 1) ps_car_11_cat_te               0.021066
 2) ps_car_13                      0.017323
 3) ps_car_12 ps_car_13            0.017271
 4) ps_car_13^2                    0.017238
 5) ps_car_13 ps_car_14            0.017201
 6) ps_reg_03 ps_car_13            0.017106
 7) ps_car_13 ps_car_15            0.016783
 8) ps_reg_01 ps_car_13            0.016777
 9) ps_reg_03 ps_car_14            0.016258
10) ps_reg_03 ps_car_12            0.015567
11) ps_reg_03 ps_car_15            0.015179
12) ps_car_14 ps_car_15            0.015002
13) ps_car_13 ps_calc_02           0.014744
14) ps_reg_01 ps_reg_03            0.014728
15) ps_car_13 ps_calc_01           0.014714
16) ps_reg_02 ps_car_13            0.014641
17) ps_car_13 ps_calc_03           0.014635
18) ps_reg_01 ps_car_14            0.014488
19) ps_reg_03^2                    0.014317
20) ps_reg_03                      0.014165
21) ps_reg_03 ps_calc_03           0.013774
22) ps_reg_03 ps_calc_02           0.013765
23) ps_reg_03 ps_calc_01           0.013694
24) ps_calc_10                     0.013626
25) ps_car_14 ps_calc_02           0.013594
26) ps_car_14 ps_calc_01           0.013580
27) ps_car_14 ps_calc_03           0.013519
28) ps_calc_14                     0.013414
29) ps_car_12 ps_car_14            0.012919
30) ps_ind_03                      0.012910
31) ps_car_14^2                    0.012752
32) ps_car_14                      0.012751
33) ps_reg_02 ps_car_14            0.012746
34) ps_calc_11                     0.012582
35) ps_reg_02 ps_reg_03            0.012560
36) ps_ind_15                      0.012182
37) ps_car_12 ps_car_15            0.010928
38) ps_car_15 ps_calc_03           0.010903
39) ps_car_15 ps_calc_02           0.010843
40) ps_car_15 ps_calc_01           0.010814
41) ps_car_12 ps_calc_01           0.010496
42) ps_calc_13                     0.010441
43) ps_car_12 ps_calc_03           0.010354
44) ps_car_12 ps_calc_02           0.010304
45) ps_reg_02 ps_car_15            0.010219
46) ps_reg_01 ps_car_15            0.010213
47) ps_calc_02 ps_calc_03          0.010074
48) ps_calc_01 ps_calc_03          0.010060
49) ps_calc_01 ps_calc_02          0.010000
50) ps_calc_07                     0.009818
51) ps_calc_08                     0.009797
52) ps_reg_01 ps_car_12            0.009466
53) ps_reg_02 ps_car_12            0.009299
54) ps_reg_02 ps_calc_01           0.009288
55) ps_reg_02 ps_calc_03           0.009221
56) ps_reg_02 ps_calc_02           0.009146
57) ps_reg_01 ps_calc_03           0.009059
58) ps_calc_06                     0.009044
59) ps_reg_01 ps_calc_02           0.009037
60) ps_reg_01 ps_calc_01           0.009012
61) ps_calc_09                     0.008800
62) ps_ind_01                      0.008605
63) ps_calc_05                     0.008318
64) ps_calc_04                     0.008128
65) ps_reg_01 ps_reg_02            0.008024
66) ps_calc_12                     0.007976
67) ps_car_15                      0.006139
68) ps_car_15^2                    0.006136
69) ps_calc_03                     0.006017
70) ps_calc_01^2                   0.006009
71) ps_calc_03^2                   0.005974
72) ps_calc_01                     0.005964
73) ps_calc_02^2                   0.005945
74) ps_calc_02                     0.005919
75) ps_car_12^2                    0.005356
76) ps_car_12                      0.005345
77) ps_reg_02^2                    0.004986
78) ps_reg_02                      0.004973
79) ps_reg_01                      0.004159
80) ps_reg_01^2                    0.004139
81) ps_car_11                      0.003798
82) ps_ind_05_cat_0                0.003557
83) ps_ind_17_bin                  0.002843
84) ps_calc_17_bin                 0.002674
85) ps_calc_16_bin                 0.002590
86) ps_calc_19_bin                 0.002548
87) ps_calc_18_bin                 0.002504
88) ps_ind_16_bin                  0.002403
89) ps_car_01_cat_11               0.002393
90) ps_ind_04_cat_0                0.002380
91) ps_ind_04_cat_1                0.002359
92) ps_ind_07_bin                  0.002333
93) ps_car_09_cat_2                0.002312
94) ps_ind_02_cat_1                0.002275
95) ps_car_01_cat_7                0.002130
96) ps_calc_20_bin                 0.002095
97) ps_car_09_cat_0                0.002090
98) ps_ind_02_cat_2                0.002088
99) ps_ind_06_bin                  0.002058
100) ps_car_06_cat_1                0.002007
101) ps_calc_15_bin                 0.001989
102) ps_car_07_cat_1                0.001957
103) ps_ind_08_bin                  0.001937
104) ps_car_09_cat_1                0.001804
105) ps_car_06_cat_11               0.001804
106) ps_ind_18_bin                  0.001719
107) ps_ind_09_bin                  0.001719
108) ps_car_01_cat_10               0.001605
109) ps_car_01_cat_9                0.001595
110) ps_car_01_cat_4                0.001545
111) ps_car_01_cat_6                0.001544
112) ps_car_06_cat_14               0.001532
113) ps_ind_05_cat_6                0.001494
114) ps_ind_02_cat_3                0.001430
115) ps_car_07_cat_0                0.001372
116) ps_car_01_cat_8                0.001345
117) ps_car_08_cat_1                0.001343
118) ps_car_02_cat_1                0.001328
119) ps_car_02_cat_0                0.001307
120) ps_car_06_cat_4                0.001241
121) ps_ind_05_cat_4                0.001199
122) ps_ind_02_cat_4                0.001163
123) ps_car_01_cat_5                0.001143
124) ps_car_06_cat_6                0.001105
125) ps_car_06_cat_10               0.001063
126) ps_ind_05_cat_2                0.001036
127) ps_car_04_cat_1                0.001030
128) ps_car_04_cat_2                0.000992
129) ps_car_06_cat_7                0.000986
130) ps_car_01_cat_3                0.000896
131) ps_car_09_cat_3                0.000878
132) ps_car_01_cat_0                0.000877
133) ps_ind_14                      0.000854
134) ps_car_06_cat_15               0.000847
135) ps_car_06_cat_9                0.000791
136) ps_ind_05_cat_1                0.000750
137) ps_car_06_cat_3                0.000711
138) ps_car_10_cat_1                0.000696
139) ps_ind_12_bin                  0.000684
140) ps_ind_05_cat_3                0.000665
141) ps_car_09_cat_4                0.000623
142) ps_car_01_cat_2                0.000553
143) ps_car_04_cat_8                0.000550
144) ps_car_06_cat_17               0.000512
145) ps_car_06_cat_16               0.000475
146) ps_car_04_cat_9                0.000443
147) ps_car_06_cat_12               0.000427
148) ps_car_06_cat_13               0.000403
149) ps_car_01_cat_1                0.000381
150) ps_ind_05_cat_5                0.000312
151) ps_car_06_cat_5                0.000273
152) ps_ind_11_bin                  0.000215
153) ps_car_04_cat_6                0.000201
154) ps_ind_13_bin                  0.000152
155) ps_car_04_cat_3                0.000149
156) ps_car_06_cat_2                0.000143
157) ps_car_04_cat_5                0.000097
158) ps_car_06_cat_8                0.000094
159) ps_car_04_cat_7                0.000080
160) ps_ind_10_bin                  0.000074
161) ps_car_10_cat_2                0.000060
162) ps_car_04_cat_4                0.000045
  • SelectFromModel์„ ํ™œ์šฉํ•˜๋ฉด ์‚ฌ์šฉํ•  ์‚ฌ์ „ ์ ํ•ฉ ๋ถ„๋ฅ˜๊ธฐ์™€ ํ”ผ์ณ ์ค‘์š”๋„์— ๋Œ€ํ•œ ์ž„๊ณ„๊ฐ’์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Œ

  • get_support ๋ฐฉ๋ฒ• ์ ์šฉ ์‹œ train ๋ฐ์ดํ„ฐ์˜ ๋ณ€์ˆ˜ ์ˆ˜๋ฅผ ์ œํ•œํ•  ์ˆ˜ ์žˆ์Œ

### SelectFromModel ์ ์šฉ
# ํŠน์ • feature๋งŒ ์„ ํƒ์ ์œผ๋กœ ํ™œ์šฉ

sfm = SelectFromModel(rf, threshold = 'median', prefit = True)

print('Number of features before selection: {}'.format(X_train.shape[1]))

# ๋ณ€์ˆ˜ ์„ ํƒ
n_features = sfm.transform(X_train).shape[1]
print('Number of features after selection: {}'.format(n_features))

selected_vars = list(feat_labels[sfm.get_support()])
Number of features before selection: 162
/usr/local/lib/python3.9/dist-packages/sklearn/base.py:432: UserWarning: X has feature names, but SelectFromModel was fitted without feature names
  warnings.warn(
Number of features after selection: 81
train = train[selected_vars + ['target']]

11. Feature scaling

  • train ๋ฐ์ดํ„ฐ์˜ ๋ฒ”์œ„๋ฅผ ์กฐ์ •ํ•ด์ฃผ๊ธฐ ์œ„ํ•ด StandardScaler๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

๐Ÿ“Œ ํ†ต๊ณ„ํ•™๊ณผ ๊ต์ˆ˜๋‹˜ ์„ค๋ช…

  • ๋ชจ๋ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค ์ค‘ ๋ฐ์ดํ„ฐ๋“ค ๊ฐ„์˜ ๊ฑฐ๋ฆฌ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ์ขŒ์šฐ๋˜๋Š” ๋ชจ๋ธ ๋ง๊ณ ๋Š” ๊ตณ์ด ์•ˆํ•ด๋„ ๋œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

(์˜คํžˆ๋ ค ํ‘œ์ค€ํ™” ์ดํ›„ ๋‚˜์ค‘์— ๊ฒฐ๊ณผ ํ•ด์„์ด ์–ด๋ ต๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.)

  • ๊ผญ ํ•ด์•ผํ•˜๋Š” ๊ฒƒ๋“ค(๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜): ๋ฆฟ์ง€(Ridge), ๋ผ์˜(Lasso), SVM ๋“ฑ
scaler = StandardScaler()
scaler.fit_transform(train.drop(['target'], axis=1))
array([[-0.45941104, -1.26665356,  1.05087653, ..., -0.72553616,
        -1.01071913, -1.06173767],
       [ 1.55538958,  0.95034274, -0.63847299, ..., -1.06120876,
        -1.01071913,  0.27907892],
       [ 1.05168943, -0.52765479, -0.92003125, ...,  1.95984463,
        -0.56215309, -1.02449277],
       ...,
       [-0.9631112 ,  0.58084336,  0.48776003, ..., -0.46445747,
         0.18545696,  0.27907892],
       [-0.9631112 , -0.89715418, -1.48314775, ..., -0.91202093,
        -0.41263108,  0.27907892],
       [-0.45941104, -1.26665356,  1.61399304, ...,  0.28148164,
        -0.11358706, -0.72653353]])

๐Ÿ“š References

ํƒœ๊ทธ: , ,

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ: