[ECC DS 2์ฃผ์ฐจ] 1. Data Preparation & Exploration
0. Introduction
-
PorteSeguro competition์ ์ํด ์ข์ insight๋ฅผ ์ป๋ ๊ฒ์ ๋ชฉํ๋ก ํจ
-
๋ชจ๋ธ๋ง์ ์ํด ๋ฐ์ดํฐ๋ฅผ ์ค๋นํ๋ ๋ฐฉ๋ฒ๋ค
-
์ฃผ์ ์น์
-
๋ฐ์ดํฐ ์๊ฐํ
-
๋ฉํ๋ฐ์ดํฐ ์ ์
-
๊ธฐ์ ํต๊ณ๋
-
๋ถ๊ท ํ ํด๋์ค ์ฒ๋ฆฌํ๊ธฐ
-
๋ฐ์ดํฐ quality ํ์ธ
-
Exploratory data visualization
-
Feature engineering
-
Feature ์ ํ
-
Feature scaling
-
1. Loading packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer # version issue
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
pd.set_option('display.max_columns', 100)
2. Loading data
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/2แแ
ฎแแ
ก/data/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/2แแ
ฎแแ
ก/data/test.csv')
3. Data at first sight
-
๋ํ ๋ฐ์ดํฐ ์ค๋ช ๋ฐ์ท๋ฌธ
-
์ ์ฌํ ๊ทธ๋ฃน์ ์ํ๋ feature๋ค์ feature name์ ํ๊น ๋์ด ์์(ex> ind, reg, car, calc)
-
feature๋ช ์๋ binary feature๋ฅผ ๋ํ๋ด๋ postfix ๋น๊ณผ ๋ฒ์ฃผํ feature๋ฅผ ๋ํ๋ด๊ธฐ ์ํ cat์ด ํฌํจ๋จ
- ์ด๋ฌํ ์ง์ ์ด ์๋ feature๋ ์ฐ์ํ ๋๋ ์์ํ feature์ด๋ค.
-
-1: ๊ฒฐ์ธก์น(missing value)
-
-
target ์ด: ํด๋น ์ ์ฑ ์์ ์์ ๋ํด ํด๋ ์์ด ์ ๊ธฐ๋์๋์ง์ ์ฌ๋ถ
โ ํ์ต์ฉ ๋ฐ์ดํฐ
train.head()
id | target | ps_ind_01 | ps_ind_02_cat | ps_ind_03 | ps_ind_04_cat | ps_ind_05_cat | ps_ind_06_bin | ps_ind_07_bin | ps_ind_08_bin | ps_ind_09_bin | ps_ind_10_bin | ps_ind_11_bin | ps_ind_12_bin | ps_ind_13_bin | ps_ind_14 | ps_ind_15 | ps_ind_16_bin | ps_ind_17_bin | ps_ind_18_bin | ps_reg_01 | ps_reg_02 | ps_reg_03 | ps_car_01_cat | ps_car_02_cat | ps_car_03_cat | ps_car_04_cat | ps_car_05_cat | ps_car_06_cat | ps_car_07_cat | ps_car_08_cat | ps_car_09_cat | ps_car_10_cat | ps_car_11_cat | ps_car_11 | ps_car_12 | ps_car_13 | ps_car_14 | ps_car_15 | ps_calc_01 | ps_calc_02 | ps_calc_03 | ps_calc_04 | ps_calc_05 | ps_calc_06 | ps_calc_07 | ps_calc_08 | ps_calc_09 | ps_calc_10 | ps_calc_11 | ps_calc_12 | ps_calc_13 | ps_calc_14 | ps_calc_15_bin | ps_calc_16_bin | ps_calc_17_bin | ps_calc_18_bin | ps_calc_19_bin | ps_calc_20_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7 | 0 | 2 | 2 | 5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 1 | 0 | 0.7 | 0.2 | 0.718070 | 10 | 1 | -1 | 0 | 1 | 4 | 1 | 0 | 0 | 1 | 12 | 2 | 0.400000 | 0.883679 | 0.370810 | 3.605551 | 0.6 | 0.5 | 0.2 | 3 | 1 | 10 | 1 | 10 | 1 | 5 | 9 | 1 | 5 | 8 | 0 | 1 | 1 | 0 | 0 | 1 |
1 | 9 | 0 | 1 | 1 | 7 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0.8 | 0.4 | 0.766078 | 11 | 1 | -1 | 0 | -1 | 11 | 1 | 1 | 2 | 1 | 19 | 3 | 0.316228 | 0.618817 | 0.388716 | 2.449490 | 0.3 | 0.1 | 0.3 | 2 | 1 | 9 | 5 | 8 | 1 | 7 | 3 | 1 | 1 | 9 | 0 | 1 | 1 | 0 | 1 | 0 |
2 | 13 | 0 | 5 | 4 | 9 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 1 | 0 | 0 | 0.0 | 0.0 | -1.000000 | 7 | 1 | -1 | 0 | -1 | 14 | 1 | 1 | 2 | 1 | 60 | 1 | 0.316228 | 0.641586 | 0.347275 | 3.316625 | 0.5 | 0.7 | 0.1 | 2 | 2 | 9 | 1 | 8 | 2 | 7 | 4 | 2 | 7 | 7 | 0 | 1 | 1 | 0 | 1 | 0 |
3 | 16 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 1 | 0 | 0 | 0.9 | 0.2 | 0.580948 | 7 | 1 | 0 | 0 | 1 | 11 | 1 | 1 | 3 | 1 | 104 | 1 | 0.374166 | 0.542949 | 0.294958 | 2.000000 | 0.6 | 0.9 | 0.1 | 2 | 4 | 7 | 1 | 8 | 4 | 2 | 2 | 2 | 4 | 9 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 17 | 0 | 0 | 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 1 | 0 | 0 | 0.7 | 0.6 | 0.840759 | 11 | 1 | -1 | 0 | -1 | 14 | 1 | 1 | 2 | 1 | 82 | 3 | 0.316070 | 0.565832 | 0.365103 | 2.000000 | 0.4 | 0.6 | 0.0 | 2 | 2 | 6 | 3 | 10 | 2 | 12 | 3 | 1 | 1 | 3 | 0 | 0 | 0 | 1 | 1 | 0 |
train.tail()
id | target | ps_ind_01 | ps_ind_02_cat | ps_ind_03 | ps_ind_04_cat | ps_ind_05_cat | ps_ind_06_bin | ps_ind_07_bin | ps_ind_08_bin | ps_ind_09_bin | ps_ind_10_bin | ps_ind_11_bin | ps_ind_12_bin | ps_ind_13_bin | ps_ind_14 | ps_ind_15 | ps_ind_16_bin | ps_ind_17_bin | ps_ind_18_bin | ps_reg_01 | ps_reg_02 | ps_reg_03 | ps_car_01_cat | ps_car_02_cat | ps_car_03_cat | ps_car_04_cat | ps_car_05_cat | ps_car_06_cat | ps_car_07_cat | ps_car_08_cat | ps_car_09_cat | ps_car_10_cat | ps_car_11_cat | ps_car_11 | ps_car_12 | ps_car_13 | ps_car_14 | ps_car_15 | ps_calc_01 | ps_calc_02 | ps_calc_03 | ps_calc_04 | ps_calc_05 | ps_calc_06 | ps_calc_07 | ps_calc_08 | ps_calc_09 | ps_calc_10 | ps_calc_11 | ps_calc_12 | ps_calc_13 | ps_calc_14 | ps_calc_15_bin | ps_calc_16_bin | ps_calc_17_bin | ps_calc_18_bin | ps_calc_19_bin | ps_calc_20_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
595207 | 1488013 | 0 | 3 | 1 | 10 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 13 | 1 | 0 | 0 | 0.5 | 0.3 | 0.692820 | 10 | 1 | -1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 31 | 3 | 0.374166 | 0.684631 | 0.385487 | 2.645751 | 0.4 | 0.5 | 0.3 | 3 | 0 | 9 | 0 | 9 | 1 | 12 | 4 | 1 | 9 | 6 | 0 | 1 | 1 | 0 | 1 | 1 |
595208 | 1488016 | 0 | 5 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 6 | 1 | 0 | 0 | 0.9 | 0.7 | 1.382027 | 9 | 1 | -1 | 0 | -1 | 15 | 0 | 0 | 2 | 1 | 63 | 2 | 0.387298 | 0.972145 | -1.000000 | 3.605551 | 0.2 | 0.2 | 0.0 | 2 | 4 | 8 | 6 | 8 | 2 | 12 | 4 | 1 | 3 | 8 | 1 | 0 | 1 | 0 | 1 | 1 |
595209 | 1488017 | 0 | 1 | 1 | 10 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 1 | 0 | 0 | 0.9 | 0.2 | 0.659071 | 7 | 1 | -1 | 0 | -1 | 1 | 1 | 1 | 2 | 1 | 31 | 3 | 0.397492 | 0.596373 | 0.398748 | 1.732051 | 0.4 | 0.0 | 0.3 | 3 | 2 | 7 | 4 | 8 | 0 | 10 | 3 | 2 | 2 | 6 | 0 | 0 | 1 | 0 | 0 | 0 |
595210 | 1488021 | 0 | 5 | 2 | 3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 1 | 0 | 0 | 0.9 | 0.4 | 0.698212 | 11 | 1 | -1 | 0 | -1 | 11 | 1 | 1 | 2 | 1 | 101 | 3 | 0.374166 | 0.764434 | 0.384968 | 3.162278 | 0.0 | 0.7 | 0.0 | 4 | 0 | 9 | 4 | 9 | 2 | 11 | 4 | 1 | 4 | 2 | 0 | 1 | 1 | 1 | 0 | 0 |
595211 | 1488027 | 0 | 0 | 1 | 8 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 1 | 0 | 0 | 0.1 | 0.2 | -1.000000 | 7 | 0 | -1 | 0 | -1 | 0 | 1 | 0 | 2 | 1 | 34 | 2 | 0.400000 | 0.932649 | 0.378021 | 3.741657 | 0.4 | 0.0 | 0.5 | 2 | 3 | 10 | 4 | 10 | 2 | 5 | 4 | 4 | 3 | 8 | 0 | 1 | 0 | 0 | 0 | 0 |
-
๊ตฌ์ฑ ์์
-
์ดํญ ๋ณ์(binary variables) โ yes or no
-
์นดํ ๊ณ ๋ฆฌ์ ๊ฐ์ด ์ ์์ธ ์นดํ ๊ณ ๋ฆฌํ ๋ณ์
-
์ ์ ๋๋ ๋ถ๋ ์์์ ๊ฐ์ด ์๋ ๊ธฐํ ๋ณ์
-
-1(-1์ ๊ฒฐ์ธก์น๋ฅผ ์๋ฏธ)์ ๊ฐ์ง๋ ๋ณ์๋ค
-
target ๋ณ์์ ID ๋ณ์
-
### rows, cols ์ ํ์ธ
train.shape
(595212, 59)
- 59๊ฐ์ ๋ณ์์ 595212๊ฐ์ ๊ด์ธก์น๊ฐ ์กด์ฌํจ
### ์ค๋ณต๋ ๋ฐ์ดํฐ ์ ๊ฑฐ
train.drop_duplicates()
train.shape
(595212, 59)
### ํ
์คํธ ๋ฐ์ดํฐ์ ํํ ํ์ธ
test.shape
(892816, 58)
-
1๊ฐ์ ๋ณ์๊ฐ ์์
- ์ด๊ฒ์ target ๋ณ์
~(๋น์ฐํ ํ ์คํธ ๋ฐ์ดํฐ์๋ ์์ด์ผ์ง..)~
-
๋์ค์ 14๊ฐ์ ๋ฒ์ฃผํ ๋ณ์๋ค์ ๋ํ dummy ๋ณ์๋ฅผ ๋ง๋ค ์ ์์
- ๋น ๋ณ์๋ ์ด๋ฏธ ์ดํญ ๋ณ์์ด๋ฏ๋ก ์ค๋ณต๊ฐ ์ ๊ฑฐ๊ฐ ํ์ํ์ง ์์
### ๋ฐ์ดํฐ ์ ๋ณด ํ์ธ
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 595212 entries, 0 to 595211 Data columns (total 59 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 595212 non-null int64 1 target 595212 non-null int64 2 ps_ind_01 595212 non-null int64 3 ps_ind_02_cat 595212 non-null int64 4 ps_ind_03 595212 non-null int64 5 ps_ind_04_cat 595212 non-null int64 6 ps_ind_05_cat 595212 non-null int64 7 ps_ind_06_bin 595212 non-null int64 8 ps_ind_07_bin 595212 non-null int64 9 ps_ind_08_bin 595212 non-null int64 10 ps_ind_09_bin 595212 non-null int64 11 ps_ind_10_bin 595212 non-null int64 12 ps_ind_11_bin 595212 non-null int64 13 ps_ind_12_bin 595212 non-null int64 14 ps_ind_13_bin 595212 non-null int64 15 ps_ind_14 595212 non-null int64 16 ps_ind_15 595212 non-null int64 17 ps_ind_16_bin 595212 non-null int64 18 ps_ind_17_bin 595212 non-null int64 19 ps_ind_18_bin 595212 non-null int64 20 ps_reg_01 595212 non-null float64 21 ps_reg_02 595212 non-null float64 22 ps_reg_03 595212 non-null float64 23 ps_car_01_cat 595212 non-null int64 24 ps_car_02_cat 595212 non-null int64 25 ps_car_03_cat 595212 non-null int64 26 ps_car_04_cat 595212 non-null int64 27 ps_car_05_cat 595212 non-null int64 28 ps_car_06_cat 595212 non-null int64 29 ps_car_07_cat 595212 non-null int64 30 ps_car_08_cat 595212 non-null int64 31 ps_car_09_cat 595212 non-null int64 32 ps_car_10_cat 595212 non-null int64 33 ps_car_11_cat 595212 non-null int64 34 ps_car_11 595212 non-null int64 35 ps_car_12 595212 non-null float64 36 ps_car_13 595212 non-null float64 37 ps_car_14 595212 non-null float64 38 ps_car_15 595212 non-null float64 39 ps_calc_01 595212 non-null float64 40 ps_calc_02 595212 non-null float64 41 ps_calc_03 595212 non-null float64 42 ps_calc_04 595212 non-null int64 43 ps_calc_05 595212 non-null int64 44 ps_calc_06 595212 non-null int64 45 ps_calc_07 595212 non-null int64 46 ps_calc_08 595212 non-null int64 47 ps_calc_09 595212 non-null int64 48 ps_calc_10 595212 non-null int64 49 ps_calc_11 595212 non-null int64 50 ps_calc_12 595212 non-null int64 51 ps_calc_13 595212 non-null int64 52 ps_calc_14 595212 non-null int64 53 ps_calc_15_bin 595212 non-null int64 54 ps_calc_16_bin 595212 non-null int64 55 ps_calc_17_bin 595212 non-null int64 56 ps_calc_18_bin 595212 non-null int64 57 ps_calc_19_bin 595212 non-null int64 58 ps_calc_20_bin 595212 non-null int64 dtypes: float64(10), int64(49) memory usage: 267.9 MB
-
data type์ด ์ฃผ๋ก ์ ์ํ(integer) ๋๋ ์ค์ํ(float)์์ ํ์ธํ ์ ์์
-
ํ์ฌ ๋ฐ์ดํฐ์ null ๊ฐ์ด ์กด์ฌํ์ง x
- ๊ฒฐ์ธก์น๊ฐ -1๋ก ๋์ฒด๋์๊ธฐ ๋๋ฌธ
4. Metadata
-
๋ฐ์ดํฐ ๊ด๋ฆฌ๋ฅผ ์ฉ์ดํ๊ฒ ํ๊ธฐ ์ํด ๋ณ์์ ๋ํ ๋ฉํ ์ ๋ณด๋ฅผ ๋ฐ์ดํฐ ํ๋ ์์ ์ ์ฅ
-
๋ถ์, ์๊ฐํ, ๋ชจ๋ธ๋ง ๋ฑ์ ์ฌ์ฉํ ๋ณ์๋ฅผ ์ ํํ๋ ค๋ ๊ฒฝ์ฐ์ ์ ์ฉ
-
์ ์ฅํ ๋ด์ฉ๋ค:
-
role: input, ID, target
-
level: nominal, interval, ordinal, binary
-
keep: True or False
-
dtype: int, float, str
-
data = []
for f in train.columns:
### role
if f == 'target':
role = 'target'
elif f == 'id':
role = 'id'
else:
role = 'input'
### level
if 'bin' in f or f == 'target':
level = 'binary'
elif 'cat' in f or f == 'id':
level = 'nominal'
elif train[f].dtype == float:
level = 'interval'
elif train[f].dtype == int:
level = 'ordinal'
# Initialize keep to True for all variables except for id
keep = True
if f == 'id':
keep = False
### dtype
dtype = train[f].dtype
# Creating a Dict that contains all the metadata for the variable
f_dict = {
'varname': f,
'role': role,
'level': level,
'keep': keep,
'dtype': dtype
}
data.append(f_dict)
meta = pd.DataFrame(data, columns=['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace=True)
meta
role | level | keep | dtype | |
---|---|---|---|---|
varname | ||||
id | id | nominal | False | int64 |
target | target | binary | True | int64 |
ps_ind_01 | input | ordinal | True | int64 |
ps_ind_02_cat | input | nominal | True | int64 |
ps_ind_03 | input | ordinal | True | int64 |
ps_ind_04_cat | input | nominal | True | int64 |
ps_ind_05_cat | input | nominal | True | int64 |
ps_ind_06_bin | input | binary | True | int64 |
ps_ind_07_bin | input | binary | True | int64 |
ps_ind_08_bin | input | binary | True | int64 |
ps_ind_09_bin | input | binary | True | int64 |
ps_ind_10_bin | input | binary | True | int64 |
ps_ind_11_bin | input | binary | True | int64 |
ps_ind_12_bin | input | binary | True | int64 |
ps_ind_13_bin | input | binary | True | int64 |
ps_ind_14 | input | ordinal | True | int64 |
ps_ind_15 | input | ordinal | True | int64 |
ps_ind_16_bin | input | binary | True | int64 |
ps_ind_17_bin | input | binary | True | int64 |
ps_ind_18_bin | input | binary | True | int64 |
ps_reg_01 | input | interval | True | float64 |
ps_reg_02 | input | interval | True | float64 |
ps_reg_03 | input | interval | True | float64 |
ps_car_01_cat | input | nominal | True | int64 |
ps_car_02_cat | input | nominal | True | int64 |
ps_car_03_cat | input | nominal | True | int64 |
ps_car_04_cat | input | nominal | True | int64 |
ps_car_05_cat | input | nominal | True | int64 |
ps_car_06_cat | input | nominal | True | int64 |
ps_car_07_cat | input | nominal | True | int64 |
ps_car_08_cat | input | nominal | True | int64 |
ps_car_09_cat | input | nominal | True | int64 |
ps_car_10_cat | input | nominal | True | int64 |
ps_car_11_cat | input | nominal | True | int64 |
ps_car_11 | input | ordinal | True | int64 |
ps_car_12 | input | interval | True | float64 |
ps_car_13 | input | interval | True | float64 |
ps_car_14 | input | interval | True | float64 |
ps_car_15 | input | interval | True | float64 |
ps_calc_01 | input | interval | True | float64 |
ps_calc_02 | input | interval | True | float64 |
ps_calc_03 | input | interval | True | float64 |
ps_calc_04 | input | ordinal | True | int64 |
ps_calc_05 | input | ordinal | True | int64 |
ps_calc_06 | input | ordinal | True | int64 |
ps_calc_07 | input | ordinal | True | int64 |
ps_calc_08 | input | ordinal | True | int64 |
ps_calc_09 | input | ordinal | True | int64 |
ps_calc_10 | input | ordinal | True | int64 |
ps_calc_11 | input | ordinal | True | int64 |
ps_calc_12 | input | ordinal | True | int64 |
ps_calc_13 | input | ordinal | True | int64 |
ps_calc_14 | input | ordinal | True | int64 |
ps_calc_15_bin | input | binary | True | int64 |
ps_calc_16_bin | input | binary | True | int64 |
ps_calc_17_bin | input | binary | True | int64 |
ps_calc_18_bin | input | binary | True | int64 |
ps_calc_19_bin | input | binary | True | int64 |
ps_calc_20_bin | input | binary | True | int64 |
### ์ญ์ ๋์ง ์์ ๋ชจ๋ nominal variable ์ถ์ถ(์์)
meta[(meta.level == 'nominal') & (meta.keep)].index
Index(['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat', 'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat', 'ps_car_10_cat', 'ps_car_11_cat'], dtype='object', name='varname')
### role๊ณผ level ๋ณ ๋ณ์์ ์
pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()
role | level | count | |
---|---|---|---|
0 | id | nominal | 1 |
1 | input | binary | 17 |
2 | input | interval | 10 |
3 | input | nominal | 14 |
4 | input | ordinal | 16 |
5 | target | binary | 1 |
5. Descriptive statistics
-
๋ฐ์ดํฐ ํ๋ ์์ ๊ธฐ์ ๋ฐฉ๋ฒ(describe method)์ ์ ์ฉํ ์ ์์
-
๊ทธ๋ฌ๋ ๋ฒ์ฃผํ ๋ณ์์ id ๋ณ์์ ๋ํ ํ๊ท , ํ์ค, โฆ์ ๊ณ์ฐํ๋ ๊ฒ์ ๊ทธ๋ค์ง ์๋ฏธ๊ฐ ์์
- ๋ฒ์ฃผํ ๋ณ์ -> ๋์ค์ ์๊ฐ์ ์ผ๋ก ์ดํด๋ณด์.
-
meta file์ ํตํด ๊ธฐ์ ํต๊ณ๋์ ๊ตฌํ ๋ณ์๋ฅผ ์ฝ๊ฒ ์ ํ ๊ฐ๋ฅ
- ๋ช ํํ๊ฒ ํ๊ธฐ ์ํด ๋ฐ์ดํฐ ํ์ ๋ณ๋ก ์์ ์ํ
5-1. Interval ํ์
v = meta[(meta.level == 'interval') & (meta.keep)].index
train[v].describe()
ps_reg_01 | ps_reg_02 | ps_reg_03 | ps_car_12 | ps_car_13 | ps_car_14 | ps_car_15 | ps_calc_01 | ps_calc_02 | ps_calc_03 | |
---|---|---|---|---|---|---|---|---|---|---|
count | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 |
mean | 0.610991 | 0.439184 | 0.551102 | 0.379945 | 0.813265 | 0.276256 | 3.065899 | 0.449756 | 0.449589 | 0.449849 |
std | 0.287643 | 0.404264 | 0.793506 | 0.058327 | 0.224588 | 0.357154 | 0.731366 | 0.287198 | 0.286893 | 0.287153 |
min | 0.000000 | 0.000000 | -1.000000 | -1.000000 | 0.250619 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.400000 | 0.200000 | 0.525000 | 0.316228 | 0.670867 | 0.333167 | 2.828427 | 0.200000 | 0.200000 | 0.200000 |
50% | 0.700000 | 0.300000 | 0.720677 | 0.374166 | 0.765811 | 0.368782 | 3.316625 | 0.500000 | 0.400000 | 0.500000 |
75% | 0.900000 | 0.600000 | 1.000000 | 0.400000 | 0.906190 | 0.396485 | 3.605551 | 0.700000 | 0.700000 | 0.700000 |
max | 0.900000 | 1.800000 | 4.037945 | 1.264911 | 3.720626 | 0.636396 | 3.741657 | 0.900000 | 0.900000 | 0.900000 |
โ reg ๋ณ์
-
ps_reg_03
์๋ง ๊ฒฐ์ธก๊ฐ์ด ์์ -
๋ณ์ ๊ฐ์ ๋ฒ์(์ต์ ~ ์ต๋๊ฐ)๊ฐ ๋ค๋ฆ
-
์ค์ผ์ผ๋ง(ex> StandardScaler)์ ์ ์ฉํ ์ ์์
-
์ฌ์ฉํ ๋ถ๋ฅ๊ธฐ(classifier)์ ๋ฐ๋ผ ๋ค๋ฆ
-
โ car ๋ณ์
-
ps_car_12
๊ณผps_car_15
์๊ฒฐ์ธก์น ์กด์ฌ -
์ค์ผ์ผ๋ง ์ ์ฉ ๊ฐ๋ฅ
โ calc ๋ณ์
-
๊ฒฐ์ธก์น๊ฐ ์กด์ฌํ์ง x
-
์ธ ๋ณ์ ๋ชจ๋ ๋ถํฌ๊ฐ ๋งค์ฐ ์ ์ฌํจ
์ ์ฒด์ ์ผ๋ก interval ๋ณ์๋ค์ ๋ฒ์๊ฐ ๋ค์ ์๋ค๋ ๊ฒ์ ํ์ธํ ์ ์์
- ๋ฐ์ดํฐ๋ฅผ ์ต๋ช ํํ๊ธฐ ์ํด ์ด๋ค ๋ณํ(ex> log transformation)์ด ์ด๋ฏธ ์ ์ฉ๋ ๊ฒ์ ์๋๊น?
5-2. Ordinal ๋ณ์
v = meta[(meta.level == 'ordinal') & (meta.keep)].index
train[v].describe()
ps_ind_01 | ps_ind_03 | ps_ind_14 | ps_ind_15 | ps_car_11 | ps_calc_04 | ps_calc_05 | ps_calc_06 | ps_calc_07 | ps_calc_08 | ps_calc_09 | ps_calc_10 | ps_calc_11 | ps_calc_12 | ps_calc_13 | ps_calc_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 |
mean | 1.900378 | 4.423318 | 0.012451 | 7.299922 | 2.346072 | 2.372081 | 1.885886 | 7.689445 | 3.005823 | 9.225904 | 2.339034 | 8.433590 | 5.441382 | 1.441918 | 2.872288 | 7.539026 |
std | 1.983789 | 2.699902 | 0.127545 | 3.546042 | 0.832548 | 1.117219 | 1.134927 | 1.334312 | 1.414564 | 1.459672 | 1.246949 | 2.904597 | 2.332871 | 1.202963 | 1.694887 | 2.746652 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 2.000000 | 0.000000 | 5.000000 | 2.000000 | 2.000000 | 1.000000 | 7.000000 | 2.000000 | 8.000000 | 1.000000 | 6.000000 | 4.000000 | 1.000000 | 2.000000 | 6.000000 |
50% | 1.000000 | 4.000000 | 0.000000 | 7.000000 | 3.000000 | 2.000000 | 2.000000 | 8.000000 | 3.000000 | 9.000000 | 2.000000 | 8.000000 | 5.000000 | 1.000000 | 3.000000 | 7.000000 |
75% | 3.000000 | 6.000000 | 0.000000 | 10.000000 | 3.000000 | 3.000000 | 3.000000 | 9.000000 | 4.000000 | 10.000000 | 3.000000 | 10.000000 | 7.000000 | 2.000000 | 4.000000 | 9.000000 |
max | 7.000000 | 11.000000 | 4.000000 | 13.000000 | 3.000000 | 5.000000 | 6.000000 | 10.000000 | 9.000000 | 12.000000 | 7.000000 | 25.000000 | 19.000000 | 10.000000 | 13.000000 | 23.000000 |
-
์ค์ง
ps_car_11
์์ ๊ฒฐ์ธก์น๊ฐ ์กด์ฌํจ -
๋ค๋ฅธ ๋ฒ์๋ฅผ ๊ฐ์ง๋ ๊ฒ์ ๋ํด ์ค์ผ์ผ๋ง ์ ์ฉ ๊ฐ๋ฅ
5-3. Binary ๋ณ์
v = meta[(meta.level == 'binary') & (meta.keep)].index
train[v].describe()
target | ps_ind_06_bin | ps_ind_07_bin | ps_ind_08_bin | ps_ind_09_bin | ps_ind_10_bin | ps_ind_11_bin | ps_ind_12_bin | ps_ind_13_bin | ps_ind_16_bin | ps_ind_17_bin | ps_ind_18_bin | ps_calc_15_bin | ps_calc_16_bin | ps_calc_17_bin | ps_calc_18_bin | ps_calc_19_bin | ps_calc_20_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 |
mean | 0.036448 | 0.393742 | 0.257033 | 0.163921 | 0.185304 | 0.000373 | 0.001692 | 0.009439 | 0.000948 | 0.660823 | 0.121081 | 0.153446 | 0.122427 | 0.627840 | 0.554182 | 0.287182 | 0.349024 | 0.153318 |
std | 0.187401 | 0.488579 | 0.436998 | 0.370205 | 0.388544 | 0.019309 | 0.041097 | 0.096693 | 0.030768 | 0.473430 | 0.326222 | 0.360417 | 0.327779 | 0.483381 | 0.497056 | 0.452447 | 0.476662 | 0.360295 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 |
max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
-
train ๋ฐ์ดํฐ์์ 1(= true)์ ๋น์จ์ 3.645% -> ๋งค์ฐ ๋ถ๊ท ํํ ๋ฐ์ดํฐ
-
๋๋ถ๋ถ์ ๋ณ์๋ค์ ๋ํ ๊ฐ์ด 0์์ ํ์ธํ ์ ์์
6. Handling imbalanced classes
-
target = 1์ธ ๋ ์ฝ๋์ ๋น์จ์ target = 0๋ณด๋ค ํจ์ฌ ์ ์ -> imbalanced data
-
์ ํ๋๋ ๋์ง๋ง ์ก์์ด ๋ง์ ๋ชจ๋ธ๋ก ์ด์ด์ง ์ ์์
-
ํด๊ฒฐ ์ ๋ต:
-
target = 1 oversampling
-
target = 0 undersampling
-
ํ์ฌ train set์ด ์๋นํ ํฌ๊ธฐ์, undersampling ์ํ
๐์ฐธ๊ณ ์๋ฃ
๋ถ๊ท ํ ๋ฐ์ดํฐ (imbalanced data) ์ฒ๋ฆฌ๋ฅผ ์ํ ์ํ๋ง ๊ธฐ๋ฒ
desired_apriori = 0.10
### target ๋ณ์์ ๊ฐ๋ง๋ค index ๊ฐ์ ธ์ค๊ธฐ
idx_0 = train[train.target == 0].index # False
idx_1 = train[train.target == 1].index # True
### target ๋ถํฌ ํ์ธ
nb_0 = len(train.loc[idx_0])
nb_1 = len(train.loc[idx_1])
### ์ถ๊ฐํ ์ฝ๋
print(nb_0)
print(nb_1)
573518 21694
- target ๋ฐ์ดํฐ๊ฐ ๊ต์ฅํ ๋ถ๊ท ํํ(imbalanced) ๋ถํฌ๋ฅผ ๊ฐ์ง๊ณ ์์์ ํ์ธํ ์ ์๋ค.
### Undersampling ์ํฌ ๋น์จ ์ ํ
undersampling_rate = ((1 - desired_apriori)*nb_1) / (nb_0*desired_apriori)
# undersampling ์์ผ์ผ ํ ๋น์จ -> ํด๋น ๋น์จ๋งํผ target = 0์ธ ๊ฒ๋ค์ด target = 1๋ก ๋ณ๊ฒฝ
undersampled_nb_0 = int(undersampling_rate * nb_0)
print('Rate to undersample records with target = 0: {}'.format(undersampling_rate))
print('Number of records with target = 0 after undersampling: {}'.format(undersampled_nb_0))
Rate to undersample records with target = 0: 0.34043569687437886 Number of records with target = 0 after undersampling: 195246
-
๋ถ๊ท ํ ์ ๋๊ฐ ๋ง์ด ๊ฐ์๋จ์ ํ์ธํ ์ ์์
(96% -> 33%)
### Undersampling ์ํ
# RandomSampling ๋ฐฉ๋ฒ์ ์ฑํ
undersampled_idx = shuffle(idx_0, random_state = 37, n_samples = undersampled_nb_0)
# ์คํ ์๋ง๋ค ๊ฒฐ๊ณผ๊ฐ ๋ฌ๋ผ์ง๋ ๊ฒ์ ๋ฐฉ์งํ๊ธฐ ์ํด random_state ์ฌ์ฉ
# ๋๋จธ์ง ์ธ๋ฑ์ค๋ก list ๊ตฌ์ฑ => target = 1์ index
idx_list = list(undersampled_idx) + list(idx_1)
# undersampling๋ ๋ฐ์ดํฐ ๋ฐํ(์ต์ข
๋ฐ์ดํฐ)
train = train.loc[idx_list].reset_index(drop=True)
7. Data Quality Checks
7-1. ๊ฒฐ์ธก์น ํ์ธ
- ๊ฒฐ์ธก์น๋
-1
๋ก ํ๊ธฐ๋์ด ์์
vars_with_missing = [] # ๊ฒฐ์ธก์น๋ฅผ ์ ์ฅํ list
for f in train.columns: # ๊ฐ column๋ณ๋ก
missings = train[train[f] == -1][f].count()
if missings > 0:
vars_with_missing.append(f)
missings_perc = missings / train.shape[0] # ๊ฐ ์ปฌ๋ผ๋ณ ๊ฒฐ์ธก์น์ ๋น์จ
print('Variable {} has {} records ({:.2%}) with missing values'.format(f, missings, missings_perc))
print()
# ์ ์ฒด ๊ฒฐ์ธก์น ๊ฐ์
print('In total, there are {} variables with missing values'.format(len(vars_with_missing)))
Variable ps_ind_02_cat has 103 records (0.05%) with missing values Variable ps_ind_04_cat has 51 records (0.02%) with missing values Variable ps_ind_05_cat has 2256 records (1.04%) with missing values Variable ps_reg_03 has 38580 records (17.78%) with missing values Variable ps_car_01_cat has 62 records (0.03%) with missing values Variable ps_car_02_cat has 2 records (0.00%) with missing values Variable ps_car_03_cat has 148367 records (68.39%) with missing values Variable ps_car_05_cat has 96026 records (44.26%) with missing values Variable ps_car_07_cat has 4431 records (2.04%) with missing values Variable ps_car_09_cat has 230 records (0.11%) with missing values Variable ps_car_11 has 1 records (0.00%) with missing values Variable ps_car_14 has 15726 records (7.25%) with missing values In total, there are 12 variables with missing values
-
ps_car_03_cat
๋ฐps_car_05_cat
์๋ ๊ฒฐ์ธก๊ฐ์ด ์๋ ๋ ์ฝ๋์ ๋น์จ์ด ๋ง์- ํด๋น ๋ณ์ ์ ๊ฑฐ
-
๊ฒฐ์ธก๊ฐ์ด ์๋ ๋ค๋ฅธ ๋ฒ์ฃผํ ๋ณ์์ ๊ฒฝ์ฐ ๊ฒฐ์ธก๊ฐ -1์ ๊ทธ๋๋ก ๋ ์ ์์
-
ps_reg_03
(์ฐ์)๋ ์ ์ฒด 18% ๋ฐ์ดํฐ๊ฐ ๊ฒฐ์ธก์น- ํ๊ท ์ผ๋ก ๋์ฒด
-
ps_car_11
(์์)์๋ ๊ฒฐ์ธก๊ฐ์ด ์๋ ๋ ์ฝ๋๊ฐ 5๊ฐ๋ง ์์- ์ต๋น๊ฐ์ผ๋ก ๋์ฒด
-
ps_car_12
(์ฐ์)์๋ ๊ฒฐ์ธก๊ฐ์ด ์๋ ๋ ์ฝ๋๊ฐ 1๊ฐ๋ง ์์- ํ๊ท ์ผ๋ก ๋์ฒด
-
ps_car_14
(์ฐ์)์๋ ๋ชจ๋ ๋ ์ฝ๋์ 7%์ ๋ํ ๊ฒฐ์ธก๊ฐ์ด ์์- ํ๊ท ์ผ๋ก ๋์ฒด
### ๊ฒฐ์ธก๊ฐ์ด ๋๋ฌด ๋ง์ ๋ณ์ ์ญ์ ํ๊ธฐ
vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
train.drop(vars_to_drop, inplace = True, axis = 1)
meta.loc[(vars_to_drop),'keep'] = False # ๋ฉํ๋ฐ์ดํฐ ๋ณ๊ฒฝํ๊ธฐ
### ๋ฐ์ดํฐ ๊ฐ๊ณต - ๊ฒฐ์ธก์น ์ฑ์๋ฃ๊ธฐ
mean_imp = SimpleImputer(missing_values = -1, strategy = 'mean')
mode_imp = SimpleImputer(missing_values = -1, strategy = 'most_frequent')
train['ps_reg_03'] = mean_imp.fit_transform(train[['ps_reg_03']]).ravel()
train['ps_car_12'] = mean_imp.fit_transform(train[['ps_car_12']]).ravel()
train['ps_car_14'] = mean_imp.fit_transform(train[['ps_car_14']]).ravel()
train['ps_car_11'] = mode_imp.fit_transform(train[['ps_car_11']]).ravel()
7-2. ๋ฒ์ฃผํ ๋ณ์์ ์นด๋๋๋ฆฌํฐ(cardinality) ํ์ธ
-
์นด๋๋๋ฆฌํฐ(cardinality, ๊ธฐ์): ๋ณ์์ ํฌํจ๋ ์๋ก ๋ค๋ฅธ ๊ฐ์ ์
-
๋์ค์ ๋ฒ์ฃผํ ๋ณ์์์ dummy ๋ณ์๋ฅผ ๋ง๋ค ๊ฒ์ด๊ธฐ ๋๋ฌธ์, ๋ค๋ฅธ ๊ฐ์ ๊ฐ์ง ๋ณ์์ ๊ฐ์๋ฅผ ํ์ธํ ํ์๊ฐ ์์
- ํด๋น ๋ณ์๋ ๋ง์ dummy ๋ณ์๋ฅผ ์ด๋ํ๋ฏ๋ก ๋ค๋ฅด๊ฒ ์ฒ๋ฆฌํด์ผ ํจ
v = meta[(meta.level == 'nominal') & (meta.keep)].index
for f in v:
dist_values = train[f].value_counts().shape[0]
print('Variable {} has {} distinct values'.format(f, dist_values))
Variable ps_ind_02_cat has 5 distinct values Variable ps_ind_04_cat has 3 distinct values Variable ps_ind_05_cat has 8 distinct values Variable ps_car_01_cat has 13 distinct values Variable ps_car_02_cat has 3 distinct values Variable ps_car_04_cat has 10 distinct values Variable ps_car_06_cat has 18 distinct values Variable ps_car_07_cat has 3 distinct values Variable ps_car_08_cat has 2 distinct values Variable ps_car_09_cat has 6 distinct values Variable ps_car_10_cat has 3 distinct values Variable ps_car_11_cat has 104 distinct values
-
ps_car_11_cat
๊ฐ ๋ง์ ๋ค์ํ ๋ณ์๋ฅผ ๊ฐ์ง๊ณ ์์ -
ํํํ(Smoothing)๋ Danielle Micci-Bareca์ ์ํด ๋ค์ ๋ ผ๋ฌธ๊ณผ ๊ฐ์ด ๊ณ์ฐ๋จ
- https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf
-
parameters>
-
trn_series : ๋ฒ์ฃผํ ๋ณ์๋ฅผ pd.Series๋ก ํ๋ จ(train)
-
tst_series: ๋ฒ์ฃผํ ๋ณ์๋ฅผ pd.Series๋ก ํ๊ฐ(test)
-
target: target data๋ฅผ pd.Series ํ์ผ๋ก
-
min_samples_leaf(int):๋ฒ์ฃผ์ ํน์ฑ์ ์ค๋ช ํด ์ค ์ ์๋ ์ต์ํ์ ํ๋ณธ
-
smoothing(int): ๋ฒ์ฃผํ ํ๊ท ๊ณผ ์ด์ ํ๊ท ์ ๊ท ํ์ ๋ง์ถ๊ธฐ ์ํ ํํ(smoothing) ํจ๊ณผ ์์ฉ
-
def add_noise(series, noise_level):
return series * (1 + noise_level * np.random.randn(len(series)))
def target_encode(trn_series = None,
tst_series = None,
target = None,
min_samples_leaf = 1,
smoothing = 1,
noise_level = 0):
assert len(trn_series) == len(target)
assert trn_series.name == tst_series.name
'''[assert ๋ฌธ]
- ์ด๋ค ์กฐ๊ฑด์ด ์ฐธ์์ ํ๊ณ ํ ํ๋ ๊ฒ
- ํด๋น ์กฐ๊ฑด์ด ๊ฑฐ์ง์ด๋ฉด ์๋ฌ ์ํฉ => ์คํ์ ๊ณ์ํ์ง ๋ชปํ๊ฒ ํจ
'''
temp = pd.concat([trn_series, target], axis = 1)
### target ๋ณ์์ ํ๊ท
averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
### ํํํ ์ ๋ ๊ณ์ฐ
smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
### ๋ชจ๋ target ๋ณ์์ ํํํ ํจ์ ์ ์ฉํ๊ธฐ
prior = target.mean()
# ์นด์ดํธ๊ฐ ํด์๋ก full_avg๊ฐ ์ ๊ฒ ๊ณ ๋ ค๋จ
averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
averages.drop(["mean", "count"], axis = 1, inplace = True)
### train, test set์ ํํํ ์ ์ฉ
ft_trn_series = pd.merge(
trn_series.to_frame(trn_series.name),
averages.reset_index().rename(columns = {'index': target.name, target.name: 'average'}),
on = trn_series.name,
how = 'left')['average'].rename(trn_series.name + '_mean').fillna(prior)
# ์ธ๋ฑ์ค ์ด๊ธฐํ(์ฌ์ค์ )
ft_trn_series.index = trn_series.index
ft_tst_series = pd.merge(
tst_series.to_frame(tst_series.name),
averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
on=tst_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
# ์ธ๋ฑ์ค ์ด๊ธฐํ(์ฌ์ค์ )
ft_tst_series.index = tst_series.index
return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)
### ๋ฒ์ฃผํ ๋ณ์ encoding
# ๋ฒ์ฃผํ -> ์์นํ ๋ณ์(dummy ๋ณ์)
train_encoded, test_encoded = target_encode(train["ps_car_11_cat"],
test["ps_car_11_cat"],
target = train.target,
min_samples_leaf = 100,
smoothing = 10,
noise_level = 0.01)
### encoding๋ ๊ฐ์ผ๋ก ๋ณ๊ฒฝ
train['ps_car_11_cat_te'] = train_encoded
train.drop('ps_car_11_cat', axis = 1, inplace = True)
# ๋ฉํ ๋ฐ์ดํฐ update
meta.loc['ps_car_11_cat','keep'] = False
test['ps_car_11_cat_te'] = test_encoded
test.drop('ps_car_11_cat', axis=1, inplace = True)
8. Exploratory Data Visualization
8-1. ๋ฒ์ฃผํ ๋ณ์
target = 1
์ธ ๊ณ ๊ฐ๋ค์ ๋น์จ ํ์
v = meta[(meta.level == 'nominal') & (meta.keep)].index
for f in v:
### ๊ธฐ๋ณธ ํ ์ค์
plt.figure()
fig, ax = plt.subplots(figsize = (20,10))
### target = 1์ ๋น์จ ๊ณ์ฐ
cat_perc = train[[f, 'target']].groupby([f],as_index = False).mean()
cat_perc.sort_values(by = 'target', ascending = False, inplace = True) # ๋ด๋ฆผ์ฐจ์ ์ ๋ ฌ
### ๋ง๋๊ทธ๋ํ(Barplot) ๊ทธ๋ฆฌ๊ธฐ
# target์ ํ๊ท ์ ๋ด๋ฆผ์ฐจ์์ผ๋ก ์ ๋ ฌ
sns.barplot(ax = ax, x = f, y = 'target', data = cat_perc, order = cat_perc[f])
plt.ylabel('% target', fontsize = 18)
plt.xlabel(f, fontsize = 18)
plt.tick_params(axis = 'both', which = 'major', labelsize = 18)
plt.show()
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
-
๊ฒฐ์ธก๊ฐ์ ์ต๋น๊ฐ ๋ฑ์ผ๋ก ๋์ฒดํ๋ ๋์ ๋ณ๋์ ๋ฒ์ฃผ ๊ฐ์ผ๋ก ์ ์งํ๋ ๊ฒ์ด ์ข์
-
๊ฒฐ์ธก๊ฐ์ ๊ฐ์ง ๊ณ ๊ฐ๋ค์ด ๋ณดํ๊ธ ์ฒญ๊ตฌ๋ฅผ ์์ฒญํ ํ๋ฅ ์ด ํจ์ฌ ๋์(๊ฒฝ์ฐ์ ๋ฐ๋ผ ํจ์ฌ ๋ฎ์) ๊ฒ์ผ๋ก ๋ณด์
8-2. ๊ตฌ๊ฐ ๋ณ์
-
๋ณ์๋ค ๊ฐ์ ์๊ด๊ณ์ ํ์ธ
-
heatmap
์ ๋ณ์๋ค ๊ฐ์ ์๊ด ๊ด๊ณ๋ฅผ ์๊ฐํํ๋ ์ข์ ๋ฐฉ๋ฒ
### heatmap ์๊ฐํ๋ฅผ ์ํ ํจ์
def corr_heatmap(v):
### ๋ณ์๋ค ๊ฐ์ ์๊ด๊ณ์
correlations = train[v].corr()
### ์๊ฐํ
cmap = sns.diverging_palette(220, 10, as_cmap = True) # colormap ์๊ฐํ
fig, ax = plt.subplots(figsize = (10,10))
sns.heatmap(correlations, cmap = cmap, vmax = 1.0, center = 0,
fmt='.2f', square=True, linewidths=.5, annot=True,
cbar_kws={"shrink": .75})
plt.show()
v = meta[(meta.level == 'interval') & (meta.keep)].index
corr_heatmap(v)
-
ํด๋น ๋ณ์ ์ฌ์ด์๋ ๊ฐ๋ ฅํ ์๊ด ๊ด๊ณ๊ฐ ์กด์ฌํจ
-
ps_reg_02
&ps_reg_03
(0.7) -
ps_car_12
&ps_car13
(0.67) -
ps_car_12
&ps_car14
(0.58) -
ps_car_13
&ps_car15
(0.67)
-
-
Seaborn
์๋ ๋ณ์ ๊ฐ์ (์ ํ) ๊ด๊ณ๋ฅผ ์๊ฐํํ ์ ์๋ ์ ์ฉํ ๋๊ตฌ๊ฐ ์์-
pairplot
์ ์ฌ์ฉํ์ฌ ๋ณ์ ๊ฐ์ ๊ด๊ณ๋ฅผ ์๊ฐํํ ์ ์์ -
์๊ด์ฑ์ด ๋์ ๋ณ์๋ฅผ ๊ฐ๋ณ์ ์ผ๋ก ์ดํด๋ณด์!
-
### ์ฒ๋ฆฌ ์๋๋ก ์ธํด sample์ ์ถ์ถํ์ฌ ๊ด์ฐฐ
s = train.sample(frac = 0.1)
โ ps_reg_02 & ps_reg_03
sns.lmplot(x = 'ps_reg_02', y = 'ps_reg_03', data = s, hue = 'target',
palette = 'Set1', scatter_kws = {'alpha':0.3})
plt.show()
-
ํ๊ท์ ์์ ์ ์ ์๋ฏ์ด ๋ณ์ ์ฌ์ด์๋ ์ ํ ๊ด๊ณ๊ฐ ์์
-
์์ ๋งค๊ฐ๋ณ์ ๋๋ถ์ target = 0๊ณผ target = 1์ ํ๊ท์ ์ด ๋์ผํจ์ ์ ์ ์์
โ ps_car_12 and ps_car_13
sns.lmplot(x = 'ps_car_12', y = 'ps_car_13', data = s, hue = 'target',
palette = 'Set1', scatter_kws = {'alpha':0.3})
plt.show()
โ ps_car_12 and ps_car_14
sns.lmplot(x = 'ps_car_12', y = 'ps_car_14', data =s, hue = 'target',
palette = 'Set1', scatter_kws = {'alpha':0.3})
plt.show()
โ ps_car_13 and ps_car_15
sns.lmplot(x = 'ps_car_15', y = 'ps_car_13', data = s, hue = 'target',
palette = 'Set1', scatter_kws = {'alpha':0.3})
plt.show()
-
๋ณ์์ ๋ํ ์ฃผ์ฑ๋ถ ๋ถ์(PCA)์ ์ํํ์ฌ ์น์๋ฅผ ์ค์ผ ์ ์์
- ์๊ด๋ ๋ณ์์ ์๊ฐ ๋ค์ ์ ๊ธฐ์, ๊ทธ๋ฅ ๋ ๋์!
8-3. ์์ํ ๋ณ์
### ์๊ด๊ณ์
v = meta[(meta.level == 'ordinal') & (meta.keep)].index
corr_heatmap(v)
-
์๊ด ๊ด๊ณ๊ฐ ๋์ ๋ณ์๋ค์ด ๋ณด์ด์ง ์์
- ๋์ , target ๊ฐ์ผ๋ก groupํ ์ ๋ถํฌ๊ฐ ์ด๋ป๊ฒ ํ์ฑ๋๋์ง ํ์ธํ ์ ์์
9. Feature engineering
9-1. Dummy ๋ณ์ ์์ฑ
-
๋ฒ์ฃผํ ๋ณ์์ ๊ฐ์ ์์๋ ํฌ๊ธฐ๋ฅผ ๋ํ๋ด์ง ์์
- ์นดํ ๊ณ ๋ฆฌ 2๋ ์นดํ ๊ณ ๋ฆฌ 1์ ๊ฐ์ ๋ ๋ฐฐ๊ฐ ์๋
-
์ด๋ฌํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด dummy ๋ณ์๋ฅผ ํ์ฉ
- ์๋ ๋ณ์์ ๋ฒ์ฃผ์ ๋ํด ์์ฑ๋ ๋ค๋ฅธ ๋๋ฏธ ๋ณ์์์ ํ์๋ ์ ์์ -> ์ฒซ ๋ฒ์งธ ๋๋ฏธ ๋ณ์๋ฅผ drop
v = meta[(meta.level == 'nominal') & (meta.keep)].index
print('Before dummification we have {} variables in train'.format(train.shape[1]))
train = pd.get_dummies(train, columns = v, drop_first = True)
print('After dummification we have {} variables in train'.format(train.shape[1]))
Before dummification we have 57 variables in train After dummification we have 109 variables in train
- ๋๋ฏธ ๋ณ์ ์์ฑ์ผ๋ก ์ธํด 52๊ฐ์ ๋ณ์๊ฐ ์ถ๊ฐ์ ์ผ๋ก ์์ฑ๋์๋ค.
9-2. interaction ๋ณ์ ์์ฑ
v = meta[(meta.level == 'interval') & (meta.keep)].index
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
interactions = pd.DataFrame(data=poly.fit_transform(train[v]), columns=poly.get_feature_names_out(v))
interactions.drop(v, axis=1, inplace=True) # Remove the original columns
# Concat the interaction variables to the train data
print('Before creating interactions we have {} variables in train'.format(train.shape[1]))
train = pd.concat([train, interactions], axis=1)
print('After creating interactions we have {} variables in train'.format(train.shape[1]))
Before creating interactions we have 109 variables in train After creating interactions we have 164 variables in train
-
train ๋ฐ์ดํฐ์ interaction ๋ณ์๊ฐ ์ถ๊ฐ๋จ
-
get_feature_names_out()
๋ฉ์๋ ๋๋ถ์ ์ด ์ด๋ฆ์ ์ ๋ณ์์ ํ ๋นํ ์ ์์
10. Feature selection
10-1. ๋ถ์ฐ์ด ๋ฎ๊ฑฐ๋ 0์ธ feature ์ ๊ฑฐ
-
Sklearn์ ์ด๋ฅผ ์ํ ํธ๋ฆฌํ ๋ฐฉ๋ฒ์ ๊ฐ์ง๊ณ ์์
- ๋ถ์ฐ ๋ฌธํฑ๊ฐ(Variance Threshold)
-
๊ธฐ๋ณธ์ ์ผ๋ก ๋ถ์ฐ์ด 0์ธ ํ์์ ์ ๊ฑฐ
-
์ด์ ๋จ๊ณ์์ zero-variance ๋ณ์๊ฐ ์์์ ํ์ธ
-
๋ถ์ฐ์ด 1% ๋ฏธ๋ง์ธ feature๋ฅผ ์ ๊ฑฐํ๋ฉด 31๊ฐ์ ๋ณ์๊ฐ ์ ๊ฑฐ๋จ
-
selector = VarianceThreshold(threshold = .01)
selector.fit(train.drop(['id', 'target'], axis = 1))
f = np.vectorize(lambda x : not x) # boolean ๊ฐ ์์๋ฅผ ์ ํ
v = train.drop(['id', 'target'], axis = 1).columns[f(selector.get_support())]
### ๋ถ์ ์ด threshold๋ณด๋ค ์์ ๋ณ์ ์ถ์ถ
print('{} variables have too low variance.'.format(len(v)))
print('These variables are {}'.format(list(v)))
28 variables have too low variance. These variables are ['ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_car_12', 'ps_car_14', 'ps_car_11_cat_te', 'ps_ind_05_cat_2', 'ps_ind_05_cat_5', 'ps_car_01_cat_1', 'ps_car_01_cat_2', 'ps_car_04_cat_3', 'ps_car_04_cat_4', 'ps_car_04_cat_5', 'ps_car_04_cat_6', 'ps_car_04_cat_7', 'ps_car_06_cat_2', 'ps_car_06_cat_5', 'ps_car_06_cat_8', 'ps_car_06_cat_12', 'ps_car_06_cat_16', 'ps_car_06_cat_17', 'ps_car_09_cat_4', 'ps_car_10_cat_1', 'ps_car_10_cat_2', 'ps_car_12^2', 'ps_car_12 ps_car_14', 'ps_car_14^2']
-
๋ถ์ฐ์ ๊ธฐ์ค์ผ๋ก ์ ํํ๋ค๋ฉด ๋ง์ ๋ณ์๋ฅผ ์๊ฒ ๋ ๊ฒ์
-
ํ์ง๋ง ๋ณ์๊ฐ ๋ง์ง ์๊ธฐ ๋๋ฌธ์, ๋ถ๋ฅ๊ธฐ๊ฐ ์ ํํ๋๋ก ํ ๊ฒ์
-
๋ณ์๊ฐ ๋ง์ ๋ฐ์ดํฐ ์ธํธ์ ๊ฒฝ์ฐ ๋ฐ์ดํฐ ์ฒ๋ฆฌ ์๊ฐ์ ์ค์ผ ์ ์์
-
-
์ฌ์ดํท๋ฐ์๋ ๋ค๋ฅธ feature ์ ํ ๋ฐฉ๋ฒ์ผ๋ก
SelectFromModel
์ ์ ๊ณตํจ
### ์ ์ฒด ๋ณ์ ํ์ฉ
# RandomForest ํ์ฉ
X_train = train.drop(['id', 'target'], axis = 1)
y_train = train['target']
feat_labels = X_train.columns
rf = RandomForestClassifier(n_estimators = 1000, random_state = 0, n_jobs=-1)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
indices = np.argsort(rf.feature_importances_)[::-1] # ์ญ์์ผ๋ก ์์๋ค์ ์ ๋ ฌ(๋ด๋ฆผ์ฐจ์ ์ ๋ ฌ)
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,feat_labels[indices[f]], importances[indices[f]]))
1) ps_car_11_cat_te 0.021066 2) ps_car_13 0.017323 3) ps_car_12 ps_car_13 0.017271 4) ps_car_13^2 0.017238 5) ps_car_13 ps_car_14 0.017201 6) ps_reg_03 ps_car_13 0.017106 7) ps_car_13 ps_car_15 0.016783 8) ps_reg_01 ps_car_13 0.016777 9) ps_reg_03 ps_car_14 0.016258 10) ps_reg_03 ps_car_12 0.015567 11) ps_reg_03 ps_car_15 0.015179 12) ps_car_14 ps_car_15 0.015002 13) ps_car_13 ps_calc_02 0.014744 14) ps_reg_01 ps_reg_03 0.014728 15) ps_car_13 ps_calc_01 0.014714 16) ps_reg_02 ps_car_13 0.014641 17) ps_car_13 ps_calc_03 0.014635 18) ps_reg_01 ps_car_14 0.014488 19) ps_reg_03^2 0.014317 20) ps_reg_03 0.014165 21) ps_reg_03 ps_calc_03 0.013774 22) ps_reg_03 ps_calc_02 0.013765 23) ps_reg_03 ps_calc_01 0.013694 24) ps_calc_10 0.013626 25) ps_car_14 ps_calc_02 0.013594 26) ps_car_14 ps_calc_01 0.013580 27) ps_car_14 ps_calc_03 0.013519 28) ps_calc_14 0.013414 29) ps_car_12 ps_car_14 0.012919 30) ps_ind_03 0.012910 31) ps_car_14^2 0.012752 32) ps_car_14 0.012751 33) ps_reg_02 ps_car_14 0.012746 34) ps_calc_11 0.012582 35) ps_reg_02 ps_reg_03 0.012560 36) ps_ind_15 0.012182 37) ps_car_12 ps_car_15 0.010928 38) ps_car_15 ps_calc_03 0.010903 39) ps_car_15 ps_calc_02 0.010843 40) ps_car_15 ps_calc_01 0.010814 41) ps_car_12 ps_calc_01 0.010496 42) ps_calc_13 0.010441 43) ps_car_12 ps_calc_03 0.010354 44) ps_car_12 ps_calc_02 0.010304 45) ps_reg_02 ps_car_15 0.010219 46) ps_reg_01 ps_car_15 0.010213 47) ps_calc_02 ps_calc_03 0.010074 48) ps_calc_01 ps_calc_03 0.010060 49) ps_calc_01 ps_calc_02 0.010000 50) ps_calc_07 0.009818 51) ps_calc_08 0.009797 52) ps_reg_01 ps_car_12 0.009466 53) ps_reg_02 ps_car_12 0.009299 54) ps_reg_02 ps_calc_01 0.009288 55) ps_reg_02 ps_calc_03 0.009221 56) ps_reg_02 ps_calc_02 0.009146 57) ps_reg_01 ps_calc_03 0.009059 58) ps_calc_06 0.009044 59) ps_reg_01 ps_calc_02 0.009037 60) ps_reg_01 ps_calc_01 0.009012 61) ps_calc_09 0.008800 62) ps_ind_01 0.008605 63) ps_calc_05 0.008318 64) ps_calc_04 0.008128 65) ps_reg_01 ps_reg_02 0.008024 66) ps_calc_12 0.007976 67) ps_car_15 0.006139 68) ps_car_15^2 0.006136 69) ps_calc_03 0.006017 70) ps_calc_01^2 0.006009 71) ps_calc_03^2 0.005974 72) ps_calc_01 0.005964 73) ps_calc_02^2 0.005945 74) ps_calc_02 0.005919 75) ps_car_12^2 0.005356 76) ps_car_12 0.005345 77) ps_reg_02^2 0.004986 78) ps_reg_02 0.004973 79) ps_reg_01 0.004159 80) ps_reg_01^2 0.004139 81) ps_car_11 0.003798 82) ps_ind_05_cat_0 0.003557 83) ps_ind_17_bin 0.002843 84) ps_calc_17_bin 0.002674 85) ps_calc_16_bin 0.002590 86) ps_calc_19_bin 0.002548 87) ps_calc_18_bin 0.002504 88) ps_ind_16_bin 0.002403 89) ps_car_01_cat_11 0.002393 90) ps_ind_04_cat_0 0.002380 91) ps_ind_04_cat_1 0.002359 92) ps_ind_07_bin 0.002333 93) ps_car_09_cat_2 0.002312 94) ps_ind_02_cat_1 0.002275 95) ps_car_01_cat_7 0.002130 96) ps_calc_20_bin 0.002095 97) ps_car_09_cat_0 0.002090 98) ps_ind_02_cat_2 0.002088 99) ps_ind_06_bin 0.002058 100) ps_car_06_cat_1 0.002007 101) ps_calc_15_bin 0.001989 102) ps_car_07_cat_1 0.001957 103) ps_ind_08_bin 0.001937 104) ps_car_09_cat_1 0.001804 105) ps_car_06_cat_11 0.001804 106) ps_ind_18_bin 0.001719 107) ps_ind_09_bin 0.001719 108) ps_car_01_cat_10 0.001605 109) ps_car_01_cat_9 0.001595 110) ps_car_01_cat_4 0.001545 111) ps_car_01_cat_6 0.001544 112) ps_car_06_cat_14 0.001532 113) ps_ind_05_cat_6 0.001494 114) ps_ind_02_cat_3 0.001430 115) ps_car_07_cat_0 0.001372 116) ps_car_01_cat_8 0.001345 117) ps_car_08_cat_1 0.001343 118) ps_car_02_cat_1 0.001328 119) ps_car_02_cat_0 0.001307 120) ps_car_06_cat_4 0.001241 121) ps_ind_05_cat_4 0.001199 122) ps_ind_02_cat_4 0.001163 123) ps_car_01_cat_5 0.001143 124) ps_car_06_cat_6 0.001105 125) ps_car_06_cat_10 0.001063 126) ps_ind_05_cat_2 0.001036 127) ps_car_04_cat_1 0.001030 128) ps_car_04_cat_2 0.000992 129) ps_car_06_cat_7 0.000986 130) ps_car_01_cat_3 0.000896 131) ps_car_09_cat_3 0.000878 132) ps_car_01_cat_0 0.000877 133) ps_ind_14 0.000854 134) ps_car_06_cat_15 0.000847 135) ps_car_06_cat_9 0.000791 136) ps_ind_05_cat_1 0.000750 137) ps_car_06_cat_3 0.000711 138) ps_car_10_cat_1 0.000696 139) ps_ind_12_bin 0.000684 140) ps_ind_05_cat_3 0.000665 141) ps_car_09_cat_4 0.000623 142) ps_car_01_cat_2 0.000553 143) ps_car_04_cat_8 0.000550 144) ps_car_06_cat_17 0.000512 145) ps_car_06_cat_16 0.000475 146) ps_car_04_cat_9 0.000443 147) ps_car_06_cat_12 0.000427 148) ps_car_06_cat_13 0.000403 149) ps_car_01_cat_1 0.000381 150) ps_ind_05_cat_5 0.000312 151) ps_car_06_cat_5 0.000273 152) ps_ind_11_bin 0.000215 153) ps_car_04_cat_6 0.000201 154) ps_ind_13_bin 0.000152 155) ps_car_04_cat_3 0.000149 156) ps_car_06_cat_2 0.000143 157) ps_car_04_cat_5 0.000097 158) ps_car_06_cat_8 0.000094 159) ps_car_04_cat_7 0.000080 160) ps_ind_10_bin 0.000074 161) ps_car_10_cat_2 0.000060 162) ps_car_04_cat_4 0.000045
-
SelectFromModel
์ ํ์ฉํ๋ฉด ์ฌ์ฉํ ์ฌ์ ์ ํฉ ๋ถ๋ฅ๊ธฐ์ ํผ์ณ ์ค์๋์ ๋ํ ์๊ณ๊ฐ์ ์ง์ ํ ์ ์์ -
get_support
๋ฐฉ๋ฒ ์ ์ฉ ์ train ๋ฐ์ดํฐ์ ๋ณ์ ์๋ฅผ ์ ํํ ์ ์์
### SelectFromModel ์ ์ฉ
# ํน์ feature๋ง ์ ํ์ ์ผ๋ก ํ์ฉ
sfm = SelectFromModel(rf, threshold = 'median', prefit = True)
print('Number of features before selection: {}'.format(X_train.shape[1]))
# ๋ณ์ ์ ํ
n_features = sfm.transform(X_train).shape[1]
print('Number of features after selection: {}'.format(n_features))
selected_vars = list(feat_labels[sfm.get_support()])
Number of features before selection: 162
/usr/local/lib/python3.9/dist-packages/sklearn/base.py:432: UserWarning: X has feature names, but SelectFromModel was fitted without feature names warnings.warn(
Number of features after selection: 81
train = train[selected_vars + ['target']]
11. Feature scaling
- train ๋ฐ์ดํฐ์ ๋ฒ์๋ฅผ ์กฐ์ ํด์ฃผ๊ธฐ ์ํด StandardScaler๋ฅผ ์ ์ฉํ ์ ์๋ค.
๐ ํต๊ณํ๊ณผ ๊ต์๋ ์ค๋ช
- ๋ชจ๋ธ ์๊ณ ๋ฆฌ์ฆ๋ค ์ค ๋ฐ์ดํฐ๋ค ๊ฐ์ ๊ฑฐ๋ฆฌ์ ๋ฐ๋ผ ์ฑ๋ฅ์ด ์ข์ฐ๋๋ ๋ชจ๋ธ ๋ง๊ณ ๋ ๊ตณ์ด ์ํด๋ ๋๋ค๊ณ ํฉ๋๋ค.
(์คํ๋ ค ํ์คํ ์ดํ ๋์ค์ ๊ฒฐ๊ณผ ํด์์ด ์ด๋ ต๋ค๊ณ ํฉ๋๋ค.)
- ๊ผญ ํด์ผํ๋ ๊ฒ๋ค(๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ ์๊ณ ๋ฆฌ์ฆ): ๋ฆฟ์ง(Ridge), ๋ผ์(Lasso), SVM ๋ฑ
scaler = StandardScaler()
scaler.fit_transform(train.drop(['target'], axis=1))
array([[-0.45941104, -1.26665356, 1.05087653, ..., -0.72553616, -1.01071913, -1.06173767], [ 1.55538958, 0.95034274, -0.63847299, ..., -1.06120876, -1.01071913, 0.27907892], [ 1.05168943, -0.52765479, -0.92003125, ..., 1.95984463, -0.56215309, -1.02449277], ..., [-0.9631112 , 0.58084336, 0.48776003, ..., -0.46445747, 0.18545696, 0.27907892], [-0.9631112 , -0.89715418, -1.48314775, ..., -0.91202093, -0.41263108, 0.27907892], [-0.45941104, -1.26665356, 1.61399304, ..., 0.28148164, -0.11358706, -0.72653353]])