๐Ÿ“ข ์ปดํ“จํ„ฐ ๋ฆฌ์†Œ์Šค ๋ฌธ์ œ์™€ ์—๋Ÿฌ ํ•ด๊ฒฐ์„ ํ•˜์ง€ ๋ชปํ•ด ์‹คํ–‰ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ๋‘ ์ง€์šด ์ƒํƒœ์ž…๋‹ˆ๋‹ค. ๊ฐ ์…€์˜ ์‹คํ–‰ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๋ ค๋ฉด ์•„๋ž˜ ๋งํฌ๋ฅผ ์ฐธ๊ณ ํ•ด ์ฃผ์„ธ์š”.

0. ๋Œ€ํšŒ ์†Œ๊ฐœ

  • Costa Rican Household Poverty Level Prediction

  • ๐Ÿ“Œ ๋ชฉ์ 

    • ๊ฐœ์ธ๊ณผ ๊ฐ€๊ตฌ ํŠน์„ฑ์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€๊ตฌ์˜ ๋นˆ๊ณค ์ˆ˜์ค€์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ
  • ๐Ÿ“Œ ๋…ธํŠธ๋ถ ๊ฐœ์š”

    • ๋ฌธ์ œ ์ •์˜

    • ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ EDA

    • ์—ฌ๋Ÿฌ ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ ํ…Œ์ŠคํŠธ/์„ ํƒ/์ตœ์ ํ™”

    • ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๊ฒ€์‚ฌ, ๊ฒฐ๋ก  ๋„์ถœ

    • feature engineering ์ž๋™ํ™”

0-1. ๋ฐ์ดํ„ฐ ์„ค๋ช…

  • ๋ฐ์ดํ„ฐ๋Š” train.csv์™€ test.csv ๋‘ ํŒŒ์ผ๋กœ ์ œ๊ณต๋จ

    • train ์„ธํŠธ: 9557๊ฐœ์˜ ํ–‰(row) * 143๊ฐœ์˜ ์—ด(column)

    • test ์„ธํŠธ: 23856๊ฐœ์˜ ํ–‰(row) * 142๊ฐœ์˜ ์—ด(column)

  • ๊ฐ ํ–‰์€ ํ•œ ๋ช…์˜ ๊ฐœ์ธ ๋˜๋Š” ํ•œ ๊ฐ€๊ตฌ๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๊ฐ ์—ด์€ ์ด๋“ค์˜ ๊ณ ์œ ํ•œ ํŠน์„ฑ์„ ๋‚˜ํƒ€๋ƒ„


  • Target ๊ตฌ์„ฑ(ํด๋ž˜์Šค ๊ตฌ์„ฑ)

    • 4๊ฐ€์ง€ ๋นˆ๊ณค ์ˆ˜์ค€
    
    1 = ๊ทน์‹ฌํ•œ ๋นˆ๊ณค ๊ฐ€๊ตฌ
    
    2 = ์ ๋‹นํ•œ ๋นˆ๊ณค ๊ฐ€๊ตฌ
    
    3 = ์ทจ์•ฝ ๊ฐ€๊ตฌ
    
    4 = ๋น„์ทจ์•ฝ๊ฐ€๊ตฌ
    
    
  • Columns

    • ์ด 143๊ฐœ์˜ ์ปฌ๋Ÿผ์œผ๋กœ ๊ตฌ์„ฑ๋จ

    • ์ „์ฒด ์ปฌ๋Ÿผ ์„ค๋ช…

    • ID: ๊ฐ ๊ฐœ์ธ์˜ ๊ณ ์œ  ์‹๋ณ„์ž -> ํ™œ์šฉ x

    • idhogar: ๊ฐ ๊ฐ€๊ตฌ์˜ ๊ณ ์œ  ์‹๋ณ„์ž -> ๊ฐ€๊ตฌ๋ณ„๋กœ ๊ฐœ์ธ์„ ๊ทธ๋ฃนํ™”ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ

    • parentesco1: ๊ฐ€์žฅ์ธ์ง€ ์—ฌ๋ถ€(์ผ์ข…์˜ flag ๋ณ€์ˆ˜)

0-2. ๋ชฉํ‘œ

  • ๊ฐ€๊ตฌ ์ˆ˜์ค€์—์„œ ๋นˆ๊ณค ์ •๋„๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ

    • ๊ฐœ์ธ ์ˆ˜์ค€์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ œ๊ณต๋˜๋ฉฐ, ๊ฐ ๊ฐœ์ธ์€ ๊ณ ์œ ํ•œ ํŠน์ง•์„ ๊ฐ€์ง€๊ณ  ์žˆ์„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ฐ€๊ตฌ์— ๋Œ€ํ•œ ์ •๋ณด๋„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ
  • ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๊ฐ€๊ณตํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๊ฐ€์ •์˜ ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ ์ง‘๊ณ„๊ฐ€ ์š”๊ตฌ๋จ

  • test set์˜ ๋ชจ๋“  ๊ฐœ์ธ์— ๋Œ€ํ•œ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰

    • โ€œONLY the heads of household are used in scoringโ€

    • ๊ฐ€๊ตฌ ๋‹จ์œ„๋กœ ๋นˆ๊ณค์„ ์˜ˆ์ธก

โญ ์ค‘์š”

  • ํ•œ ๊ฐ€๊ตฌ์˜ ๋ชจ๋“  ๊ตฌ์„ฑ์›์ด train ๋ฐ์ดํ„ฐ์—์„œ ๋™์ผํ•œ ๋ ˆ์ด๋ธ”์„ ๊ฐ€์ ธ์•ผ ํ•จ

    • ๋งŒ์•ฝ ๋‹ค๋ฅธ ๊ฒฝ์šฐ parentesco1 == 1.0 ํ–‰์œผ๋กœ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ ๊ฐ€๊ตฌ์˜ ๊ฐ€์žฅ์— ๋Œ€ํ•œ ๋ผ๋ฒจ์„ ์‚ฌ์šฉ

  • ๋ชจ๋ธ ํ›ˆ๋ จ ์‹œ ๊ฐ€์žฅ์˜ ๋นˆ๊ณค ์ˆ˜์ค€์ด๋ผ๋Š” ๋ผ๋ฒจ์„ ๊ฐ ๊ฐ€๊ตฌ์— ๋ถ™์—ฌ์„œ ๊ฐ€์ • ๋‹จ์œ„๋กœ ํ›ˆ๋ จ์‹œํ‚ฌ ์˜ˆ์ •

    • ์›๋ณธ ๋ฐ์ดํ„ฐ์—๋Š” ๊ฐ€๊ตฌ ๋ฐ ๊ฐœ์ธ์˜ ํŠน์„ฑ์ด ํ˜ผํ•ฉ๋˜์–ด ์žˆ์Œ โ‡€ ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ ๊ฐ ๊ฐ€๊ตฌ์— ๋Œ€ํ•ด ์ด๋ฅผ ์ง‘๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ฐพ์•„์•ผ ํ•จ

    • ๊ฐœ์ธ ์ค‘ ์ผ๋ถ€๋Š” ๊ฐ€์žฅ์ด ์—†๋Š” ๊ฐ€๊ตฌ์— ์†ํ•จ โ‡€ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉ ๋ถˆ๊ฐ€

0-3. ํ‰๊ฐ€ ์ง€ํ‘œ(metric) - Macro F1 Score

  • ๊ถ๊ทน์ ์œผ๋กœ ์šฐ๋ฆฌ๋Š” ๊ฐ€๊ตฌ์˜ ๋นˆ๊ณค ์ˆ˜์ค€(์ •์ˆ˜๋กœ ๊ตฌ๋ถ„๋จ)์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๊ณ ์ž ํ•จ

  • ํ‰๊ฐ€ ์ง€ํ‘œ๋กœ Macro F1 Score์„ ํ™œ์šฉํ•  ์˜ˆ์ •

    • ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์˜ ์กฐํ™” ํ‰๊ท 
  • ํ‘œ์ค€ F1 ์ ์ˆ˜

    • Reference

    • ๋‹ค์ค‘ ๋ถ„๋ฅ˜ ๋ฌธ์ œ์˜ ๊ฒฝ์šฐ ๊ฐ ํด๋ž˜์Šค ๋ณ„ F1 score์„ ํ‰๊ท ๋‚ด์–ด ํ™œ์šฉ

  • Macro F1 score

    • label์˜ ๋ถˆ๊ท ํ˜•์„ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  ๊ฐ ํด๋ž˜์Šค์˜ F1 ์ ์ˆ˜๋ฅผ ํ‰๊ท 

      • ๊ฐ ๋ ˆ์ด๋ธ”์˜ ๋ฐœ์ƒ ๋นˆ๋„๋Š” ๋งคํฌ๋กœ๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ๊ณ„์‚ฐ์— ๋ฐ˜์˜๋˜์ง€ ์•Š์Œ(์‚ฌ์šฉํ•˜๋ ค๋ฉด weighted ์˜ต์…˜์„ ํ™œ์šฉ)
    • ์ฝ”๋“œ

    
    from sklearn.metrics import f1_score
    
    f1_score(y_true, y_predicted, average = 'macro')
    
    

0-4. ๋กœ๋“œ๋งต(์ „๋ฐ˜์ ์ธ ์ง„ํ–‰ process)

  1. ๋ฌธ์ œ ์ดํ•ด

  2. ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„(EDA)

  3. ํŠน์„ฑ ๊ณตํ•™(Feature Engineering)

  4. Baseline ML ํ•™์Šต ๋ชจ๋ธ ๋น„๊ต

  5. ์ข€ ๋” ๋ณต์žกํ•œ ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ ์‹œ๋„

  6. ์„ ํƒํ•œ ๋ชจ๋ธ ์ตœ์ ํ™”

  7. ์˜ˆ์ธก ์ˆ˜ํ–‰ / ํ™•์ธ

  8. ๊ฒฐ๋ก  ๋„์ถœ, ๋‹ค์Œ ๋‹จ๊ณ„ ์ œ์‹œ

1. ์ค€๋น„(Getting Started)

1-1. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import

### ๋ฐ์ดํ„ฐ ๋ณ€ํ˜•(๊ฐ€๊ณต)
import pandas as pd
import numpy as np

### ์‹œ๊ฐํ™”
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use('fivethirtyeight')
plt.rcParams['font.size'] = 18
plt.rcParams['patch.edgecolor'] = 'k'

1-2. ๋ฐ์ดํ„ฐ ์ค€๋น„ํ•˜๊ธฐ

from google.colab import drive
drive.mount('/content/drive')
pd.options.display.max_columns = 150 # ์ตœ๋Œ€ 150 ์ปฌ๋Ÿผ๋งŒ ํ‘œ์‹œ

train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/4แ„Œแ…ฎแ„Žแ…ก/Costa Rica/data/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แ„€แ…ต แ„ƒแ…ฆแ„€แ…ชB/4แ„Œแ…ฎแ„Žแ…ก/Costa Rica/data/test.csv')
train.head()
  • ์ปฌ๋Ÿผ๋“ค ๊ฐ„์˜ ์ˆœ์„œ๋Š” ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋œ๋‹ค.

2. EDA(Exploratory Data Analysis) & ์ „์ฒ˜๋ฆฌ

2-1. ๋ฐ์ดํ„ฐ ์ •๋ณด ํ™•์ธ

### ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ

train.info()
  • 130๊ฐœ์˜ integerํ˜• ์ปฌ๋Ÿผ, 8๊ฐœ์˜ floatํ˜• ์ปฌ๋Ÿผ ๋ฐ 5๊ฐœ์˜ object ์ปฌ๋Ÿผ์ด ์กด์žฌ

  • integer ์ปฌ๋Ÿผ์˜ ๊ฒฝ์šฐ 0 ๋˜๋Š” 1์„ ์‚ฌ์šฉํ•˜๋Š” bool ๋ณ€์ˆ˜ ๋˜๋Š” ์ˆœ์„œํ˜•(ordinal) ๋ณ€์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ

  • object ์ปฌ๋Ÿผ์˜ ๊ฒฝ์šฐ ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ์— ์ง์ ‘ ๊ณต๊ธ‰๋  ์ˆ˜ ์—†์Œ -> ์ „์ฒ˜๋ฆฌ ํ•„์š”

### ํ…Œ์ŠคํŠธ์šฉ ๋ฐ์ดํ„ฐ

test.info()

a) ์ •์ˆ˜ํ˜•(Integer) Columns

  • ๊ฐ ์—ด์— ๋Œ€ํ•ด ๊ณ ์œ ํ•œ ๊ฐ’์˜ ์ˆ˜๋ฅผ ์„ธ๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„๋กœ ํ‘œ์‹œ
train.select_dtypes(np.int64).nunique().value_counts().sort_index().plot.bar(color = 'blue', 
                                                                             figsize = (8, 6),
                                                                             edgecolor = 'k', 
                                                                             linewidth = 2)
plt.xlabel('Number of Unique Values')
plt.ylabel('Count')
plt.title('Count of Unique Values in Integer Columns')
  • ๊ณ ์œ ํ•œ ๊ฐ’์ด 2๊ฐœ๋งŒ ์žˆ๋Š” ์ปฌ๋Ÿผ์€ boolean(0 ๋˜๋Š” 1)์„ ๋‚˜ํƒ€๋ƒ„

    • ๋Œ€๋ถ€๋ถ„์˜ boolean ์ •๋ณด๋“ค์€ ๋Œ€๋ถ€๋ถ„ ๊ฐ€๊ตฌ ๋‹จ์œ„๋กœ ์ง‘๊ณ„๋˜์–ด ์žˆ์Œ

    • ex> refrig ์ปฌ๋Ÿผ

      • ๊ฐ€์ •์— ๋ƒ‰์žฅ๊ณ ๊ฐ€ ์žˆ๋Š”์ง€ ์—†๋Š”์ง€ 0๊ณผ 1๋กœ ํ‘œ์‹œ
  • ๊ฐœ์ธ ์ˆ˜์ค€์œผ๋กœ ์กด์žฌํ•˜๋Š” ์ปฌ๋Ÿผ๋“ค์€ ์ง‘๊ฒจ ํ•„์š”

b) Float Columns

  • floatํ˜• ๋ณ€์ˆ˜๋“ค์˜ ๊ฒฝ์šฐ ์ฃผ๋กœ ์—ฐ์†ํ˜• ๋ณ€์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ„

  • ๋ถ„ํฌ๋„ plot(dist plot)์„ ํ†ตํ•ด float ๋ณ€์ˆ˜๋“ค์˜ ๋ถ„ํฌ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

    • โ€˜OrderedDictโ€™ ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋นˆ๊ณค ์ˆ˜์ค€์„ ์ƒ‰์ƒ์— ๋งคํ•‘
  • ์•„๋ž˜ ๊ทธ๋ž˜ํ”„๋Š” target ๊ฐ’์œผ๋กœ ํ‘œ์‹œ๋œ float ์—ด์˜ ๋ถ„ํฌ๋ฅผ ๋ณด์—ฌ์คŒ

    ๊ฐ€๊ตฌ์˜ ๋นˆ๊ณค ์ˆ˜์ค€์— ๋”ฐ๋ผ ๋ณ€์ˆ˜๋“ค์˜ ๋ถ„ํฌ์— ์œ ์˜ํ•œ ์ฐจ์ด๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

from collections import OrderedDict # ์ˆœ์„œ๋ฅผ ๊ฐ€์ง€๋Š” ๋”•์…”๋„ˆ๋ฆฌ

### ์‹œ๊ฐํ™” format ์„ค์ •
plt.figure(figsize = (20, 16))
plt.style.use('fivethirtyeight')

### Color ์„ค์ •
colors = OrderedDict({1: 'red', 2: 'orange', 3: 'blue', 4: 'green'})
poverty_mapping = OrderedDict({1: 'extreme', 2: 'moderate', 
                               3: 'vulnerable', 4: 'non vulnerable'})

### floatํ˜• ๋ณ€์ˆ˜๋“ค ๊ฐ๊ฐ์— ๋Œ€ํ•ด...
for i, col in enumerate(train.select_dtypes('float')):
    ax = plt.subplot(4, 2, i + 1) # plotting ์œ„์น˜ ์„ค์ •
    # ๊ฐ ๊ฐ€๊ตฌ๋ณ„ ๋นˆ๊ณค ์ˆ˜์ค€๋ณ„๋กœ
    for poverty_level, color in colors.items():
        # ๊ฐ๊ฐ ๋‹ค๋ฅธ ์ƒ‰์ƒ์˜ ์„ ์œผ๋กœ ์‹œ๊ฐํ™”
        sns.kdeplot(train.loc[train['Target'] == poverty_level, col].dropna(),
                    ax = ax, color = color, label = poverty_mapping[poverty_level])
        
    plt.title(f'{col.capitalize()} Distribution')
    plt.xlabel(f'{col}')
    plt.ylabel('Density')

plt.subplots_adjust(top = 2)
  • ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด ์–ด๋–ค ๋ณ€์ˆ˜๊ฐ€ ๋ชจํ˜•๊ณผ ๊ฐ€์žฅ ๊ด€๋ จ์ด ์žˆ์„์ง€๋ฅผ ๋Œ€๋žต์ ์œผ๋กœ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Œ

    • ex>

      • meaneduc(๊ฐ€๊ตฌ ๋‚ด ์„ฑ์ธ์˜ ํ‰๊ท  ๊ต์œก์„ ๋‚˜ํƒ€๋ƒ„)์€ ๋นˆ๊ณค ์ˆ˜์ค€๊ณผ ๊ด€๋ จ์ด ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„

      • meaneduc์ด ๋†’์„์ˆ˜๋ก ๋นˆ๊ณค ์ˆ˜์ค€์ด ๋‚ฎ์€ ๊ฐ€๊ตฌ๊ฐ€ ์ ์Œ

c) Object Columns

train.select_dtypes('object').head()
  • id, idhogar: ๋ณ€์ˆ˜ ์‹๋ณ„์— ํ™œ์šฉ

  • dependency: ์ข…์†๋ฅ , (19์„ธ ๋ฏธ๋งŒ ๋˜๋Š” 64์„ธ ์ด์ƒ ๊ฐ€๊ตฌ์› ์ˆ˜)/(19์„ธ ์ด์ƒ 64์„ธ ๋ฏธ๋งŒ ๊ฐ€๊ตฌ์› ์ˆ˜)

  • edjeefe: ๋‚จ์„ฑ ๊ฐ€์žฅ์˜ ์ˆ˜๋…„๊ฐ„ ๊ต์œก, ์—์Šค์ฝ”๋ผ๋ฆฌ(๊ต์œก์—ฐ์ˆ˜), ๊ฐ€์žฅ๊ณผ ์„ฑ๋ณ„์„ ๊ธฐ๋ฐ˜์œผ๋กœ yes = 1, no = 0๋กœ ํ‘œ์‹œ

  • edjefa: ์—ฌ์„ฑ ๊ฐ€์žฅ์˜ ์ˆ˜๋…„๊ฐ„ ๊ต์œก, ์—์Šค์ฝ”๋ผ๋ฆฌ(๊ต์œก์—ฐ์ˆ˜), ๊ฐ€์žฅ๊ณผ ์„ฑ๋ณ„์„ ๊ธฐ๋ฐ˜์œผ๋กœ yes = 1 ๋ฐ no = 0

  • ์„ธ ๋ณ€์ˆ˜์— ๋Œ€ํ•ด yes = 1, no = 0์˜ ๊ฒฝ์šฐ ๋งคํ•‘์„ ํ™œ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜๋ฅผ ์ˆ˜์ •ํ•˜๊ณ  ๋ถ€๋™ ์†Œ์ˆ˜์ ์œผ๋กœ ๋ณ€ํ™˜

mapping = {"yes": 1, "no": 0}

### train, test์— ๋Œ€ํ•ด ๋ชจ๋‘ ๊ฐ™์€ ์ž‘์—… ์ˆ˜ํ–‰
for df in [train, test]:
    # mapping์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ ์ ˆํ•˜๊ฒŒ ๋ณ€ํ™˜
    df['dependency'] = df['dependency'].replace(mapping).astype(np.float64)
    df['edjefa'] = df['edjefa'].replace(mapping).astype(np.float64)
    df['edjefe'] = df['edjefe'].replace(mapping).astype(np.float64)
train[['dependency', 'edjefa', 'edjefe']].describe()
  • ์ œ๋Œ€๋กœ ๋ณ€ํ™˜๋˜์—ˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
### ์‹œ๊ฐํ™”

plt.figure(figsize = (16, 12))

### ๊ฐ ๋ณ€์ˆ˜๋“ค ๋ณ„๋กœ..
for i, col in enumerate(['dependency', 'edjefa', 'edjefe']):
    ax = plt.subplot(3, 1, i + 1)
    # ๊ฐ ๊ฐ€๊ตฌ๋ณ„ ๋นˆ๊ณค ์ˆ˜์ค€๋ณ„๋กœ
    for poverty_level, color in colors.items():
        # ๋นˆ๊ณค ์ˆ˜์ค€๋ณ„๋กœ ๊ฐ๊ฐ ๋‹ค๋ฅธ ์ƒ‰์œผ๋กœ ํ‘œ์‹œ
        sns.kdeplot(train.loc[train['Target'] == poverty_level, col].dropna(), 
                    ax = ax, color = color, label = poverty_mapping[poverty_level])
        
    plt.title(f'{col.capitalize()} Distribution')
    plt.xlabel(f'{col}')
    plt.ylabel('Density')

plt.subplots_adjust(top = 2)
  • ํ•ด๋‹น ๋ณ€์ˆ˜๋“ค์ด ์ˆซ์ž๋กœ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ํ‘œํ˜„๋จ

    • ML ํ•™์Šต ๋ชจ๋ธ์— ์ž…๋ ฅ๋  ์ˆ˜ ์žˆ์Œ
### test ๋ฐ์ดํ„ฐ์˜ target ๊ฐ’์„ ์ผ๋‹จ null๋กœ ์ดˆ๊ธฐํ™”

test['Target'] = np.nan
data = train.append(test, ignore_index = True)

2-2. ๋ผ๋ฒจ(target) ๋ถ„ํฌ ํ™•์ธ

  • ๊ต‰์žฅํžˆ ๋ถˆ๊ท ํ˜•ํ•œ(imbalanced) ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

  • ์ •์ˆ˜(1~4)๋กœ ๊ฐ ํด๋ž˜์Šค๊ฐ€ ๊ตฌ๋ถ„๋˜์–ด ์žˆ์Œ

  • ๋ผ๋ฒจ์„ ์ •ํ™•ํ•˜๊ฒŒ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด parentesco1 == 1๋กœ ํ‘œ์‹œ๋œ(๊ฐ ๊ฐ€์ •์˜ ๊ฐ€์žฅ์„ ํ‘œ์‹œ) ์—ด๋งŒ ๊ณ ๋ ค

# ๊ฐ€์žฅ
heads = data.loc[data['parentesco1'] == 1].copy()

# ํ•™์Šต์„ ์œ„ํ•œ ๋ผ๋ฒจ(target)
train_labels = data.loc[(data['Target'].notnull()) & (data['parentesco1'] == 1), ['Target', 'idhogar']]

# target ์ง‘๊ณ„(๋ถ„ํฌ ํ™•์ธ)
label_counts = train_labels['Target'].value_counts().sort_index()

# ๊ฐ label์˜ ๋ถ„ํฌ์— ๋Œ€ํ•œ ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„
label_counts.plot.bar(figsize = (8, 6), 
                      color = colors.values(),
                      edgecolor = 'k', linewidth = 2)

# ์ถ•, ์ œ๋ชฉ ์„ค์ •
plt.xlabel('Poverty Level'); plt.ylabel('Count'); 
plt.xticks([x - 1 for x in poverty_mapping.keys()], 
           list(poverty_mapping.values()), rotation = 60)
plt.title('Poverty Level Breakdown');

label_counts
  • target์˜ ๋ถ„ํฌ๊ฐ€ ๋งค์šฐ ๋ถˆ๊ท ํ˜•ํ•จ

    • non_vernerable์œผ๋กœ ๋ถ„๋ฅ˜๋˜๋Š” ๊ฐ€๊ตฌ์˜ ์ˆ˜๊ฐ€ ๋‹ค๋ฅธ ํด๋ž˜์Šค์— ๋น„ํ•ด ํ˜„์ €ํžˆ ๋งŽ์Œ

    • extreme์€ ๋งค์šฐ ์ ์Œ

  • ๋ถˆ๊ท ํ˜• ํด๋ž˜์Šค ๋ถ„๋ฅ˜ ๋ฌธ์ œ์˜ ๊ฒฝ์šฐ ML ๋ชจ๋ธ์ด ํ›จ์”ฌ ์ ์€ ์˜ˆ์ œ๋ฅผ ๋ณด๊ธฐ ๋•Œ๋ฌธ์— ์†Œ์ˆ˜ ํด๋ž˜์Šค๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์„ ๊ฒช์„ ์ˆ˜ ์žˆ์Œ

  • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Oversampling์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Œ

2-3. ์ž˜๋ชป๋œ label ์ฒ˜๋ฆฌ

  • ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฐ์ดํ„ฐ ๊ณผํ•™ ํ”„๋กœ์ ํŠธ์—์„œ 80%์˜ ์‹œ๊ฐ„์„ ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ฆฌํ•˜๊ณ  ์ด์ƒ ์ง•ํ›„/์˜ค๋ฅ˜๋ฅผ ์ˆ˜์ •ํ•˜๋Š” ๋ฐ ํ• ์• 

  • ์‚ฌ๋žŒ์˜ ์ž…๋ ฅ ์˜ค๋ฅ˜, ์ธก์ • ์˜ค๋ฅ˜ ๋˜๋Š” ์ •ํ™•ํ•˜์ง€๋งŒ ๋ˆˆ์— ๋„๋Š” ๊ทน๋‹จ๊ฐ’ ๋“ฑ

  • ๊ฐ™์€ ๊ฐ€๊ตฌ์ž„์—๋„ ๊ฐœ์ธ์€ ๋นˆ๊ณค ์ˆ˜์ค€์ด ๋‹ค๋ฅธ, ์ž˜๋ชป๋œ ๋ผ๋ฒจ๋“ค์ด ์กด์žฌํ•จ

    • ๊ฐ€์žฅ์˜ ๋ผ๋ฒจ์„ ์ง„์ •ํ•œ(true) ๋ผ๋ฒจ๋กœ ํ™œ์šฉ
  • ์ž˜๋ชป๋œ label์„ ์ฐพ์•„๋‚ด๊ธฐ ์œ„ํ•ด ๊ฐ€๊ตฌ๋ณ„๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฅ˜ํ•œ ๋‹ค์Œ,target์˜ ๊ณ ์œ ๊ฐ’์ด ํ•˜๋‚˜๋งŒ ์žˆ๋Š”์ง€ ํ™•์ธ

# ๊ฐ€๊ตฌ๋ณ„๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ  ๊ณ ์œ ๊ฐ’์˜ ์ˆ˜ ํŒŒ์•…
all_equal = train.groupby('idhogar')['Target'].apply(lambda x: x.nunique() == 1)

# target ๊ฐ’์ด ๋ชจ๋‘ ๊ฐ™์ง€ ์•Š์€ ๊ฐ€๊ตฌ
not_equal = all_equal[all_equal != True]
print('๊ฐ€์กฑ ๊ตฌ์„ฑ์›์ด ๋ชจ๋‘ ๋™์ผํ•œ ๋Œ€์ƒ์ด ์•„๋‹Œ ๊ฐ€๊ตฌ๋Š” {}๊ฐ€๊ตฌ์ž…๋‹ˆ๋‹ค.'.format(len(not_equal)))
  • ์˜ˆ์‹œ๋ฅผ ํ™•์ธํ•ด๋ณด์ž.
train[train['idhogar'] == not_equal.index[0]][['idhogar', 'parentesco1', 'Target']]
  • parentesco1 == 1๋กœ ํ™•์ธ ๊ฒฐ๊ณผ, ํ•ด๋‹น ๊ฐ€๊ตฌ์˜ ๊ฒฝ์šฐ ๋ชจ๋“  ๊ตฌ์„ฑ์›์— ๋Œ€ํ•œ ์˜ฌ๋ฐ”๋ฅธ ๋ ˆ์ด๋ธ”์€ 3(vulnerable)์ด๋‹ค.

  • ํ•ด๋‹น ๊ฐ€๊ตฌ์˜ ๋ชจ๋“  ๊ตฌ์„ฑ์›์—์„œ ์˜ฌ๋ฐ”๋ฅธ ๋นˆ๊ณค ์ˆ˜์ค€(target, label)์„ ์žฌํ• ๋‹น

cf> ๊ฐ€์žฅ์ด ์—†๋Š” ๊ฒฝ์šฐ

households_leader = train.groupby('idhogar')['parentesco1'].sum()

# ๊ฐ€์žฅ์ด ์—†๋Š” ๊ฐ€๊ตฌ
households_no_head = train.loc[train['idhogar'].isin(households_leader[households_leader == 0].index), :]

print('There are {} households without a head.'.format(households_no_head['idhogar'].nunique()))
# ๊ฐ€์žฅ์ด ์—†๋Š” ๊ฐ€๊ตฌ๋“ค ์ค‘ ๊ตฌ์„ฑ์›๋“ค ๊ฐ„์˜ ๋ ˆ์ด๋ธ”์ด ๋‹ค๋ฅธ ๊ฐ€๊ตฌ ์ฐพ๊ธฐ

households_no_head_equal = households_no_head.groupby('idhogar')['Target'].apply(lambda x: x.nunique() == 1)
print('{} Households with no head have different labels.'.format(sum(households_no_head_equal == False)))
  • ๋‹คํ–‰ํžˆ ์ด๋Ÿฌํ•œ ๋ฐ์ดํ„ฐ๋Š” ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค.

์ž˜๋ชป๋œ label ์ฒ˜๋ฆฌ

### ๋ผ๋ฒจ๊ฐ’์ด ๊ฐ™์ง€ ์•Š์€ ๊ฐ ๊ฐ€๊ตฌ๋ณ„๋กœ..
for household in not_equal.index:
    # ๊ฐ€์žฅ์„ ํ†ตํ•ด ์ œ๋Œ€๋กœ ๋œ label๊ฐ’ ์ฐพ๊ธฐ
    true_target = int(train[(train['idhogar'] == household) & (train['parentesco1'] == 1.0)]['Target'])
    
    # ์ œ๋Œ€๋กœ ๋œ ๊ฐ’์œผ๋กœ ์žฌํ• ๋‹น
    train.loc[train['idhogar'] == household, 'Target'] = true_target
    
    
# ๊ฐ€๊ตฌ๋ณ„๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ  ๊ณ ์œ ๊ฐ’์˜ ์ˆ˜ ํŒŒ์•…
all_equal = train.groupby('idhogar')['Target'].apply(lambda x: x.nunique() == 1)

# target ๊ฐ’์ด ํ†ต์ผ๋˜์ง€ ์•Š์€ ๊ฐ€๊ตฌ
not_equal = all_equal[all_equal != True]
print('There are {} households where the family members do not all have the same target.'.format(len(not_equal)))
  • ์ œ๋Œ€๋กœ ์ฒ˜๋ฆฌ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

2-4. ๊ฒฐ์ธก์น˜(Missing Value) ์ฒ˜๋ฆฌ

  • ๋ˆ„๋ฝ๋œ ๊ฐ’์€ ML ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ธฐ ์ „์— ์ž…๋ ฅํ•ด์•ผ ํ•˜๋ฉฐ, ๋ณ€์ˆ˜์— ๋”ฐ๋ผ ์ž…๋ ฅํ•˜๋Š” ์ตœ์„ ์˜ ์ „๋žต์„ ์ƒ๊ฐํ•ด์•ผ ํ•ฉ
# ์ปฌ๋Ÿผ(๋ณ€์ˆ˜)๋ณ„๋กœ ๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜ ํ™•์ธ
missing = pd.DataFrame(data.isnull().sum()).rename(columns = {0: 'total'})

# ๊ฒฐ์ธก์น˜์˜ ๋น„์œจ ํ™•์ธ
missing['percent'] = missing['total'] / len(data)

missing.sort_values('percent', ascending = False).head(10).drop('Target') 
  • ๊ฒฐ์ธก๊ฐ’์˜ ๋น„์œจ์ด ๋†’์€ ์„ธ ๊ฐœ์˜ ์ปฌ๋Ÿผ์„ ์ฒ˜๋ฆฌํ•˜์ž.

  • v18q1: ๊ฐ€์กฑ๋ณ„๋กœ ์†Œ์œ ํ•œ ํƒœ๋ธ”๋ฆฟ ์ˆ˜

    • ๊ฐ€๊ตฌ ์ฐจ์›์œผ๋กœ ๋ณด๋Š” ๊ฒƒ์ด ํƒ€๋‹นํ•จ

    • ๊ฐ€์žฅ์— ๋Œ€ํ•ด์„œ๋งŒ ํ–‰์„ ์„ ํƒ

  • v2a1: ์›”์„ธ

  • rez_esc: ๋’ค๋–จ์–ด์ง„ ํ•™๋…„์˜ ์—ฐ ์ˆ˜

๐Ÿ“Œ Value Counts ์‹œ๊ฐํ™” ํ•จ์ˆ˜

  • ํ•œ ์—ด์˜ ์นด์šดํŠธ ๊ฐ’์„ ์‹œ๊ฐํ™”

  • ์„ ํƒ์ ์œผ๋กœ ๊ฐ€์žฅ๋งŒ ํ‘œ์‹œ

def plot_value_counts(df, col, heads_only = False):
    # ๊ฐ€์žฅ๋งŒ ์„ ํƒ
    if heads_only:
        df = df.loc[df['parentesco1'] == 1].copy()
        
    plt.figure(figsize = (8, 6))
    df[col].value_counts().sort_index().plot.bar(color = 'blue',
                                                 edgecolor = 'k',
                                                 linewidth = 2)
    plt.xlabel(f'{col}')
    plt.ylabel('Count')
    plt.title(f'{col} Value Counts')
    
    plt.show()

a) v18q1

plot_value_counts(heads, 'v18q1')
  • ์ผ๋‹จ์€ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์œผ๋กœ ์†Œ์œ ํ•  ์ˆ˜ ์žˆ๋Š” ํƒœ๋ธ”๋ฆฟ์˜ ์ˆ˜๋Š” 1์ธ ๊ฒƒ ๊ฐ™์Œ

    • ํ•˜์ง€๋งŒ, ๊ฒฐ์ธก ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋„ ์ƒ๊ฐํ•  ํ•„์š”๊ฐ€ ์žˆ์Œ

    • ํ•ด๋‹น ๋ฒ”์ฃผ๊ฐ€ NaN์ธ ๊ฐ€๊ตฌ๋Š” ํƒœ๋ธ”๋ฆฟ์„ ์†Œ์œ ํ•˜์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ์Œ

  • v18q ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ™” ํ•œ ๋‹ค์Œ v18q1์— ๋Œ€ํ•œ null ๊ฐ’์˜ ์ˆ˜๋ฅผ ๊ณ„์‚ฐ

    • null ๊ฐ’์ด ๊ฐ€์กฑ์ด ํƒœ๋ธ”๋ฆฟ์„ ์†Œ์œ ํ•˜์ง€ ์•Š์Œ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์ธ์ง€๋ฅผ ํ™•์ธ ๊ฐ€๋Šฅ
heads.groupby('v18q')['v18q1'].apply(lambda x: x.isnull().sum())
  • v18q1์—์„œ ๊ฒฐ์ธก์น˜๋ฅผ ๊ฐ€์ง€๋Š” ๊ฐ€๊ตฌ์˜ ๊ฒฝ์šฐ ํƒœ๋ธ”๋ฆฟ์„ ์†Œ์œ ํ•˜์ง€ ์•Š์€ ๊ฐ€๊ตฌ๋ผ๋Š” ์ ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

    • ํ•ด๋‹น ๊ฐ’์˜ ๊ฒฐ์ธก์น˜๋ฅผ 0์œผ๋กœ ์ฑ„์šธ ์ˆ˜ ์žˆ์Œ
### ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

data['v18q1'] = data['v18q1'].fillna(0)

b) v2a1

  • ์›”์„ธ ๋‚ฉ๋ถ€์˜ ๋ˆ„๋ฝ๋œ ๊ฐ’์„ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ ์™ธ์—๋„ ์ฃผํƒ์˜ ์†Œ์œ /์ž„๋Œ€ ์ƒํƒœ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์—ด์— ์žˆ๋Š” tipovivi์˜ ๋ถ„ํฌ๋ฅผ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ๋„ ํฅ๋ฏธ๋กœ์šธ ๊ฒƒ์ž„

    • ์›”์„ธ ์ง€๋ถˆ์— ๋Œ€ํ•ด ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š” ์ฃผํƒ๋“ค์˜ ์†Œ์œ  ํ˜„ํ™ฉ์„ ํŒŒ์•…ํ•˜์ž.
# ์ฃผํƒ ์†Œ์œ ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ณ€์ˆ˜
own_variables = [x for x in data if x.startswith('tipo')]


# ์ฃผํƒ ์ž„๋Œ€๋ฃŒ ์ง€๊ธ‰(๊ฒฐ์ธก์น˜ O) vs ์ฃผํƒ ์†Œ์œ (Plottinh)
data.loc[data['v2a1'].isnull(), own_variables].sum().plot.bar(figsize = (10, 8),
                                                              color = 'green',
                                                              edgecolor = 'k', 
                                                              linewidth = 2)
plt.xticks([0, 1, 2, 3, 4],
           ['Owns and Paid Off', 'Owns and Paying', 'Rented', 'Precarious', 'Other'],
          rotation = 60)
plt.title('Home Ownership Status for Households Missing Rent Payments', size = 18);
  • ์ฃผํƒ ์†Œ์œ  ๋ณ€์ˆ˜ ์„ค๋ช…:

tipovivi1 = 1: ์ „์•ก์„ ์ง€๋ถˆํ•œ ๋ณธ์ธ ์†Œ์œ ์˜ ์ง‘

tipovivi2 = 1: ์†Œ์œ , ํ• ๋ถ€๋กœ ์ง€๋ถˆ

tipovivi3 = 1: ์ž„๋Œ€ ์ฃผํƒ

tipovivi4 = 1: ๋ถˆํ™•์‹ค

tipovivi5 = 1: ๊ธฐํƒ€(๋Œ€์—ฌ)

  • ๋Œ€๋ถ€๋ถ„ ์›”์„ธ๋ฅผ ๋‚ด์ง€ ์•Š๋Š” ๊ฐ€๊ตฌ๋“ค์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์ž์‹ ์˜ ์ง‘์„ ์†Œ์œ ํ•˜๊ณ  ์žˆ์Œ

  • ์ผ๋ถ€ ์ƒํ™ฉ์—์„œ๋Š” ๊ฒฐ์ธก์น˜๊ฐ€ ๋ฐœ์ƒํ•œ ์ด์œ ๋ฅผ ์•Œ ์ˆ˜ ์—†์Œ


  • ์†Œ์œ ํ•˜๊ณ  ์žˆ๊ณ  ์›”์„ธ๊ฐ€ ๋ˆ„๋ฝ๋œ ์ฃผํƒ์˜ ๊ฒฝ์šฐ ์ž„๋Œ€๋ฃŒ ์ง€๊ธ‰์•ก์„ 0์œผ๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Œ

  • ๋‹ค๋ฅธ ์ฃผํƒ์˜ ๊ฒฝ์šฐ ๊ฒฐ์ธก๊ฐ’์„ ๋ƒ…๋‘˜ ์ˆœ ์žˆ์ง€๋งŒ ํ•ด๋‹น ๊ฐ€๊ตฌ์— ๊ฒฐ์ธก๊ฐ’์ด ์žˆ์Œ์„ ๋‚˜ํƒ€๋‚ด๋Š” ํ”Œ๋ž˜๊ทธ(๋ถ€์šธ) ์—ด์„ ์ถ”๊ฐ€

# ์ง‘์„ ์†Œ์œ ํ•œ ๊ฐ€๊ตฌ์˜ ๊ฒฝ์šฐ ์›”์„ธ ๋‚ฉ๋ถ€์•ก ๊ฒฐ์ธก์น˜๋ฅผ 0์œผ๋กœ ํ‘œ๊ธฐ
data.loc[(data['tipovivi1'] == 1), 'v2a1'] = 0

# ๋ˆ„๋ฝ๋œ ์ž„๋Œ€๋ฃŒ ์ง€๊ธ‰์„ ํ‘œ๊ธฐํ•˜๋Š” ์ปฌ๋Ÿผ ์ƒ์„ฑ
data['v2a1-missing'] = data['v2a1'].isnull()

data['v2a1-missing'].value_counts()

c) rez_esc

  • ๊ฒฐ์ธก์น˜์˜ ๊ฒฝ์šฐ ๊ฐ€๊ตฌ์—์„œ ํ˜„์žฌ ํ•™๊ต์— ๋‹ค๋‹ˆ๋Š” ์ž๋…€๊ฐ€ ์—†์„ ์ˆ˜ ์žˆ์Œ

    • ํ•ด๋‹น ์—ด์— ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ์‚ฌ๋žŒ์˜ ๋‚˜์ด์™€ ๊ฒฐ์ธก๊ฐ’์ด ์—†๋Š” ์‚ฌ๋žŒ์˜ ๋‚˜์ด๋ฅผ ์ฐพ์•„ ์ด๋ฅผ ํ™•์ธํ•˜์ž.
### ๊ฒฐ์ธก์น˜๊ฐ€ ์•„๋‹Œ ์‚ฌ๋žŒ๋“ค์˜ ๋‚˜์ด

data.loc[data['rez_esc'].notnull()]['age'].describe()
  • ๊ฒฐ์ธก์น˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ์€ ๋‚˜์ด๋Š” 17์„ธ์ž„

    • ์ด๊ฒƒ๋ณด๋‹ค ๋‚˜์ด๊ฐ€ ๋งŽ์€ ์‚ฌ๋žŒ์ด๋ผ๋ฉด, ์šฐ๋ฆฌ๋Š” ๋‹จ์ˆœํžˆ ์ด ์‚ฌ๋žŒ๋“ค์ด ํ•™๊ต์— ๋‹ค๋‹ˆ์ง€ ์•Š๋Š”๋‹ค๊ณ  ๊ฐ€์ •ํ•  ์ˆ˜๋„ ์žˆ์Œ
### ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฒฐ์ธก์ธ ์‚ฌ๋žŒ๋“ค์˜ ๋‚˜์ด

data.loc[data['rez_esc'].isnull()]['age'].describe()
  • ๋Œ€ํšŒ ์ •๋ณด์— ์˜ํ•˜๋ฉด ํ•ด๋‹น ๋ณ€์ˆ˜๋Š” 7์„ธ์—์„œ 19์„ธ ์‚ฌ์ด์˜ ๊ฐœ์ธ์—๊ฒŒ๋งŒ ์ •์˜๋จ

    • ํ•ด๋‹น ๋ฒ”์œ„๋ณด๋‹ค ๋‚˜์ด๊ฐ€ ์–ด๋ฆฐ ์‚ฌ๋žŒ์ด๋‚˜ ๋‚˜์ด๊ฐ€ ๋งŽ์€ ์‚ฌ๋žŒ์€ ๊ฐ’์„ 0์œผ๋กœ ์„ค์ •ํ•˜๋ฉด ๋จ

    • ๋‹ค๋ฅธ ์‚ฌ๋žŒ์˜ ๊ฒฝ์šฐ ๊ฒฐ์ธก์น˜๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ , bool flag๋ฅผ ์ถ”๊ฐ€

# 19์„ธ ์ด์ƒ์ด๊ฑฐ๋‚˜ 7์„ธ ๋ฏธ๋งŒ์ธ ์‚ฌ๋žŒ -> 0์œผ๋กœ ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ
data.loc[((data['age'] > 19) | (data['age'] < 7)) & (data['rez_esc'].isnull()), 'rez_esc'] = 0

# 7์„ธ์—์„œ 19์„ธ ์‚ฌ์ด์ธ ์‚ฌ๋žŒ ์ค‘ ๊ฒฐ์ธก์น˜์ธ ๊ฒฝ์šฐ flag ๋ณ€์ˆ˜๋กœ ํ‘œ์‹œ
data['rez_esc-missing'] = data['rez_esc'].isnull()
  • rez_esc์— ์ด์ƒ์น˜๊ฐ€ ํ•˜๋‚˜ ์กด์žฌํ•จ

    • ๋Œ€ํšŒ ์„ค๋ช…์— ์˜ํ•˜๋ฉด, ํ•ด๋‹น ๋ณ€์ˆ˜์˜ ์ตœ๋Œ“๊ฐ’์€ 5

    • 5๋ณด๋‹ค ํฐ ๊ฐ’๋“ค์„ 5๋กœ ์žฌํ• ๋‹น

### ์ด์ƒ์น˜ ์ฒ˜๋ฆฌ

data.loc[data['rez_esc'] > 5, 'rez_esc'] = 5

2-5. ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์‹œ๊ฐํ™”

  • ๋‘ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๊ฐ€ ์„œ๋กœ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ๋ฐฉ์‹์„ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด ์‚ฐ์ ๋„,๋ˆ„์ ๋ง‰๋Œ€๊ทธ๋ฆผ, ์ƒ์ž๊ทธ๋ฆผ ๋“ฑ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ํ‘œ์‹œ ์˜ต์…˜์ด ์žˆ์Œ

  • ๊ฐ๊ฐ์˜ x ๊ฐ’์— ๋Œ€ํ•œ y๊ฐ’์˜ ๋ฐฑ๋ถ„์œจ์„ ์ ์˜ ํฌ๊ธฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ์‚ฐ์ ๋„ ๊ทธ๋ฆผ์œผ๋กœ ์‹œ๊ฐํ™” ์ง„ํ–‰

### ์‚ฐ์ ๋„ ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ ํ•จ์ˆ˜ ์ •์˜

def plot_categoricals(x, y, data, annotate = True):
    """
    - ๋‘ ๋ฒ”์ฃผํ˜•์˜ ์นด์šดํŠธ๋ฅผ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.
    - size: ๊ฐ ๊ทธ๋ฃน์˜ ์นด์šดํŠธ ์ˆ˜
    - percentage: y์˜ ์ฃผ์–ด์ง„ ๊ฐ’์— ๋Œ€ํ•œ ๊ฒƒ
    """
    
    # Raw counts 
    raw_counts = pd.DataFrame(data.groupby(y)[x].value_counts(normalize = False))
    raw_counts = raw_counts.rename(columns = {x: 'raw_count'})
    
    # x ๋ฐ y์˜ ๊ฐ ๊ทธ๋ฃน์— ๋Œ€ํ•œ ์นด์šดํŠธ ๊ณ„์‚ฐ
    counts = pd.DataFrame(data.groupby(y)[x].value_counts(normalize = True)) # ์ •๊ทœํ™” ์ˆ˜ํ–‰
    
    # ์—ด ์ด๋ฆ„ ๋ณ€๊ฒฝ ๋ฐ ์ธ๋ฑ์Šค ์žฌ์„ค์ •
    counts = counts.rename(columns = {x: 'normalized_count'}).reset_index()
    counts['percent'] = 100 * counts['normalized_count']
    
    # Add the raw count
    counts['raw_count'] = list(raw_counts['raw_count'])
    
    # ๋ฐฑ๋ถ„์œจ๋กœ ํฌ๊ธฐ๊ฐ€ ์กฐ์ •๋œ ์‚ฐ์ ๋„
    plt.figure(figsize = (14, 10))
    plt.scatter(counts[x], counts[y], edgecolor = 'k', color = 'lightgreen',
                s = 100 * np.sqrt(counts['raw_count']), marker = 'o',
                alpha = 0.6, linewidth = 1.5)
    
    if annotate:
        # ํ…์ŠคํŠธ๋กœ ํ”Œ๋กฏ์— ์ฃผ์„ ๋‹ฌ๊ธฐ
        for i, row in counts.iterrows():
            plt.annotate(xy = (row[x] - (1 / counts[x].nunique()), 
                               row[y] - (0.15 / counts[y].nunique())),
                         color = 'navy',
                         text = f"{round(row['percent'], 1)}%")
        
    # ์ถ• ์„ค์ •
    plt.yticks(counts[y].unique())
    plt.xticks(counts[x].unique())
    
    # ์ œ๊ณฑ๊ทผ ์˜์—ญ์—์„œ ์ตœ์†Œ ๋ฐ ์ตœ๋Œ€๋ฅผ ๊ท ๋“ฑํ•œ ๊ณต๊ฐ„์œผ๋กœ ๋ณ€ํ™˜
    sqr_min = int(np.sqrt(raw_counts['raw_count'].min()))
    sqr_max = int(np.sqrt(raw_counts['raw_count'].max()))
    
    # 5 sizes for legend
    msizes = list(range(sqr_min, sqr_max,
                        int(( sqr_max - sqr_min) / 5)))
    markers = []
    
    # Markers for legend
    for size in msizes:
        markers.append(plt.scatter([], [], s = 100 * size, 
                                   label = f'{int(round(np.square(size) / 100) * 100)}', 
                                   color = 'lightgreen',
                                   alpha = 0.6, edgecolor = 'k', linewidth = 1.5))
        
    # Legend and formatting
    plt.legend(handles = markers, title = 'Counts',
               labelspacing = 3, handletextpad = 2,
               fontsize = 16,
               loc = (1.10, 0.19))
    
    plt.annotate(f'* Size represents raw count while % is for a given y value.',
                 xy = (0, 1), xycoords = 'figure points', size = 10)
    
    # Adjust axes limits
    plt.xlim((counts[x].min() - (6 / counts[x].nunique()), 
              counts[x].max() + (6 / counts[x].nunique())))
    plt.ylim((counts[y].min() - (4 / counts[y].nunique()), 
              counts[y].max() + (4 / counts[y].nunique())))
    plt.grid(None)
    plt.xlabel(f"{x}"); plt.ylabel(f"{y}"); plt.title(f"{y} vs {x}");
plot_categoricals('rez_esc', 'Target', data)
  • ๋งˆ์ปค์˜ ํฌ๊ธฐ: raw count

  • ํ•ด์„

    = ์ง€์ •๋œ y ๊ฐ’์„ ์„ ํƒํ•œ ๋‹ค์Œ ํ–‰ ์ „์ฒด๋ฅผ ์ฝ๊ธฐ

    • ์˜ˆ์‹œ: ๋นˆ๊ณค ์ˆ˜์ค€์ด 1์ธ ๊ฒฝ์šฐ, 93%์˜ ๊ฐœ์ธ์ด 1๋…„ ์ด์ƒ ๋’ค์ฒ˜์ง€์ง€ ์•Š๊ณ  ์ด 800๋ช… ์ •๋„์˜ ๊ฐœ์ธ์ด ์žˆ์œผ๋ฉฐ, ์•ฝ 0.4%์˜ ๊ฐœ์ธ์ด 5๋…„ ๋’ค์ณ์ ธ ์žˆ์œผ๋ฉฐ, ์ด ๋ฒ”์ฃผ์— ์†ํ•˜๋Š” ์ด 50๋ช… ์ •๋„์˜ ๊ฐœ์ธ์ด ์žˆ์Œ
  • ํ•ด๋‹น ๊ทธ๋ฆผ์€ ์ „์ฒด ์นด์šดํŠธ์™€ ๋ฒ”์ฃผ ๋‚ด ๋น„์œจ์„ ๋ชจ๋‘ ํ‘œ์‹œ

plot_categoricals('escolari', 'Target', data, annotate = False)
  • ๊ฒฐ์ธก์น˜ ์ค‘ ๋‚จ์€ ๊ฐ’์€ ์ค‘๊ฐ„๊ฐ’์œผ๋กœ ์ฑ„์šฐ์ž.

  • target ๊ฐ’ ๋ถ„ํฌ๋ฅผ ์‹œ๊ฐํ™”ํ•˜์—ฌ ๊ฒฐ์ธก๊ฐ’์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

plot_value_counts(data[(data['rez_esc-missing'] == 1)], 'Target')
  • ํ•ด๋‹น ๋ถ„ํฌ๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋ถ„ํฌ์™€ ์ผ์น˜ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„
plot_value_counts(data[(data['v2a1-missing'] == 1)], 'Target')
  • target = 2(moderate)์˜ ๋น„์œจ์ด ๋†’์Œ

    • ๋” ๋งŽ์€ ๋นˆ๊ณค์˜ ์ง€ํ‘œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Œ

3. ํŠน์„ฑ ๊ณตํ•™(Feature Engineering)

  • ํ•™์Šต์„ ์œ„ํ•ด์„œ๋Š” ๊ฐ ๊ฐ€๊ตฌ์— ๋Œ€ํ•ด ์š”์•ฝ๋œ ๋ชจ๋“  ์ •๋ณด๊ฐ€ ํ•„์š”ํ•จ

    • ๊ฐ€๊ตฌ ๋‚ด์˜ ๊ฐœ์ธ์„ groupby()ํ•˜๊ณ  ๊ฐœ๋ณ„ ๋ณ€์ˆ˜์˜ agg()๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธ
  • ์ดํ›„ ํŠน์„ฑ ๊ณตํ•™์„ ์ž๋™ํ™”์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ์Œ

3-1. ์ปฌ๋Ÿผ(๋ณ€์ˆ˜) ์ •์˜ํ•˜๊ธฐ

  • data descriptions ๋ฅผ ํ†ตํ•ด ๊ฐœ์ธ ์ˆ˜์ค€๊ณผ ๊ฐ€๊ตฌ ์ˆ˜์ค€์— ์žˆ๋Š” ์—ด์„ ์ •์˜ํ•ด์•ผ ํ•จ

    • ๋ณ€์ˆ˜ ์ž์ฒด๋ฅผ ๊ฒ€ํ† 
  • ์ผ๋ถ€ ๋ณ€์ˆ˜๋Š” ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ ๋ณ€์ˆ˜๋ฅผ ์ •์˜

    • ๊ฐ ์ˆ˜์ค€์—์„œ ์ •์˜๋œ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํ•„์š”์— ๋”ฐ๋ผ ๋ณ€์ˆ˜๋ฅผ ์ง‘๊ณ„ํ•  ์ˆ˜ ์žˆ์Œ
  • ์ง„ํ–‰ ํ”„๋กœ์„ธ์Šค

  1. ๋ณ€์ˆ˜๋ฅผ ๊ฐ€๊ตฌ ์ˆ˜์ค€๊ณผ ๊ฐœ์ธ ์ˆ˜์ค€์œผ๋กœ ๋‚˜๋ˆ„๊ธฐ

  2. ๊ฐœ์ธ ์ˆ˜์ค€์˜ ๋ฐ์ดํ„ฐ์— ์ ํ•ฉํ•œ ์ง‘๊ณ„ ์ฐพ๊ธฐ

  • ์ˆœ์„œํ˜•(ordinal) ๋ณ€์ˆ˜: ํ†ต๊ณ„์  ์ง‘๊ณ„ ํ™œ์šฉ

  • ๋…ผ๋ฆฌํ˜•(bool)๋ณ€์ˆ˜: ์ง‘๊ณ„ํ•  ์ˆ˜๋Š” ์žˆ์ง€๋งŒ, ํ†ต๊ณ„์น˜์˜ ์ข…๋ฅ˜๋Š” ์ ์Œ

  1. ๊ฐœ์ธ ์ˆ˜์ค€์˜ ์ง‘๊ณ„๋ฅผ ๊ฐ€๊ตฌ ์ˆ˜์ค€ ๋ฐ์ดํ„ฐ์— ๊ฒฐํ•ฉ

๐Ÿ“Œ ๋ณ€์ˆ˜๋“ค์˜ ๋ฒ”์ฃผ ์ •์˜ํ•˜๊ธฐ

  • ๋ณ€์ˆ˜์—๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฒ”์ฃผ๊ฐ€ ์žˆ์Œ
  1. Individual Variables: ๊ฐœ์ธ๋ณ„ ํŠน์„ฑ
  • boolean: yes or no(0 ๋˜๋Š” 1)

  • ordered discrete: ์ˆœ์„œ๊ฐ€ ์žˆ๋Š” ์ •์ˆ˜

  1. Household variables
  • boolean: Yes or No

  • ordered discrete: ์ˆœ์„œ๊ฐ€ ์žˆ๋Š” ์ •์ˆ˜

  • ์—ฐ์†ํ˜• ์ˆ˜์น˜

  1. Squared Variables: ๋ฐ์ดํ„ฐ์˜ (๋ณ€์ˆ˜)^2์—์„œ ํŒŒ์ƒ

  2. Id variables: ๋ฐ์ดํ„ฐ๋ฅผ ์‹๋ณ„์šฉ, ํ”ผ์ณ๋กœ ์‚ฌ์šฉ x

### Id variables

id_ = ['Id', 'idhogar', 'Target']
### Individual Variables

ind_bool = ['v18q', 'dis', 'male', 'female', 'estadocivil1', 'estadocivil2', 'estadocivil3', 
            'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7', 
            'parentesco1', 'parentesco2',  'parentesco3', 'parentesco4', 'parentesco5', 
            'parentesco6', 'parentesco7', 'parentesco8',  'parentesco9', 'parentesco10', 
            'parentesco11', 'parentesco12', 'instlevel1', 'instlevel2', 'instlevel3', 
            'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 
            'instlevel9', 'mobilephone', 'rez_esc-missing']

ind_ordered = ['rez_esc', 'escolari', 'age']
### Household variables

hh_bool = ['hacdor', 'hacapo', 'v14a', 'refrig', 'paredblolad', 'paredzocalo', 
           'paredpreb','pisocemento', 'pareddes', 'paredmad',
           'paredzinc', 'paredfibras', 'paredother', 'pisomoscer', 'pisoother', 
           'pisonatur', 'pisonotiene', 'pisomadera',
           'techozinc', 'techoentrepiso', 'techocane', 'techootro', 'cielorazo', 
           'abastaguadentro', 'abastaguafuera', 'abastaguano',
            'public', 'planpri', 'noelec', 'coopele', 'sanitario1', 
           'sanitario2', 'sanitario3', 'sanitario5',   'sanitario6',
           'energcocinar1', 'energcocinar2', 'energcocinar3', 'energcocinar4', 
           'elimbasu1', 'elimbasu2', 'elimbasu3', 'elimbasu4', 
           'elimbasu5', 'elimbasu6', 'epared1', 'epared2', 'epared3',
           'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3', 
           'tipovivi1', 'tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5', 
           'computer', 'television', 'lugar1', 'lugar2', 'lugar3',
           'lugar4', 'lugar5', 'lugar6', 'area1', 'area2', 'v2a1-missing']

hh_ordered = [ 'rooms', 'r4h1', 'r4h2', 'r4h3', 'r4m1','r4m2','r4m3', 'r4t1',  'r4t2', 
              'r4t3', 'v18q1', 'tamhog','tamviv','hhsize','hogar_nin',
              'hogar_adul','hogar_mayor','hogar_total',  'bedrooms', 'qmobilephone']

hh_cont = ['v2a1', 'dependency', 'edjefe', 'edjefa', 'meaneduc', 'overcrowding']
sqr_ = ['SQBescolari', 'SQBage', 'SQBhogar_total', 'SQBedjefe', 
        'SQBhogar_nin', 'SQBovercrowding', 'SQBdependency', 'SQBmeaned', 'agesq']
  • ์ค‘๋ณต ์—ฌ๋ถ€/ ๋น ์ง„ ๋ณ€์ˆ˜ ํ™•์ธ
x = ind_bool + ind_ordered + id_ + hh_bool + hh_ordered + hh_cont + sqr_

from collections import Counter

print('There are no repeats: ', np.all(np.array(list(Counter(x).values())) == 1))
print('We covered every variable: ', len(x) == data.shape[1]) # ์—ด์˜ ๊ฐœ์ˆ˜์™€ ๋™์ผํ•œ์ง€ ํ™•์ธ

โบ Squared Variables

  • ์„ ํ˜• ๋ชจํ˜•์ด ๋น„์„ ํ˜• ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ณ€์ˆ˜๊ฐ€ ํ”ผ์ณ ์—”์ง€๋‹ˆ์–ด๋ง์˜ ์ผ๋ถ€๋กœ ์ œ๊ณฑ๋˜๊ฑฐ๋‚˜ ๋ณ€ํ™˜๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์กด์žฌ

  • ๋” ๋ณต์žกํ•œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฌํ•œ squared feature๋“ค์ด ์ค‘๋ณต๋จ

  • ์ œ๊ณฑ๋˜์ง€ ์•Š์€ feature๋“ค๊ณผ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„ -> ๊ด€๋ จ ์—†๋Š” ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  ํ•™์Šต์„ ๋Š๋ฆฌ๊ฒŒ ํ•จ์œผ๋กœ์จ ๋ชจ๋ธ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Œ

  • ex> SQBage vs age

sns.lmplot(x = 'age', y = 'SQBage', data = data, fit_reg = False);
plt.title('Squared Age versus Age');
  • ๋‘ ๋ณ€์ˆ˜์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋งค์šฐ ํฌ๋‹ค.

    • ๋ฐ์ดํ„ฐ์— ๋‘ ๋ณ€์ˆ˜ ๋ชจ๋‘๋“ค ์ €์žฅํ•  ํ•„์š”๊ฐ€ x
# ์ œ๊ณฑํ•œ ๋ณ€์ˆ˜ ์ œ๊ฑฐํ•˜๊ธฐ
data = data.drop(columns = sqr_)
data.shape

โบ Id Variables

  • ๋ฐ์ดํ„ฐ ์‹๋ณ„์— ํ•„์š” -> ์œ ์ง€

โบ Household Variables

heads = data.loc[data['parentesco1'] == 1, :]
heads = heads[id_ + hh_bool + hh_cont + hh_ordered]
heads.shape
  • ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ€๊ตฌ ์ˆ˜์ค€ ๋ณ€์ˆ˜์˜ ๊ฒฝ์šฐ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•  ์ˆ˜๋„ ์žˆ์Œ

  • ์ผ๋ถ€ ์ค‘๋ณต ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๊ธฐ์กด ๋ฐ์ดํ„ฐ์—์„œ ํŒŒ์ƒ๋œ ๊ธฐ๋Šฅ์„ ์ถ”๊ฐ€ํ•  ์ˆ˜๋„ ์žˆ์Œ

์ค‘๋ณต๋œ Household Variables

  • ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ๋„ˆ๋ฌด ๋†’์€ ๋ณ€์ˆ˜๊ฐ€ ์žˆ์œผ๋ฉด ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ๋†’์€ ๋ณ€์ˆ˜ ์Œ ์ค‘ ํ•˜๋‚˜๋ฅผ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์Œ
# ์ƒ๊ด€๊ณ„์ˆ˜ ํ–‰๋ ฌ
corr_matrix = heads.corr()

# ์ƒ์‚ผ๊ฐํ–‰๋ ฌ๋งŒ ๋‚จ๊ธฐ๊ธฐ
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ 0.95๋ณด๋‹ค ํฐ feeature column์˜ ์ธ๋ฑ์Šค ์ฐพ๊ธฐ
to_drop = [column for column in upper.columns if any(abs(upper[column]) > 0.95)]

to_drop
corr_matrix.loc[corr_matrix['tamhog'].abs() > 0.9, corr_matrix['tamhog'].abs() > 0.9]
sns.heatmap(corr_matrix.loc[corr_matrix['tamhog'].abs() > 0.9, corr_matrix['tamhog'].abs() > 0.9],
            annot=True, cmap = plt.cm.autumn_r, fmt='.3f');
  • ์ง‘์˜ ํฌ๊ธฐ์™€ ๊ด€๋ จ๋œ ๋ช‡ ๊ฐ€์ง€ ๋ณ€์ˆ˜๊ฐ€ ์žˆ์Œ

    • r4t3: ๊ฐ€๊ตฌ ๋‚ด ์ด ์ธ์›์ˆ˜

    • tamhog: ๊ฐ€๊ตฌ ํฌ๊ธฐ

    • tamviv: ๊ฐ€๊ตฌ์— ์‚ฌ๋Š” ์‚ฌ๋žŒ๋“ค์˜ ์ˆ˜

    • hhsize: ๊ฐ€๊ตฌ ํฌ๊ธฐ

    • hogar_total: ๊ฐ€๊ตฌ ๊ตฌ์„ฑ์› ์ˆ˜

  • ์ด ๋ณ€์ˆ˜๋“ค์€ ๋ชจ๋‘ ์„œ๋กœ ๋†’์€ ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง

    • hhsize๋Š” tamhog์™€ hogar_total๊ณผ ์™„๋ฒฝํ•œ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง

    • rt4t3๊ณผ hhsize๋Š” ๊ฑฐ์˜ ์™„๋ฒฝํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง

    ๋‘ ๋ณ€์ˆ˜ ์ค‘ ํ•˜๋‚˜๋ฅผ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์Œ

heads = heads.drop(columns = ['tamhog', 'hogar_total', 'r4t3'])

  • hhsize vs tamviv
sns.lmplot(x = 'tamviv', y = 'hhsize', data = data, fit_reg = False);
plt.title('Household size vs number of persons living in the household');
  • ๊ฐ€์กฑ๋ณด๋‹ค ๊ฐ€๊ตฌ์— ๋” ๋งŽ์€ ์‚ฌ๋žŒ๋“ค์ด ์‚ด๊ณ  ์žˆ์Œ..?

~(๋ฌด์Šจ ์˜๋ฏธ์ธ์ง€..)~

heads['hhsize-diff'] = heads['tamviv'] - heads['hhsize']
plot_categoricals('hhsize-diff', 'Target', heads)
  • ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ€๊ตฌ๋Š” ์ฐจ์ด๊ฐ€ ์—†์ง€๋งŒ ๊ฐ€๊ตฌ ๊ตฌ์„ฑ์›๋ณด๋‹ค ๊ฐ€๊ตฌ์— ์‚ด๊ณ  ์žˆ๋Š” ์‚ฌ๋žŒ์˜ ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ฒฝ์šฐ๊ฐ€ ์ ์ง€๋งŒ ์กด์žฌํ•จ

  • coopele ๋ณ€์ˆ˜
corr_matrix.loc[corr_matrix['coopele'].abs() > 0.9, corr_matrix['coopele'].abs() > 0.9]
  • coopele: ๊ฐ€์ •์˜ ์ „๊ธฐ๊ฐ€ ์–ด๋””์—์„œ ์˜ค๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ„

  • ๋„ค ๊ฐ€์ง€ ์„ ํƒ์ง€๊ฐ€ ์žˆ๋Š”๋ฐ, ์ด ๋‘ ๊ฐ€์ง€ ์„ ํƒ์ง€ ์ค‘ ํ•˜๋‚˜๊ฐ€ ์—†๋Š” ๊ฐ€์ •์€ ์ „๊ธฐ๊ฐ€ ์—†๊ฑฐ๋‚˜(noelec) ๊ฐœ์ธ ๋ฐœ์ „์†Œ์—์„œ ๊ณต๊ธ‰๋ฐ›์Œ(planpri)


    0: No electricity

    1: Electricity from cooperative

    2: Electricity from CNFL, ICA, ESPH/JASEC

    3: Electricity from private plant

  • ์ •๋ ฌ๋œ ๋ณ€์ˆ˜์—๋Š” ๊ณ ์œ ํ•œ ์ˆœ์„œ๊ฐ€ ์žˆ์œผ๋ฉฐ, ๋„๋ฉ”์ธ ์ง€์‹์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์„ ํƒ

  • ์ƒˆ๋กœ์šด ์ˆœ์„œํ˜• ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•œ ๋‹ค์Œ, ๊ธฐ์กด์˜ ๋ณ€์ˆ˜๋“ค์„ ์‚ญ์ œํ•ด๋„ ok

  • ๊ฒฐ์ธก์น˜์ธ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ NaN์„ ์ž…๋ ฅํ•˜๊ณ  boolean ์ปฌ๋Ÿผ์œผ๋กœ ํ•ด๋‹น ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ์—†์Œ์„ ํ‘œ์‹œ

elec = []

# ๊ฐ’ ํ• ๋‹น(mapping)
for i, row in heads.iterrows():
    if row['noelec'] == 1:
        elec.append(0)
    elif row['coopele'] == 1:
        elec.append(1)
    elif row['public'] == 1:
        elec.append(2)
    elif row['planpri'] == 1:
        elec.append(3)
    else:
        elec.append(np.nan)
        
# ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ
heads['elec'] = elec
heads['elec-missing'] = heads['elec'].isnull()

# ๊ธฐ์กด ์ปฌ๋Ÿผ ์ œ๊ฑฐ
# heads = heads.drop(columns = ['noelec', 'coopele', 'public', 'planpri'])
plot_categoricals('elec', 'Target', heads)
  • target์˜ ๋ชจ๋“  ๊ฐ’์— ๋Œ€ํ•ด ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ์ „๊ธฐ ๊ณต๊ธ‰์›์ด ๋‚˜์—ด๋œ ๊ณต๊ธ‰์—…์ฒด ์ค‘ ํ•˜๋‚˜์ž„์„ ์•Œ ์ˆ˜ ์žˆ์Œ

  • area2 ๋ณ€์ˆ˜

  • ์ง‘์ด ์‹œ๊ณจ ์ง€์—ญ์— ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ

    • ์ง‘์ด ๋„์‹œ ์ง€์—ญ์— ์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์—ด(area1)์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ค‘๋ณต๋ฉ

ํ•ด๋‹น ์—ด ์‚ญ์ œ

heads = heads.drop(columns = 'area2')

heads.groupby('area1')['Target'].value_counts(normalize = True)
  • ๋„์‹œ ์ง€์—ญ์˜ ๊ฐ€๊ตฌ(value = 1)๋Š” ๋†์ดŒ ์ง€์—ญ์˜ ๊ฐ€๊ตฌ(value = 0)๋ณด๋‹ค ๋นˆ๊ณค ์ˆ˜์ค€์ด ๋‚ฎ์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋” ๋†’์€ ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋จ

โบ Ordinal Variables

  • ์ง‘์˜ ๋ฒฝ, ์ง€๋ถ•, ๋ฐ”๋‹ฅ์—๋Š” ๊ฐ๊ฐ 3๊ฐœ์˜ column์ด ์กด์žฌํ•จ

    • โ€˜badโ€™, โ€˜regularโ€™, โ€˜goodโ€™
  • ๋ณ€์ˆ˜๋ฅผ boolean ํ˜•์œผ๋กœ ๋ƒ…๋‘˜ ์ˆ˜๋„ ์žˆ์ง€๋งŒ, bad < regular < good์ด๋ผ๋Š” ๋ช…ํ™•ํ•œ ์ˆœ์„œ๊ฐ€ ์กด์žฌํ•จ

    • ์ˆœ์„œํ˜• ๋ณ€์ˆ˜๋กœ ๋ฐ”๊พธ๋Š” ๊ฒƒ์ด ๋” ์ข‹์•„ ๋ณด์ž„

    • np.argmax()๋ฅผ ํ†ตํ•ด ๊ฐ ๊ฐ€๊ตฌ์— ๋Œ€ํ•ด 0์ด ์•„๋‹Œ ์—ด์„ ์‰ฝ๊ฒŒ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ

### Wall ordinal variable

heads['walls'] = np.argmax(np.array(heads[['epared1', 'epared2', 'epared3']]),
                           axis = 1)

# heads = heads.drop(columns = ['epared1', 'epared2', 'epared3'])
plot_categoricals('walls', 'Target', heads)
### Roof ordinal variable

heads['roof'] = np.argmax(np.array(heads[['etecho1', 'etecho2', 'etecho3']]),
                           axis = 1)
heads = heads.drop(columns = ['etecho1', 'etecho2', 'etecho3'])
### Floor ordinal variable

heads['floor'] = np.argmax(np.array(heads[['eviv1', 'eviv2', 'eviv3']]),
                           axis = 1)
# heads = heads.drop(columns = ['eviv1', 'eviv2', 'eviv3'])

3-2. ๋ณ€์ˆ˜(feature) ๊ตฌ์„ฑํ•˜๊ธฐ

  • ๋ณ€์ˆ˜๋ฅผ ์ˆœ์„œํ˜• ํ”ผ์ฒ˜์— ๋งคํ•‘ํ•˜๋Š” ๊ฒƒ ์™ธ์—๋„ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์—์„œ ์™„์ „ํžˆ ์ƒˆ๋กœ์šด ํ”ผ์ฒ˜๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜๋„ ์žˆ์Œ

    • ์˜ˆ์‹œ> ์ด์ „์˜ ์„ธ ๊ฐ€์ง€ ํŠน์ง•(wall, roof, floor)์„ ํ•ฉ์‚ฐํ•˜์—ฌ ์ง‘ ๊ตฌ์กฐ์˜ ์ „๋ฐ˜์ ์ธ ํ’ˆ์งˆ์„ ์ธก์ •ํ•  ์ˆ˜ ์žˆ์Œ
  • walls + roof + floor

# ์ƒˆ๋กœ์šด feature ์ƒ์„ฑํ•˜๊ธฐ

heads['walls+roof+floor'] = heads['walls'] + heads['roof'] + heads['floor']

plot_categoricals('walls+roof+floor', 'Target', heads, annotate = False)
  • ์ƒˆ๋กœ์šด feature๋Š” target = 4(the lowest poverty level)์—์„œ house quality ๋ณ€์ˆ˜์˜ ๊ฐ’์ด ๋” ๋†’์€ ๊ฒฝํ–ฅ์ด ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ž„
counts = pd.DataFrame(heads.groupby(['walls+roof+floor'])['Target'].value_counts(normalize = True)).rename(columns = {'Target': 'Normalized Count'}).reset_index()
counts.head()
  • warning

  • ์ง‘์˜ ์งˆ์— ๋Œ€ํ•œ ๊ฒฝ๊ณ 

  • ํ™”์žฅ์‹ค, ์ „๊ธฐ, ๋ฐ”๋‹ฅ, ์ˆ˜๋„, ์ฒœ์žฅ์ด ์—†๋Š” ๊ฒฝ์šฐ ๊ฐ๊ฐ -1์ ์˜ ๋งˆ์ด๋„ˆ์Šค ๊ฐ’

# No toilet, no electricity, no floor, no water service, no ceiling

heads['warning'] = 1 * (heads['sanitario1'] + 
                         (heads['elec'] == 0) + 
                         heads['pisonotiene'] + 
                         heads['abastaguano'] + 
                         (heads['cielorazo'] == 0))
### seaborn์˜ violin plot์œผ๋กœ ์‹œ๊ฐํ™”

plt.figure(figsize = (10, 6))
sns.violinplot(x = 'warning', y = 'Target', data = heads)
plt.title('Target vs Warning Variable');
  • ๋ฐ”์ด์˜ฌ๋ฆฐ ํ”Œ๋กฏ์€ ๋Œ€์ƒ์ด ์‹ค์ œ๋ณด๋‹ค ๋” ์ž‘๊ณ  ๋” ํฐ ๊ฐ’์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด๋Š” ํšจ๊ณผ๋กœ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ํ‰ํ™œ(smoothing)ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์—ฌ๊ธฐ์„œ๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์€ ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋จ

    • ํ•˜์ง€๋งŒ, ๊ฒฝ๊ณ  ์‹ ํ˜ธ๊ฐ€ ์—†๋Š” ๊ฐ€๊ตฌ ๊ทธ๋ฃน์— ๋นˆ๊ณค ์ˆ˜์ค€์ด ๊ฐ€์žฅ ๋‚ฎ์€ ๊ฐ€๊ตฌ๋“ค์ด ์ง‘์ค‘๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ
plot_categoricals('warning', 'Target', data = heads)
  • bonus

  • ๋ƒ‰์žฅ๊ณ , ์ปดํ“จํ„ฐ, ํƒœ๋ธ”๋ฆฟ, ํ…”๋ ˆ๋น„์ „์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉด ์ ์ˆ˜๊ฐ€ ์˜ฌ๋ผ๊ฐ€๋Š” ๋ณ€์ˆ˜

# Owns a refrigerator, computer, tablet, and television

heads['bonus'] = 1 * (heads['refrig'] + 
                      heads['computer'] + 
                      (heads['v18q1'] > 0) + 
                      heads['television'])

sns.violinplot(x = 'bonus', y = 'Target', 
               data = heads, figsize = (10, 6))

plt.title('Target vs Bonus Variable');

3-3. Per Capita Features

  • ๊ฐ€๊ตฌ์› ๋ณ„๋กœ ํŠน์ • ์ธก์ •๊ฐ’์˜ ์ˆ˜๋ฅผ ๊ณ„์‚ฐ

  • ํŠน์ •๊ฐ’ / ๊ฐ€๊ตฌ ๊ตฌ์„ฑ์› ์ˆ˜

heads['phones-per-capita'] = heads['qmobilephone'] / heads['tamviv']
heads['tablets-per-capita'] = heads['v18q1'] / heads['tamviv']
heads['rooms-per-capita'] = heads['rooms'] / heads['tamviv']
heads['rent-per-capita'] = heads['v2a1'] / heads['tamviv']

3-4. Household Variables ์‚ดํŽด๋ณด๊ธฐ

  • ๊ด€๊ณ„๋ฅผ ์ •๋Ÿ‰ํ™”

a) ์ƒ๊ด€๊ด€๊ณ„ ํ™•์ธํ•˜๊ธฐ

  • ๋‘ ๋ณ€์ˆ˜ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐ๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ์Œ

1. Pearson Correlation

  • -1๋ถ€ํ„ฐ 1๊นŒ์ง€ ๋‘ ๋ณ€์ˆ˜ ์‚ฌ์ด์˜ ์„ ํ˜• ๊ด€๊ณ„ ์ธก์ •

  • ์ฆ๊ฐ€ ์ถ”์„ธ๊ฐ€ ์ •ํ™•ํ•˜๊ฒŒ ์„ ํ˜•์ธ ๊ฒฝ์šฐ์—๋งŒ ํ•˜๋‚˜๊ฐ€ ๋  ์ˆ˜ ์žˆ์Œ

2. Spearman Correlation

  • -1์—์„œ 1๊นŒ์ง€ ๋‘ ๋ณ€์ˆ˜์˜ ๋‹จ์กฐ๋กœ์šด ๊ด€๊ณ„๋ฅผ ์ธก์ •

  • ํ•œ ๋ณ€์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ๊ด€๊ณ„๊ฐ€ ์„ ํ˜•์ ์ด์ง€ ์•Š๋”๋ผ๋„ ๋‹ค๋ฅธ ๋ณ€์ˆ˜๋„ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ 1์ž„

  • ์Šคํ”ผ์–ด๋งŒ ์ƒ๊ด€๊ณ„์ˆ˜ ๊ณ„์‚ฐ์—๋Š” ๊ด€๊ณ„์˜ ์ค‘์š”์„ฑ ์ˆ˜์ค€์„ ๋‚˜ํƒ€๋‚ด๋Š” p-value๋„ ํ•จ๊ป˜ ์‚ฐ์ถœ๋จ

    • p-value๊ฐ€ 0.05 ๋ฏธ๋งŒ์ด๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ ์œ ์˜ํ•œ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผ๋จ

  • ์ƒ๊ด€๊ณ„์ˆ˜ ํ•ด์„

- 0.00 ~ 0.19: "๋งค์šฐ ์•ฝํ•œ ์ƒ๊ด€๊ด€๊ณ„"

- 0.20 ~ 0.39: "์•ฝํ•œ ์ƒ๊ด€๊ด€๊ณ„"

- 0.40 ~ 0.59: "์–ด๋Š ์ •๋„์˜ ์ƒ๊ด€๊ด€๊ณ„"

- 0.60 ~ 0.79: "๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„"

- 0.80 ~ 1.0: "๋งค์šฐ ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„"

  • ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ๋‘ ์ƒ๊ด€๊ด€๊ณ„๋Š” ๋น„์Šทํ•จ

  • ์Šคํ”ผ์–ด๋งŒ ์ƒ๊ด€๊ด€๊ณ„๋Š” ์ข…์ข… ์ˆœ์„œํ˜• ๋ณ€์ˆ˜์— ๋Œ€ํ•ด ๋” ๋‚˜์€ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ํŒ๋‹จ๋จ

    • ์‹ค์ œ ์„ธ๊ณ„์—์„œ ๋Œ€๋ถ€๋ถ„์˜ ๊ด€๊ณ„๋Š” ์„ ํ˜•์ ์ด์ง€ ์•Š์Œ

    • Pearson ์ƒ๊ด€๊ด€๊ณ„๋Š” ๋‘ ๋ณ€์ˆ˜๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ๋˜์–ด ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ๊ทผ์‚ฌ์น˜์ผ ์ˆœ ์žˆ์ง€๋งŒ, ์ •ํ™•ํ•˜์ง€ ์•Š๊ณ  ๊ฐ€์žฅ ์ข‹์€ ๋น„๊ต ๋ฐฉ๋ฒ• ๋˜ํ•œ ์•„๋‹˜

from scipy.stats import spearmanr
### ์ƒ๊ด€๊ณ„์ˆ˜ ์‹œ๊ฐํ™”

def plot_corrs(x, y):
    
    # ์ƒ๊ด€๊ณ„์ˆ˜ ๊ณ„์‚ฐ
    spr = spearmanr(x, y).correlation
    pcr = np.corrcoef(x, y)[0, 1]
    
    # ์‚ฐ์ ๋„ plot
    data = pd.DataFrame({'x': x, 'y': y})
    plt.figure( figsize = (6, 4))
    sns.regplot(data = data, x = 'x', y = 'y', fit_reg = False)
    plt.title(f'Spearman: {round(spr, 2)}; Pearson: {round(pcr, 2)}')

  • Example
x = np.array(range(100))
y = x ** 2

plot_corrs(x, y)
x = np.array([1, 1, 1, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 8, 8, 9, 9, 9])
y = np.array([1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 3, 3, 2, 4, 2, 2, 4])

plot_corrs(x, y)
x = np.array(range(-19, 20))
y = 2 * np.sin(x)

plot_corrs(x, y)

๐Ÿ“Œ Pearson ์ƒ๊ด€ ๊ด€๊ณ„

### ํ•™์Šต(train) ๋ฐ์ดํ„ฐ๋งŒ ์‚ฌ์šฉ
train_heads = heads.loc[heads['Target'].notnull(), :].copy()

pcorrs = pd.DataFrame(train_heads.corr()['Target'].sort_values()).rename(columns = {'Target': 'pcorr'}).reset_index()
pcorrs = pcorrs.rename(columns = {'index': 'feature'})

print('Most negatively correlated variables:')
print(pcorrs.head())

print('\nMost positively correlated variables:')
print(pcorrs.dropna().tail())
  • ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„์˜ ๊ฒฝ์šฐ ๋ณ€์ˆ˜์˜ ๊ฐ’์ด ์ฆ๊ฐ€ํ• ์ˆ˜๋ก target๊ฐ’์ด ๊ฐ์†Œํ•จ์„ ์˜๋ฏธ

    • ๋นˆ๊ณค ์‹ฌ๊ฐ๋„๊ฐ€ ์ฆ๊ฐ€ํ•จ์„ ์˜๋ฏธ
  • warning์ด ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ๋นˆ๊ณค ์ˆ˜์ค€๋„ ์ฆ๊ฐ€

    • ์ด๋Š” ์ง‘์— ๋Œ€ํ•œ ์ž ์žฌ์ ์ธ ๋‚˜์œ ์ง•ํ›„๋ฅผ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•œ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ํƒ€๋‹นํ•จ
  • hogar_nin

    • ์ด๋Š” ๊ฐ€์กฑ ๋‚ด 0~19๋ช…์˜ ์•„์ด๋“ค์˜ ์ˆซ์ž๋กœ, ๋” ์–ด๋ฆฐ ์•„์ด๋“ค์ด ๋” ๋†’์€ ์ˆ˜์ค€์˜ ๋นˆ๊ณค์œผ๋กœ ์ด์–ด์ง€๋Š” ๊ฐ€์กฑ์—๊ฒŒ ์ŠคํŠธ๋ ˆ์Šค์˜ ์žฌ์ •์ ์ธ ์›์ธ์ด ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ

    • ์‚ฌํšŒ ๊ฒฝ์ œ์  ์ง€์œ„๊ฐ€ ๋‚ฎ์€ ๊ฐ€์ •๋“ค์€ ๊ทธ๋“ค ์ค‘ ํ•œ ๋ช…์ด ์„ฑ๊ณตํ•  ์ˆ˜ ์žˆ๊ธฐ๋ฅผ ๋ฐ”๋ผ๋Š” ๋งˆ์Œ์œผ๋กœ ๋” ๋งŽ์€ ์•„์ด๋“ค์„ ๊ฐ€์ง€๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Œ

    ๊ฐ€์กฑ ๊ทœ๋ชจ์™€ ๋นˆ๊ณค ์‚ฌ์ด์—๋Š” ์‹ค์งˆ์ ์ธ ์—ฐ๊ด€์„ฑ์ด ์žˆ๋‹ค.


  • ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„์˜ ๊ฒฝ์šฐ, ๋ณ€์ˆ˜์˜ ๊ฐ’์ด ์ฆ๊ฐ€ํ• ์ˆ˜๋ก Target๊ฐ’์ด ์ฆ๊ฐ€ํ•จ์„ ์˜๋ฏธ

    • ๋นˆ๊ณค ์‹ฌ๊ฐ๋„๊ฐ€ ๊ฐ์†Œํ•จ์„ ์˜๋ฏธ
  • meaneduc

    • ๊ฐ€๊ตฌ ๋‚ด ์„ฑ์ธ์˜ ํ‰๊ท  ๊ต์œก ์ˆ˜์ค€์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ€๊ตฌ ์ˆ˜์ค€์˜ ๋ณ€์ˆ˜

    • target๊ณผ ๊ฐ€์žฅ ๋†’์€ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ž„

    ๊ต์œก์˜ ๋‚ฎ์€ ์ˆ˜์ค€์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋นˆ๊ณค์˜ ๋‚ฎ์€ ์ˆ˜์ค€๊ณผ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์žˆ๋‹ค.

  • ๋ณ€์ˆ˜์™€ target ๊ฐ„์— ์•ฝํ•œ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์กด์žฌํ•จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

๐Ÿ“Œ ์Šคํ”ผ์–ด๋งŒ ์ƒ๊ด€ ๊ด€๊ณ„

import warnings
warnings.filterwarnings('ignore', category = RuntimeWarning)

feats = [] # features
scorr = [] # scores
pvalues = [] # p-value

# ๊ฐ ์ปฌ๋Ÿผ(๋ณ€์ˆ˜)๋ณ„๋กœ
for c in heads:
    # ์ˆซ์žํ˜• ๋ณ€์ˆ˜๋“ค์— ๋Œ€ํ•ด..
    if heads[c].dtype != 'object':
        feats.append(c)
        
        # ์Šคํ”ผ์–ด๋งŒ ์ƒ๊ด€๊ณ„์ˆ˜ ๊ณ„์‚ฐ
        scorr.append(spearmanr(train_heads[c], train_heads['Target']).correlation)
        pvalues.append(spearmanr(train_heads[c], train_heads['Target']).pvalue)

scorrs = pd.DataFrame({'feature': feats, 'scorr': scorr, 'pvalue': pvalues}).sort_values('scorr')
  • p-value๊ฐ€ 0.05 ๋ฏธ๋งŒ์ด๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ ์œ ์˜ํ•œ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผ๋จ

    • ์šฐ๋ฆฌ๋Š” ๋‹ค์ค‘ ๋น„๊ต๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— p-value๋ฅผ ๋น„๊ต ํšŸ์ˆ˜๋กœ ๋‚˜๋ˆ„๊ณ ์ž ํ•จ

    Bonferroni ์ˆ˜์ •์ด๋ผ ํ•จ

print('Most negative Spearman correlations:')
print(scorrs.head())
print()
print('\nMost positive Spearman correlations:')
print(scorrs.dropna().tail())
  • ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์ธก์ •ํ•˜๋Š” ๋‘ ๊ณ„์ˆ˜ ๋ชจ๋‘ ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ด๊ณ  ์žˆ๋‹ค.
corrs = pcorrs.merge(scorrs, on = 'feature')
corrs['diff'] = corrs['pcorr'] - corrs['scorr'] # p-value - ์ƒ๊ด€๊ณ„์ˆ˜

corrs.sort_values('diff').head()
corrs.sort_values('diff').dropna().tail()
  • dependency ๋ณ€์ˆ˜๊ฐ€ ๊ฐ€์žฅ ํฐ ์ฐจ์ด๋ฅผ ๋ณด์ž„
### ์‹œ๊ฐํ™”
# ๋‘ ๋ณ€์ˆ˜ ๋ชจ๋‘ ์ด์‚ฐํ˜• ๋ณ€์ˆ˜์ด๊ธฐ์—, plot์— ์•ฝ๊ฐ„์˜ jitter๋ฅผ ์ถ”๊ฐ€ํ•จ

sns.lmplot(x = 'dependency', y = 'Target', fit_reg = True, data = train_heads, 
           x_jitter = 0.05, y_jitter = 0.05)
plt.title('Target vs Dependency')
  • ์•ฝํ•œ ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ž„

    • dependency๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก target ๊ฐ’์ด ๊ฐ์†Œํ•จ

    • dependency๊ฐ€ (dependent)/(non-dependent)๋ฅผ ๋‚˜ํƒ€๋ƒ„์„ ์˜๋ฏธ

    • ํ•ด๋‹น ๊ฐ’์ด ์ฆ๊ฐ€ํ•˜๋ฉด, ๋นˆ๊ณค์˜ ์‹ฌ๊ฐ์„ฑ์ด ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Œ

(๋ณดํ†ต ์ผ์„ ํ•˜์ง€ ์•Š๋Š”) ์˜์กด์ ์ธ ๊ฐ€์กฑ ๊ตฌ์„ฑ์›์€ ๋น„์˜์กด์ ์ธ ๊ฐ€์กฑ ๊ตฌ์„ฑ์›์˜ ์ง€์›์„ ๋ฐ›์•„์•ผ ํ•จ => ๋” ๋†’์€ ์ˆ˜์ค€์˜ ๋นˆ๊ณค์œผ๋กœ ์ด์–ด์ง

sns.lmplot(x = 'rooms-per-capita', y = 'Target', fit_reg = True, data = train_heads, 
           x_jitter = 0.05, y_jitter = 0.05)
plt.title('Target vs Rooms Per Capita')
  • ์•ฝํ•œ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ž„

๐Ÿ“Œ์ƒ๊ด€๊ณ„์ˆ˜ heatmap

  • ์ƒ๊ด€๊ณ„์ˆ˜ ์‹œ๊ฐํ™”
# Household Variables
variables = ['Target', 'dependency', 'warning', 'walls+roof+floor', 'meaneduc',
             'floor', 'r4m1', 'overcrowding']

# ์ƒ๊ด€๊ณ„์ˆ˜ ๊ณ„์‚ฐ
corr_mat = train_heads[variables].corr().round(2)

# heatmap ์ƒ์„ฑ
plt.rcParams['font.size'] = 18
plt.figure(figsize = (12, 12))

sns.heatmap(corr_mat, vmin = -0.5, vmax = 0.8, center = 0, 
            cmap = plt.cm.RdYlGn_r, annot = True);
  • target๊ณผ ์•ฝํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๋Š” ๋ณ€์ˆ˜๋“ค์ด ์ƒ๋‹น์ˆ˜ ์กด์žฌํ•จ์„ ๋ณด์—ฌ์คŒ

  • ์ผ๋ถ€ ๋ณ€์ˆ˜๋“ค์˜ ๊ฒฝ์šฐ ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์ƒ๊ด€๋„๊ฐ€ ๋†’์Œ

    • floor vs walls+roof+floor

    • ๋‹ค์ค‘๊ณต์„ ์„ฑ(multicollinearity) ๋ฌธ์ œ

b) ๋ณ€์ˆ˜ ์‹œ๊ฐํ™”

  • ์œ„์ชฝ ์‚ผ๊ฐํ˜•์—๋Š” ์‚ฐ์ ๋„(scatterplot), ๋Œ€๊ฐ์„ ์—๋Š” ์ปค๋„ ๋ฐ€๋„ plot(KDE), ์•„๋ž˜์ชฝ ์‚ผ๊ฐํ˜•์—๋Š” 2์ฐจ์› KDE ๊ทธ๋ฆผ์„ ํ‘œ์‹œํ•ด๋ณด์ž!
import warnings
warnings.filterwarnings('ignore')

# ์‹œ๊ฐํ™” ํ•  ๋ณ€์ˆ˜ ์„ ํƒ
plot_data = train_heads[['Target', 'dependency', 'walls+roof+floor',
                         'meaneduc', 'overcrowding']]

# ์˜์—ญ ๋‚˜๋ˆ„๊ธฐ -> pairgrid
grid = sns.PairGrid(data = plot_data, diag_sharey = False,
                    hue = 'Target', hue_order = [4, 3, 2, 1], 
                    vars = [x for x in list(plot_data.columns) if x != 'Target'])

# Upper: scatter plot
grid.map_upper(plt.scatter, alpha = 0.8, s = 20)

# Diagonal: kdeplot
grid.map_diag(sns.kdeplot)

# Bottom: 2์ฐจ์› kdeplot
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r)
grid = grid.add_legend()

plt.suptitle('Feature Plots Colored By Target', size = 32, y = 1.05)
household_feats = list(heads.columns) # ๊ฐ€๊ตฌ ์ˆ˜์ค€์—์„œ์˜ ๋ณ€์ˆ˜๋“ค์„ ์ตœ์ข…์ ์œผ๋กœ ์ €์žฅ

3-5. Individual Level Variables

  • ๋‘ ๊ฐ€์ง€ ์œ ํ˜•์ด ์กด์žฌํ•จ

    • booleanํ˜• ๋ณ€์ˆ˜(1 or 0)

    • ์ˆœ์„œํ˜•(ordinal) ๋ณ€์ˆ˜(์˜๋ฏธ ์žˆ๋Š” ์ˆœ์„œ๊ฐ€ ์ง€์ •๋œ ๊ฐœ๋ณ„ ๊ฐ’)

ind = data[id_ + ind_bool + ind_ordered]
ind.shape

a) ์ค‘๋ณต ๋ณ€์ˆ˜ ์ œ๊ฑฐํ•˜๊ธฐ

  • ์ƒ๊ด€ ๊ณ„์ˆ˜์˜ ์ ˆ๋Œ“๊ฐ’์ด 0.95๋ณด๋‹ค ํฐ ๋ณ€์ˆ˜์— ์ฃผ๋ชฉํ•˜์ž.
# ์ƒ๊ด€๊ณ„์ˆ˜ ํ–‰๋ ฌ ์ƒ์„ฑ
corr_matrix = ind.corr()

# ์ƒ์‚ผ๊ฐํ–‰๋ ฌ๋งŒ ๋‚จ๊ธฐ๊ธฐ
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))

# ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ 0.95 ์ด์ƒ์ธ ๋ณ€์ˆ˜๋“ค
to_drop = [column for column in upper.columns if any(abs(upper[column]) > 0.95)]

to_drop
  • ์ด๋Š” ๋‹จ์ˆœํžˆ ๋‚จ์„ฑ์˜ ๋ฐ˜๋Œ€๋ฅผ ์˜๋ฏธ

    • ๋‚จ์„ฑ flag ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐ
ind = ind.drop(columns = 'male')

b) ์ˆœ์„œํ˜• ๋ณ€์ˆ˜ ์ƒ์„ฑํ•˜๊ธฐ

  • ๊ธฐ์กด์˜ ๋ณ€์ˆ˜๋“ค์„ ์ˆœ์„œํ˜• ๋ณ€์ˆ˜๋กœ ๋งคํ•‘ ๊ฐ€๋Šฅ

    • ๊ฐœ์ธ์˜ ๊ต์œก ์ˆ˜์ค€์„ ๋‚˜ํƒ€๋‚ด๋Š” instlevel_ ๋ณ€์ˆ˜๋ฅผ ์ค‘์‹ฌ์œผ๋กœ 1(๊ต์œก ์ˆ˜์ค€์ด x)๋ถ€ํ„ฐ 9(๋Œ€ํ•™์›)๊นŒ์ง€ mapping
ind[[c for c in ind if c.startswith('instl')]].head()
ind['inst'] = np.argmax(np.array(ind[[c for c in ind if c.startswith('instl')]]), axis = 1)

plot_categoricals('inst', 'Target', ind, annotate = False);
  • ๋” ๋†’์€ ์ˆ˜์ค€์˜ ๊ต์œก์„ ๋ฐ›์„์ˆ˜๋ก ๋œ ๊ทน๋‹จ์ ์ธ ์ˆ˜์ค€์˜ ๋นˆ๊ณค์— ํ•ด๋‹นํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„
plt.figure(figsize = (10, 8))
sns.violinplot(x = 'Target', y = 'inst', data = ind)
plt.title('Education Distribution by Target');
# Drop the education columns
# ind = ind.drop(columns = [c for c in ind if c.startswith('instlevel')])
ind.shape

c) ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜ ์ƒ์„ฑํ•˜๊ธฐ

  • ๊ธฐ์กด ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ช‡ ๊ฐ€์ง€ ๋ณ€์ˆ˜๋“ค์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ
ind['escolari/age'] = ind['escolari'] / ind['age']

plt.figure(figsize = (10, 8))
sns.violinplot(x = 'Target', y = 'escolari/age', data = ind)
ind['inst/age'] = ind['inst'] / ind['age']
ind['tech'] = ind['v18q'] + ind['mobilephone']
ind['tech'].describe()

3-6. ๋ณ€์ˆ˜ ์ง‘๊ณ„

  • ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€๊ตฌ ๋ฐ์ดํ„ฐ์— ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ฐ€๊ตฌ๋ณ„ ์ง‘๊ณ„๊ฐ€ ํ•„์š”

    • ๊ฐ€์กฑ ID์ธ idhogar๋กœ ๊ทธ๋ฃนํ™” ํ›„ ๋ฐ์ดํ„ฐ๋ฅผ agg๋กœ ๊ทธ๋ฃนํ™”
### ์ง‘๊ณ„(aggregation)๋ฅผ ์œ„ํ•œ ์‚ฌ์šฉ์ž ์ •์˜ ํ•จ์ˆ˜
range_ = lambda x: x.max() - x.min() # ํ•œ ์ค„์งœ๋ฆฌ ํ•จ์ˆ˜๋Š” ์ฃผ๋กœ lambda ์‹์œผ๋กœ ๊ตฌํ˜„
range_.__name__ = 'range_'

# ๊ทธ๋ฃนํ™” & ์ง‘๊ณ„
ind_agg = ind.drop(columns = ['Target','Id']).groupby('idhogar').agg(['min', 'max', 'sum', 'count', 'std', range_])
ind_agg.head()
  • ๋ณ€์ˆ˜๊ฐ€ 30๊ฐœ์—์„œ 180๊ฐœ๊ฐ€ ๋˜์—ˆ๋‹ค..
### ๋ณ€์ˆ˜ ์ด๋ฆ„ ์žฌ์ •์˜

new_col = []
for c in ind_agg.columns.levels[0]:
    for stat in ind_agg.columns.levels[1]:
        new_col.append(f'{c}-{stat}')
        
ind_agg.columns = new_col
ind_agg.head()
ind_agg.iloc[:, [0, 1, 2, 3, 6, 7, 8, 9]].head()

3-7. ๋ณ€์ˆ˜ ์„ ํƒ

  • ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ 0.95๋ณด๋‹ค ํฐ ๋ณ€์ˆ˜๋“ค์˜ ์Œ ์ค‘ ํ•˜๋‚˜๋ฅผ ์ œ๊ฑฐ
# Create correlation matrix
corr_matrix = ind_agg.corr()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(abs(upper[column]) > 0.95)]

print(f'There are {len(to_drop)} correlated columns to remove.')
  • ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐํ•˜๊ณ , head ๋ฐ์ดํ„ฐ์™€ ๋ณ‘ํ•ฉํ•˜์—ฌ ์ตœ์ข… ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์ƒ์„ฑ
ind_agg = ind_agg.drop(columns = to_drop)
ind_feats = list(ind_agg.columns)

# Merge on the household id
final = heads.merge(ind_agg, on = 'idhogar', how = 'left')

print('Final features shape: ', final.shape)
final.head()

3-8. ์ตœ์ข…์ ์ธ ๋ฐ์ดํ„ฐ ํƒ์ƒ‰

a) ์ƒ๊ด€๊ณ„์ˆ˜ ํ™•์ธ

corrs = final.corr()['Target']
corrs.sort_values().head()
corrs.sort_values().dropna().tail()
  • ์ƒ์„ฑ๋œ ๋ณ€์ˆ˜ ์ค‘ ์ผ๋ถ€๊ฐ€ target ๋ณ€์ˆ˜์™€ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

  • ํ•ด๋‹น ๋ณ€์ˆ˜๊ฐ€ ์‹ค์ œ๋กœ ์œ ์šฉํ•œ์ง€๋Š” ๋ชจ๋ธ๋ง ๋‹จ๊ณ„์—์„œ ํŒ๋‹จํ•  ์˜ˆ์ •

b) escolari ๋ณ€์ˆ˜

plot_categoricals('escolari-max', 'Target', final, annotate = False)
plt.figure(figsize = (10, 6))
sns.violinplot(x = 'Target', y = 'escolari-max', data = final)
plt.title('Max Schooling by Target')
plt.figure(figsize = (10, 6))
sns.boxplot(x = 'Target', y = 'escolari-max', data = final)
plt.title('Max Schooling by Target')

c) meaneduc ๋ณ€์ˆ˜

plt.figure(figsize = (10, 6))
sns.boxplot(x = 'Target', y = 'meaneduc', data = final)
plt.xticks([0, 1, 2, 3], poverty_mapping.values())
plt.title('Average Schooling by Target')

d) Overcrowing ๋ณ€์ˆ˜

plt.figure(figsize = (10, 6))
sns.boxplot(x = 'Target', y = 'overcrowding', data = final)
plt.xticks([0, 1, 2, 3], poverty_mapping.values())
plt.title('Overcrowding by Target')

e) ๊ฐ€์žฅ์˜ ์„ฑ๋ณ„

head_gender = ind.loc[ind['parentesco1'] == 1, ['idhogar', 'female']]
final = final.merge(head_gender, on = 'idhogar', how = 'left').rename(columns = {'female': 'female-head'})
final.groupby('female-head')['Target'].value_counts(normalize=True)
  • ๊ฐ€์žฅ์ด ์—ฌ์„ฑ์ธ ๊ฐ€๊ตฌ๋Š” ๋นˆ๊ณค ์ˆ˜์ค€์•„ ์‹ฌ๊ฐํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์•ฝ๊ฐ„ ๋” ๋†’์€ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.
sns.violinplot(x = 'female-head', y = 'Target', data = final)
plt.title('Target by Female Head of Household');

๊ฐ€์žฅ์˜ ์„ฑ๋ณ„์— ๋”ฐ๋ฅธ ํ‰๊ท  ๊ต์œก ์ˆ˜์ค€ ์ฐจ์ด

plt.figure(figsize = (8, 8))
sns.boxplot(x = 'Target', y = 'meaneduc', hue = 'female-head', data = final)
plt.title('Average Education by Target and Female Head of Household', size = 16)
  • ์—ฌ์„ฑ ๊ฐ€์žฅ์„ ๋‘” ๊ฐ€๊ตฌ์ผ์ˆ˜๋ก ๊ต์œก ์ˆ˜์ค€์ด ๋†’์€ ๊ฒƒ์œผ๋กœ ๋ณด์ž„

  • ๊ทธ๋Ÿฌ๋‚˜ ์ „์ฒด์ ์œผ๋กœ ์—ฌ์„ฑ์ด ๊ฐ€์žฅ์ธ ๊ฐ€๊ตฌ๋Š” ์‹ฌ๊ฐํ•œ ๋นˆ๊ณค์„ ๊ฒช์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋” ๋†’๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Œ

final.groupby('female-head')['meaneduc'].agg(['mean', 'count'])
  • ์ „์ฒด์ ์œผ๋กœ ์—ฌ์„ฑ ๊ฐ€์žฅ์ด ์žˆ๋Š” ๊ฐ€๊ตฌ์˜ ํ‰๊ท  ๊ต์œก ์ˆ˜์ค€์€ ๋‚จ์„ฑ ๊ฐ€์žฅ์ด ์žˆ๋Š” ๊ฐ€๊ตฌ๋ณด๋‹ค ์•ฝ๊ฐ„ ๋†’์Œ

4. Baseline Model

  • ๋ชจ๋“  ๋ฐ์ดํ„ฐ(train/test)๋Š” ๊ฐ ๊ฐ€๊ตฌ์— ๋Œ€ํ•ด ์ง‘๊ณ„๋˜๋ฏ€๋กœ ๋ชจ๋ธ์— ์ง์ ‘ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ

  • sklearn์˜ RandomForest๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ๋ง ๊ธฐ์ค€ ์ฐพ๊ธฐ

  • ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด train ๋ฐ์ดํ„ฐ์— 10-fold ๊ต์ฐจ ๊ฒ€์ฆ์„ ํ™œ์šฉ

    • ๋ฐ์ดํ„ฐ๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์œผ๋กœ ๋ถ„ํ• ํ•˜์—ฌ ๋ชจ๋ธ์„ ์ด 10๋ฒˆ ํ›ˆ๋ จ
  • ๊ต์ฐจ ๊ฒ€์ฆ์˜ ํ‰๊ท  ์„ฑ๋Šฅ๊ณผ ํ‘œ์ค€ ํŽธ์ฐจ๋ฅผ ์กฐ์‚ฌํ•˜์—ฌ fold ๊ฐ„์— ์ ์ˆ˜๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋ณ€ํ•˜๋Š”์ง€ ํ™•์ธ

    • Fl Macro Score๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ ํ‰๊ฐ€
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

# ๊ต์ฐจ ๊ฒ€์ฆ์„ ์œ„ํ•ด ์‚ฌ์šฉ์ž ์ •์˜ ํ•จ์ˆ˜ ๋งŒ๋“ค๊ธฐ
scorer = make_scorer(f1_score, greater_is_better=True, average = 'macro')
# ํ•™์Šต์„ ์œ„ํ•œ label(target) ๊ฐ’
train_labels = np.array(list(final[final['Target'].notnull()]['Target'].astype(np.uint8)))

# train/ test set ์ค€๋น„
train_set = final[final['Target'].notnull()].drop(columns = ['Id', 'idhogar', 'Target'])
test_set = final[final['Target'].isnull()].drop(columns = ['Id', 'idhogar', 'Target'])

# ์ œ์ถœ์šฉ ์–‘์‹ ๋งŒ๋“ค๊ธฐ
submission_base = test[['Id', 'idhogar']].copy()

4-1. Pipelining

  • ์—ฌ๋Ÿฌ ๋ชจํ˜•์„ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด feature๋“ค ๊ฐ„์˜ ์Šค์ผ€์ผ์„ ์กฐ์ •

    • ๊ฐ ์ปฌ๋Ÿผ(๋ณ€์ˆ˜)์˜ ๋ฒ”์œ„๋ฅผ 0๊ณผ 1 ์‚ฌ์ด๋กœ ์ œํ•œ

    • ๋งŽ์€ ์•™์ƒ๋ธ” ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋ถˆํ•„์š”ํ•˜์ง€๋งŒ, KNearest Neighbors ๋˜๋Š” Support Vector Machine๊ณผ ๊ฐ™์ด ๊ฑฐ๋ฆฌ ๋ฉ”ํŠธ๋ฆญ์— ์˜์กดํ•˜๋Š” ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ feature scaling์ด์ ˆ๋Œ€์ ์œผ๋กœ ํ•„์š”

    • ๊ฒฐ์ธก์น˜์˜ ๊ฒฝ์šฐ feature์˜ ์ค‘์•™๊ฐ’์œผ๋กœ ๋Œ€์ฒด

  • ๊ฒฐ์ธก์น˜๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  feature๋“ค์„ ํ•œ ๋ฒˆ์— scaling ํ•˜๊ธฐ ์œ„ํ•ด Pipeline์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ

    • train ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋ธ์— ์ ํ•ฉํ•˜๊ณ  train ๋ฐ test ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ
features = list(train_set.columns)

pipeline = Pipeline([('imputer', SimpleImputer(strategy = 'median')), 
                      ('scaler', MinMaxScaler())])

# ๋ฐ์ดํ„ธ๋ฅผ ์•Œ๋งž๊ฒŒ ๋ณ€ํ™˜(์ „์ฒ˜๋ฆฌ)
train_set = pipeline.fit_transform(train_set)
test_set = pipeline.transform(test_set)
  • ๋ฐ์ดํ„ฐ์—๋Š” ๊ฒฐ์ธก๊ฐ’์ด ์—†์œผ๋ฉฐ, 0๊ณผ 1 ์‚ฌ์ด ๋ฒ”์œ„๋กœ scaling ๋จ

    • scikit-Learn ๋ชจ๋ธ์—์„œ ์ง์ ‘ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ
### ๋ชจ๋ธ ํ•™์Šต

model = RandomForestClassifier(n_estimators = 100, random_state = 10, n_jobs = -1)
# 10 fold cross validation
cv_score = cross_val_score(model, train_set, train_labels, cv = 10, scoring = scorer)

print(f'10 Fold Cross Validation F1 Score = {round(cv_score.mean(), 4)} with std = {round(cv_score.std(), 4)}')
  • ํ˜„์žฌ๋Š” ์„ฑ๋Šฅ์ด ๊ทธ๋‹ค์ง€ ์ข‹์ง€๋Š” ์•Š์Œ

4-2. ํ”ผ์ณ ์ค‘์š”๋„ ํ™•์ธ

  • ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ชจ๋ธ์˜ ๊ธฐ๋Šฅ ์œ ์šฉ์„ฑ์— ๋Œ€ํ•œ ์ƒ๋Œ€์  ์ˆœ์œ„๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๊ธฐ๋Šฅ ์ค‘์š”๋„(feature importances)๋ฅผ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ์Œ

    • ๋ถ„ํ• ์„ ์œ„ํ•ด ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ ๋…ธ๋“œ์˜ ๋ถˆ์ˆœ๋ฌผ ๊ฐ์†Œ์˜ ํ•ฉ์„ ์˜๋ฏธ

    • ์ƒ๋Œ€ ์ ์ˆ˜์— ์ดˆ์ 

  • feature importances๋ฅผ ๋ณด๊ธฐ ์œ„ํ•ด์„  ์ „์ฒด train ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋ชจ๋ธ์„ train์‹œ์ผœ์•ผ ํ•จ

  • ๊ต์ฐจ ๊ฒ€์ฆ์˜ ๊ฒฝ์šฐ feature importance๋ฅผ ๋ฐ˜ํ™˜ํ•˜์ง€ x

model.fit(train_set, train_labels) # ์ „์ฒด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ํ•™์Šต

# Feature importances ํ™•์ธ
feature_importances = pd.DataFrame({'feature': features, 'importance': model.feature_importances_})
feature_importances.head()

๐Ÿ“Œ ๊ธฐ๋Šฅ ์ค‘์š”๋„ ์‹œ๊ฐํ™” ํ•จ์ˆ˜

  • ์ค‘์š”ํ•œ ์ˆœ์„œ๋Œ€๋กœ n๊ฐœ์˜ ํ”ผ์ณ ์‹œ๊ฐํ™”

  • ์ž„๊ณ„๊ฐ’(threshold)์ด ์ง€์ •๋œ ๊ฒฝ์šฐ ๋ˆ„์  ์ค‘์š”๋„๋ฅผ ํ‘œ์‹œํ•˜๊ณ , ๋ˆ„์  ์ค‘์š”๋„๊ฐ€ ์ž„๊ณ„๊ฐ’์— ๋„๋‹ฌํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ํ”ผ์ณ ์ˆ˜๋ฅผ print

  • ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๊ธฐ๋Šฅ ์ค‘์š”๋„์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋„๋ก ์„ค๊ณ„๋จ


  • Arguments>

    • df(dataframe)

      • ํ”ผ์ฒ˜ ์ค‘์š”๋„์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„

      • ์—ด์€ ๊ธฐ๋Šฅ ๋ฐ ์ค‘์š”๋„์—ฌ์•ผ ํ•จ

    • n(int)

      • ์ค‘์š”๋„ ์ˆœ์œผ๋กœ ํ‘œ์‹œํ•  ํ”ผ์ณ ์ˆ˜

      • default = 15

    • threshold(float)

      • ๋ˆ„์  ์ค‘์š”๋„ plot์˜ ์ž„๊ณ„๊ฐ’

      • ์ž„๊ณ„๊ฐ’์ด ์ œ๊ณต๋˜์ง€ ์•Š์œผ๋ฉด plot์ด ์ƒ์„ฑ๋˜์ง€ x

      • default: None

  • Returns>

    • df(dataframe)

      • ์ •๊ทœํ™”๋œ ์ปฌ๋Ÿผ(ํ•ฉ๊ณ„ = 1)๊ณผ ๋ˆ„์  ์ค‘์š”๋„ ์ปฌ๋Ÿผ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ”ผ์ณ ์ค‘์š”๋„ ์ˆœ์œผ๋กœ ์ •๋ ฌ๋œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„

(Note)

  • ์ด ๊ฒฝ์šฐ ์ •๊ทœํ™”๋Š” ํ•ฉ์ด 1์ด๋ผ๋Š” ๊ฒƒ์„ ์˜๋ฏธ

  • ๋ˆ„์  ์ค‘์š”๋„๋Š” ๊ฐ€์žฅ ์ค‘์š”ํ•˜์ง€ ์•Š์€ feature๋ถ€ํ„ฐ ๊ฐ€์žฅ ์ค‘์š”ํ•˜์ง€ ์•Š์€ feature๋ฅผ ํ•ฉ์‚ฐํ•˜์—ฌ ๊ณ„์‚ฐ

  • threshold = 0.9: ๋ˆ„์  ์ค‘์š”๋„์˜ 90%์— ๋„๋‹ฌํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ feature๋ฅผ ํ‘œ์‹œ

def plot_feature_importances(df, n = 10, threshold = None):

    plt.style.use('fivethirtyeight')
    
    # ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ธฐ๋Šฅ ๊ธฐ์ค€ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ
    df = df.sort_values('importance', ascending = False).reset_index(drop = True)
    
    # ํ”ผ์ณ ์ค‘์š”๋„๋ฅผ ์ •๊ทœํ™”(0 ~ 1 ์‚ฌ์ด)ํ•˜์—ฌ ๋ˆ„์  ์ค‘์š”๋„ ๊ณ„์‚ฐ
    df['importance_normalized'] = df['importance'] / df['importance'].sum()
    df['cumulative_importance'] = np.cumsum(df['importance_normalized'])
    
    plt.rcParams['font.size'] = 12
    
    # ์ค‘์š”ํ•œ ์ˆœ์„œ๋Œ€๋กœ n๊ฐœ feature์— ๋Œ€ํ•œ barhplot
    df.loc[:n, :].plot.barh(y = 'importance_normalized', 
                            x = 'feature', color = 'darkgreen', 
                            edgecolor = 'k', figsize = (12, 8),
                            legend = False, linewidth = 2)

    plt.xlabel('Normalized Importance', size = 18); plt.ylabel(''); 
    plt.title(f'{n} Most Important Features', size = 18)
    plt.gca().invert_yaxis()
    
    ### ------------------------------------------------------------------------
    
    # ์ž„๊ณ„๊ฐ’์— ๋‹ค๋‹ค๋ž๋‹ค๋ฉด
    if threshold:
        # ๋ˆ„์  ์ค‘์š”๋„ plot
        plt.figure(figsize = (8, 6))
        plt.plot(list(range(len(df))), df['cumulative_importance'], 'b-')
        plt.xlabel('Number of Features', size = 16); plt.ylabel('Cumulative Importance', size = 16)
        plt.title('Cumulative Feature Importance', size = 18)
        
        # ๋ˆ„์  ์ค‘์š”๋„์— ๋„๋‹ฌํ•˜๊ธฐ ์œ„ํ•œ feature ์ˆ˜
        # ์ธ๋ฑ์Šค(-> ์‹ค์ œ ์ˆซ์ž์— ๋Œ€ํ•ด 1์„ ์ถ”๊ฐ€ํ•ด์•ผ ํ•จ)
        importance_index = np.min(np.where(df['cumulative_importance'] > threshold))
        
        # ์ˆ˜์ง ์„  ์ถ”๊ฐ€
        plt.vlines(importance_index + 1, ymin = 0, ymax = 1.05, linestyles = '--', colors = 'red')
        plt.show()
        
        print('{} features required for {:.0f}% of cumulative importance.'.format(importance_index + 1, 
                                                                                  100 * threshold))
    
    return df
norm_fi = plot_feature_importances(feature_importances, threshold=0.95)
  • ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ณ€์ˆ˜๋Š” meaneduc(๊ฐ€๊ตฌ ๋‚ด ํ‰๊ท  ๊ต์œก๋Ÿ‰)์ด๊ณ , ๋‹ค์Œ์ด age-max(๊ฐ€๊ตฌ ๋‚ด ํ•œ ๊ฐœ์ธ์˜ ์ตœ๋Œ€ ๊ต์œก๋Ÿ‰)์ž„

    • ์ด ๋‘ ๋ณ€์ˆ˜๋Š” ์ƒ๊ด€์„ฑ์ด ๋†’์€ ๋ณ€์ˆ˜๋“ค์ž„

    • ๋”ฐ๋ผ์„œ, ๋ฐ์ดํ„ฐ์—์„œ ๋ณ€์ˆ˜ ์ค‘ ํ•˜๋‚˜๋ฅผ ์ œ๊ฑฐํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ

  • ๋‹ค๋ฅธ ์ค‘์š”ํ•œ feature๋“ค์€ ์šฐ๋ฆฌ๊ฐ€ ๋งŒ๋“  ๋ณ€์ˆ˜์™€ ๋ฐ์ดํ„ฐ์— ์ด๋ฏธ ์กด์žฌํ–ˆ๋˜ ๋ณ€์ˆ˜๋“ค๊ณผ์˜ ์กฐํ•ฉ์œผ๋กœ ์ƒ์„ฑ๋œ ๋ณ€์ˆ˜๋“ค์ž„

  • 90%์˜ ์ค‘์š”๋„๋ฅผ ์œ„ํ•ด์„œ๋Š” ๋Œ€๋žต 180๊ฐœ ์ •๋„์˜ feature๋“ค๋งŒ ์กด์žฌํ•ด๋„ ok

    • ์ผ๋ถ€ feature๋“ค์„ ์ œ๊ฑฐํ•ด๋„ ๋ฌด๋ฐฉํ•จ
  • ํ”ผ์ฒ˜ ์ค‘์š”๋„์€ ํ”ผ์ฒ˜๊ฐ€ ์–ด๋Š ๋ฐฉํ–ฅ์œผ๋กœ ์ค‘์š”ํ•œ์ง€๋ฅผ ์•Œ๋ ค์ฃผ์ง€๋Š” ์•Š์Œ

    • ์˜ˆ๋ฅผ ๋“ค์–ด, ๊ต์œก์„ ๋งŽ์ด ๋ฐ›์„์ˆ˜๋ก ๋œ ์‹ฌ๊ฐํ•œ ๋นˆ๊ณค์œผ๋กœ ์ด์–ด์ง€๋Š”์ง€๋ฅผ ์•Œ๋ ค์ฃผ์ง€๋Š” ๋ชปํ•จ

    • ๊ด€๋ จ์ด ์žˆ์„ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผ๋˜๋Š” ๋ชจ๋ธ๋งŒ ์•Œ๋ ค์คŒ

### ์ปค๋„ ๋ฐ€๋„ ํ•จ์ˆ˜ ์‹œ๊ฐํ™”
# "variable" ๋ณ„๋กœ "target" ๊ฐ’์˜ ๋ถ„ํฌ๋ฅผ ํ‘œ์‹œ

def kde_target(df, variable):
    
    colors = {1: 'red', 2: 'orange', 3: 'blue', 4: 'green'}

    plt.figure(figsize = (12, 8))
    
    df = df[df['Target'].notnull()]
    
    for level in df['Target'].unique():
        subset = df[df['Target'] == level].copy()
        sns.kdeplot(subset[variable].dropna(), 
                    label = f'Poverty Level: {level}', 
                    color = colors[int(subset['Target'].unique())])

    plt.xlabel(variable); plt.ylabel('Density');
    plt.title('{} Distribution'.format(variable.capitalize()));
kde_target(final, 'meaneduc')
kde_target(final, 'escolari/age-range_')

4-2. ๋ชจ๋ธ ์„ ํƒ

  • ์ด๋ฏธ RandomForestClassifier๋Š” ์‹œ๋„

    • Macro F1-score: 0.35
  • ๊ธฐ๊ณ„ ํ•™์Šต์—์„œ๋Š” ์–ด๋–ค ๋ชจ๋ธ์ด ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๊ฐ€์žฅ ์ž˜ ์ž‘๋™ํ•˜๋Š”์ง€ ๋ฏธ๋ฆฌ ์•Œ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ์—†์Œ

    • ๊ฐ ์ƒํ™ฉ๋งˆ๋‹ค ๋‹ค๋ฆ„..

    • ์–ด๋–ค ๊ฒƒ์ด ์ตœ์ ์ธ์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ชจ๋ธ์„ ์‹œ๋„ํ•ด ๋ณด์•„์•ผ ํ•จ

algorithm_comparison

# Model imports

from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegressionCV, RidgeClassifierCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier
import warnings 
from sklearn.exceptions import ConvergenceWarning

# Filter out warnings from models
warnings.filterwarnings('ignore', category = ConvergenceWarning)
warnings.filterwarnings('ignore', category = DeprecationWarning)
warnings.filterwarnings('ignore', category = UserWarning)

# ๊ฒฐ๊ณผ ์ €์žฅ์„ ์œ„ํ•œ data frame
model_results = pd.DataFrame(columns = ['model', 'cv_mean', 'cv_std'])

๐Ÿ“Œ ๋ชจ๋ธ ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ํ•จ์ˆ˜

  • RandomForestClassifier ์™ธ์— 8๊ฐœ์˜ ๋‹ค๋ฅธ Scikit - Learn ๋ชจ๋ธ ์‚ฌ์šฉ

  • ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•  ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๋งŒ๋“ค๊ณ , ํ•จ์ˆ˜๋Š” ๊ฐ ๋ชจ๋ธ์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— ํ–‰์„ ์ถ”๊ฐ€ํ•˜๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰

  • ๊ฐ ๋ชจ๋ธ์— ๋Œ€ํ•ด 10-fold ๊ต์ฐจ ๊ฒ€์ฆ ์ˆ˜ํ–‰

def cv_model(train, train_labels, model, name, model_results = None):
    
    cv_scores = cross_val_score(model, train, train_labels, cv = 10, scoring=scorer, n_jobs = -1)
    print(f'10 Fold CV Score: {round(cv_scores.mean(), 5)} with std: {round(cv_scores.std(), 5)}')
    
    if model_results is not None:
        # ๊ฒฐ๊ณผ ์ €์žฅ
        model_results = model_results.append(pd.DataFrame({'model': name, 
                                                           'cv_mean': cv_scores.mean(), 
                                                            'cv_std': cv_scores.std()},
                                                           index = [0]),ignore_index = True)

        return model_results
### 1. Linear SVC

model_results = cv_model(train_set, train_labels, 
                         LinearSVC(), 'LSVC', model_results)
  • ์„ฑ๋Šฅ์ด ๊ทธ๋‹ฅ ์ข‹์ง€ ์•Š์Œ

    • ๋ชฉ๋ก์—์„œ ์‚ญ์ œํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜
### 2. Gaussian Naive Bayes

model_results = cv_model(train_set, train_labels, 
                         GaussianNB(), 'GNB', model_results)
  • ๋งค์šฐ ๋‚˜์œ ์„ฑ๋Šฅ..
### 3. Multi-Layer Perceptron

model_results = cv_model(train_set, train_labels, 
                         MLPClassifier(hidden_layer_sizes=(32, 64, 128, 64, 32)),
                         'MLP', model_results)
  • ๊ดœ์ฐฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ์Œ

    • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋„คํŠธ์›Œํฌ๋ฅผ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ ํ•ด๋‹น ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ

    • ํ•˜์ง€๋งŒ ์ œํ•œ๋œ ์–‘์˜ ๋ฐ์ดํ„ฐ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ํšจ๊ณผ์ ์ธ ํ•™์Šต์„ ์œ„ํ•ด์„  ์ˆ˜์‹ญ๋งŒ ๊ฐœ์˜ ์˜ˆ์ œ๋ฅผ ํ•„์š”๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์‹ ๊ฒฝ๋ง์— ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Œ

### 4. LinearDiscriminantAnalysis

model_results = cv_model(train_set, train_labels, 
                          LinearDiscriminantAnalysis(), 
                          'LDA', model_results)
  • UserWarning์„ filteringํ•˜์ง€ ์•Š๊ณ  LDA๋ฅผ ์ง„ํ–‰ ์‹œ ์—๋Ÿฌ ๋ฐœ์ƒ

    • Variables are collinear. : ๊ณต์„ ์„ฑ ๋ฌธ์ œ
  • ์„ ํ˜• ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐํ•œ ํ›„ ํ•ด๋‹น ๋ชจ๋ธ์„ ๋‹ค์‹œ ์‹œ๋„ํ•ด ๋ณผ ์ˆ˜ ์žˆ์Œ

### 5. Ridge

model_results = cv_model(train_set, train_labels, 
                         RidgeClassifierCV(), 'RIDGE', model_results)
  • ์„ ํ˜• ๋ชจํ˜•(ridge ๋ชจ๋ธ ํฌํ•จ)์€ ๋†€๋ผ์šธ ์ •๋„๋กœ ์ž˜ ์ž‘๋™ํ•จ

    • ๋‹จ์ˆœํ•œ ๋ชจ๋ธ์ด ์ด ๋ฌธ์ œ์—์„œ๋Š” ๋” ๋„์›€์ด ๋  ์ˆ˜๋„ ์žˆ์Œ์„ ์˜๋ฏธ
### 6. K-Neighbors

for n in [5, 10, 20]:
    print(f'\nKNN with {n} neighbors\n')
    model_results = cv_model(train_set, train_labels, 
                             KNeighborsClassifier(n_neighbors = n),
                             f'knn-{n}', model_results)
### 7. ExtraTreeClassifier

model_results = cv_model(train_set, train_labels, 
                         ExtraTreesClassifier(n_estimators = 100, random_state = 10),
                         'EXT', model_results)

4-3. ๋ชจ๋ธ ์„ฑ๋Šฅ ๋น„๊ต

  • ๋ชจ๋ธ๋ง ๊ฒฐ๊ณผ๊ฐ€ ์ €์žฅ๋œ df๋ฅผ ํ™œ์šฉํ•ด ์–ด๋–ค ๋ชจ๋ธ์ด ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ์ง€ ์‹œ๊ฐํ™”ํ•˜๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆด ์ˆ˜ ์žˆ์Œ
# ์ฒ˜์Œ์œผ๋กœ ์ˆ˜ํ–‰ํ•œ RandomForestClassifier์˜ ๊ฒฐ๊ณผ๋„ ์ €์žฅํ•ด์ฃผ๊ธฐ

model_results = cv_model(train_set, train_labels,
                          RandomForestClassifier(100, random_state=10),
                              'RF', model_results)
### ๋ชจ๋ธ๋ง ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”

model_results.set_index('model', inplace = True)
model_results['cv_mean'].plot.bar(color = 'orange', figsize = (8, 6),
                                  yerr = list(model_results['cv_std']),
                                  edgecolor = 'k', linewidth = 2)
plt.title('Model F1 Score Results')
plt.ylabel('Mean F1 Score (with error bar)')

model_results.reset_index(inplace = True)
  • RandomForestClassifier๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ธฐ์—, ๊ฐ€์žฅ ๋จผ์ € ์‹œ๋„ํ•ด ๋ณผ ์ˆ˜ ์žˆ๋Š” ๋ชจํ˜•์ž„

  • ์•„์ง baseline model๋“ค์˜ ๊ฒฝ์šฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •ํ•˜์ง€ ์•Š์•„ ๋ชจ๋ธ ๊ฐ„ ๋น„๊ต๊ฐ€ ์™„๋ฒฝํ•˜์ง€๋Š” ์•Š์ง€๋งŒ, ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ์•™์ƒ๋ธ” ๋ฐฉ๋ฒ•(Gradient Boosting Machine ํฌํ•จ)์ด ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ๋งค์šฐ ์ž˜ ์ˆ˜ํ–‰๋œ๋‹ค๋Š” ๋‹ค๋ฅธ ๋งŽ์€ Kaggle competition์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜์˜ํ•˜๊ณ  ์žˆ์Œ

ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ • ํšจ๊ณผ

hyperparameter_improvement

  • ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ์ •ํ™•๋„ ํ–ฅ์ƒ์€ 10% ๋ฏธ๋งŒ

    • ์ตœ์•…์˜ ๋ชจ๋ธ์€ ํŠœ๋‹์„ ํ†ตํ•ด ๊ฐ‘์ž๊ธฐ ์ตœ๊ณ ์˜ ๋ชจ๋ธ์ด ๋˜์ง€๋Š” ์•Š์„ ๊ฒƒ์ž„
  • ์ผ๋‹จ์€ ๊ทธ๋ƒฅ RandomForestClassifier๋กœ ์˜ˆ์ธก ์ˆ˜ํ–‰

4-4. ์ œ์ถœ ํŒŒ์ผ ์ƒ์„ฑ

  • ์ œ์ถœํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” test ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•จ

    • ํ˜„์žฌ test data๊ฐ€ train data์™€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ format ๋˜์–ด ์žˆ์Œ
  • ๊ฐ€๊ตฌ๋ณ„๋กœ ์˜ˆ์ธก์„ ํ•˜๊ณ  ์žˆ์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” ๊ฐœ์ธ๋‹น ํ•œ ์ค„(Id๋กœ ์‹๋ณ„)๋งŒ ํ•„์š”

    • ๊ฐ€์žฅ(household)์— ๋Œ€ํ•œ ์˜ˆ์ธก๋งŒ ์ ์ˆ˜๊ฐ€ ๋งค๊ฒจ์ง
  • ํ…Œ์ŠคํŠธ ์ œ์ถœ ํ˜•์‹


Id,Target

ID_2f6873615,1

ID_1c78846d2,2

ID_e5442cf6a,3

ID_a8db26a79,4

ID_a62966799,4 

  • submission_base๋Š” ๊ฐ ๊ฐœ์ธ์— ๋Œ€ํ•œ ์˜ˆ์ธก์ด ์žˆ์–ด์•ผ ํ•จ

    • submission_base๋Š” ๋ชจ๋“  ๊ฐœ์ธ์„ test set์— ํฌํ•จ
  • test_ids: ๊ฐ€์žฅ์˜ idhogar๋งŒ ํฌํ•จ

  • ์˜ˆ์ธก ์‹œ์—๋Š” ๊ฐ ๊ฐ€๊ตฌ์— ๋Œ€ํ•ด์„œ๋งŒ ์˜ˆ์ธกํ•œ ๋‹ค์Œ ์˜ˆ์ธก ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๊ฐ€๊ตฌ ID(idhogar)์˜ ๋ชจ๋“  ๊ฐœ์ธ๊ณผ ๋ณ‘ํ•ฉ

    • target์ด ๊ฐ€๊ตฌ์› ๋ชจ๋‘์—๊ฒŒ ๋™์ผํ•œ ๊ฐ’์œผ๋กœ ์„ค์ •๋จ

    • ๊ฐ€์žฅ์ด ์—†๋Š” ๊ฐ€๊ตฌ์˜ ๊ฒฝ์šฐ ์ ์ˆ˜๊ฐ€ ๋งค๊ฒจ์ง€์ง€ ์•Š์œผ๋ฏ€๋กœ ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ์—๋Š” ์˜ˆ์ธก์น˜๋ฅผ 4(non-vulnerable)๋กœ ์„ค์ •

test_ids = list(final.loc[final['Target'].isnull(), 'idhogar'])

๐Ÿ“Œ ์˜ˆ์ธก์„ ์œ„ํ•œ ํ•จ์ˆ˜

  • ๋ชจ๋ธ, train ์„ธํŠธ, train label ๋ฐ test ์„ธํŠธ๋ฅผ ๊ฐ€์ ธ์™€ ๋‹ค์Œ ์ž‘์—…๋“ค์„ ์ˆ˜ํ–‰

    • fit()์„ ํ™œ์šฉํ•˜์—ฌ train ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋ชจ๋ธ์„ ํ•™์Šต

    • predict()๋ฅผ ํ™œ์šฉํ•˜์—ฌ test ๋ฐ์ดํ„ฐ๋กœ ์˜ˆ์ธก

    • ์ด๋ฅผ ์ €์žฅํ•˜์—ฌ ์ œ์ถœ ํŒŒ์ผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋„๋ก submission ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์ƒ์„ฑ

def submit(model, train, train_labels, test, test_ids):
    
    model.fit(train, train_labels) # ํ•™์Šต
    predictions = model.predict(test) # ์˜ˆ์ธก
    predictions = pd.DataFrame({'idhogar': test_ids,
                               'Target': predictions})

    # ์ œ์ถœ์šฉ df ๋งŒ๋“ค๊ธฐ
    submission = submission_base.merge(predictions, 
                                       on = 'idhogar',
                                       how = 'left').drop(columns = ['idhogar'])
    
    # ๊ฐ€์žฅ x -> 4๋กœ ์ฑ„์šฐ๊ธฐ
    submission['Target'] = submission['Target'].fillna(4).astype(np.int8)

    return submission 
### RandomForest๋กœ ์˜ˆ์ธก

rf_submission = submit(RandomForestClassifier(n_estimators = 100, 
                                              random_state=10, n_jobs = -1), 
                         train_set, train_labels, test_set, test_ids)

rf_submission.to_csv('rf_submission.csv', index = False)
  • ์˜ˆ์ธก ์„ฑ๋Šฅ: 0.370

4-5. ๋ณ€์ˆ˜ ์„ ํƒ

  • ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ํ•œ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•

    • ๋ชจ๋ธ์— ๊ฐ€์žฅ ์œ ์šฉํ•œ ๊ธฐ๋Šฅ๋งŒ ์œ ์ง€ํ•˜๋ ค๊ณ  ํ•˜๋Š” ํ”„๋กœ์„ธ์Šค
  • ํ•ด๋‹น ๋…ธํŠธ๋ถ์—์„œ๋Š” feature๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฒฝ์šฐ ๋จผ์ € ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ 0.95๋ณด๋‹ค ํฐ ์—ด์„ ์ œ๊ฑฐํ•œ ๋‹ค์Œ, scikit-learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ค‘๋ณต์ ์ธ feature ์ œ๊ฑฐ๋ฅผ ์ ์šฉ

a) ์ƒ๊ด€๋„๊ฐ€ ๋†’์€ ๋ณ€์ˆ˜ ์ œ๊ฑฐ

  • ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ 0.95 ์ด์ƒ์ธ ๋ณ€์ˆ˜๋“ค์„ ์ œ๊ฑฐ
### ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ 0.95 ์ด์ƒ์ธ ์ปฌ๋Ÿผ ํ™•์ธ

train_set = pd.DataFrame(train_set, columns = features)

# ์ƒ๊ด€๊ณ„์ˆ˜ ํ–‰๋ ฌ ์ƒ์„ฑ
corr_matrix = train_set.corr()

# ์ƒ์‚ผ๊ฐํ–‰๋ ฌ๋งŒ ์„ ํƒ
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ 0.95 ์ด์ƒ์ธ ์ปฌ๋Ÿผ๋งŒ ์„ ํƒ
to_drop = [column for column in upper.columns if any(abs(upper[column]) > 0.95)]

to_drop
# ํ•ด๋‹น ์ปฌ๋Ÿผ drop

train_set = train_set.drop(columns = to_drop)
train_set.shape
# test data์—์„œ๋„ ํ•ด๋‹น feature๋“ค์„ ์ œ๊ฑฐ

test_set = pd.DataFrame(test_set, columns = features)
train_set, test_set = train_set.align(test_set, axis = 1, join = 'inner')

features = list(train_set.columns)

b) RandomForest๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ค‘๋ณต ๋ณ€์ˆ˜ ์ œ๊ฑฐ

  • sklearn.RFECV

    • ๊ต์ฐจ ๊ฒ€์ฆ์„ ํ†ตํ•œ ์ค‘๋ณต์ ์ธ feature ์ œ๊ฑฐ๋ฅผ ์˜๋ฏธ

    • ๋ฐ˜๋ณต์ ์ธ ๋ฐฉ์‹์œผ๋กœ ํ”ผ์ฒ˜ ์ค‘์š”๋„๊ฐ€ ์žˆ๋Š” ๋ชจ๋ธ์„ ์‚ฌ์šฉ

    • ๊ฐ ๋ฐ˜๋ณต๋งˆ๋‹ค ํ”ผ์ฒ˜์˜ ์ผ๋ถ€ ๋˜๋Š” ์„ค์ •๋œ ๊ฐœ์ˆ˜์˜ ํ”ผ์ฒ˜๋ฅผ ์ œ๊ฑฐ

    • ๊ต์ฐจ ๊ฒ€์ฆ ์ ์ˆ˜๊ฐ€ ๋” ์ด์ƒ ํ–ฅ์ƒ๋˜์ง€ ์•Š์„ ๋•Œ๊นŒ์ง€ ์ž‘์—…์„ ๊ณ„์† ๋ฐ˜๋ณตํ•จ

  • selector ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ, ๊ฐ ๋ฐ˜๋ณต๋งˆ๋‹ค ์ œ๊ฑฐํ•  feature์˜ ์ˆ˜, ๊ต์ฐจ ๊ฒ€์ฆ ์‹œ์˜ fold ์ˆ˜, ์‚ฌ์šฉ์ž ์ง€์ • ์ ์ˆ˜ ๊ณ„์‚ฐ๊ธฐ ๋ฐ ์„ ํƒ์„ ์•ˆ๋‚ดํ•˜๋Š” ๊ธฐํƒ€ parameter๋“ค์„ ์„ค์ •

from sklearn.feature_selection import RFECV

# ๋ณ€์ˆ˜ ์„ ํƒ์„ ์œ„ํ•œ ๋ชจ๋ธ ๊ฐ์ฒด ์ƒ์„ฑ
estimator = RandomForestClassifier(random_state = 10, n_estimators = 100, n_jobs = -1)

# selector ๊ฐ์ฒด ์ƒ์„ฑ
selector = RFECV(estimator, step = 1, cv = 3, scoring = scorer, n_jobs = -1)
### ํ•™์Šต

selector.fit(train_set, train_labels)
### ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”

plt.plot(selector.cv_results_.keys())

plt.xlabel('Number of Features'); plt.ylabel('Macro F1 Score'); plt.title('Feature Selection Scores');
selector.n_features_
  • ์ตœ๋Œ€ 96๊ฐœ์˜ ๋ณ€์ˆ˜๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด ์ ์ˆ˜๊ฐ€ ํ–ฅ์ƒ๋œ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

    • selector์— ๋”ฐ๋ฅด๋ฉด ์ด๊ฒƒ์ด ์ตœ์ ์˜ feature ๊ฐœ์ˆ˜์ž„
  • ๊ฐ feature์˜ ์ˆœ์œ„๋Š” ํ›ˆ๋ จ๋œ selector ๊ฐ์ฒด๋ฅผ ํ†ตํ•ด ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

    • ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ๋ฐ˜๋ณต์— ๊ฑธ์ฒ˜ ํ‰๊ท ํ™”๋œ ๊ธฐ๋Šฅ ์ค‘์š”๋„๋ฅผ ํ‘œ์‹œ

    • ์ˆœ์œ„๊ฐ€ ๋™์ผํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ˆœ์œ„๊ฐ€ 1์ธ feature๋งŒ ์œ ์ง€๋จ

rankings = pd.DataFrame({'feature': list(train_set.columns), 'rank': list(selector.ranking_)}).sort_values('rank')
rankings.head(10)

์ตœ์ข… ๋ณ€์ˆ˜ ์„ ํƒ & ๊ต์ฐจ ๊ฒ€์ฆ ์ˆ˜ํ–‰

train_selected = selector.transform(train_set)
test_selected = selector.transform(test_set)
# df๋กœ ์žฌ๊ฐ€๊ณต

selected_features = train_set.columns[np.where(selector.ranking_==1)]
train_selected = pd.DataFrame(train_selected, columns = selected_features)
test_selected = pd.DataFrame(test_selected, columns = selected_features)
model_results = cv_model(train_selected, train_labels, model, 'RF-SEL', model_results)
# ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”

model_results.set_index('model', inplace = True)
model_results['cv_mean'].plot.bar(color = 'orange', figsize = (8, 6),
                                  yerr = list(model_results['cv_std']),
                                 edgecolor = 'k', linewidth = 2)
plt.title('Model F1 Score Results');
plt.ylabel('Mean F1 Score (with error bar)');
model_results.reset_index(inplace = True)
  • feature selection ํ•œ ๋ชจ๋ธ์ด ๊ต์ฐจ ๊ฒ€์ฆ์—์„œ ์•ฝ๊ฐ„ ๋” ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„

5. ๋ชจ๋ธ ์—…๋ฐ์ดํŠธ

5-1. Light Gradient Boosting Machine

  • Kaggle ๋Œ€ํšŒ์—์„œ ์ฃผ๋กœ ๋ฐ์ดํ„ฐ๊ธฐ ๊ตฌ์กฐํ™”๋˜์–ด ์žˆ๊ณ (ํ…Œ์ด๋ธ” ํ˜•ํƒœ), ๋ฐ์ดํ„ฐ ์…‹์ด ๊ทธ๋ฆฌ ํฌ์ง€ ์•Š์€ ๊ฒฝ์šฐ(๊ด€์ธก์น˜๊ฐ€ ๋ฐฑ๋งŒ ๊ฐœ ๋ฏธ๋งŒ) GBM(Gradient Boosting Machine)์ด ๊ฒฝ์Ÿ์—์„œ ๋†’์€ ๋น„์œจ๋กœ ์Šน๋ฆฌํ•จ

  • Gradient Boosting Machine์„ ์œ„ํ•œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™”๋Š” ์ฃผ๋กœ ๋ชจ๋ธ ์ตœ์ ํ™”๋ฅผ ํ†ตํ•ด ์ˆ˜ํ–‰๋จ

    • ์ด์ „์— ์ž˜ ์ž‘๋™ํ•˜์˜€๋˜ ๊ฐ’๋“ค์„ ๊ธฐ์ค€์œผ๋กœ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์„ค์ •ํ•จ
  • n_estimators๋ฅผ ์ผ๋‹จ 10000์œผ๋กœ ์„ค์ •ํ•˜์˜€์ง€๋งŒ, ํ•ด๋‹น ์ˆซ์ž์— ๋„๋‹ฌํ•˜์ง€๋Š” ๋ชปํ•จ

    • ์šฐ๋ฆฌ๋Š” early_stopping_rounds๋ฅผ ์„ค์ •ํ•จ

      • ๊ต์ฐจ ๊ฒ€์ฆ metric์ด ๊ฐœ์„ ๋˜์ง€ ์•Š์„ ๋•Œ train estimator๋ฅผ ์ข…๋ฃŒ์‹œํ‚ด

      • display๋Š” %%capture์™€ ๊ฒฐํ•ฉํ•˜์—ฌ train ์ค‘ ์‚ฌ์šฉ์ž ์ง€์ • ์ •๋ณด๋ฅผ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋จ

๐Ÿ“Œ ์กฐ๊ธฐ ์ข…๋ฃŒ(Early Stopping)๋ฅผ ํ†ตํ•œ estimator์˜ ๊ฐœ์ˆ˜ ์ •ํ•˜๊ธฐ

  • estimator์˜ ์ˆ˜(n_estimators ๋˜๋Š” num_boost_rounds๋ผ๊ณ  ํ•˜๋Š” ์•™์ƒ๋ธ”์˜ ์˜์‚ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ ์ˆ˜)๋ฅผ ์„ ํƒํ•˜๊ธฐ ์œ„ํ•ด 5-fold ๊ต์ฐจ ๊ฒ€์ฆ์„ ํ™œ์šฉํ•˜์—ฌ ์กฐ๊ธฐ ์ค‘์ง€ ์ˆ˜ํ–‰

    • Macro F1-score๋กœ ์ธก์ •ํ•œ ์„ฑ๋Šฅ์ด 100ํšŒ์˜ train ๋ผ์šด๋“œ ๋™์•ˆ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š์„ ๋•Œ๊นŒ์ง€ ์ถ”์ •์น˜๋ฅผ ๊ณ„์† ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ

    • ํ•ด๋‹น metric์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์‚ฌ์šฉ์ž ์ง€์ • metric์„ ์ •์˜ํ•ด์•ผ ํ•จ

def macro_f1_score(labels, predictions):
    predictions = predictions.reshape(len(np.unique(labels)), -1).argmax(axis = 0)
    
    metric_value = f1_score(labels, predictions, average = 'macro')
    
    return 'macro_f1', metric_value, True

๐Ÿ“Œ Stratified K-Fold๋ฅผ ์œ„ํ•œ ํ•จ์ˆ˜

  • Stratified K-fold ๊ต์ฐจ ๊ฒ€์ฆ ๋ฐ ์กฐ๊ธฐ ์ •์ง€๋ฅผ ํ†ตํ•ด ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๋จธ์‹ ์„ ๊ต์œก

  • ๊ต์œก ๋ฐ์ดํ„ฐ์— ๊ณผ์ ํ•ฉ๋˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€

  • ๊ต์ฐจ ๊ฒ€์ฆ์„ ํ†ตํ•ด ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ฐ fold์— ๋Œ€ํ•œ ํ™•๋ฅ ๋กœ ์˜ˆ์ธก์„ ๊ธฐ๋ก

  • ๊ฐ fold์˜ ์˜ˆ์ธก๊ฐ’์„ ๋ฐ˜ํ™˜ํ•œ ๋‹ค์Œ ์ œ์ถœ๋ฌผ์„ ๋ฐ˜ํ™˜ํ•˜์—ฌ ๊ฒฐ๊ณผ ํ™•์ธ

from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb
from IPython.display import display
def model_gbm(features, labels, test_features, test_ids, 
              nfolds = 5, return_preds = False, hyp = None):
    
    feature_names = list(features.columns) # ๋ณ€์ˆ˜๋“ค์„ ์ €์žฅ

    ### ์‚ฌ์šฉ์ž ์ง€์ • hyper parameter์— ๋Œ€ํ•œ ์˜ต์…˜
    # ์‚ฌ์šฉ์ž ์ง€์ • hyper parameter๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ
    if hyp is not None:
        # early sropping์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ estimator ์ˆ˜๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Œ
        if 'n_estimators' in hyp:
            del hyp['n_estimators']
        params = hyp
    else:
        # ๊ธฐ๋ณธ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •
        params = {'boosting_type': 'dart', 
                  'colsample_bytree': 0.88, 
                  'learning_rate': 0.028, 
                   'min_child_samples': 10, 
                   'num_leaves': 36, 'reg_alpha': 0.76, 
                   'reg_lambda': 0.43, 
                   'subsample_for_bin': 40000, 
                   'subsample': 0.54, 
                   'class_weight': 'balanced'}
    
    # ๋ชจ๋ธ ๊ฐ์ฒด ์ƒ์„ฑ
    model = lgb.LGBMClassifier(**params, objective = 'multiclass', 
                               n_jobs = -1, n_estimators = 10000,
                               random_state = 10)
  
    # Stratified k-Fold ๊ต์ฐจ ๊ฒ€์ฆ
    strkfold = StratifiedKFold(n_splits = nfolds, shuffle = True)
    
    # ๊ฐ fold๋งˆ๋‹ค ์˜ˆ์ธกํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅ
    predictions = pd.DataFrame()
    importances = np.zeros(len(feature_names))
    
    # ์ธ๋ฑ์‹ฑ์„ ์œ„ํ•ด array๋กœ ๋ณ€ํ™˜
    features = np.array(features)
    test_features = np.array(test_features)
    labels = np.array(labels).reshape((-1 ))
    
    valid_scores = [] # ๊ฒ€์ฆ ์ ์ˆ˜
    
    ### ๊ฐ fold ๋งˆ๋‹ค
    for i, (train_indices, valid_indices) in enumerate(strkfold.split(features, labels)):
        # ๊ฐ fold๋งˆ๋‹ค ์˜ˆ์ธก ์ˆ˜ํ–‰ -> ๊ฒฐ๊ณผ ์ €์žฅ
        fold_predictions = pd.DataFrame()
        
        # train data / valid data
        X_train = features[train_indices]
        y_train = labels[train_indices]
        X_valid = features[valid_indices]
        y_valid = labels[valid_indices]
        
        # ํ•™์Šต(early stopping ์ ์šฉ)
        model.fit(X_train, y_train, early_stopping_rounds = 100, 
                  eval_metric = macro_f1_score,
                  eval_set = [(X_train, y_train), (X_valid, y_valid)],
                  eval_names = ['train', 'valid'],
                  verbose = 200)
        
        # ๊ฒ€์ฆ ์ ์ˆ˜ ์ €์žฅ
        valid_scores.append(model.best_score_['valid']['macro_f1'])
        
        # "ํ™•๋ฅ "์„ ํ†ตํ•œ ์˜ˆ์ธก ์ˆ˜ํ–‰
        fold_probabilitites = model.predict_proba(test_features)
        
        # ๊ฐœ๋ณ„ ์ปฌ๋Ÿผ์œผ๋กœ ์˜ˆ์ธก๊ฐ’ ์ €์žฅ
        for j in range(4):
            fold_predictions[(j + 1)] = fold_probabilitites[:, j]
            
        # ์˜ˆ์ธก์„ ์œ„ํ•ด ํ•„์š”ํ•œ ์ •๋ณด ์ถ”๊ฐ€
        fold_predictions['idhogar'] = test_ids
        fold_predictions['fold'] = (i+1)
        
        # ์˜ˆ์ธก์„ ๊ธฐ์กด ์˜ˆ์ธก์— ์ƒˆ ํ–‰์œผ๋กœ ์ถ”๊ฐ€
        predictions = predictions.append(fold_predictions)
        
        # ๊ฐ fold ๋ณ„ ํ”ผ์ฒ˜ ์ค‘์š”๋„
        importances += model.feature_importances_ / nfolds   
        
        # fold์— ๋Œ€ํ•œ ์ •๋ณด ํ‘œ์‹œ
        display(f'Fold {i + 1}, Validation Score: {round(valid_scores[i], 5)}, Estimators Trained: {model.best_iteration_}')

    # ํ”ผ์ณ ์ค‘์š”๋„๋ฅผ df๋กœ ์ €์žฅ
    feature_importances = pd.DataFrame({'feature': feature_names,
                                        'importance': importances})
    
    # ๊ฒ€์ฆ ์ ์ˆ˜์— ๋Œ€ํ•œ ์ •๋ณด ํ‘œ์‹œ
    valid_scores = np.array(valid_scores)
    display(f'{nfolds} cross validation score: {round(valid_scores.mean(), 5)} with std: {round(valid_scores.std(), 5)}.')
    
    # ์˜ˆ์ธก์ด ํ‰๊ท ์„ ์ดˆ๊ณผํ•˜์ง€ ์•Š๋Š”์ง€ ํ™•์ธํ•˜๋ ค๋ฉด
    if return_preds:
        predictions['Target'] = predictions[[1, 2, 3, 4]].idxmax(axis = 1)
        predictions['confidence'] = predictions[[1, 2, 3, 4]].max(axis = 1)
        
        return predictions, feature_importances
    
    # fold ๋ณ„ ์˜ˆ์ธก์„ ํ‰๊ท 
    predictions = predictions.groupby('idhogar', as_index = False).mean()
    
    # ํด๋ž˜์Šค ๋ฐ ๊ด€๋ จ ํ™•๋ฅ  ์ฐพ๊ธฐ
    predictions['Target'] = predictions[[1, 2, 3, 4]].idxmax(axis = 1)
    predictions['confidence'] = predictions[[1, 2, 3, 4]].max(axis = 1)
    predictions = predictions.drop(columns = ['fold'])
    
    # ๊ฐ ๊ฐœ์ฒด์— ๋Œ€ํ•ด "ํ•˜๋‚˜"์˜ ์˜ˆ์ธก๊ฐ’์„ ๊ฐ–๋„๋ก ๊ธฐ์กด์˜ ๊ฐ’๊ณผ ๋ณ‘ํ•ฉ
    submission = submission_base.merge(predictions[['idhogar', 'Target']], 
                                       on = 'idhogar', how = 'left').drop(columns = ['idhogar'])
        
    # ๊ฒฐ์ธก์น˜ -> class = 4๋กœ  ์ฑ„์šฐ๊ธฐ
    # ์ ์ˆ˜ ๊ณ„์‚ฐ์ด ๋˜์ง€ x
    submission['Target'] = submission['Target'].fillna(4).astype(np.int8)
    
    return submission, feature_importances, valid_scores

a) ์กฐ๊ธฐ ์ค‘์ง€ ๋…ธํŠธ๋ฅผ ์‚ฌ์šฉํ•œ ๊ต์ฐจ ๊ฒ€์ฆ

  • train ์„ธํŠธ์—์„œ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๋Š” ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜

    • ๊ฒ€์ฆ ์ ์ˆ˜๊ฐ€ ๊ฐœ์„ ๋˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ๋ถ„๋ช…ํ•ด์ง€๋ฉด ๋ชจ๋ธ ๋ณต์žก์„ฑ์„ ๋Š˜๋ฆด ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ
  • ํ•ด๋‹น ๊ณผ์ •์„ ์—ฌ๋Ÿฌ fold์—์„œ ๋ฐ˜๋ณตํ•˜๋ฉด ๋‹จ์ผ fold๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ๋ฐœ์ƒ๋  ์ˆ˜ ์žˆ๋Š” ํŽธํ–ฅ(bias)์„ ์ค„์ด๋Š” ๋ฐ ๋„์›€์ด ๋จ

  • ๋˜ํ•œ, ๋น ๋ฅธ ํ•™์Šต ๊ฐ€๋Šฅ

  • Gradient Boosting Machine์—์„œ eatimator์˜ ์ˆ˜๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฐ€์žฅ ์ข‹์€ ๋ฐฉ๋ฒ•

%%capture --no-display

predictions, gbm_fi = model_gbm(train_set, train_labels, test_set, test_ids, return_preds=True)
  • LGBM์„ ํ™œ์šฉํ•˜์—ฌ ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ ์„ฑ๋Šฅ์ด ๋งŽ์ด ๋†’์•„์ง
### ์˜ˆ์ธก๊ฐ’ ํ™•์ธ

predictions.head()
  • ๊ฐ fold์— ๋Œ€ํ•ด 1, 2, 3, 4์—ด์€ ๊ฐ target์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋ƒ„

    • target๋กœ confidence๊ฐ€ ์ตœ๋Œ€์น˜์ธ ๊ฒƒ์„ ์„ ํƒ
  • 5๊ฐœ์˜ fold ๋ชจ๋‘์— ๋Œ€ํ•œ ์˜ˆ์ธก์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ

    • ๋‹ค๋ฅธ fold์— ๋Œ€ํ•œ ๊ฐ target์— ๋Œ€ํ•œ ์‹ ๋ขฐ๋„ ํ‘œ์‹œ ๊ฐ€๋Šฅ
plt.rcParams['font.size'] = 18

### Kdeplot
g = sns.FacetGrid(predictions, row = 'fold', hue = 'Target', aspect = 4)
g.map(sns.kdeplot, 'confidence');
g.add_legend();

plt.suptitle('Distribution of Confidence by Fold and Target', y = 1.05);
  • ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ์‹ ๋ขฐ๋„๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์Œ

    • ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜•๊ณผ ๋†’์€ ๋ณด๊ธ‰๋ฅ ๋กœ ์ธํ•ด Target = 4์— ๋Œ€ํ•œ ์‹ ๋ขฐ๋„๊ฐ€ ๋†’์€ ๊ฒƒ์œผ๋กœ ๋ณด์ž„
### violinplot

plt.figure(figsize = (24, 12))
sns.violinplot(x = 'Target', y = 'confidence', hue = 'fold', data = predictions)
  • ๋ถˆ๊ท ํ˜• ํด๋ž˜์Šค์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

    • ๋ชจ๋ธ์ด ๊ฐ๊ฐ์˜ ํด๋ž˜์Šค๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์„ ๊ฒช์„ ์ˆ˜ ์žˆ์Œ

    • ์ดํ›„ ์˜ˆ์ธก์น˜๋ฅผ ๋ณด๊ณ  ํ˜ผ๋ž€์„ ์•ผ๊ธฐํ•˜๋Š” ์œ„์น˜๋ฅผ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ์Œ


  • ๊ฐ ๊ฐ€๊ตฌ๋ณ„ ์˜ˆ์ธก ์ˆ˜ํ–‰ ์‹œ ๊ฐ fold์— ๋Œ€ํ•œ ์˜ˆ์ธก๊ฐ’์„ ํ‰๊ท ํ•จ

    • ๊ฐ๊ฐ์˜ ๋ชจ๋ธ์€ ์•ฝ๊ฐ„์”ฉ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ fold์— ๋Œ€ํ•ด ํ•™์Šต๋จ -> ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•จ
  • Gradient Boosting Machine์€ ์•™์ƒ๋ธ” ๋ชจ๋ธ์ด๋ฉฐ, ์—ฌ๋Ÿฌ gbm ๋ชจ๋ธ์„ ํ™œ์˜ํ•˜์—ฌ meta-ensemble๋กœ ํ™œ์šฉ

# fold ๋ณ„ ์˜ˆ์ธก์„ ํ‰๊ท 
predictions = predictions.groupby('idhogar', as_index = False).mean()

# ํด๋ž˜์Šค ๋ฐ ๊ด€๋ จ ํ™•๋ฅ  ์ฐพ๊ธฐ
predictions['Target'] = predictions[[1, 2, 3, 4]].idxmax(axis = 1)
predictions['confidence'] = predictions[[1, 2, 3, 4]].max(axis = 1)
predictions = predictions.drop(columns = ['fold'])

# ๊ฐ target์— ๋Œ€ํ•œ ์‹ ๋ขฐ๋„ plotting
plt.figure(figsize = (10, 6))
sns.boxplot(x = 'Target', y = 'confidence', data = predictions)
plt.title('Confidence by Target')

plt.figure(figsize = (10, 6))
sns.violinplot(x = 'Target', y = 'confidence', data = predictions)
plt.title('Confidence by Target')
  • 5๊ฐœ์˜ fold์— ๋Œ€ํ•œ ํ‰๊ท  ์˜ˆ์ธก์„ ์ทจํ•˜๋ฉฐ, ์‚ฌ์‹ค์ƒ 5๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผ

  • ๊ฐ ๋ชจ๋ธ์€ ์•ฝ๊ฐ„์”ฉ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ๋ถ€๋ถ„ ์ง‘ํ•ฉ์— ๋Œ€ํ•ด์„œ ํ•™์Šต๋จ

%%capture
submission, gbm_fi, valid_scores = model_gbm(train_set, train_labels, 
                                             test_set, test_ids, return_preds=False)

submission.to_csv('gbm_baseline.csv')
### feature ์ค‘์š”๋„ ํ™•์ธ

_ = plot_feature_importances(gbm_fi, threshold = 0.95)
  • gbm์—์„œ ์ค‘์š”ํ•˜๊ฒŒ ์ž‘์šฉํ•˜๋Š” feature๋“ค์€ ์ฃผ๋กœ age(๋‚˜์ด)์—์„œ ํŒŒ์ƒ๋œ feature๋“ค์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

  • education ๋ณ€์ˆ˜ ๋˜ํ•œ ์ค‘์š”ํ•œ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚จ

b) ์„ ํƒ๋œ ๋ณ€์ˆ˜๋“ค๋งŒ ์ ์šฉํ•˜๊ธฐ

  • ์ค‘๋ณต feature ์ œ๊ฑฐ ์ž‘์—…์„ ํ†ตํ•ด ์„ ํƒ๋œ feature๋“ค์„ ํ™œ์šฉ
%%capture --no-display

### ์„ ํƒ๋œ ๋ณ€์ˆ˜๋“ค๋งŒ ๊ฐ€์ง€๊ณ  ๊ต์ฐจ ๊ฒ€์ฆ ์ˆ˜ํ–‰
submission, gbm_fi_selected, valid_scores_selected = model_gbm(train_selected, train_labels, 
                                                               test_selected, test_ids)
### ๊ฒฐ๊ณผ ์ €์žฅ

model_results = model_results.append(pd.DataFrame({'model': ["GBM", "GBM_SEL"], 
                                                   'cv_mean': [valid_scores.mean(), valid_scores_selected.mean()],
                                                   'cv_std':  [valid_scores.std(), valid_scores_selected.std()]}),
                                                sort = True)
### ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”

model_results.set_index('model', inplace = True)
model_results['cv_mean'].plot.bar(color = 'orange', figsize = (8, 6),
                                  yerr = list(model_results['cv_std']),
                                 edgecolor = 'k', linewidth = 2)
plt.title('Model F1 Score Results')
plt.ylabel('Mean F1 Score (with error bar)')
model_results.reset_index(inplace = True)
  • 10-fold ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ ์šฉํ•ด๋ณด์ž.
%%capture

### 1. ์ „์ฒด feature์— ๋Œ€ํ•ด..
# 5-folds -> 10-folds
submission, gbm_fi, valid_scores = model_gbm(train_set, train_labels, test_set, test_ids, 
                                             nfolds = 10, return_preds = False)
submission.to_csv('gbm_10fold.csv', index = False)
%%capture

### 2. ์„ ํƒ๋œ feature์— ๋Œ€ํ•ด์„œ๋งŒ
submission, gbm_fi_selected, valid_scores_selected = model_gbm(train_selected, train_labels, 
                                                               test_selected, test_ids, nfolds = 10)
submission.to_csv('gmb_10fold_selected.csv', index = False)
### ๊ฒฐ๊ณผ ์ €์žฅ

model_results = model_results.append(pd.DataFrame({'model': ["GBM_10Fold", "GBM_10Fold_SEL"], 
                                                   'cv_mean': [valid_scores.mean(), valid_scores_selected.mean()],
                                                   'cv_std':  [valid_scores.std(), valid_scores_selected.std()]}),
                                    sort = True)
### ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”

model_results.set_index('model', inplace = True)
model_results['cv_mean'].plot.bar(color = 'orange', figsize = (8, 6), 
                                  edgecolor = 'k', linewidth = 2,
                                  yerr = list(model_results['cv_std']))
plt.title('Model F1 Score Results')
plt.ylabel('Mean F1 Score (with error bar)')
model_results.reset_index(inplace = True)
  • ๊ฐ€์žฅ ์ข‹์€ ๋ชจ๋ธ์€ ์„ ํƒ๋œ feature๋“ค๋กœ 10-fold ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ˆ˜ํ–‰ํ•œ ๋ชจ๋ธ์ž„

  • ์ตœ์ ํ™”๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ๋” ๊ฐœ์„ ์‹œํ‚ฌ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋ผ ๊ธฐ๋Œ€๋จ

print(f"There are {gbm_fi_selected[gbm_fi_selected['importance'] == 0].shape[0]} features with no importance.")
  • ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋“  feature๋“ค์€ Gradient Boosting Machine์—์„œ ์–ด๋Š ์ •๋„ ์ค‘์š”ํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

6. ๋ชจ๋ธ ์ตœ์ ํ™”(Model Optimization)

  • ๊ต์ฐจ ๊ฒ€์ฆ์„ ํ†ตํ•ด ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •ํ•˜์—ฌ ๋จธ์‹  ๋Ÿฌ๋‹ ๋ชจ๋ธ์—์„œ ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ์ด๋Œ์–ด ๋‚ด๋Š” ํ”„๋กœ์„ธ์Šค

  • ๋ชจ๋ธ ์ตœ์ ํ™” Options


1. ์ˆ˜๋™(Manual)

2. GridSearch

3. RandomSearch

4. ์ž๋™ํ™” ๊ธฐ๋ฒ•

  • 4์˜ ๊ฒฝ์šฐ Tree Parzen Estimator์™€ ํ•จ๊ป˜ Bayesian Optimization์˜ ์ˆ˜์ • ๋ฒ„์ „์„ ์‚ฌ์šฉํ•˜๋Š” Hyperopt๋ฅผ ํฌํ•จํ•œ ๋‹ค์ˆ˜์˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ์‰ฝ๊ฒŒ ๊ตฌํ˜„ ๊ฐ€๋Šฅ -> ์ด๋ฅผ ํ™œ์šฉ

6-1. Hyperopt์„ ํ†ตํ•œ ๋ชจ๋ธ ํŠœ๋‹

  • ๋ฒ ์ด์ง€์•ˆ ์ตœ์ ํ™”์—๋Š” 4๊ฐ€์ง€ ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋จ

1. ๋ชฉ์  ํ•จ์ˆ˜: ์ตœ๋Œ€ํ™”(๋˜๋Š” ์ตœ์†Œํ™”)ํ•˜๊ณ  ์‹ถ์€ ๊ฒƒ

2. ๋„๋ฉ”์ธ ์˜์—ญ: ๊ฒ€์ƒ‰ํ•  ์˜์—ญ

3. ๋‹ค์Œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ ํƒ์„ ์œ„ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๊ณผ๊ฑฐ ๊ฒฐ๊ณผ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์Œ ๊ฐ’์„ ์ œ์•ˆ

4. ๊ฒฐ๊ณผ ์ €์žฅ

from hyperopt import hp, tpe, Trials, fmin, STATUS_OK
from hyperopt.pyll.stochastic import sample
import csv
import ast
from timeit import default_timer as timer

a) ๋ชฉ์  ํ•จ์ˆ˜(Objective Function)

  • ๋ชจ๋ธ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์‚ฌ์šฉ๋˜๊ณ  ๊ด€๋ จ validation ์ ์ˆ˜๊ฐ€ ๋ฐ˜ํ™˜๋จ

  • Hyperopt์€ ์ตœ์†Œํ™” ํ•  ์ ์ˆ˜๋ฅผ ์š”๊ตฌํ•จ

    • 1 - Macro F1 score๋ฅผ ๋ฆฌํ„ด
  • hyper parameter์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ ์„ธ๋ถ€ ์‚ฌํ•ญ๋“ค์„ ์„ค์ •ํ•˜๋Š” ๋‹จ๊ณ„

def objective(hyperparameters, nfolds=5):

    global ITERATION
    ITERATION += 1
    
    # ํ•˜์œ„ ์ƒ˜ํ”Œ 
    subsample = hyperparameters['boosting_type'].get('subsample', 1.0)
    subsample_freq = hyperparameters['boosting_type'].get('subsample_freq', 0)
    
    boosting_type = hyperparameters['boosting_type']['boosting_type']
    
    if boosting_type == 'dart':
        hyperparameters['drop_rate'] = hyperparameters['boosting_type']['drop_rate']
    
    # Subsample and subsample frequency to top level keys
    hyperparameters['subsample'] = subsample
    hyperparameters['subsample_freq'] = subsample_freq
    hyperparameters['boosting_type'] = boosting_type
    
    # Whether or not to use limit maximum depth
    if not hyperparameters['limit_max_depth']:
        hyperparameters['max_depth'] = -1
    
    # Make sure parameters that need to be integers are integers
    for parameter_name in ['max_depth', 'num_leaves', 'subsample_for_bin', 
                           'min_child_samples', 'subsample_freq']:
        hyperparameters[parameter_name] = int(hyperparameters[parameter_name])

    if 'n_estimators' in hyperparameters:
        del hyperparameters['n_estimators']
    
    # Using stratified kfold cross validation
    strkfold = StratifiedKFold(n_splits = nfolds, shuffle = True)
    
    # Convert to arrays for indexing
    features = np.array(train_selected)
    labels = np.array(train_labels).reshape((-1 ))
    
    valid_scores = []
    best_estimators = []
    run_times = []
    
    model = lgb.LGBMClassifier(**hyperparameters, class_weight = 'balanced',
                               n_jobs=-1, metric = 'None',
                               n_estimators=10000)
    
    # Iterate through the folds
    for i, (train_indices, valid_indices) in enumerate(strkfold.split(features, labels)):
        
        # Training and validation data
        X_train = features[train_indices]
        X_valid = features[valid_indices]
        y_train = labels[train_indices]
        y_valid = labels[valid_indices]
        
        start = timer()
        # Train with early stopping
        model.fit(X_train, y_train, early_stopping_rounds = 100, 
                  eval_metric = macro_f1_score, 
                  eval_set = [(X_train, y_train), (X_valid, y_valid)],
                  eval_names = ['train', 'valid'],
                  verbose = 400)
        end = timer()
        
        # Record the validation fold score
        valid_scores.append(model.best_score_['valid']['macro_f1'])
        best_estimators.append(model.best_iteration_)
        
        run_times.append(end - start)
    
    score = np.mean(valid_scores)
    score_std = np.std(valid_scores)
    loss = 1 - score
    
    run_time = np.mean(run_times)
    run_time_std = np.std(run_times)
    
    estimators = int(np.mean(best_estimators))
    hyperparameters['n_estimators'] = estimators
    
    # Write to the csv file ('a' means append)
    of_connection = open(OUT_FILE, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([loss, hyperparameters, ITERATION, run_time, score, score_std])
    of_connection.close()
    
    # Display progress
    if ITERATION % PROGRESS == 0:
        display(f'Iteration: {ITERATION}, Current Score: {round(score, 4)}.')
    
    return {'loss': loss, 'hyperparameters': hyperparameters, 'iteration': ITERATION,
            'time': run_time, 'time_std': run_time_std, 'status': STATUS_OK, 
            'score': score, 'score_std': score_std}

b) ๋„๋ฉ”์ธ ์˜์—ญ(Search Space)

  • domain: ๊ฒ€์ƒ‰ํ•  ์ „์ฒด ๊ฐ’์˜ ๋ฒ”์œ„

  • boosting_type์ด goss์ธ ๊ฒฝ์šฐ subsample ๋น„์œจ์„ ๋ฐ˜๋“œ์‹œ 1.0์œผ๋กœ ์„ค์ •ํ•ด์•ผ ํ•จ

space = {
    'boosting_type': hp.choice('boosting_type', 
                              [{'boosting_type': 'gbdt', 
                                'subsample': hp.uniform('gdbt_subsample', 0.5, 1),
                                'subsample_freq': hp.quniform('gbdt_subsample_freq', 1, 10, 1)}, 
                               
                               {'boosting_type': 'dart', 
                                 'subsample': hp.uniform('dart_subsample', 0.5, 1),
                                 'subsample_freq': hp.quniform('dart_subsample_freq', 1, 10, 1),
                                 'drop_rate': hp.uniform('dart_drop_rate', 0.1, 0.5)},
                               
                                {'boosting_type': 'goss',
                                 'subsample': 1.0,
                                 'subsample_freq': 0}]),
         
    'limit_max_depth': hp.choice('limit_max_depth', [True, False]),
    'max_depth': hp.quniform('max_depth', 1, 40, 1),
    'num_leaves': hp.quniform('num_leaves', 3, 50, 1),
    'learning_rate': hp.loguniform('learning_rate', 
                                   np.log(0.025), 
                                   np.log(0.25)),
    'subsample_for_bin': hp.quniform('subsample_for_bin', 2000, 100000, 2000),
    'min_child_samples': hp.quniform('min_child_samples', 5, 80, 5),
    'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
    'colsample_bytree': hp.uniform('colsample_by_tree', 0.5, 1.0)
}
sample(space)

c) ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • ๋‹ค์Œ ๊ฐ’์„ ์„ ํƒํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ Tree Parzen Estimator๋กœ, Bayes rule์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชฉ์  ํ•จ์ˆ˜์˜ ๋Œ€์ฒด ๋ชจ๋ธ์„ ๊ตฌ์„ฑํ•จ

  • objective function์„ ์ตœ๋Œ€ํ™” ํ•˜๋Š” ๋Œ€์‹  ๋Œ€์ฒด ๋ชจ๋ธ์˜ Expected Improvement (EI)๋ฅผ ์ตœ๋Œ€ํ™”

algo = tpe.suggest

d) ๊ฒฐ๊ณผ ์ €์žฅ

  • ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•ด ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ํ™œ์šฉ

1. Trials object: ๋ชฉ์  ํ•จ์ˆ˜์—์„œ ๋ฐ˜ํ™˜๋œ ๋ชจ๋“  ๊ฒƒ์„ ์ €์žฅ

2. ๋ฐ˜๋ณตํ•  ๋•Œ๋งˆ๋‹ค CSV ํŒŒ์ผ์— ์“ฐ๊ธฐ

# ๊ฒฐ๊ณผ ์ €์žฅํ•˜๊ธฐ
trials = Trials()

# ํŒŒ์ผ ์—ด๊ธฐ, ์—ฐ๊ฒฐํ•˜๊ธฐ
OUT_FILE = 'optimization.csv'
of_connection = open(OUT_FILE, 'w')
writer = csv.writer(of_connection)

MAX_EVALS = 100
PROGRESS = 10
N_FOLDS = 5
ITERATION = 0

# ์ปฌ๋Ÿผ๋ช… ์ž‘์„ฑ
headers = ['loss', 'hyperparameters', 'iteration', 'runtime', 'score', 'std']
writer.writerow(headers)

of_connection.close()
%%capture --no-display
display("Running Optimization for {} Trials.".format(MAX_EVALS))

# ์ตœ์ ํ™” ์ˆ˜ํ–‰
best = fmin(fn = objective, space = space, algo = tpe.suggest, trials = trials,
            max_evals = MAX_EVALS)
  • ํ•™์Šต์„ ์žฌ๊ฐœํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋™์ผํ•œ trials ๊ฐ์ฒด๋ฅผ ์ „๋‹ฌํ•˜๊ณ  ์ตœ๋Œ€ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๋ฉด ๋จ

  • ๋‚˜์ค‘์— ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด trials๋ฅผ json์œผ๋กœ ์ €์žฅํ•  ์ˆ˜ ์žˆ์Œ

import json

# trial ๊ฒฐ๊ณผ ์ €์žฅ
with open('trials.json', 'w') as f:
    f.write(json.dumps(str(trials)))

6-2. ์ตœ์ ํ™”๋œ ๋ชจ๋ธ ์‚ฌ์šฉํ•˜๊ธฐ

### ์ตœ์ ํ™” ๋œ ๋ชจ๋ธ ๊ฒฐ๊ณผ ๊ฐ€์ ธ์˜ค๊ธฐ

results = pd.read_csv(OUT_FILE).sort_values('loss', ascending = True).reset_index()
results.head()
### ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”

plt.figure(figsize = (8, 6))
sns.regplot('iteration', 'score', data = results)
plt.title("Optimization Scores")
plt.xticks(list(range(1, results['iteration'].max() + 1, 3)))
### ์ตœ์  parameter ์„ ํƒ

best_hyp = ast.literal_eval(results.loc[0, 'hyperparameters'])
best_hyp
%%capture

### ์„ ํƒ๋œ feature๋“ค๋กœ๋งŒ ๋ชจ๋ธ๋ง
submission, gbm_fi, valid_scores = model_gbm(train_selected, train_labels, 
                                             test_selected, test_ids, 
                                             nfolds = 10, return_preds=False)

model_results = model_results.append(pd.DataFrame({'model': ["GBM_OPT_10Fold_SEL"], 
                                                   'cv_mean': [valid_scores.mean()],
                                                   'cv_std':  [valid_scores.std()]}),
                                    sort = True).sort_values('cv_mean', ascending = False)
%%capture

### ์ „์ฒด feature๋กœ ๋ชจ๋ธ๋ง
submission, gbm_fi, valid_scores = model_gbm(train_set, train_labels, 
                                             test_set, test_ids, 
                                             nfolds = 10, return_preds=False)

model_results = model_results.append(pd.DataFrame({'model': ["GBM_OPT_10Fold"], 
                                                   'cv_mean': [valid_scores.mean()],
                                                   'cv_std':  [valid_scores.std()]}),
                                    sort = True).sort_values('cv_mean', ascending = False)
model_results.head()
### ์ œ์ผ ์„ฑ๋Šฅ์ด ์ข‹์€ ๋ชจ๋ธ๋กœ ์˜ˆ์ธก ํ›„ ๊ฒฐ๊ณผ ์ €์žฅ

submission.to_csv('gbm_opt_10fold_selected.csv', index = False)
  • ์ด ์‹œ์ ์—์„œ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์ ํ™”๋ฅผ ๊ณ„์†ํ•˜๊ฑฐ๋‚˜, ๋” ๋งŽ์€ ๊ธฐ๋Šฅ ์—”์ง€๋‹ˆ์–ด๋ง/ ์ถ”๊ฐ€์ ์ธ ๋ชจ๋ธ ์Šคํƒ ๋˜๋Š” ์•™์ƒ๋ธ”์„ ์‹œ๋„ํ•˜๊ฑฐ๋‚˜, ์ฐจ์› ์ถ•์†Œ ๋˜๋Š” ์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง๊ณผ ๊ฐ™์€ ๋” ์‹คํ—˜์ ์ธ ๋ฐฉ๋ฒ•์„ ๊ฒ€ํ† ํ•  ์ˆ˜ ์žˆ์Œ

    • ์˜ˆ์ธก๊ฐ’์„ ๊ฒ€ํ† ํ•˜์—ฌ ๋ชจ๋ธ์ด ์–ด๋””์„œ ์˜ค๋ฅ˜๋ฅผ ๋ฒ”ํ•˜๊ณ  ์žˆ๋Š”์ง€๋ฅผ ํ™•์ธํ•  ์˜ˆ์ •
_ = plot_feature_importances(gbm_fi)

7. ์˜ˆ์ธก ํ™•์ธ

  • test ๋ฐ์ดํ„ฐ์—์„œ ์˜ˆ์ธก๋œ label์˜ ๋ถ„ํฌ๋ฅผ ์‹œ๊ฐํ™”ํ•˜์—ฌ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

    • train ๋ฐ์ดํ„ฐ์™€ ๋™์ผํ•œ ๋ถ„ํฌ๋ฅผ ๋ณด์ผ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋จ

    • ๊ฐ€๊ตฌ๋ณ„ ์˜ˆ์ธก์— ๊ด€์‹ฌ์ด ์žˆ๊ธฐ์—, ๊ฐ ๊ฐ€๊ตฌ์— ๋Œ€ํ•œ ์˜ˆ์ธก๋งŒ ํ™•์ธํ•˜์—ฌ train ๋ฐ์ดํ„ฐ์˜ ์˜ˆ์ธก๊ณผ ๋น„๊ต


  • ๋‹ค์Œ ํžˆ์Šคํ† ๊ทธ๋žจ์€ ์ ˆ๋Œ€ ์นด์šดํŠธ ๋Œ€์‹  ์ƒ๋Œ€์ ์ธ ๋นˆ๋„๋ฅผ ํ‘œ์‹œํ•˜๋Š” ์ •๊ทœํ™” ๋œ ๊ฐ’

    • ์›๋ณธ์˜ ๋ฐ์ดํ„ฐ ์ˆ˜์™€ validation ๋ฐ์ดํ„ฐ์—์„œ์˜ ๋ฐ์ดํ„ฐ ์ˆ˜๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ
preds = submission_base.merge(submission, on = 'Id', how = 'left')
preds = pd.DataFrame(preds.groupby('idhogar')['Target'].mean())

# train data์—์„œ์˜ label์˜ ๋ถ„ํฌ ์‹œ๊ฐํ™”
fig, axes = plt.subplots(1, 2, sharey = True, figsize = (12, 6))
heads['Target'].sort_index().plot.hist(normed = True,
                                       edgecolor = r'k',
                                       linewidth = 2,
                                       ax = axes[0])
axes[0].set_xticks([1, 2, 3, 4])
axes[0].set_xticklabels(poverty_mapping.values(), rotation = 60)
axes[0].set_title('Train Label Distribution')

# Plot the predicted labels
preds['Target'].sort_index().plot.hist(normed = True, 
                                       edgecolor = 'k',
                                       linewidth = 2,
                                       ax = axes[1])
axes[1].set_xticks([1, 2, 3, 4])
axes[1].set_xticklabels(poverty_mapping.values(), rotation = 60)
plt.subplots_adjust()
plt.title('Predicted Label Distribution')
heads['Target'].value_counts()
preds['Target'].value_counts()
  • ์•ฝ๊ฐ„์˜ ์ฐจ์ด๊ฐ€ ์žˆ์ง€๋งŒ train label์˜ ๋ถ„ํฌ์— ๊ฐ€๊นŒ์›€

    • target = 4๊ฐ€ ์•„๋‹Œ target = 3์ด ๊ณผ๋„ํ•˜๊ฒŒ ํ‘œํ˜„๋จ
  • ๋ถˆ๊ท ํ˜• ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์†Œ์ˆ˜์˜ ํด๋ž˜์Šค๋ฅผ oversamplingํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Œ

    • imbalanced learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์‰ฝ๊ฒŒ ๊ตฌํ˜„ ๊ฐ€๋Šฅ

7-1. ๊ฒ€์ฆ(Validation)

  • test์šฉ ์˜ˆ์ธก์„ ํ†ตํ•ด label์˜ ๋ถ„ํฌ๋ฅผ train ๋ฐ์ดํ„ฐ์—์„œ์˜ ๋ถ„ํฌ์™€ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์Œ

  • ์˜ˆ์ธก์„ ์‹ค์ œ ๊ฐ’๊ณผ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” train ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ„๋„์˜ validation ์„ธํŠธ๋กœ ๋ถ„ํ• ํ•ด์•ผ ํ•จ

    • 1000๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์ฆ์— ํ™œ์šฉ

    • ์ดํ›„ confusion matrix๋ฅผ ํ†ตํ•ด ์˜ค๋ถ„๋ฅ˜ ํƒ์ง€

from sklearn.model_selection import train_test_split

# ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
X_train, X_valid, y_train, y_valid = train_test_split(train_selected,
                                                      train_labels,
                                                      test_size = 1000,
                                                      random_state = 10)

# ๋ชจ๋ธ ์ƒ์„ฑ/ํ•™์Šต
model = lgb.LGBMClassifier(**best_hyp, 
                           class_weight = 'balanced',
                           random_state = 10)
model.fit(X_train, y_train);
# ๊ฒ€์ฆ ์ˆ˜ํ–‰
valid_preds = model.predict_proba(X_valid)
preds_df = pd.DataFrame(valid_preds, columns = [1, 2, 3, 4])

# ์˜ˆ์ธก๊ฐ’์œผ๋กœ ๋ณ€ํ™˜
preds_df['prediction'] = preds_df[[1, 2, 3, 4]].idxmax(axis = 1)
preds_df['confidence'] = preds_df[[1, 2, 3, 4]].max(axis = 1)

preds_df.head()
print('F1 score:', round(f1_score(y_valid, preds_df['prediction'], average = 'macro'), 5))

๐Ÿ“Œ ์˜ค์ฐจ ํ–‰๋ ฌ(confusion matrix)

  • ์˜ˆ์ธก๊ณผ ์‹ค์ œ ๊ฐ’์˜ ์ฐจ์ด๋ฅผ ๋ณด์—ฌ์คŒ์œผ๋กœ์จ ๋ชจํ˜•์ด ํ˜ผ๋™ํ•˜๋Š” ์ง€์ ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Œ
from sklearn.metrics import confusion_matrix
import itertools

### ์˜ค์ฐจ ํ–‰๋ ฌ์„ ์ƒ์„ฑํ•˜๋Š” ํ•จ์ˆ˜
def plot_confusion_matrix(cm, classes,
                          normalize = False,
                          title = 'Confusion matrix',
                          cmap = plt.cm.Oranges):
  
    ### ์ •๊ทœํ™”
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')
    print(cm)

    plt.figure(figsize = (10, 10))
    plt.imshow(cm, interpolation='nearest', cmap = cmap)
    plt.title(title, size = 24)
    plt.colorbar(aspect = 4)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation = 45, size = 14)
    plt.yticks(tick_marks, classes, size = 14)

    fmt = '.2f' if normalize else 'd' # formatting
    thresh = cm.max() / 2. # ์ž„๊ณ„๊ฐ’ ์„ค์ •
    
    ### Labeling 
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), fontsize = 20,
                 horizontalalignment = "center",
                 color = "white" if cm[i, j] > thresh else "black")
        
    plt.grid(None)
    plt.tight_layout()
    plt.ylabel('True label', size = 18)
    plt.xlabel('Predicted label', size = 18)
cm = confusion_matrix(y_valid, preds_df['prediction']) # ์˜ค์ฐจ ํ–‰๋ ฌ

plot_confusion_matrix(cm, classes = ['Extreme', 'Moderate', 'Vulnerable', 'Non-Vulnerable'],
                      title = 'Poverty Confusion Matrix')
  • Confusion Matrix ํ•ด์„

    • ๋Œ€๊ฐ์„ :(์˜ˆ์ธก ๊ฐ’) == (์‹ค์ œ ๊ฐ’)

    • ๋‚˜๋จธ์ง€: ์ž˜๋ชป๋œ ๊ฐ’

  • ํ˜„์žฌ ๋ชจ๋ธ์€ poverty = extreme์ธ 25๊ฐœ์˜ ๊ด€์ธก์น˜๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์˜ˆ์ธกํ•œ ๋ฐ˜๋ฉด, ๋‚˜๋จธ์ง€ 26๊ฐœ์˜ ๊ด€์ธก์น˜์˜ ๊ฒฝ์šฐ poverty = moderate๋ผ๊ณ  ์ž˜๋ชป ์˜ˆ์ธกํ•จ

  • ์ „๋ฐ˜์ ์œผ๋กœ ๋ชจ๋ธ์€ poverty = non-vulnerable์ธ ๊ฐ€๊ตฌ๋ฅผ ์‹๋ณ„ํ•˜๋Š” ๋ฐ๋งŒ ๋งค์šฐ ์ •ํ™•ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

  • ์‹ค์ œ ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•œ ์˜ค์ฐจ ํ–‰๋ ฌ์„ ์ •๊ทœํ™”ํ•˜์—ฌ ๊ฐ ํด๋ž˜์Šค์—์„œ ์˜ˆ์ธก๋œ ์‹ค์ œ ๋ ˆ์ด๋ธ”์˜ ๋ฐฑ๋ถ„์œจ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

plot_confusion_matrix(cm, normalize = True,
                      classes = ['Extreme', 'Moderate', 'Vulnerable', 'Non-Vulnerable'],
                      title = 'Poverty Confusion Matrix')
  • poverty = non-vulnerable์™ธ์˜ ํด๋ž˜์Šค๋Š” ์ž˜ ๊ตฌ๋ถ„ํ•˜์ง€ ๋ชปํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

  • poverty = vulnerable ์ค‘ ์•ฝ 35%๋งŒ ์ •ํ™•ํ•˜๊ฒŒ ์‹๋ณ„ํ•˜๊ณ , ๋” ๋งŽ์€ ์ˆ˜๋Š” ์ž˜๋ชป ๋ถ„๋ฅ˜ํ•˜๊ณ  ์žˆ์Œ

๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋Š” ๋ชจ๋ธ์ด ์ •ํ™•ํžˆ ์˜ˆ์ธก์„ ํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์ด ์žˆ์Œ

  • ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ oversampling์ด๋‚˜ ์—ฌ๋Ÿฌ ์˜์—ญ์—์„œ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๋“ฑ์˜ ๋ฐฉ๋ฒ•์„ ํƒํ•  ์ˆ˜๋Š” ์žˆ์ง€๋งŒ, ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ๋Œ€ํ•œ ๋งŽ์ด ๋ชจ์œผ๋Š” ๊ฒƒ์ด ๊ถŒ์žฅ๋จ

7-2. ์ฐจ์› ์ถ•์†Œ(Dimension Reduction)

  • ์„ ํƒ๋œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋ช‡ ๊ฐ€์ง€ ๋‹ค๋ฅธ ์ฐจ์› ์ถ•์†Œ ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Œ

    • ์‹œ๊ฐํ™” ๋˜๋Š” ๊ธฐ๊ณ„ ํ•™์Šต์„ ์œ„ํ•œ ์ „์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ
  • ์ข…๋ฅ˜

  1. PCA(Principal Components Analysis): ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ€์žฅ ํฐ ๋ณ€๋™์ด ์ผ์–ด๋‚˜๋Š” ์ฐจ์›์„ ์ฐพ์Œ

  2. ICA(Independent Components Analysis): ๋‹ค๋ณ€๋Ÿ‰ ์‹ ํ˜ธ๋ฅผ ๋…๋ฆฝ์ ์ธ ์‹ ํ˜ธ๋กœ ๋ถ„๋ฆฌํ•˜๋ ค๊ณ  ์‹œ๋„

  3. TSNE(T-distributed Stochastic Neighbor Embedding): ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ๋ฅผ ์ €์ฐจ์› ๋งค๋‹ˆํด๋“œ์— ๋งคํ•‘ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋‚ด์˜ ๋กœ์ปฌ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€

  4. UMAP(Uniform Manifold Approximation and Projection): ๋ฐ์ดํ„ฐ๋ฅผ ์ €์ฐจ์› ๋งค๋‹ˆํด๋“œ์— ๋งคํ•‘ํ•˜์ง€๋งŒ TSNE๋ณด๋‹ค ๋” ๋งŽ์€ ์ „์—ญ ๊ตฌ์กฐ๋ฅผ ๋ณด์กดํ•˜๋ ค๊ณ  ํ•˜๋Š” ๋น„๊ต์  ์ƒˆ๋กœ์šด ๊ธฐ์ˆ 

  • ๋„ค ๊ฐ€์ง€ ๋ฐฉ๋ฒ• ๋ชจ๋‘ ํŒŒ์ด์ฌ์—์„œ ๊ตฌํ˜„ํ•˜๊ธฐ๊ฐ€ ๋น„๊ต์  ๊ฐ„๋‹จํ•จ

  • ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•ด ์„ ํƒํ•œ ํ˜•์ƒ์„ 3์ฐจ์›์œผ๋กœ ๋งคํ•‘ํ•œ ๋‹ค์Œ ๋ชจ๋ธ๋ง ํ˜•์ƒ์œผ๋กœ PCA, ICA ๋ฐ UMAP์„ ์‚ฌ์šฉ

    • TSNE์—๋Š” ๋ณ€ํ™˜ ๋ฐฉ๋ฒ•์ด ์—†์œผ๋ฏ€๋กœ ์ „์ฒ˜๋ฆฌ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์Œ
!pip install umap
from umap import UMAP
from sklearn.decomposition import PCA, FastICA
from sklearn.manifold import TSNE

n_components = 3 # 3์ฐจ์›

### ์ฐจ์› ์ถ•์†Œ๋ฅผ ์œ„ํ•œ ๊ฐ์ฒด ์ƒ์„ฑ
umap = UMAP(n_components = n_components)
pca = PCA(n_components = n_components)
ica = FastICA(n_components = n_components)
tsne = TSNE(n_components = n_components)
train_df = train_selected.copy()
test_df = test_selected.copy()

for method, name in zip([umap, pca, ica, tsne], 
                        ['umap', 'pca', 'ica', 'tsne']):
    
    # TSNE๋Š” ๋ณ€ํ™˜ method๊ฐ€ ์—†์Œ
    if name == 'tsne':
        start = timer()
        reduction = method.fit_transform(train_selected)
        end = timer()
    else:
        start = timer()
        reduction = method.fit_transform(train_selected)
        end = timer()
        
        test_reduction = method.transform(test_selected)
    
        # Add components to test data
        test_df['%s_c1' % name] = test_reduction[:, 0]
        test_df['%s_c2' % name] = test_reduction[:, 1]
        test_df['%s_c3' % name] = test_reduction[:, 2]

    # Add components to training data for visualization and modeling
    train_df['%s_c1' % name] = reduction[:, 0]
    train_df['%s_c2' % name] = reduction[:, 1]
    train_df['%s_c3' % name] = reduction[:, 2]
    
    # ์†Œ์š” ์‹œ๊ฐ„ ์ถœ๋ ฅ
    print(f'Method: {name} {round(end - start, 2)} seconds elapsed.')
from mpl_toolkits.mplot3d import Axes3D

def discrete_cmap(N, base_cmap=None):
    """Create an N-bin discrete colormap from the specified input map
    Source: https://gist.github.com/jakevdp/91077b0cae40f8f8244a"""

    base = plt.cm.get_cmap(base_cmap)
    color_list = base(np.linspace(0, 1, N))
    cmap_name = base.name + str(N)
    
    return base.from_list(cmap_name, color_list, N)

cmap = discrete_cmap(4, base_cmap = plt.cm.RdYlBu)

train_df['label'] = train_labels
### ๊ฐ๊ฐ์˜ ๋ฐฉ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ์‹œ๊ฐํ™”

for method, name in zip([umap, pca, ica, tsne], 
                        ['umap', 'pca', 'ica', 'tsne']):
    
    fig = plt.figure(figsize = (8, 8))
    ax = fig.add_subplot(111, projection='3d')
    
    p = ax.scatter(train_df['%s_c1' % name], train_df['%s_c2'  % name], train_df['%s_c3'  % name], 
                   c = train_df['label'].astype(int), cmap = cmap)
    
    plt.title(f'{name.capitalize()}', size = 22)
    fig.colorbar(p, aspect = 4, ticks = [1, 2, 3, 4])
  • ์ด๋Ÿฌํ•œ ๊ทธ๋ž˜ํ”„์—์„œ ๋งŽ์€ ๊ตฐ์ง‘ํ™”๋ฅผ ๋ณด๊ธฐ๋Š” ์–ด๋ ค์›€

    • ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ ๋ คํ•  ๋•Œ ๋นˆ๊ณค ์ˆ˜์ค€์„ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ค์›€์„ ์˜๋ฏธ
  • ๋งˆ์ง€๋ง‰์œผ๋กœ PCA, ICA ๋ฐ UMAP์„ ํ™œ์šฉํ•˜์—ฌ ์ถ”๊ฐ€์ ์œผ๋กœ feature์˜ ๊ฐœ์ˆ˜๋ฅผ ์ค„์—ฌ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ

train_df, test_df = train_df.align(test_df, axis = 1, join = 'inner')
%%capture

submission, gbm_fi, valid_scores = model_gbm(train_df, train_labels, 
                                             test_df, test_ids, nfolds = 10,
                                             hyp = best_hyp)
submission.to_csv('gbm_opt_10fold_dr.csv', index = False)
model_results = model_results.append(pd.DataFrame({'model': ["GBM_OPT_10Fold_DR"], 
                                                   'cv_mean': [valid_scores.mean()],
                                                   'cv_std':  [valid_scores.std()]}),
                                    sort = True)
model_results = model_results.sort_values('cv_mean')
model_results.set_index('model', inplace = True)
model_results['cv_mean'].plot.bar(color = 'orange', figsize = (10, 8),
                                  edgecolor = 'k', linewidth = 2,
                                  yerr = list(model_results['cv_std']))
plt.title('Model F1 Score Results')
plt.ylabel('Mean F1 Score (with error bar)')

model_results.reset_index(inplace = True)
  • ์ฐจ์›์ด ๊ฐ์†Œ๋œ ์„ฑ๋ถ„์€ ๋ชจํ˜•์˜ ์ „์ฒด ์„ฑ๋Šฅ์— ์•ฝ๊ฐ„ ์˜ํ–ฅ์„ ๋ฏธ์นจ

    • train data์˜ ๊ณผ์ ํ•ฉ์„ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ์Œ
_ = plot_feature_importances(gbm_fi)
  • ์ฐจ์› ์ถ•์†Œ๋œ feature์˜ feature importance๊ฐ€ ๋งค์šฐ ๋†’์Œ

    • overfitting์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Œ
  • ์ฐจ์› ์ถ•์†Œ๋Š” label ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์ง€ x

    • ๋ชจํ˜•์˜ ์˜ˆ์ธก์— ์ค‘์š”ํ•œ ์ •๋ณด๋“ค์„ ํฌํ•จํ•˜์ง€ ๋ชปํ•  ์ˆ˜๋„ ์žˆ์Œ

7-3. ๋‹จ์ผ ํŠธ๋ฆฌ ๋ชจ๋ธ ์‹œ๊ฐํ™”

  • RandomForestClassifier์—์„œ ํ•˜๋‚˜์˜ ์˜์‚ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ์Œ

    • ๊ฐ€์‹œ์„ฑ์„ ์œ„ํ•ด max_depth๋ฅผ ์ œํ•œํ•œ ๋‹ค์Œ ํŠธ๋ฆฌ๋ฅผ ์ „์ฒด์ ์œผ๋กœ ํ™•์žฅํ•จ
model = RandomForestClassifier(max_depth = 3, n_estimators=10)
model.fit(train_selected, train_labels)

### ํ•˜๋‚˜์˜ ํŠธ๋ฆฌ๋ฅผ ์ถ”์ถœ
estimator_limited = model.estimators_[5]
estimator_limited
  • ํ›ˆ๋ จ๋œ ํŠธ๋ฆฌ๋ฅผ ๊ฐ€์ ธ์™€ export_graphviz๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ .dot ํŒŒ์ผ๋กœ ๋‚ด๋ณด๋ƒ„
### ๋‹จ์ผ ํŠธ๋ฆฌ ๋ชจํ˜• ๋‚ด๋ณด๋‚ด๊ธฐ

from sklearn.tree import export_graphviz

export_graphviz(estimator_limited, out_file='tree_limited.dot', feature_names = train_selected.columns,
                class_names = ['extreme', 'moderate' , 'vulnerable', 'non-vulnerable'],
                rounded = True, proportion = False, precision = 2, filled = True)
  • .dot ํŒŒ์ผ์„ .png ํŒŒ์ผ๋กœ ๋ณ€ํ™˜
!dot -Tpng tree_limited.dot -o tree_limited.png
  • IPython.display์„ ํ™œ์šฉํ•˜์—ฌ Jupyter Notebook์—์„œ ํŠธ๋ฆฌ๋ฅผ ํ™•์ธ
### ํŠธ๋ฆฌ ์‹œ๊ฐํ™”

from IPython.display import Image
Image(filename = 'tree_limited.png')

๐Ÿ“Œ ์ตœ๋Œ€ ๊นŠ์ด ์ œํ•œ ์—†์ด ํŠธ๋ฆฌ ์‹œ๊ฐํ™”

  • ๊นŠ์ด ์ œํ•œ์ด ์—†์œผ๋ฉด, ํŠธ๋ฆฌ๋Š” ๋ฌดํ•œ์ • ์„ฑ์žฅํ•  ์ˆ˜ ์žˆ์Œ

    • ๋”ฐ๋ผ์„œ, ์ผ๋ฐ˜์ ์œผ๋กœ๋Š” ์•ฝ๊ฐ„์˜ ์ œํ•œ์„ ๋‘๋Š” ๊ฒƒ์ด ๋„์›€๋จ
# ๊นŠ์ด ์ œํ•œ x

model = RandomForestClassifier(max_depth = None, n_estimators=10)
model.fit(train_selected, train_labels)
estimator_nonlimited = model.estimators_[5]

export_graphviz(estimator_nonlimited, out_file='tree_nonlimited.dot', feature_names = train_selected.columns,
                class_names = ['extreme', 'moderate' , 'vulnerable', 'non-vulnerable'],
                rounded = True, proportion = False, precision = 2)

!dot -Tpng tree_nonlimited.dot -o tree_nonlimited.png -Gdpi=600
Image(filename = 'tree_nonlimited.png')

~๋„์ €ํžˆ ์•Œ์•„๋ณผ ์ˆ˜ ์—†์Œ~

8. ๊ฒฐ๋ก 

  • ์‹ค์ œ ๋ฌธ์ œ์— ๋Œ€ํ•œ ์ „์ฒด ๋ฐ์ดํ„ฐ ๊ณผํ•™ ์†”๋ฃจ์…˜์˜ ๋‹จ๊ณ„๋ณ„ ๊ตฌํ˜„์„ ์ˆ˜ํ–‰

  • ํ”„๋กœ์„ธ์Šค ์ •๋ฆฌ>


1. ๋ฌธ์ œ ์ดํ•ด

2. ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„(EDA)

  - ๋ฐ์ดํ„ฐ ๋ฌธ์ œ ์ฒ˜๋ฆฌ

  - ๊ฒฐ์ธก๊ฐ’ ์ž…๋ ฅ

3. ํŠน์„ฑ ๊ณตํ•™(Feature Engineering)

  - ๋ฐ์ดํ„ฐ ์ง‘๊ณ„(aggregation)

  - ๋‹จ๊ณ„๋ณ„ ํ”ผ์ณ ์„ ํƒ

4. ๋ชจ๋ธ ์„ ํƒ

  - ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์„ ์‹œ๋„ํ•˜์—ฌ ์–ด๋–ค ๋ชจ๋ธ์ด ๊ฐ€์žฅ ์œ ๋ ฅํ•œ์ง€ ํ™•์ธ

  - ๊ธฐ๋Šฅ ์„ ํƒ ๊ธฐ๋Šฅ๋„ ์ž‘๋™ ๊ฐ€๋Šฅ

5. ๋ชจ๋ธ ์ตœ์ ํ™”

  - ์ตœ๊ณ  ์„ฑ๋Šฅ ๋ชจ๋ธ์„ ์„ ํƒํ•˜๊ณ  ํŠœ๋‹

6. ๋ชจ๋ธ ์ตœ์ ํ™”

7. ์˜ˆ์ธก ๊ฒ€ํ† 

  - ๋ชจ๋ธ์˜ ๋‹จ์ ์„ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค

8. ์ƒˆ๋กœ์šด ๊ธฐ์ˆ ์„ ์‹œ๋„


  • ์„œ๋ก ์—์„œ ์–ธ๊ธ‰ํ•œ ๋ฐ”์™€ ๊ฐ™์ด, ์ด๋Ÿฌํ•œ ๋‹จ๊ณ„๋Š” ์ผ๋ฐ˜์ ์ธ ์ˆœ์„œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€๋งŒ, ์„ฑ๋Šฅ์— ๋งŒ์กฑ๋˜์ง€ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ ๋ชจ๋ธ๋ง ํ›„ ํ”ผ์ณ ์—”์ง€๋‹ˆ์–ด๋ง/์„ ํƒ ํ•ญ๋ชฉ์œผ๋กœ ๋˜๋Œ์•„๊ฐ€๋Š” ๊ฒฝ์šฐ๋„ ๋งŽ์Œ

    • ์˜ˆ์ธก์„ ์กฐ์‚ฌํ•œ ํ›„ ๋ชจ๋ธ๋ง ๋‹จ๊ณ„๋กœ ๋Œ์•„๊ฐ€์„œ ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋‹ค์‹œ ์ƒ๊ฐํ•  ์ˆ˜๋„ ์žˆ์Œ
  • ๊ธฐ๊ณ„ ํ•™์Šต์€ ๋Œ€๋ถ€๋ถ„ ๊ฒฝํ—˜์ ์ด๋ผ๋Š” ๊ฒƒ์„ ๋ช…์‹ฌํ•ด์•ผ ํ•จ

    • ํ™•๋ฆฝ๋œ ๋ชจ๋ฒ” ์‚ฌ๋ก€๊ฐ€ ๊ฑฐ์˜ ์—†์œผ๋ฏ€๋กœ ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์„ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ์ง€์†์ ์œผ๋กœ ๊ฒฝํ—˜ํ•ด์•ผ ํ•จ
  • ์šฐ๋ฆฌ์˜ ์ตœ์ข… ๋ชจ๋ธ์€ ๋‹ค๋ฅธ ๊ฒฝ์Ÿ ๋ชจ๋ธ๋“ค์— ๋น„ํ•ด ์šฐ์ˆ˜ํ•˜์ง€๋งŒ, ์ „์ฒด์ ์œผ๋กœ ๋งค์šฐ ์ •ํ™•ํ•˜์ง€๋Š” ์•Š์Œ

    • ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์„ ์ˆ˜ ์žˆ์ง€๋งŒ, ์ „๋ฐ˜์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•˜์—ฌ ์˜ˆ์™ธ์ ์ธ ๋ฉ”ํŠธ๋ฆญ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์—†์Œ

    • ๊ฒฐ๊ตญ ๋ฐ์ดํ„ฐ ๊ณผํ•™ ํ”„๋กœ์ ํŠธ์˜ ์„ฑํŒจ๋Š” ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ๊ณผ ์–‘์— ๋‹ฌ๋ ค ์žˆ์Œ

๐Ÿ“Œ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ๋“ค

  1. ์ถ”๊ฐ€์ ์ธ hyper parameter tuning
  • ๋ชจ๋ธ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐ ๋งŽ์€ ์‹œ๊ฐ„์„ ํˆฌ์žํ•˜์ง€ ์•Š์•˜์œผ๋ฉฐ, ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด ์‹œ๋„ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค๋ฅธ ํŒจํ‚ค์ง€๊ฐ€ ์žˆ์Œ
  1. ์ถ”๊ฐ€์ ์ธ feature selection
  • ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด ๋ชจ๋“  feature์„ ์œ ์ง€ํ•  ํ•„์š”๋Š” ์—†์Œ
  1. ์†Œ์ˆ˜์˜ class์— ๋Œ€ํ•œ oversampling, ๋งŽ์€ class์— ๋Œ€ํ•œ undersampling

  2. ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ์•™์ƒ๋ธ”/์Šคํƒœํ‚น

  • ๋ฐ์ดํ„ฐ์˜ ๋‹ค๋ฅธ ๋ถ€๋ถ„์— ๋Œ€ํ•ด ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ค๊ณ  ๊ทธ๋“ค์˜ ์˜ˆ์ธก์„ ๊ฒฐํ•ฉํ•˜์—ฌ ํด๋ž˜์Šค๋ฅผ ๋” ์ž˜ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ์Œ

ํƒœ๊ทธ: , ,

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ: