[ECC DS 11์ฃผ์ฐจ] ํ๊ท 3_Pycaret Introduction
0. Pycaret
0-1. AutoML
-
AutoML(Automated Machine Learning)
-
ํ์ฌ์ ๋จธ์ ๋ฌ๋ ๋ชจ๋ธ๋ง์ Machine Learning Process ๋์ ๋ง์ ์๊ฐ๊ณผ ๋ ธ๋ ฅ์ด ์๊ตฌ๋จ
-
AutoML์ ๊ธฐ๊ณ ํ์ต ํ์ดํ๋ผ์ธ์์ ์์์ ๊ณผ ๋ฐ๋ณต๋๋ ์์ ์ ์๋ํํ๋ ํ๋ก์ธ์ค
-
๋จธ์ ๋ฌ๋์ ์๋ํํ๋ AI ๊ธฐ์
-
cf) Machine Learning Process: ๋ฌธ์ ์ ์ ๊ณผ์ , ๋ฐ์ดํฐ ์์ง, ์ ์ฒ๋ฆฌ, ๋ชจ๋ธ ํ์ต ๋ฐ ํ๊ฐ, ์๋น์ค ์ ์ฉ
cf) ํ์ดํ๋ผ์ธ: ํ ๋ฐ์ดํฐ ์ฒ๋ฆฌ ๋จ๊ณ์ ์ถ๋ ฅ์ด ๋ค์ ๋จ๊ณ์ ์ ๋ ฅ์ผ๋ก ์ด์ด์ง๋ ํํ๋ก ์ฐ๊ฒฐ๋ ๊ตฌ์กฐ
0-2. Pycaret
-
Low-code machine learning
-
AutoML์ ๊ฐ๋ฅํ๊ฒ ํด์ฃผ๋ ํ์ด์ฌ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
-
scikit-learn ํจํค์ง ๊ธฐ๋ฐ
-
๋ถ๋ฅ, ํ๊ท, ๊ตฐ์งํ ๋ฑ ๋ค์ํ ๋ชจ๋ธ ์ง์
cf) low-code: ์ดํ๋ฆฌ์ผ์ด์ ๊ณผ ์์คํ ์ ๋น๋ฉํ ๋ ๊ฑฐ์ ์ฝ๋ฉ์ด ํ์ ์๋ ๋ฐฉ์์ ์ํํธ์จ์ด
- setup
- compare models
- create and store models in variable
- blend models
- stack models
# **1. ๋ฐ์ดํฐ ์ค๋น**
- ๋ฐ์ดํฐ๋ ์ด์ ์์ ์์ ๋ค๋ฃฌ house price ๋ฐ์ดํฐ๋ฅผ ํ์ฉ
! pip install pycaret
```python
### Import libraries
import numpy as np
import pandas as pd
from pycaret.regression import * # pycaret์์ ํ๊ท ๊ด๋ จ ๋ชจ๋ import
### Import dataset
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/10แแ
ฎแแ
ก/data/house_price_train.csv')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/10แแ
ฎแแ
ก/data/house_price_test.csv')
sample= pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/10แแ
ฎแแ
ก/data/house_sample_submission.csv')
train.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows ร 81 columns
### ๋ฐ์ดํฐ ์ ๋ณด ํ์ธ
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 Alley 91 non-null object 7 LotShape 1460 non-null object 8 LandContour 1460 non-null object 9 Utilities 1460 non-null object 10 LotConfig 1460 non-null object 11 LandSlope 1460 non-null object 12 Neighborhood 1460 non-null object 13 Condition1 1460 non-null object 14 Condition2 1460 non-null object 15 BldgType 1460 non-null object 16 HouseStyle 1460 non-null object 17 OverallQual 1460 non-null int64 18 OverallCond 1460 non-null int64 19 YearBuilt 1460 non-null int64 20 YearRemodAdd 1460 non-null int64 21 RoofStyle 1460 non-null object 22 RoofMatl 1460 non-null object 23 Exterior1st 1460 non-null object 24 Exterior2nd 1460 non-null object 25 MasVnrType 1452 non-null object 26 MasVnrArea 1452 non-null float64 27 ExterQual 1460 non-null object 28 ExterCond 1460 non-null object 29 Foundation 1460 non-null object 30 BsmtQual 1423 non-null object 31 BsmtCond 1423 non-null object 32 BsmtExposure 1422 non-null object 33 BsmtFinType1 1423 non-null object 34 BsmtFinSF1 1460 non-null int64 35 BsmtFinType2 1422 non-null object 36 BsmtFinSF2 1460 non-null int64 37 BsmtUnfSF 1460 non-null int64 38 TotalBsmtSF 1460 non-null int64 39 Heating 1460 non-null object 40 HeatingQC 1460 non-null object 41 CentralAir 1460 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1460 non-null int64 44 2ndFlrSF 1460 non-null int64 45 LowQualFinSF 1460 non-null int64 46 GrLivArea 1460 non-null int64 47 BsmtFullBath 1460 non-null int64 48 BsmtHalfBath 1460 non-null int64 49 FullBath 1460 non-null int64 50 HalfBath 1460 non-null int64 51 BedroomAbvGr 1460 non-null int64 52 KitchenAbvGr 1460 non-null int64 53 KitchenQual 1460 non-null object 54 TotRmsAbvGrd 1460 non-null int64 55 Functional 1460 non-null object 56 Fireplaces 1460 non-null int64 57 FireplaceQu 770 non-null object 58 GarageType 1379 non-null object 59 GarageYrBlt 1379 non-null float64 60 GarageFinish 1379 non-null object 61 GarageCars 1460 non-null int64 62 GarageArea 1460 non-null int64 63 GarageQual 1379 non-null object 64 GarageCond 1379 non-null object 65 PavedDrive 1460 non-null object 66 WoodDeckSF 1460 non-null int64 67 OpenPorchSF 1460 non-null int64 68 EnclosedPorch 1460 non-null int64 69 3SsnPorch 1460 non-null int64 70 ScreenPorch 1460 non-null int64 71 PoolArea 1460 non-null int64 72 PoolQC 7 non-null object 73 Fence 281 non-null object 74 MiscFeature 54 non-null object 75 MiscVal 1460 non-null int64 76 MoSold 1460 non-null int64 77 YrSold 1460 non-null int64 78 SaleType 1460 non-null object 79 SaleCondition 1460 non-null object 80 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB
- ์ผ๋ถ ๊ฒฐ์ธก์น๊ฐ ์กด์ฌํจ์ ํ์ธํ ์ ์๋ค.
2. ๋ฐ์ดํฐ ๊ตฌ์ฑ(๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ)
setup()
ํจ์๋ฅผ ํตํด ์ง์
reg = setup(data = train,
target = 'SalePrice',
numeric_imputation = 'mean',
categorical_features = ['MSZoning','Exterior1st','Exterior2nd','KitchenQual','Functional','SaleType',
'Street','LotShape','LandContour','LotConfig','LandSlope','Neighborhood',
'Condition1','Condition2','BldgType','HouseStyle','RoofStyle','RoofMatl',
'MasVnrType','ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond',
'BsmtExposure','BsmtFinType1','BsmtFinType2','Heating','HeatingQC','CentralAir',
'Electrical','GarageType','GarageFinish','GarageQual','GarageCond','PavedDrive',
'SaleCondition'] ,
ignore_features = ['Alley','PoolQC','MiscFeature','Fence','FireplaceQu','Utilities'],
normalize = True)
Description | Value | |
---|---|---|
0 | Session id | 679 |
1 | Target | SalePrice |
2 | Target type | Regression |
3 | Original data shape | (1460, 81) |
4 | Transformed data shape | (1460, 262) |
5 | Transformed train set shape | (1021, 262) |
6 | Transformed test set shape | (439, 262) |
7 | Ignore features | 6 |
8 | Ordinal features | 2 |
9 | Numeric features | 37 |
10 | Categorical features | 37 |
11 | Rows with missing values | 100.0% |
12 | Preprocess | True |
13 | Imputation type | simple |
14 | Numeric imputation | mean |
15 | Categorical imputation | mode |
16 | Maximum one-hot encoding | 25 |
17 | Encoding method | None |
18 | Normalize | True |
19 | Normalize method | zscore |
20 | Fold Generator | KFold |
21 | Fold Number | 10 |
22 | CPU Jobs | -1 |
23 | Use GPU | False |
24 | Log Experiment | False |
25 | Experiment Name | reg-default-name |
26 | USI | 6ade |
3. ๋ชจ๋ธ ์ฑ๋ฅ ๋น๊ต
-
models()
๋ฅผ ํตํด ๋ชจ๋ธ์ ํ์ธํ ์ ์์- setup์ ํ ํ์๋ง ๊ฐ๋ฅ
-
compare_models()
models()
์์ ์ ๊ณตํ๋ ๋ชจ๋ธ๋ค์ด๋ scikit-learn์์ ์ ๊ณตํ๋ ๋ชจ๋ธ์ ๋ณ๋๋ก ์ ์ธํ ์ดํ์ ์ ๋ ฅํ ๋ชจ๋ธ๋ค์ ์ฑ๋ฅ(MAE, MSE, RMSE, R^2, train time)๋ฑ์ ๋ฐ์ดํฐํ๋ ์ ํํ๋ก ์ ๊ณต
compare_models()
Initiated | . . . . . . . . . . . . . . . . . . | 05:49:16 |
---|---|---|
Status | . . . . . . . . . . . . . . . . . . | Fitting 10 Folds |
Estimator | . . . . . . . . . . . . . . . . . . | Random Forest Regressor |
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
en | Elastic Net | 18364.3952 | 1344589957.1260 | 35059.9474 | 0.7834 | 0.1449 | 0.1024 | 0.8530 |
par | Passive Aggressive Regressor | 18134.1562 | 1374167341.7881 | 35373.6928 | 0.7773 | 0.1699 | 0.1033 | 2.4800 |
br | Bayesian Ridge | 19102.7516 | 1509230751.8566 | 36789.4023 | 0.7551 | 0.1706 | 0.1097 | 0.9030 |
omp | Orthogonal Matching Pursuit | 19565.0295 | 1604341893.6740 | 37530.4782 | 0.7370 | 0.1815 | 0.1121 | 1.2170 |
llar | Lasso Least Angle Regression | 19787.4115 | 1620639261.1369 | 38028.6722 | 0.7366 | 0.1867 | 0.1157 | 0.9700 |
dt | Decision Tree Regressor | 26873.8594 | 1656701282.2708 | 40601.1973 | 0.7338 | 0.2134 | 0.1554 | 0.9180 |
ridge | Ridge Regression | 20390.6023 | 1675546296.8694 | 38612.7187 | 0.7280 | 0.2029 | 0.1197 | 1.1390 |
lasso | Lasso Regression | 20348.5520 | 1679664973.5803 | 38597.0093 | 0.7271 | 0.2045 | 0.1196 | 3.0070 |
huber | Huber Regressor | 19236.5019 | 1766235862.2651 | 38848.9916 | 0.7128 | 0.2052 | 0.1123 | 1.2150 |
knn | K Neighbors Regressor | 27602.8084 | 2038986154.7901 | 44370.3270 | 0.6870 | 0.2092 | 0.1559 | 1.1220 |
lr | Linear Regression | 489462244852579.7500 | 25087786038259630322198507421696.0000 | 3296273264514296.0000 | -3909394621073952604160.0000 | 3.5936 | 4067501731.2203 | 4.4590 |
lar | Least Angle Regression | 42521860416708132924698326635424220409273253888.0000 | 1844180973733323350448562275845971804997761547450196885667534503848224189678324096601303512252416.0000 | 429448190934811211097724065035493310127918809088.0000 | -356872355765253379694331818216569495633078728307084395869623944858328491348458530144256.0000 | 49.8038 | 310381041719748206784084963210196848476160.0000 | 1.4570 |
Processing: 0%| | 0/81 [00:00<?, ?it/s]
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
gbr | Gradient Boosting Regressor | 16668.8275 | 813316846.6913 | 27912.6374 | 0.8666 | 0.1348 | 0.0963 | 1.9690 |
rf | Random Forest Regressor | 18261.1246 | 998294529.0476 | 31130.8689 | 0.8383 | 0.1493 | 0.1068 | 3.5390 |
lightgbm | Light Gradient Boosting Machine | 17672.3315 | 1012800681.3536 | 31313.5997 | 0.8345 | 0.1418 | 0.0999 | 1.3330 |
xgboost | Extreme Gradient Boosting | 19010.3780 | 1029452959.8392 | 31447.4011 | 0.8309 | 0.1482 | 0.1083 | 2.6500 |
et | Extra Trees Regressor | 18237.5227 | 1037767632.2362 | 31830.1485 | 0.8297 | 0.1515 | 0.1073 | 3.0070 |
en | Elastic Net | 18364.3952 | 1344589957.1260 | 35059.9474 | 0.7834 | 0.1449 | 0.1024 | 0.8530 |
par | Passive Aggressive Regressor | 18134.1562 | 1374167341.7881 | 35373.6928 | 0.7773 | 0.1699 | 0.1033 | 2.4800 |
ada | AdaBoost Regressor | 25297.1088 | 1483109549.4956 | 38170.5192 | 0.7603 | 0.2061 | 0.1631 | 1.6810 |
br | Bayesian Ridge | 19102.7516 | 1509230751.8566 | 36789.4023 | 0.7551 | 0.1706 | 0.1097 | 0.9030 |
omp | Orthogonal Matching Pursuit | 19565.0295 | 1604341893.6740 | 37530.4782 | 0.7370 | 0.1815 | 0.1121 | 1.2170 |
llar | Lasso Least Angle Regression | 19787.4115 | 1620639261.1369 | 38028.6722 | 0.7366 | 0.1867 | 0.1157 | 0.9700 |
dt | Decision Tree Regressor | 26873.8594 | 1656701282.2708 | 40601.1973 | 0.7338 | 0.2134 | 0.1554 | 0.9180 |
ridge | Ridge Regression | 20390.6023 | 1675546296.8694 | 38612.7187 | 0.7280 | 0.2029 | 0.1197 | 1.1390 |
lasso | Lasso Regression | 20348.5520 | 1679664973.5803 | 38597.0093 | 0.7271 | 0.2045 | 0.1196 | 3.0070 |
huber | Huber Regressor | 19236.5019 | 1766235862.2651 | 38848.9916 | 0.7128 | 0.2052 | 0.1123 | 1.2150 |
knn | K Neighbors Regressor | 27602.8084 | 2038986154.7901 | 44370.3270 | 0.6870 | 0.2092 | 0.1559 | 1.1220 |
dummy | Dummy Regressor | 57764.4783 | 6442602524.5583 | 79947.6263 | -0.0024 | 0.4042 | 0.3610 | 1.1870 |
lr | Linear Regression | 489462244852579.7500 | 25087786038259630322198507421696.0000 | 3296273264514296.0000 | -3909394621073952604160.0000 | 3.5936 | 4067501731.2203 | 4.4590 |
lar | Least Angle Regression | 42521860416708132924698326635424220409273253888.0000 | 1844180973733323350448562275845971804997761547450196885667534503848224189678324096601303512252416.0000 | 429448190934811211097724065035493310127918809088.0000 | -356872355765253379694331818216569495633078728307084395869623944858328491348458530144256.0000 | 49.8038 | 310381041719748206784084963210196848476160.0000 | 1.4570 |
GradientBoostingRegressor(random_state=679)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(random_state=679)
- LGBM์ ์ฑ๋ฅ์ด ๊ฐ์ฅ ์ข๋ค.
4. ๋จ์ผ ๋ชจ๋ธ ์์ฑ
-
create_model()
-
์ฌ๋ฌ ๋ชจ๋ธ์ด ์๋ ํ๋์ ๋ชจ๋ธ์ ๋ํด์
setup()
์ ์ค์ ๋๋ก ํ์ต์ ์งํํ๊ณ ํ์ต ๊ฒฐ๊ณผ ํ์ธ -
์ธ๋ถ์ ์ผ๋ก ๊ฐ fold์ ๋ํ ์ฑ๋ฅ์ ์ ์
-
lgb = create_model('lightgbm')
Initiated | . . . . . . . . . . . . . . . . . . | 05:55:18 |
---|---|---|
Status | . . . . . . . . . . . . . . . . . . | Loading Dependencies |
Estimator | . . . . . . . . . . . . . . . . . . | Compiling Library |
Processing: 0%| | 0/4 [00:00<?, ?it/s]
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 15442.1645 | 500820041.6930 | 22379.0090 | 0.9257 | 0.1161 | 0.0871 |
1 | 18244.9876 | 855920787.4039 | 29256.1239 | 0.8332 | 0.1369 | 0.1017 |
2 | 20166.5540 | 1435737692.1195 | 37891.1295 | 0.7530 | 0.1675 | 0.1193 |
3 | 17853.1263 | 1729976226.4035 | 41592.9829 | 0.6375 | 0.1523 | 0.1004 |
4 | 20165.9642 | 1283772957.2518 | 35829.7775 | 0.8314 | 0.1748 | 0.1221 |
5 | 16520.0521 | 686462662.0092 | 26200.4325 | 0.9053 | 0.1295 | 0.0917 |
6 | 17028.9203 | 786230258.6556 | 28039.7978 | 0.8479 | 0.1132 | 0.0887 |
7 | 16193.2443 | 1033335960.9131 | 32145.5434 | 0.8559 | 0.1365 | 0.0908 |
8 | 16631.2041 | 685488507.1789 | 26181.8354 | 0.8899 | 0.1640 | 0.1019 |
9 | 18477.0974 | 1130261719.9075 | 33619.3653 | 0.8654 | 0.1274 | 0.0954 |
Mean | 17672.3315 | 1012800681.3536 | 31313.5997 | 0.8345 | 0.1418 | 0.0999 |
Std | 1530.8375 | 365042550.9815 | 5679.7143 | 0.0797 | 0.0206 | 0.0116 |
-
verbose
์ต์ -
ํจ์ ์ํ ์ ๋ฐ์ํ๋ ์์ธํ ์ ๋ณด๋ค์ ํ์ค ์ถ๋ ฅ์ผ๋ก ์์ธํ ๋ด๋ณด๋ผ ๊ฒ์ธ๊ฐ๋ฅผ ๊ฒฐ์
-
True์ ๊ฒฝ์ฐ ์์ธํ ์ถ๋ ฅํจ
-
5. ๋ชจ๋ธ ํ๋ํ๊ธฐ
-
tune_model()
- ์ ๋ ฅํ ๋ชจ๋ธ์ ๋ํด์ hyper parameter tuning์ ์ํ
tuned_lgb = tune_model(lgb)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 16220.2312 | 672187894.3403 | 25926.5866 | 0.9003 | 0.1230 | 0.0895 |
1 | 18389.4504 | 957053892.6023 | 30936.2876 | 0.8135 | 0.1396 | 0.1032 |
2 | 18155.1805 | 1124713680.5994 | 33536.7512 | 0.8065 | 0.1665 | 0.1146 |
3 | 17656.8498 | 1042311405.3105 | 32284.8479 | 0.7816 | 0.1439 | 0.1019 |
4 | 18224.1143 | 1133540184.1908 | 33668.0885 | 0.8512 | 0.1765 | 0.1181 |
5 | 16493.6110 | 687283374.5380 | 26216.0900 | 0.9052 | 0.1354 | 0.0950 |
6 | 14976.2889 | 584667369.0686 | 24179.8960 | 0.8869 | 0.1024 | 0.0797 |
7 | 19611.3610 | 1654645129.8896 | 40677.3294 | 0.7693 | 0.1606 | 0.1084 |
8 | 16973.8506 | 744827085.8664 | 27291.5204 | 0.8803 | 0.1582 | 0.1068 |
9 | 20922.3215 | 1753138613.2070 | 41870.4981 | 0.7912 | 0.1580 | 0.1162 |
Mean | 17762.3259 | 1035436862.9613 | 31658.7896 | 0.8386 | 0.1464 | 0.1033 |
Std | 1629.3522 | 382505984.9453 | 5758.2901 | 0.0494 | 0.0211 | 0.0117 |
Processing: 0%| | 0/7 [00:00<?, ?it/s]
Fitting 10 folds for each of 10 candidates, totalling 100 fits
6. ๋ชจ๋ธ๋ง ๊ฒฐ๊ณผ ํด์
-
interpret_model()
-
๋ชจ๋ธ์ ํด์
-
ํ๋ จ๋ ๋ชจ๋ธ ๊ฐ์ฒด์ ํ๋กฏ ์ ํ์ ๋ฌธ์์ด๋ก ๋ฐ์
-
ํด์์ SHAP๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ๊ตฌํ๋๋ฉฐ ํธ๋ฆฌ ๊ธฐ๋ฐ ๋ชจ๋ธ์์๋ง ์ฌ์ฉ ๊ฐ๋ฅํจ
-
cf) pycaret.classification, pycaret.regression ๋ชจ๋์์๋ง ์ฌ์ฉ ๊ฐ๋ฅ
cf) SHAP: SHapley Addictive exPlanations
!pip install shap
interpret_model(tuned_lgb)
predictions = predict_model(tuned_lgb, data = test)
sample['SalePrice'] = predictions['Label']
sample.to_csv('/content/drive/MyDrive/Colab Notebooks/ECC 48แแ
ต แแ
ฆแแ
ชB/10แแ
ฎแแ
ก/data/house_sample_submission.csv',index = False)
sample.head()