๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
study๐Ÿ“š/๋จธ์‹ ๋Ÿฌ๋‹

[๋จธ์‹ ๋Ÿฌ๋‹] ์‚ฌ์ดํ‚ท๋Ÿฐ(scikit-learn) - train_test_split, ๊ต์ฐจ๊ฒ€์ฆ, GridSearchCV

by ์Šค๋‹ 2022. 8. 29.

Model Selection

- ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ

1. ํ•™์Šต ๋ฐ์ดํ„ฐ ์„ธํŠธ

  • ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•™์Šต์„ ์œ„ํ•ด ์‚ฌ์šฉ.
  • ๋ฐ์ดํ„ฐ์˜ ์†์„ฑ๋“ค๊ณผ ๊ฒฐ์ •๊ฐ’(๋ ˆ์ด๋ธ”๊ฐ’) ๋ชจ๋‘๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ.
  • ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋ฐ์ดํ„ฐ ์†์„ฑ๊ณผ ๊ฒฐ์ •๊ฐ’์˜ ํŒจํ„ด์„ ์ธ์ง€ํ•˜๊ณ  ํ•™์Šต

2. ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ

  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ํ•™์Šต๋œ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ…Œ์ŠคํŠธ.
  • ์†์„ฑ ๋ฐ์ดํ„ฐ๋งŒ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์ œ๊ณตํ•˜๋ฉฐ, ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ œ๊ณต๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฒฐ์ •๊ฐ’์„ ์˜ˆ์ธก.
  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ๋ณ„๋„์˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ์ œ๊ณต๋˜์–ด์•ผ ํ•จ.

3. ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ - train_test_split()

sklearn.model_selection์˜ train_test_split()ํ•จ์ˆ˜

X_train, X_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target, 
                                                    test_size=0.3, random_state=121)
  • test_size : ์ „์ฒด ๋ฐ์ดํ„ฐ์—์„œ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ ํฌ๊ธฐ๋ฅผ ์–ผ๋งˆ๋กœ ์ƒ˜ํ”Œ๋งํ•  ๊ฒƒ์ธ๊ฐ€๋ฅผ ๊ฒฐ์ •. ๋””ํดํŠธ๋Š” 0.25, 25%
  • train_size : ์ „์ฒด ๋ฐ์ดํ„ฐ์—์„œ ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ ์„ธํŠธ ํฌ๊ธฐ๋ฅผ ์–ผ๋งˆ๋กœ ์ƒ˜ํ”Œ๋งํ•  ๊ฒƒ์ธ๊ฐ€๋ฅผ ๊ฒฐ์ •. test_size ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ†ต์ƒ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— train size๋Š” ์ž˜ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ.
  • shuffle : ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฆฌํ•˜๊ธฐ ์ „์— ๋ฐ์ดํ„ฐ๋ฅผ ๋ฏธ๋ฆฌ ์„ž์„์ง€๋ฅผ ๊ฒฐ์ •. ๋””ํดํŠธ๋Š” True. ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์‚ฐ์‹œ์ผœ์„œ ์ข€ ๋” ํšจ์œจ์ ์ธ ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋งŒ๋“œ๋Š”๋ฐ ์‚ฌ์šฉ๋จ.
  • random_state : ํ˜ธ์ถœํ•  ๋•Œ๋งˆ๋‹ค ๋™์ผํ•œ ํ•™์Šต/ํ…Œ์ŠคํŠธ์šฉ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ์–ด์ง€๋Š” ๋‚œ์ˆ˜๊ฐ’. train_test_split()๋Š” ํ˜ธ์ถœ ์‹œ ๋ฌด์ž‘์œ„๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฆฌํ•˜๋ฏ€๋กœ random_state๋ฅผ ์ง€์ •ํ•˜์ง€ ์•Š์œผ๋ฉด ์ˆ˜ํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๋‹ค๋ฅธ ํ•™์Šต/ํ…Œ์ŠคํŠธ์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•จ.

4. ์‹ค์Šต

# ํ•™์Šต/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ
from sklearn.model_selection import train_test_split

ํ•™์Šต/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ์ดํ•ด

  • ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ์ž˜๋ชป๋œ ์˜ˆ์ธก ์ผ€์ด์Šค
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# iris ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = load_iris()

# ํ•™์Šต ๋ฐ์ดํ„ฐ ์„ธํŒ…
train_data = iris.data
train_label = iris.target
# ๋””์‹œ์ ผํŠธ๋ฆฌ ๋ถ„๋ฅ˜๊ธฐ ์ธ์Šคํ„ด์Šค ์ƒ์„ฑ
dt_clf = DecisionTreeClassifier()

# ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต
dt_clf.fit(train_data, train_label)

  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ predictํ•ด์•ผ ์ œ๋Œ€๋กœ ๋œ ์˜ˆ์ธก
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

dt_clf = DecisionTreeClassifier( )
iris_data = load_iris()

# train_test_split ํ•จ์ˆ˜ : ํ•™์Šต, ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ
X_train, X_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target, 
                                                    test_size=0.3, random_state=121)
print(X_train.shape)
print(X_test.shape)

# ๋ชจ๋ธ ํ•™์Šต
dt_clf.fit(X_train, y_train)

# ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ์˜ˆ์ธก
pred = dt_clf.predict(X_test)
print('์˜ˆ์ธก ์ •ํ™•๋„: {0:.4f}'.format(accuracy_score(y_test,pred)))

- ๊ต์ฐจ ๊ฒ€์ฆ๊ณผ GridSearchCV

1. ๊ต์ฐจ ๊ฒ€์ฆ

  • ํ•™์Šต ๋ฐ์ดํ„ฐ ์„ธํŠธ : ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์‹œ ๋ถ„ํ• ํ•˜์—ฌ ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ผ์ฐจ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋กœ ๋‚˜๋ˆ”
  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ : ๋ชจ๋“  ํ•™์Šต/๊ฒ€์ฆ ๊ณผ์ •์ด ์™„๋ฃŒ๋„๋‹ˆ ํ›„ ์ตœ์ข…์ ์œผ๋กœ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ

k ํด๋“œ ๊ต์ฐจ ๊ฒ€์ฆ

  • ์ผ๋ฐ˜ K ํด๋“œ
  • Stratified K ํด๋“œ
    • ๋ถˆ๊ท ํ˜•ํ•œ(imbalanced)๋ถ„ํฌ๋„๋ฅผ ๊ฐ€์ง„ ๋ ˆ์ด๋ธ”(๊ฒฐ์ • ํด๋ž˜์Šค) ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ ์œ„ํ•œ k ํด๋“œ ๋ฐฉ์‹.
    • ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ๊ฐ€์ง€๋Š” ๋ ˆ์ด๋ธ” ๋ถ„ํฌ๋„๊ฐ€ ์œ ์‚ฌํ•˜๋„๋ก ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์ถ”์ถœ.
  1. K ํด๋“œ
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import numpy as np

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = load_iris()
label = iris.target
features = iris.data

print(features.shape)
features

# ๋ชจ๋ธ ์ •์˜
dt_clf = DecisionTreeClassifier(random_state=156)
dt_clf

# 5๊ฐœ์˜ ํด๋“œ ์„ธํŠธ๋กœ ๋ถ„๋ฆฌํ•˜๋Š” KFold ๊ฐ์ฒด์™€ ํด๋“œ ์„ธํŠธ๋ณ„ ์ •ํ™•๋„๋ฅผ ๋‹ด์„ ๋ฆฌ์ŠคํŠธ ๊ฐ์ฒด ์ƒ์„ฑ.
kfold = KFold(n_splits=5)  # n=5
cv_accuracy = []           # ์ตœ์ข…์ ์œผ๋กœ๋Š” n๋ฒˆ์˜ ๊ต์ฐจ๊ฒ€์ฆ์˜ ํ‰๊ท  ์ •ํ™•๋„ ๊ณ„์‚ฐ/ ํด๋“œ ์„ธํŠธ ๋ณ„๋กœ ์ •ํ™•๋„ ๊ฐ’์„ ์ €์žฅํ•  ๋ฆฌ์ŠคํŠธ ์ƒ์„ฑ
print('๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ์„ธํŠธ ํฌ๊ธฐ:', features.shape[0])

# for๋ฌธ์ด ๋„๋Š” ๋™์•ˆ generator๊ฐ€ kfold๋œ ๋ฐ์ดํ„ฐ์˜ ํ•™์Šต, ๊ฒ€์ฆ row ์ธ๋ฑ์Šค๋ฅผ array๋กœ ๋ฐ˜ํ™˜  
kfold.split(features)

n_iter = 0

# KFold๊ฐ์ฒด์˜ split( ) ํ˜ธ์ถœํ•˜๋ฉด ํด๋“œ ๋ณ„ ํ•™์Šต์šฉ, ๊ฒ€์ฆ์šฉ ํ…Œ์ŠคํŠธ์˜ row ์ธ๋ฑ์Šค๋ฅผ array๋กœ ๋ฐ˜ํ™˜  
for train_index, test_index  in kfold.split(features):
    # kfold.split( )์œผ๋กœ ๋ฐ˜ํ™˜๋œ ์ธ๋ฑ์Šค๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•™์Šต์šฉ, ๊ฒ€์ฆ์šฉ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
    X_train, X_test = features[train_index], features[test_index]
    y_train, y_test = label[train_index], label[test_index]

    # ํ•™์Šต ๋ฐ ์˜ˆ์ธก 
    dt_clf.fit(X_train , y_train)    
    pred = dt_clf.predict(X_test)
    n_iter += 1

    # ๋ฐ˜๋ณต ์‹œ ๋งˆ๋‹ค ์ •ํ™•๋„ ์ธก์ •
    accuracy = np.round(accuracy_score(y_test,pred), 4)  # ์ •ํ™•๋„ : ์†Œ์ˆ˜์  4์ž๋ฆฌ๊นŒ์ง€ ๊ตฌํ•จ /np.round(์ˆ˜,์ž๋ฆฟ์ˆ˜) 
    train_size = X_train.shape[0]
    test_size = X_test.shape[0]
    print('\n#{0} ๊ต์ฐจ ๊ฒ€์ฆ ์ •ํ™•๋„ :{1}, ํ•™์Šต ๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {2}, ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {3}'
          .format(n_iter, accuracy, train_size, test_size))
    print('#{0} ๊ฒ€์ฆ ์„ธํŠธ ์ธ๋ฑ์Šค:{1}'.format(n_iter,test_index))

    cv_accuracy.append(accuracy)

# ๊ฐœ๋ณ„ iteration๋ณ„ ์ •ํ™•๋„๋ฅผ ํ•ฉํ•˜์—ฌ ํ‰๊ท  ์ •ํ™•๋„ ๊ณ„์‚ฐ 
print('\n## ํ‰๊ท  ๊ฒ€์ฆ ์ •ํ™•๋„:', np.mean(cv_accuracy)) 

  1. Stratified K ํด๋“œ

KFOLD ๊ต์ฐจ๊ฒ€์ฆ์˜ ๋ฌธ์ œ์  : ๋ถˆ๊ท ํ˜•ํ•œ ๋ฐ์ดํ„ฐ์—๋Š” ์ ์šฉ์ด ์•ˆ๋œ๋‹ค.
์ด๋ฅผ ํ•ด๊ฒฐํ•  ๋ฐฉ๋ฒ•์ด StratifiedKFold : ๋ถˆ๊ท ํ˜•ํ•œ ๋ถ„ํฌ๋„๋ฅผ ๊ฐ€์ง„ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ ๊ท ํ˜•ํ•˜๊ฒŒ ์„ž์–ด์ฃผ๊ณ  ๊ต์ฐจ๊ฒ€์ฆ์„ ์ง„ํ–‰ํ•œ๋‹ค.

import pandas as pd

# iris ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# iris ํƒ€๊ฒŸ๊ฐ’ ํ™•์ธ
iris_df['label'] = iris.target
iris_df['label'].value_counts()

kfold = KFold(n_splits=3)
# kfold.split(X)๋Š” ํด๋“œ ์„ธํŠธ๋ฅผ 3๋ฒˆ ๋ฐ˜๋ณตํ•  ๋•Œ๋งˆ๋‹ค ๋‹ฌ๋ผ์ง€๋Š” ํ•™์Šต/ํ…Œ์ŠคํŠธ ์šฉ ๋ฐ์ดํ„ฐ ๋กœ์šฐ ์ธ๋ฑ์Šค ๋ฒˆํ˜ธ ๋ฐ˜ํ™˜. 
n_iter =0
for train_index, test_index  in kfold.split(iris_df):
    n_iter += 1
    label_train= iris_df['label'].iloc[train_index]  # ํ•™์Šต ๋ ˆ์ด๋ธ”
    label_test= iris_df['label'].iloc[test_index]    # ๊ฒ€์ฆ ๋ ˆ์ด๋ธ”

    print('## ๊ต์ฐจ ๊ฒ€์ฆ: {0}'.format(n_iter))
    print('ํ•™์Šต ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ๋ถ„ํฌ:\n', label_train.value_counts())       # ํ•™์Šต ๋ ˆ์ด๋ธ” ๋ถ„ํฌ
    print('๊ฒ€์ฆ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ๋ถ„ํฌ:\n', label_test.value_counts(), '\n')  # ๊ฒ€์ฆ ๋ ˆ์ด๋ธ” ๋ถ„ํฌ

kfold๋ฅผ ํ–ˆ๋”๋‹ˆ ๋ถˆ๊ท ํ˜•ํ•˜๊ฒŒ ํ•™์Šต ๋ ˆ์ด๋ธ”, ๊ฒ€์ฆ ๋ ˆ์ด๋ธ”์ด ๋“ค์–ด๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ ๊ฒ€์ฆ์ด ์ œ๋Œ€๋กœ ๋˜์ง€ ์•Š๋Š”๋‹ค
์ด๋ฅผ ํ•ด๊ฒฐํ•  ๋ฐฉ๋ฒ•์ด StratifiedKFold : ๋ถˆ๊ท ํ˜•ํ•œ ๋ถ„ํฌ๋„๋ฅผ ๊ฐ€์ง„ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ ๊ท ํ˜•ํ•˜๊ฒŒ ์„ž์–ด์ฃผ๊ณ  ๊ต์ฐจ๊ฒ€์ฆ์„ ์ง„ํ–‰.

from sklearn.model_selection import StratifiedKFold

# StratifiedKFold ํด๋ž˜์Šค์˜ ์ธ์Šคํ„ด์Šค ์„ ์–ธ : skf
skf = StratifiedKFold(n_splits=3)
n_iter=0

# StratifiedKFold ์‚ฌ์šฉ์‹œ KFold์™€ ์ฐจ์ด์  : ๋ ˆ์ด๋ธ” ๊ฐ’์„ ๋„ฃ์–ด์ค˜์„œ ๋ ˆ์ด๋ธ”์— ๋งž๊ฒŒ ๊ท ์ผํ•˜๊ฒŒ ๋ถ„ํฌ๋ฅผ ๋งž์ถฐ์ค€๋‹ค.
for train_index, test_index in skf.split(iris_df, iris_df['label']):
    n_iter += 1
    label_train= iris_df['label'].iloc[train_index]
    label_test= iris_df['label'].iloc[test_index]

    print('## ๊ต์ฐจ ๊ฒ€์ฆ: {0}'.format(n_iter))
    print('ํ•™์Šต ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ๋ถ„ํฌ:\n', label_train.value_counts())
    print('๊ฒ€์ฆ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ๋ถ„ํฌ:\n', label_test.value_counts(), '\n')

StratifiedKFold ํ–ˆ๋”๋‹ˆ ๊ท ์ผํ•˜๊ฒŒ ํ•™์Šต ๋ ˆ์ด๋ธ”, ๊ฒ€์ฆ ๋ ˆ์ด๋ธ”์ด ๋“ค์–ด๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ ๊ฒ€์ฆ์ด ์ œ๋Œ€๋กœ ๋œ๋‹ค.

์ตœ์ข…์ ์œผ๋กœ StratifiedKFold๋ฅผ ํ™œ์šฉํ•œ ๊ต์ฐจ ๊ฒ€์ฆ ์ •ํ™•๋„ ํ™•์ธ

from sklearn.model_selection import StratifiedKFold

dt_clf = DecisionTreeClassifier(random_state=156)

skfold = StratifiedKFold(n_splits=3)
n_iter=0
cv_accuracy=[]

# StratifiedKFold์˜ split( ) ํ˜ธ์ถœ์‹œ ๋ฐ˜๋“œ์‹œ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ์…‹๋„ ์ถ”๊ฐ€ ์ž…๋ ฅ ํ•„์š”(๋ ˆ์ด๋ธ” ๋ถ„ํฌ๋„์— ๋”ฐ๋ผ ํ•™์Šต/๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•˜๊ธฐ ๋•Œ๋ฌธ์—)
for train_index, test_index  in skfold.split(features, label):
    # split( )์œผ๋กœ ๋ฐ˜ํ™˜๋œ ์ธ๋ฑ์Šค๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•™์Šต์šฉ, ๊ฒ€์ฆ์šฉ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
    X_train, X_test = features[train_index], features[test_index]
    y_train, y_test = label[train_index], label[test_index]

    #ํ•™์Šต ๋ฐ ์˜ˆ์ธก 
    dt_clf.fit(X_train , y_train)    
    pred = dt_clf.predict(X_test)

    # ๋ฐ˜๋ณต ์‹œ ๋งˆ๋‹ค ์ •ํ™•๋„ ์ธก์ • 
    n_iter += 1
    accuracy = np.round(accuracy_score(y_test,pred), 4)
    train_size = X_train.shape[0]
    test_size = X_test.shape[0]

    print('\n#{0} ๊ต์ฐจ ๊ฒ€์ฆ ์ •ํ™•๋„ :{1}, ํ•™์Šต ๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {2}, ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {3}'
          .format(n_iter, accuracy, train_size, test_size))
    print('#{0} ๊ฒ€์ฆ ์„ธํŠธ ์ธ๋ฑ์Šค:{1}'.format(n_iter,test_index))
    cv_accuracy.append(accuracy)

# ๊ต์ฐจ ๊ฒ€์ฆ๋ณ„ ์ •ํ™•๋„ ๋ฐ ํ‰๊ท  ์ •ํ™•๋„ ๊ณ„์‚ฐ 
print('\n## ๊ต์ฐจ ๊ฒ€์ฆ๋ณ„ ์ •ํ™•๋„:', np.round(cv_accuracy, 4))
print('## ํ‰๊ท  ๊ฒ€์ฆ ์ •ํ™•๋„:', np.mean(cv_accuracy)) 

์•„๊นŒ๋ณด๋‹ค ์ข‹์€ ๊ฒ€์ฆ ์ •ํ™•๋„๊ฐ€ ๋‚˜์˜ด

๊ต์ฐจ ๊ฒ€์ฆ์„ ๋ณด๋‹ค ๊ฐ„ํŽธํ•˜๊ฒŒ - cross_val_score()

KFold ํด๋ž˜์Šค๋ฅผ ์ด์šฉํ•œ ๊ต์ฐจ ๊ฒ€์ฆ ๋ฐฉ๋ฒ•

  1. ํด๋“œ ์„ธํŠธ ์„ค์ •
  2. For๋ฃจํ”„์—์„œ ๋ฐ˜๋ณต์ ์œผ๋กœ ํ•™์Šต/๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์ถ”์ถœ ๋ฐ ํ•™์Šต๊ณผ ์˜ˆ์ธก ์ˆ˜ํ–‰
  3. ํด๋“œ ์„ธํŠธ๋ณ„๋กœ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ‰๊ท ํ•˜์—ฌ ์ตœ์ข… ์„ฑ๋Šฅ ํ‰๊ฐ€

-> cross_val_score() ํ•จ์ˆ˜๋กœ ํด๋“œ ์„ธํŠธ ์ถ”์ถœ, ํ•™์Šต/์˜ˆ์ธก, ํ‰๊ฐ€๋ฅผ ํ•œ๋ฒˆ์— ์ˆ˜ํ–‰

from sklearn.tree import DecisionTreeClassifier
# cross_val_score
from sklearn.model_selection import cross_val_score , cross_validate
from sklearn.datasets import load_iris
import numpy as np

iris_data = load_iris()
dt_clf = DecisionTreeClassifier(random_state=156)

data = iris_data.data
label = iris_data.target

# ์„ฑ๋Šฅ ์ง€ํ‘œ๋Š” ์ •ํ™•๋„(accuracy), ๊ต์ฐจ ๊ฒ€์ฆ ์„ธํŠธ๋Š” 3๊ฐœ 
scores = cross_val_score(dt_clf , data , label , scoring='accuracy', cv=3)
print('๊ต์ฐจ ๊ฒ€์ฆ๋ณ„ ์ •ํ™•๋„:',np.round(scores, 4))
print('ํ‰๊ท  ๊ฒ€์ฆ ์ •ํ™•๋„:', np.round(np.mean(scores), 4))

2. GridSearchCV

GridSearchCV - ๊ต์ฐจ ๊ฒ€์ฆ๊ณผ ์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ํ•œ ๋ฒˆ์—

- ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ : ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ตœ๋Œ€๋กœ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ํ•™์Šต ์กฐ๊ฑด
- ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์˜ ์ค‘์š”์„ฑ : ํ•™์Šต ์กฐ๊ฑด์„ ์ž˜ ์„ค์ •ํ•ด์•ผ ์ตœ๋Œ€์˜ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๋จธ์‹ ๋Ÿฌ๋‹์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ

์‚ฌ์ดํ‚ท๋Ÿฐ์€ GridSearchCV๋ฅผ ์ด์šฉํ•ด Classifier๋‚˜ Regressor์™€ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์‚ฌ์šฉ๋˜๋Š” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ž…๋ ฅํ•˜๋ฉด์„œ ํŽธ๋ฆฌํ•˜๊ฒŒ ์ตœ์ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋„์ถœํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ์•ˆ์„ ์ œ๊ณต.

EX.
grid_parameters = {'max_depth': [1, 2, 3], min_samples_split': [2, 3]}

CV ์„ธํŠธ๊ฐ€ 3์ด๋ผ๋ฉด ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆœ์ฐจ ์ ์šฉ ํšŸ์ˆ˜ : 6 X CV์„ธํŠธ ์ˆ˜ : 3 = ํ•™์Šต/๊ฒ€์ฆ ์ด ์ˆ˜ํ–‰ ํšŸ์ˆ˜ : 18

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# iris ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œ
iris = load_iris()

# ํ•™์Šต/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ
X_train, X_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target, 
                                                    test_size=0.2, random_state=121)

# ๋ชจ๋ธ ์ •์˜
dtree = DecisionTreeClassifier()

### hyper-parameter ๋“ค์„ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์„ค์ •
parameters = {'max_depth':[1, 2, 3], 'min_samples_split':[2,3]}
import pandas as pd

# param_grid์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ 3๊ฐœ์˜ train, test set fold ๋กœ ๋‚˜๋ˆ„์–ด์„œ ํ…Œ์ŠคํŠธ ์ˆ˜ํ–‰ ์„ค์ •.  
grid_dtree = GridSearchCV(dtree, param_grid=parameters, cv=3, refit=True, return_train_score=True)
### refit=True ๊ฐ€ default : ๊ฐ€์žฅ ์ข‹์€ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์œผ๋กœ ์žฌ ํ•™์Šต ์‹œํ‚ด.  

# ๋ถ“๊ฝƒ Train ๋ฐ์ดํ„ฐ๋กœ param_grid์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šต/ํ‰๊ฐ€ .
grid_dtree.fit(X_train, y_train)

# GridSearchCV ๊ฒฐ๊ณผ ์ „์ฒด ํ™•์ธ
grid_dtree.cv_results_

# GridSearchCV ๊ฒฐ๊ณผ๋Š” cv_results_ ๋ผ๋Š” ๋”•์…”๋„ˆ๋ฆฌ๋กœ ์ €์žฅ๋จ
# ์ด๋ฅผ DataFrame์œผ๋กœ ๋ณ€ํ™˜ํ•ด์„œ ํ™•์ธ
scores_df = pd.DataFrame(grid_dtree.cv_results_)
scores_df[['params', 'mean_test_score', 'rank_test_score', 
           'split0_test_score', 'split1_test_score', 'split2_test_score']]

-> ๊ฐ€์žฅ ์ข‹์€ hyper-parameter๋Š” {'max_depth': 3, 'min_samples_split': 3}

print('GridSearchCV ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ:', grid_dtree.best_params_)
print('GridSearchCV ์ตœ๊ณ  ์ •ํ™•๋„: {0:.4f}'.format(grid_dtree.best_score_))

# refit=True๋กœ ์„ค์ •๋œ GridSearchCV ๊ฐ์ฒด๊ฐ€ fit()์„ ์ˆ˜ํ–‰ ์‹œ ํ•™์Šต์ด ์™„๋ฃŒ๋œ Estimator๋ฅผ ๋‚ดํฌํ•˜๊ณ  ์žˆ์œผ๋ฏ€๋กœ predict()๋ฅผ ํ†ตํ•ด ์˜ˆ์ธก๋„ ๊ฐ€๋Šฅ. 
pred = grid_dtree.predict(X_test)
print('ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ ์ •ํ™•๋„: {0:.4f}'.format(accuracy_score(y_test, pred)))

# ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์˜ˆ์ธก ์ •ํ™•๋„ ํ™•์ธ
accuracy_score(y_test, pred)

estimator ์ข…๋ฅ˜

  1. ๋ถ„๋ฅ˜ : DecisionTreeClassifier, RandomForestClassifier, ...
  2. ํšŒ๊ท€ : LinearRegression, ...
# GridSearchCV์˜ refit์œผ๋กœ ์ด๋ฏธ ํ•™์Šต์ด ๋œ estimator ๋ฐ˜ํ™˜
# ์œ„์—์„œ dtree = DecisionTreeClassifier() ๋กœ estimator๋ฅผ ์„ ์–ธํ–ˆ๊ณ , ์ด๋ฅผ GridSearchCV์— ๋„ฃ์—ˆ์œผ๋ฏ€๋กœ,
estimator = grid_dtree.best_estimator_
estimator

# GridSearchCV์˜ best_estimator_๋Š” ์ด๋ฏธ ์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํ•™์Šต์ด ๋จ
pred = estimator.predict(X_test)
print('ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ ์ •ํ™•๋„: {0:.4f}'.format(accuracy_score(y_test, pred)))

GridSearchCV๋ฅผ ์‚ฌ์šฉํ–ˆ๋”๋‹ˆ ๊ต์ฐจ๊ฒ€์ฆ์„ ํ†ตํ•ด ์ตœ์ ์˜ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ํ•˜๊ณ  ์ •ํ™•๋„๊ฐ€ ๋†’์€ ๋ชจ๋ธ์„ ์–ป์–ด๋ƒ„.

๋Œ“๊ธ€