[ML] DecisionTree모델 이용해서 붓꽃 품종 예측 / 사용자 행동 인식 Dataset

Decision Tree(결정트리)

✔️ 장점

쉽다. 직관적이다.
feature의 스케일링이나 정규화 등의 사전 가공 영향도가 크지 않음

✔️ 단점

과적합으로 알고리즘 성능이 떨어진다. 이를 극복하기 위해 트리의 크기를 사전에 제한하는 튜닝 필요

결정트리 Parameter

min_samples_split : 노드를 분할하기 위한 샘플 데이터 수로 과적합을 제어하는 사용됨
miin_samples_leaf : 말단 노드(lead)가 되기 위한 최소한의 샘플 데이터 수
max_feaures : 최적의 분할을 위해 고려할 최대 피처 개수
max_depth : 트리의 최대 깊이를 규정
max_leaf_nodes : 말단 노드(leaf)의 최대 개수

더이상 자식 노드가 없는 노드는 리프(leaf) 노드이다. 리프 노드는 최종 클래스(레이블)값이 결정되는 노드이다.
- 리프 노드가 되려면 오직 하나의 클래스 값으로 최종 데이터가 구성되거나 리프 노드가 될 수 있는 하이퍼 파라미터 조건을 충족하면 된다.
자식노드가 있는 브랜치(branch)노드는 자식 노드를 만들기 위한 분할 규칙 조건을 가지고 있다.

1. Decision Tree 모델을 이용해 붓꽃데이터 학습 모델 제작

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# DecisionTree Classifier 생성
dt_clf = DecisionTreeClassifier(random_state=156)

# 붓꽃 데이터 로딩
iris= load_iris()

# 학습,테스트 데이터 셋 분리
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, 
                                                    test_size=0.2, random_state=11)

# 학습
dt_clf.fit(X_train, y_train)

1-2. Graphviz를 이용해 어떤 형태로 규칙 트리가 만들어지는지 확인

from sklearn.tree import export_graphviz

# export_graphviz호출 결과로 out_file로 지정된 tree.dot 파일 생성
export_graphviz(dt_clf, out_file="tree.dot", class_names=iris.target_names, 
               feature_names=iris.feature_names, impurity=True, filled=True)

import graphviz
# 생성한 tree.dot파일을 graphviz가 읽어서 시각화
with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

트리의 브랜치(branch)노트와 말단 리프(leaf)노드가 어떻게 구성되는지 한눈에 알 수 있게 시각화 되었다.
petal lengh(cm) <= 2.45와 같이 피처의 조건이 있는 것은 자식 노드를 만들기 위한 규칙 조건이다.
(이 조건이 없으면 리프 노드임)
gini는 다음의 value=[]로 주어진 데이터 분포에서의 지니 계수이다.
samples는 현 규칙에 해당하는 데이터 건수이다.
value=[]는 클래스 값 기반의 데이터 건수이다. 붓꽃 데이터 셋은 클래스 값으로 0,1,2를 가지고 있으며,
0:Setosa, 1:Versicolor, 2:Virginica 품종을 가리킴.
- value=[41,40,39]라면 클래스 값의 순서대로 Setosa 41개, Versicolor 40개, Virginica 39개로 구성되어있다는 의미

1번 노드

samples=120은 전체 데이터가 120개
sample 120개는 각각 value 41,40,39 분포도로 되어 있으므로 지니 계수는 0.667
class = sotosa는 하위 노드를 가질 경우 setosa의 개수가 41개로 제일 많다는 의미

1-3. 피처별로 결정 트리 알고리즘에서 중요도 추출

import seaborn as sns
import numpy as np
%matplotlib inline

# feature importance 추출
print("Feautre importance:\n{0}".format(np.round(dt_clf.feature_importances_,3)))

# feature별 importance 매핑
for name, value in zip(iris.feature_names, dt_clf.feature_importances_):
    print('{0}:{1:.3f}'.format(name,value))
    
# feature importance를 column별로 시각화
sns.barplot(x=dt_clf.feature_importances_, y=iris.feature_names)

[out]

Feautre importance:
[0.025 0. 0.555 0.42 ]
sepal length (cm):0.025
sepal width (cm):0.000
petal length (cm):0.555
petal width (cm):0.420

1-4. 결정 트리 과적합(overfitting)

from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
%matplotlib inline

plt.title("3 Class values with 2 Features Sample data creation")

# 2차원 시각화를 위해서 피처는 2개, 클래스는 3가지 유형의 분류 샘플 데이터 생성
X_features, y_labels = make_classification(n_features=2, n_redundant=0, n_informative=2,
                                          n_classes=3, n_clusters_per_class=1, random_state=0)

# 그래프 형태로 2개의 피처로 2차원 자표 시각화, 각 클래스 값은 다른 색으로 표현됨
plt.scatter(X_features[:,0], X_features[:,1],marker='o', c=y_labels, s=25, edgecolor='k')

5. 시각화

# Classifier의 Decision Boundary를 시각화 하는 함수
def visualize_boundary(model, X, y):
    fig,ax = plt.subplots()
    
    # 학습 데이타 scatter plot으로 나타내기
    ax.scatter(X[:, 0], X[:, 1], c=y, s=25, cmap='rainbow', edgecolor='k',
               clim=(y.min(), y.max()), zorder=3)
    ax.axis('tight')
    ax.axis('off')
    xlim_start , xlim_end = ax.get_xlim()
    ylim_start , ylim_end = ax.get_ylim()
    
    # 호출 파라미터로 들어온 training 데이타로 model 학습 . 
    model.fit(X, y)
    # meshgrid 형태인 모든 좌표값으로 예측 수행. 
    xx, yy = np.meshgrid(np.linspace(xlim_start,xlim_end, num=200),np.linspace(ylim_start,ylim_end, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    # contourf() 를 이용하여 class boundary 를 visualization 수행. 
    n_classes = len(np.unique(y))
    contours = ax.contourf(xx, yy, Z, alpha=0.3,
                           levels=np.arange(n_classes + 1) - 0.5,
                           cmap='rainbow', clim=(y.min(), y.max()),
                           zorder=1)
                                                     
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier(random_state=156).fit(X_features, y_labels)
visualize_boundary(dt_clf, X_features, y_labels)

매우 얇은 영역으로 나타난 부분은 이상치에 해당하는데, 이런 이상치까지 모두 분류하기 위해 분할한 결과 결정 기준 경계가 많아졌다.

# min_samples_leaf = 6 으로 설정한 Decision Tree의 학습과 결정 경계 시각화
dt_clf = DecisionTreeClassifier(min_samples_leaf=6).fit(X_features, y_labels)
visualize_boundary(dt_clf, X_features, y_labels)

default 값으로 실행한 앞선 경우보다 이상치에 크게 반응하지 않으면서 일반화된 분류 규칙에 의해 분류되었음을 확인할 수 있습니다.

Decision Tree의 과적합을 줄이기 위한 파라미터 튜닝

max_depth 를 줄여서 트리의 깊이 제한
min_samples_split 를 높여서 데이터가 분할하는데 필요한 샘플 데이터의 수를 높이기
min_samples_leaf 를 높여서 말단 노드가 되는데 필요한 샘플 데이터의 수를 높이기
max_features를 높여서 분할을 하는데 고려하는 feature의 수 제한

2. Decision Tree 실습 - 사용자 행동 인식 Dataset

사용자 행동 인식 Dataset

30명에게 스마트폰 센서를 장착한 뒤 사람의 동작과 관련된 여러 가지 피처를 수집한 데이터
수집된 피처 세트를 기반으로 어떠한 동작인지 예측

Dataset 설명

feature_info.txt 과 README.txt : 데이터 세트와 피처에 대한 간략한 설명
features.txt : 피처의 이름 기술
activity_labels.txt : 동작 레이블 값에 대한 설명

2-1.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

# 각 데이터 파일들은 공백으로 분리되어 있으므로 read_csv에서 공백문자를 sep으로 할당
feature_name_df = pd.read_csv('./human_activity/features.txt', sep='\s+',
                                header=None, names=['column_index', 'column_name'])

# 데이터프레임에 피처명을 컬럼으로 뷰여하기 위해 리스트 객체로 다시 반환 
feature_name = feature_name_df.iloc[:, 1].values.tolist()
print('전체 피처명에서 10개만 추출:', feature_name[:10])

[out]

전체 피처명에서 10개만 추출: ['tBodyAcc-mean()-X', 'tBodyAcc-mean()-Y', 'tBodyAcc-mean()-Z', 'tBodyAcc-std()-X', 'tBodyAcc-std()-Y', 'tBodyAcc-std()-Z', 'tBodyAcc-mad()-X', 'tBodyAcc-mad()-Y', 'tBodyAcc-mad()-Z', 'tBodyAcc-max()-X']

2-2.

feature_dup_df = feature_name_df.groupby('column_name').count()
print(feature_dup_df[feature_dup_df['column_index']>1].count())

feature_dup_df[feature_dup_df['column_index']>1].head()

[out]

column_index 42
dtype: int64

2-3. 중복된 피처 이름 바꾸기

# 중복된 피처 이름 바꾸기
def get_new_feature_name_df(old_feature_name_df):
    feature_dup_df = pd.DataFrame(data=old_feature_name_df.groupby('column_name').cumcount(),
                                 columns=['dup_cnt'])
    feature_dup_df = feature_dup_df.reset_index()
    
    new_feature_name_df = pd.merge(old_feature_name_df.reset_index(), feature_dup_df, how="outer")
    new_feature_name_df['column_name'] = new_feature_name_df[['column_name',
                                                              'dup_cnt']].apply(lambda x : x[0] + '_' + str(x[1]) 
                                                                                if x[1] > 0 else x[0], axis=1)
    new_feature_name_df = new_feature_name_df.drop(['index'], axis=1)
    return new_feature_name_df

2-4. 데이터셋을 구성하는 함수 설정

# 데이터셋을 구성하는 함수 설정
def get_human_dataset():
    
    # 각 데이터 파일들은 공백으로 분리되어 있으므로 read_csv에서 공백문자를 sep으로 할당
    feature_name_df = pd.read_csv('./human_activity/features.txt', sep='\s+', 
                                  header=None, names=['column_index', 'column_name'])
    # 데이터프레임에 피처명을 컬럼으로 뷰여하기 위해 리스트 객체로 다시 반환
    new_feature_name_df = get_new_feature_name_df(feature_name_df)
    
    # 데이터프레임에 피처명을 칼럼으로 부여하기 위해 리스트로 다시 변환
    feature_name = new_feature_name_df.iloc[:,1].values.tolist()
    
    # 학습 피처 데이터셋과 테스트 피처 데이터를 DataFrame으로 로딩. 칼럼명은 feature_name
    X_train = pd.read_csv('./human_activity/train/X_train.txt', sep='\s+', 
                          header=None, names=feature_name)
    X_test = pd.read_csv('./human_activity/test/X_test.txt', sep='\s+', 
                          header=None, names=feature_name)
    
    y_train = pd.read_csv('./human_activity/train/y_train.txt', sep='\s+', 
                          header=None, names=['action'])
    y_test = pd.read_csv('./human_activity/test/y_test.txt', sep='\s+', 
                          header=None, names=['action'])
    #로드된 학습/테스트용 데이터 DataFrame 반환
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = get_human_dataset()

y_train['action'].value_counts()

[out]

6    1407
5    1374
4    1286
1    1226
2    1073
3     986
Name: action, dtype: int64

2-5. DecisionTreeClassifier를 이용해 동작 예측 분류를 수행

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 예제 반복시마다 동일한 예측 결과 도출을 위해 난수값(random_state) 설정
dt_clf = DecisionTreeClassifier(random_state=156)

dt_clf.fit(X_train, y_train)
pred = dt_clf.predict(X_test)
accuracy = accuracy_score(y_test, pred)

print('Decision Tree 예측 정확도 : {0:.4f}'.format(accuracy))

# DecisionTreeClassifier의 하이퍼 파리미터 추출
print('\nDecisionTreeClassifier 기본 하이퍼파라미터:\n', dt_clf.get_params())

[out]

Decision Tree 예측 정확도 : 0.8548

DecisionTreeClassifier 기본 하이퍼파라미터:
{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': 156, 'splitter': 'best'}

약 85.48%의 정확도를 나타내고 있다.

2-6. 결정 트리의 트리 깊이(Tree depth)가 예측 정확도에 주는 영향 확인

결정 트리는 분류를 위해 리프노드(클래스 결정 노드)가 될 수 있는 적합한 수준이 될 때까지 지속해서 트리의 분할을 수행하면서 깊이가 깊어진다.
GridSearchCV를 이용해 사이킷런 결정 트리의 깊이를 조절할 수 있는 하이퍼 파라미터인 max_depth값을 변화시키면서 예측 성능을 확인 [6,8,10,12,16,20,24]

from sklearn.model_selection import GridSearchCV

params = {'max_depth':[6,8,10,12,16,20,24],
          'min_samples_split':[16]} # min_samples_split은 16으로 고정

grid_cv = GridSearchCV(dt_clf, param_grid=params, scoring='accuracy', cv=5, verbose=1) # 교차검증은 5개 세트
grid_cv.fit(X_train,y_train)
print('GridSearchCV 최고 평균 정확도 수치: {0:.4f}'.format(grid_cv.best_score_))
print('GridSearchCV 최적 하이퍼 파라미터:', grid_cv.best_params_)

[out]

Fitting 5 folds for each of 7 candidates, totalling 35 fits
GridSearchCV 최고 평균 정확도 수치: 0.8549
GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 8, 'min_samples_split': 16}

# GridSearchCV 객체의 cv_results_속성을 df로 생성
cv_results_df = pd.DataFrame(grid_cv.cv_results_)

# max_depth 파라미터 값과 그떄의 테스트 셋, 학습 데이터 셋의 정확도 수치 추출
cv_results_df[['param_max_depth', 'mean_test_score']]

mean_test_score는 max_depth가 8일때 0.854로 정확도가 가장 높고, 이를 넘어가면 정확도가 계속 떨어지는것을 볼 수 있다.
결정 트리는 더 완벽한 규칙을 학습 데이터 세트에 적용하기 위해 노드를 지속적으로 분할하면서 깊이가 깊어지고 더욱 더 복잡한 모델이 된다. 깊어진 트리는 학습 데이터 세트에는 올바른 예측 결과를 가져올지 모르지만, 검증 데이터 세트에서는 오히려 과적합으로 인한 성능 저하를 유발하게 된다.

max_depth = [6,8,10,12,16,20,24]

# max_depth값 변화시키면서 그때마다 학습과 테스트 세트에서의 예측 성능 측정
for depth in max_depth:
    dt_clf = DecisionTreeClassifier(max_depth=depth, min_samples_split=16, random_state=156)
    dt_clf.fit(X_train, y_train)
    pred = dt_clf.predict(X_test)
    accuracy = accuracy_score(y_test, pred)
    print('max_depth = {0}정확도: {1:.4f}'.format(depth,accuracy))

[out]

max_depth = 6정확도: 0.8551
max_depth = 8정확도: 0.8717
max_depth = 10정확도: 0.8599
max_depth = 12정확도: 0.8571
max_depth = 16정확도: 0.8599
max_depth = 20정확도: 0.8565
max_depth = 24정확도: 0.8565

params = {'max_depth':[8,12,16,20],
         'min_samples_split':[16,24]}

grid_cv=GridSearchCV(dt_clf, param_grid=params, scoring='accuracy', cv=5, verbose=1)
grid_cv.fit(X_train,y_train)
print('GridSearchCV 최고 평균 정확도 수치: {0:.4f}'.format(grid_cv.best_score_))
print('GridSearchCV 최적 하이퍼 파라미터:', grid_cv.best_params_)

[out]

Fitting 5 folds for each of 8 candidates, totalling 40 fits
GridSearchCV 최고 평균 정확도 수치: 0.8549
GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 8, 'min_samples_split': 16}

max_depth가 8, min_samples_split이 16일때 가장 최고의 정확도로 약 85.49%를 나타낸다.

best_df_clf = grid_cv.best_estimator_
pred1 = best_df_clf.predict(X_test)
accuracy = accuracy_score(y_test, pred1)
print('결정 트리 예측 정확도:{0:.4f}'.format(accuracy))

[out]

결정 트리 예측 정확도:0.8717

import seaborn as sns

ftr_importance_values = best_df_clf.feature_importances_
# top중요도로 정렬, series로 변환해서 seaborn 막대그래프로 표현
ftr_importance = pd.Series(ftr_importance_values, index=X_train.columns)

# 중요도값 순으로 series 정렬
ftr_top20 = ftr_importance.sort_values(ascending=False)[:20]
plt.figure(figsize=(8,6))
plt.title('Feature importances Top 20')
sns.barplot(x=ftr_top20, y=ftr_top20.index)
plt.show()

이 중 가장 높은 중요도를 가진 top5의 피처들이 매우 중요하게 규칙생성에 영향을 미치고 있는것을 알 수 있다.

저작자표시 비영리 변경금지

'Data Analytics > MachineLearning' 카테고리의 다른 글

[ML] 머신러닝 모델의 성능을 향상시키는 방법 GridSearchCV (0)	2022.11.07
[ML] 머신러닝 최적화 방법, 경사하강법(Gradient Descent) 알아보기 / 확률적 경사하강법 (0)	2022.11.07
[ML] 머신러닝 지도학습의 회귀(regression)의 종류와 실습해보기 (0)	2022.11.07
[ML] 추천 시스템 개발을 위한 surprise 라이브러리 - 컨텐츠 기반 필터링 (0)	2022.10.02
[ML] 머신러닝이란? 지도학습의 분류(Classification) (0)	2022.10.02

데이터 분석가 샛별

[ML] DecisionTree모델 이용해서 붓꽃 품종 예측 / 사용자 행동 인식 Dataset

Decision Tree(결정트리)

1. Decision Tree 모델을 이용해 붓꽃데이터 학습 모델 제작

1-2. Graphviz를 이용해 어떤 형태로 규칙 트리가 만들어지는지 확인

1-3. 피처별로 결정 트리 알고리즘에서 중요도 추출

1-4. 결정 트리 과적합(overfitting)

5. 시각화

2. Decision Tree 실습 - 사용자 행동 인식 Dataset

2-1.

2-2.

2-3. 중복된 피처 이름 바꾸기

2-4. 데이터셋을 구성하는 함수 설정

2-5. DecisionTreeClassifier를 이용해 동작 예측 분류를 수행

2-6. 결정 트리의 트리 깊이(Tree depth)가 예측 정확도에 주는 영향 확인

'Data Analytics > MachineLearning' 카테고리의 다른 글

댓글

티스토리툴바

[ML] DecisionTree모델 이용해서 붓꽃 품종 예측 / 사용자 행동 인식 Dataset

Decision Tree(결정트리)

1. Decision Tree 모델을 이용해 붓꽃데이터 학습 모델 제작

1-2. Graphviz를 이용해 어떤 형태로 규칙 트리가 만들어지는지 확인

1-3. 피처별로 결정 트리 알고리즘에서 중요도 추출

1-4. 결정 트리 과적합(overfitting)

5. 시각화

2. Decision Tree 실습 - 사용자 행동 인식 Dataset

2-1.

2-2.

2-3. 중복된 피처 이름 바꾸기

2-4. 데이터셋을 구성하는 함수 설정

2-5. DecisionTreeClassifier를 이용해 동작 예측 분류를 수행

2-6. 결정 트리의 트리 깊이(Tree depth)가 예측 정확도에 주는 영향 확인

'Data Analytics > MachineLearning' 카테고리의 다른 글

관련글

댓글

티스토리툴바