๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
study๐Ÿ“š/๋จธ์‹ ๋Ÿฌ๋‹ ์‹ค์Šต

[ํŒŒ์ด์ฌ/๋จธ์‹ ๋Ÿฌ๋‹] ๋ณดํ—˜๋ฃŒ ์˜ˆ์ธกํ•˜๊ธฐ - ์ „์ฒ˜๋ฆฌ

by ์Šค๋‹ 2022. 10. 21.

๋ณดํ—˜๋ฃŒ ์˜ˆ์ธกํ•˜๊ธฐ - ์ „์ฒ˜๋ฆฌ

https://www.kaggle.com/datasets/mirichoi0218/insurance

# ํ•„์š”ํ•œ ํŒŒ์ด์ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ 
import pandas as pd
import numpy as np
import seaborn as sns
import missingno

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt 

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

[EDA ์ฒดํฌ๋ฆฌ์ŠคํŠธ]

  1. ์–ด๋–ค ์งˆ๋ฌธ์„ ํ’€๊ฑฐ๋‚˜ ํ‹€๋ ธ๋‹ค๊ณ  ์ฆ๋ช…ํ•˜๋ ค๊ณ  ํ•˜๋Š”๊ฐ€?
  2. ์ค‘๋ณต๋œ ํ•ญ๋ชฉ์€ ์žˆ๋Š”๊ฐ€?
  3. ์–ด๋–ค ์ข…๋ฅ˜์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์œผ๋ฉฐ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ํƒ€์ž…๋“ค์„ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฃจ๋ ค๊ณ  ํ•˜๋Š”๊ฐ€?
  4. ๋ฐ์ดํ„ฐ์—์„œ ๋ˆ„๋ฝ๋œ ๊ฒƒ์ด ์žˆ๋Š”์ง€, ์žˆ๋‹ค๋ฉด ๊ทธ๊ฒƒ๋“ค์„ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋ ค๋Š”๊ฐ€?
  5. ์ด์ƒ์น˜๋Š” ์–ด๋””์— ์žˆ๋Š”๊ฐ€? ๊ด€์‹ฌ์„ ๊ฐ€์ ธ์•ผ ํ•  ๋ฐ์ดํ„ฐ์ธ๊ฐ€?
  6. ๋ณ€์ˆ˜ ๊ฐ„ ์ƒ๊ด€์„ฑ์ด ์žˆ๋Š”๊ฐ€?

  • ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
data = pd.read_csv("./insurance.csv")

1. ์–ด๋–ค ์งˆ๋ฌธ์„ ํ’€๊ฑฐ๋‚˜ ํ‹€๋ ธ๋‹ค๊ณ  ์ฆ๋ช…ํ•˜๋ ค๊ณ  ํ•˜๋Š”๊ฐ€?

-> ๋ณดํ—˜์‚ฌ ๊ณ ๊ฐ ์ •๋ณด๋ฅผ ํ†ตํ•ด ๋ณดํ—˜๋ฃŒ ์˜ˆ์ธก ๋ชจ๋ธ ์ƒ์„ฑ

# ๋ฐ์ดํ„ฐ์˜ ๋ชจ์–‘ ์•Œ์•„๋ณด๊ธฐ
print(data.shape)

# ๋ฐ์ดํ„ฐ์˜ 15๊ฐœ ํ–‰ ๋ฐ์ดํ„ฐ๋งŒ ํ™•์ธํ•˜๊ธฐ
print(data.head(15))

2. ์ค‘๋ณต๋œ ํ•ญ๋ชฉ์€ ์žˆ๋Š”๊ฐ€?

  • df.duplicated() : ์ค‘๋ณต๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธ
# ์ค‘๋ณต๋œ ํ•ญ๋ชฉ ์ˆ˜ ์•Œ์•„๋ณด๊ธฐ
print("์ค‘๋ณต๋œ ํ•ญ๋ชฉ ์ˆ˜ :", len(data[data.duplicated()]))

  • ์ค‘๋ณต์ด ์žˆ์œผ๋ฉด ์ฒ˜์Œ๊ณผ ๋ ์ค‘ ๋ฌด์Šจ ๊ฐ’์„ ๋‚จ๊ธธ ๊ฒƒ์ธ๊ฐ€? : keep = 'first', 'last', False
# ์ค‘๋ณต๋œ ํ•ญ๋ชฉ ํ™•์ธ
print(data[data.duplicated(keep = False)])

# ์ค‘๋ณต๋œ ํ•ญ๋ชฉ ์ œ๊ฑฐ
data.drop_duplicates(inplace = True, keep = 'first',ignore_index = True)

3. ์–ด๋–ค ์ข…๋ฅ˜์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์œผ๋ฉฐ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ํƒ€์ž…๋“ค์„ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฃจ๋ ค๊ณ  ํ•˜๋Š”๊ฐ€?

# ๋ฐ์ดํ„ฐ ์ปฌ๋Ÿผ ์ด๋ฆ„/ํƒ€์ž… ์ •๋ณด ํ™•์ธํ•˜๊ธฐ
print(data.info())

# ๋ฐ์ดํ„ฐ ํƒ€์ž…๋ณ„ ์ปฌ๋Ÿผ ์ˆ˜ ํ™•์ธํ•˜๊ธฐ
dtype_data = data.dtypes.reset_index()
dtype_data.columns = ["Count","Column Type"]
dtype_data = dtype_data.groupby("Column Type").aggregate('count').reset_index() # aggregate = agg

print(dtype_data)

  • nunique() : ๊ณ ์œ ํ•œ ๊ฐ’๋“ค์˜ ์ˆ˜
# ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ ์ปฌ๋Ÿผ๋ณ„ ์œ ์ผํ•œ ๊ฐ’ ๊ฐœ์ˆ˜ ํ™•์ธํ•˜๊ธฐ 
print(data.select_dtypes(include=['object','category']).nunique())

ํ•ญ๋ชฉ์ด 2๊ฐœ์ธ ์„ฑ๋ณ„(sex)๊ณผ ํก์—ฐ ์—ฌ๋ถ€(smoker)๋Š” LabelEncoder, ์ง€์—ญ(region)์€ OneHotEncoder๋ฅผ ์‚ฌ์šฉ

sklearn ์˜ LabelEncoder, OneHotEncoder ์‚ฌ์šฉ

## LabelEncoder : ๊ฐ๊ฐ์˜ ๋ฒ”์ฃผ๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ์ •์ˆ˜๋กœ ๋งตํ•‘
## ์„ฑ๋ณ„, ํก์—ฐ ์—ฌ๋ถ€ ์ปฌ๋Ÿผ์€ Label Encoding ์„ ์œ„ํ•ด ndarray ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ค€๋‹ค
sex = data.iloc[:,1:2].values
smoker = data.iloc[:,4:5].values

### ์„ฑ๋ณ„ ###
# 1. LabelEncoder() ๋ฅผ ์„ ์–ธํ•ด์ฃผ๊ณ 
le = LabelEncoder()

# 2. ์„ฑ๋ณ„์„ LabelEncoder ์˜ fit_transform ์— ๋„ฃ์–ด์ค€๋‹ค
sex[:,0] = le.fit_transform(sex[:,0])
sex = pd.DataFrame(sex)
sex.columns = ['sex']
print(sex)

# 3. dict ํ˜•์œผ๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ๊ธฐ
le_sex_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print("์„ฑ๋ณ„์— ๋Œ€ํ•œ Label Encoder ๊ฒฐ๊ณผ :")
print(le_sex_mapping)
print(sex[:10])

์„ฑ๋ณ„(sex)์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ

### ํก์—ฐ ์—ฌ๋ถ€ ###
# 1. LabelEncoder() ๋ฅผ ์„ ์–ธํ•ด์ฃผ๊ณ 
le = LabelEncoder()

# 2. ํก์—ฐ ์—ฌ๋ถ€๋ฅผ LabelEncoder ์˜ fit_transform ์— ๋„ฃ์–ด์ค€๋‹ค
smoker[:,0] = le.fit_transform(smoker[:,0])
smoker = pd.DataFrame(smoker)
smoker.columns = ['smoker']
print(smoker)

# 3. dict ํ˜•์œผ๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ๊ธฐ
le_smoker_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print("ํก์—ฐ ์—ฌ๋ถ€์— ๋Œ€ํ•œ Label Encoder ๊ฒฐ๊ณผ :")
print(le_smoker_mapping)
print(smoker[:10])

## OneHot Encoder : ๊ฐ๊ฐ์˜ ๋ฒ”์ฃผ๋ฅผ 0๊ณผ 1๋กœ ๋งตํ•‘
## ์ง€์—ญ ์ปฌ๋Ÿผ์€ Label Encoding ์„ ์œ„ํ•ด ndarray ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ค€๋‹ค
region = data.iloc[:,5:6].values

### ์ง€์—ญ ###
# 1. OneHotEncoder() ๋ฅผ ์„ ์–ธํ•ด์ฃผ๊ณ 
ohe = OneHotEncoder()


# 2. ์ง€์—ญ์„ OneHotEncoder ์˜ fit_transform ์— ๋„ฃ์–ด์ค€๋‹ค

region = ohe.fit_transform(region).toarray()
region = pd.DataFrame(region)
region.columns = ['northeast', 'northwest', 'southeast', 'southwest']
print("์ง€์—ญ์— ๋Œ€ํ•œ OneHot Encoder ๊ฒฐ๊ณผ : ")  
print(region[:10])

4. ๋ฐ์ดํ„ฐ์—์„œ ๋ˆ„๋ฝ๋œ ๊ฒƒ์ด ์žˆ๋Š”์ง€, ์žˆ๋‹ค๋ฉด ๊ทธ๊ฒƒ๋“ค์„ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋ ค๋Š”๊ฐ€?

# ๊ฐ ์ปฌ๋Ÿผ๋“ค์— ๋ช‡ ๊ฐœ์˜ NULL ๊ฐ’์ด ํฌํ•จ๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธ
count_nan = data.isnull().sum()
print(count_nan[count_nan > 0])

# missingno ํŒจํ‚ค์ง€๋ฅผ ํ†ตํ•ด ์‹œ๊ฐํ™” ํ™•์ธ
missingno.matrix(data, figsize=(30,10))

# seaborn ํŒจํ‚ค์ง€ heatmap ์„ ํ†ตํ•ด ์‹œ๊ฐํ™” ํ™•์ธ
sns.heatmap(data.isnull(), cbar=False, yticklabels=False, cmap='viridis')

์ด ๋ฐ์ดํ„ฐ์— ๊ฒฝ์šฐ์—๋Š” NULL๊ฐ’์ด ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— NULL ๊ฐ’์„ ๋Œ€์ฒดํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค
๋งŒ์•ฝ NULL๊ฐ’์ด ํฌํ•จ๋˜์–ด ์žˆ์„ ๊ฒฝ์šฐ์—๋Š” ๋ณดํ†ต์€ ๊ฐ ์นผ๋Ÿผ์˜ ํ‰๊ท ๊ฐ’์œผ๋กœ ์ฑ„์šด๋‹ค

๋Œ“๊ธ€