๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
study๐Ÿ“š/python

[python/ํŒŒ์ด์ฌ] ๊ทธ๋ฃนํ™” - groupby(), agg(), set_index(), reset_index()

by ์Šค๋‹ 2022. 7. 29.

๊ทธ๋ฃนํ™”

import pandas as pd

df1 = pd.DataFrame({'ID' : [1, 2, 3, 4, 5], '๊ฐ€์ž…์ผ' : ['2021-01-02', '2021-01-04', '2021-01-10', '2021-02-10', '2021-02-24'], '์„ฑ๋ณ„' : ['F', 'M', 'F', 'M', 'M']})
df2 = pd.DataFrame({'๊ตฌ๋งค์ˆœ์„œ' : [1, 2, 3, 4, 5], 'ID' : [1, 1, 2, 4, 1], '๊ตฌ๋งค์›”' : [1, 1, 2, 2, 3], '๊ธˆ์•ก' : [1000, 1500, 2000, 3000, 4000]})

-๋ฌธ์ œ : df1 ์€ ํšŒ์›์˜ ์ •๋ณด๋ฅผ ์ €์žฅํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด๋ฉฐ, df2 ๋Š” ๊ฐ ํšŒ์›์˜ ๊ตฌ๋งค ๋‚ด์—ญ์„ ์ €์žฅํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด๋‹ค. ๊ฐ ํšŒ์›์˜ ์ •๋ณด์™€ ๊ตฌ๋งค ๋‚ด์—ญ์„ ์ทจํ•ฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ๋งŒ๋“ค๊ธฐ.

pd.merge(df1, df2, how = 'left', on = 'ID')

groupby()

  • groupby(๊ธฐ์ค€์ด ๋˜๋Š” ์—ด์ด๋ฆ„) : ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ฃน ๋ณ„๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ํ•จ์ˆ˜

1) ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํ•จ์ˆ˜๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์—ฐ์‚ฐ

๊ฐ๊ฐ์˜ column๋“ค์„ ๊ทธ๋ฃนํ•‘ํ•˜๊ณ  ๊ทธ๋ฃนํ•‘ํ•œ ๊ฐ์ฒด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์—ฐ์‚ฐ

1) count() : ๊ทธ๋ฃน ๋‚ด Non-NA ๊ฐฏ์ˆ˜
2) sum() : ๊ทธ๋ฃน ๋‚ด Non-NA๋“ค์˜ ํ•ฉ
3) min() : ์ตœ์†Ÿ๊ฐ’
4) max() : ์ตœ๋Œ“๊ฐ’
5) mean() : ํ‰๊ท ๊ฐ’
6) std() : ํ‘œ์ค€ํŽธ์ฐจ
7) var() : ๋ถ„์‚ฐ
8) first() : ๊ทธ๋ฃน ๋‚ด Non-NA ๊ฐ’ ์ค‘ ์ฒซ๋ฒˆ์งธ ๊ฐ’
9) last() : ๊ทธ๋ฃน ๋‚ด Non-NA ๊ฐ’ ์ค‘ ๋งˆ์ง€๋ง‰ ๊ฐ’
10) describe() : ๊ทธ๋ฃน์˜ ๊ธฐ์ˆ  ํ†ต๊ณ„๋Ÿ‰

-๋ฌธ์ œ : df1 ์€ ํšŒ์›์˜ ์ •๋ณด๋ฅผ ์ €์žฅํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด๋ฉฐ, df2 ๋Š” ๊ฐ ํšŒ์›์˜ ๊ตฌ๋งค ๋‚ด์—ญ์„ ์ €์žฅํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด๋‹ค. ๊ฐ ํšŒ์›์˜ ๋ˆ„์  ๊ธˆ์•ก์„ ํšŒ์› ID ๋ณ„๋กœ ๊ตฌํ•˜๊ธฐ

df2

df2.groupby(by = ['ID'])['๊ธˆ์•ก'].sum()

ID
1 6500
2 2000
4 3000
Name: ๊ธˆ์•ก, dtype: int64

type(df2.groupby(by = ['ID'])['๊ธˆ์•ก'].sum())

pandas.core.series.Series

s2 = df2.groupby(by = ['ID'])['๊ธˆ์•ก'].sum()
pd.merge(df1, s2, how = 'left', on = 'ID')

-๋ฌธ์ œ : df1 ์€ ํšŒ์›์˜ ์ •๋ณด๋ฅผ ์ €์žฅํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด๋ฉฐ, df2 ๋Š” ๊ฐ ํšŒ์›์˜ ๊ตฌ๋งค ๋‚ด์—ญ์„ ์ €์žฅํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด๋‹ค. ๊ฐ ํšŒ์›์˜ ์›”๋ณ„ ๋ˆ„์  ๊ธˆ์•ก์„ ํšŒ์› ID ๋ณ„๋กœ ๊ตฌํ•˜๊ธฐ.

df2

df2.groupby(by = ['ID', '๊ตฌ๋งค์›”'])['๊ธˆ์•ก'].sum()

ID ๊ตฌ๋งค์›”
1 1 2500
3 4000
2 2 2000
4 2 3000
Name: ๊ธˆ์•ก, dtype: int64

type(df2.groupby(by = ['ID', '๊ตฌ๋งค์›”'])['๊ธˆ์•ก'].sum())

pandas.core.series.Series

s2 = df2.groupby(by = ['ID', '๊ตฌ๋งค์›”'])['๊ธˆ์•ก'].sum()
s2.index

MultiIndex([(1, 1),
(1, 3),
(2, 2),
(4, 2)],
names=['ID', '๊ตฌ๋งค์›”'])

pd.merge(df1, s2, how = 'left', on = 'ID')

df3 = pd.DataFrame(S2)
df3

df3.index

MultiIndex([(1, 1),
(1, 3),
(2, 2),
(4, 2)],
names=['ID', '๊ตฌ๋งค์›”'])

pd.merge(df1, df3, how = 'left', on = 'ID')

#๊ทธ๋ฃน์„ index ๋กœ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์ง€ ์•Š์€ ๊ฒฝ์šฐ์—๋Š” as_index = False ๋กœ ์„ค์ •
df2.groupby(by = ['ID', '๊ตฌ๋งค์›”'], as_index = False)['๊ธˆ์•ก'].sum()

type(df2.groupby(by = ['ID', '๊ตฌ๋งค์›”'], as_index = False)['๊ธˆ์•ก'].sum())

pandas.core.frame.DataFrame

df3 = df2.groupby(by = ['ID', '๊ตฌ๋งค์›”'], as_index = False)['๊ธˆ์•ก'].sum()
pd.merge(df1, df3, how = 'left', on = 'ID')

2) ํ™œ์šฉ๋„๊ฐ€ ๋†’์€ ๊ณ ๊ธ‰ ๊ทธ๋ฃน์—ฐ์‚ฐ

  • agg() : ์—ฌ๋Ÿฌ๊ฐœ์˜ ํ•จ์ˆ˜๋ฅผ ์—ฌ๋Ÿฌ ์—ด์— ์ ์šฉ

๋ชจ๋“  ์—ด์— ์—ฌ๋Ÿฌ ํ•จ์ˆ˜๋ฅผ ๋งคํ•‘ : group๊ฐ์ฒด.agg([ํ•จ์ˆ˜1, ํ•จ์ˆ˜2, ํ•จ์ˆ˜3, ...])
๊ฐ ์—ด๋งˆ๋‹ค ๋‹ค๋ฅธ ํ•จ์ˆ˜๋ฅผ ๋งคํ•‘ : group๊ฐ์ฒด.agg({'์—ด1' : ํ•จ์ˆ˜1, '์—ด2' : ํ•จ์ˆ˜2, ...})

-๋ฌธ์ œ : df ๋Š” ๊ฐ ํšŒ์›์˜ ๊ตฌ๋งค ๋‚ด์—ญ์„ ์ €์žฅํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด๋‹ค. ๊ฐ ํšŒ์›์˜ ๋ˆ„์  ๊ธˆ์•ก๊ณผ ๋ˆ„์  ๊ตฌ๋งค ํšŸ์ˆ˜๋ฅผ ํšŒ์› ID ๋ณ„๋กœ ๊ตฌํ•˜๊ธฐ

df = pd.DataFrame({'๊ตฌ๋งค์ˆœ์„œ' : [1, 2, 3, 4, 5], 'ID' : [1, 1, 2, 4, 1], '๊ตฌ๋งค์›”' : [1, 1, 2, 2, 3], '๊ธˆ์•ก' : [1000, 1500, 2000, 3000, 4000], '์ˆ˜์ˆ˜๋ฃŒ' : [100, 150, 200, 300, 400]})
df

df.groupby(by = ['ID'])['๊ธˆ์•ก'].agg([sum, len])

df.groupby(by = ['ID'], as_index = False)['๊ธˆ์•ก'].agg([sum, len])

  • index๋ฅผ ์ด์šฉํ•œ ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
    • set_index() : column ๋ฐ์ดํ„ฐ๋ฅผ index๋ ˆ๋ฒจ๋กœ ๋ณ€๊ฒฝ
    • reset_index() : index ์ดˆ๊ธฐํ™” -> ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” column์„ index๋ ˆ๋ฒจ๋กœ ๋ณ€๊ฒฝ
df2 = df.groupby(by = ['ID'])['๊ธˆ์•ก'].agg([sum, len])
df2.reset_index(inplace = True)
df2

-๋ฌธ์ œ : df ๋Š” ๊ฐ ํšŒ์›์˜ ๊ตฌ๋งค ๋‚ด์—ญ์„ ์ €์žฅํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด๋‹ค. ๊ฐ ํšŒ์›์˜ ์ตœ๋Œ€ ์‚ฌ์šฉ ๊ธˆ์•ก / ์ตœ์†Œ ์‚ฌ์šฉ ๊ธˆ์•ก๊ณผ ์ตœ์ € ์ˆ˜์ˆ˜๋ฃŒ์˜ ๊ฐ’์„ ๊ตฌํ•˜๊ธฐ.

df

df.groupby(by = ['ID']).agg({'๊ธˆ์•ก' : [max, min], '์ˆ˜์ˆ˜๋ฃŒ' : min})

df2 = df.groupby(by = ['ID']).agg({'๊ธˆ์•ก' : [max, min], '์ˆ˜์ˆ˜๋ฃŒ' : min})
df2.reset_index()

df2.columns

MultiIndex([( '๊ธˆ์•ก', 'max'),
( '๊ธˆ์•ก', 'min'),
('์ˆ˜์ˆ˜๋ฃŒ', 'min')],
)

df2.columns.values

array([('๊ธˆ์•ก', 'max'), ('๊ธˆ์•ก', 'min'), ('์ˆ˜์ˆ˜๋ฃŒ', 'min')], dtype=object)

df2.columns = ['_'.join(col) for col in df2.columns.values]
df2

df2.reset_index

๋Œ“๊ธ€