๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
study๐Ÿ“š/python

[python/ํŒŒ์ด์ฌ] ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๊ฒฐํ•ฉ - ์ƒํ•˜ ๊ฒฐํ•ฉ, ์ขŒ์šฐ ๊ฒฐํ•ฉ

by ์Šค๋‹ 2022. 7. 28.

๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๊ฒฐํ•ฉ

1. ์ƒํ•˜ ๊ฒฐํ•ฉ

  • pd.concat([df1, df2], axis = 0) : ๋™์ผํ•œ column ๊ธฐ์ค€์œผ๋กœ ์œ„/์•„๋ž˜๋กœ ํ•ฉ์น˜๊ธฐ, ํ–‰ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๊ฒฐํ•ฉ
import pandas as pd

df1 = pd.DataFrame({'A' : [1, 2, 3], 'B' : [11, 12, 13], 'C' : [21, 22, 23]})
df2 = pd.DataFrame({'A' : [4, 5, 6], 'B' : [14, 15, 16], 'C' : [24, 25, 26]})
pd.concat([df1, df2])

pd.concat([df2, df1])

#index ์ดˆ๊ธฐํ™”๋ฅผ ์œ„ํ•ด์„œ ignore_index = True
pd.concat([df1, df2], ignore_index = True

# ํ•„๋“œ์˜ ์ˆœ์„œ๊ฐ€ ์„ž์˜€์„ ๋•Œ ๊ฒฐํ•ฉ ๊ฒฐ๊ณผ ํ™•์ธ
df1 = pd.DataFrame({'A' : [1, 2, 3], 'B' : [11, 12, 13], 'C' : [21, 22, 23]})
df2 = pd.DataFrame({'B' : [14, 15, 16], 'A' : [4, 5, 6], 'C' : [24, 25, 26]})
df1

df2

pd.concat([df1, df2])

- ํ–‰ ๊ธฐ์ค€์œผ๋กœ ํ•ฉ์ณ์งˆ ๋•Œ column์ด ์ผ์น˜ํ•˜์ง€ ์•Š๋Š” ๊ณณ์€ NaN์œผ๋กœ ์ฑ„์›Œ์ง„๋‹ค.

# ์„œ๋กœ ๋‹ค๋ฅธ ํ•„๋“œ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ๊ฒฐํ•ฉ
df1 = pd.DataFrame({'A' : [1, 2, 3], 'B' : [11, 12, 13], 'C' : [21, 22, 23], 'D' : [31, 32, 33]})
df2 = pd.DataFrame({'A' : [3, 4, 5], 'B' : [13, 14, 15], 'C' : [23, 24, 25], 'E' : [41, 42, 43]})
df1

df2

pd.concat([df1, df2])

- pd.concat()ํ•จ์ˆ˜๋Š” default๋กœ outer๋ฅผ ๊ฐ€์ง„๋‹ค
์ด์–ด๋ถ™์ด๋Š” ๋ฐฉ์‹์„ join = outer๋Š” ํ•ฉ์ง‘ํ•ฉ, join = inner๋Š” ๊ต์ง‘ํ•ฉ์„ ์˜๋ฏธํ•œ๋‹ค

#outer๊ฐ€ default์ด๊ธฐ ๋•Œ๋ฌธ์— ์ƒ๋žต๊ฐ€๋Šฅ
pd.concat([df1, df2], join = 'outer')

# join = inner๋Š” ๊ต์ง‘ํ•ฉ, ์—ฌ๊ธฐ์—์„œ๋Š” df1, df2๊ฐ€ ๊ณตํ†ต์ ์œผ๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” A, B, C๋งŒ ๊ฒฐํ•ฉ๋˜์–ด ๋ณด์—ฌ์ง
pd.concat([df1, df2], join = 'inner')

2. ์ขŒ์šฐ ๊ฒฐํ•ฉ

1) concat()

  • pd.concat([df1, df2], axis = 1) : ์—ด ๊ธฐ์ค€. ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด ์ขŒ์šฐ, ์˜†์œผ๋กœ ๊ฒฐํ•ฉ
import pandas as pd

df1 = pd.DataFrame({'A' : [1, 2, 3], 'B' : [11, 12, 13], 'C' : [21, 22, 23], 'D' : [31, 32, 33]})
df2 = pd.DataFrame({'E' : [3, 4, 5], 'F' : [13, 14, 15], 'G' : [23, 24, 25], 'H' : [41, 42, 43]})
df1

df2

pd.concat([df1, df2], axis = 1)

-๋ฌธ์ œ : ๋‹ค์Œ ๋‘ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๊ฒฐํ•ฉํ•˜๊ธฐ

df1 = pd.DataFrame({'ID' : [1, 2, 3], '์„ฑ๋ณ„' : ['F', 'M', 'F'], '๋‚˜์ด' : [20, 30, 40]})
df2 = pd.DataFrame({'ID' : [1, 2, 3], 'ํ‚ค' : [160.5, 170.3, 180.1], '๋ชธ๋ฌด๊ฒŒ' : [45.1, 50.3, 72.1]})
df1

df2

pd.concat([df1, df2], axis = 1)

-๋ฌธ์ œ : ๋‹ค์Œ ๋‘ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ID ๊ธฐ์ค€์œผ๋กœ ๊ฒฐํ•ฉํ•˜๊ธฐ

df1 = pd.DataFrame({'ID' : [1, 2, 3, 4, 5], '์„ฑ๋ณ„' : ['F', 'M', 'F', 'M', 'F'], '๋‚˜์ด' : [20, 30, 40, 25, 42]})
df2 = pd.DataFrame({'ID' : [3, 4, 5, 6, 7], 'ํ‚ค' : [160.5, 170.3, 180.1, 142.3, 153.7], '๋ชธ๋ฌด๊ฒŒ' : [45.1, 50.3, 72.1, 38,  42]})
df1

df2

pd.concat([df1, df2], axis = 1)

2) merge()

  • pd.merge(df_left, df_right, how='inner', on=None) : ๋‘ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๊ฐ ๋ฐ์ดํ„ฐ์— ์กด์žฌํ•˜๋Š” ๊ณ ์œ ๊ฐ’(key)์„ ๊ธฐ์ค€์œผ๋กœ ๋ณ‘ํ•ฉํ• ๋•Œ ์‚ฌ์šฉ. ์•„๋ฌด ์˜ต์…˜์„ ์ ์šฉํ•˜์ง€ ์•Š์œผ๋ฉด, on=None ์ด๋ฏ€๋กœ ๋‘ ๋ฐ์ดํ„ฐ์˜ ๊ณตํ†ต ์—ด ์ด๋ฆ„์„ ๊ธฐ์ค€์œผ๋กœ inner(๊ต์ง‘ํ•ฉ)์กฐ์ธ์„ ํ•˜๊ฒŒ ๋จ
df1 = pd.DataFrame({'ID' : [1, 2, 3, 4, 5], '์„ฑ๋ณ„' : ['F', 'M', 'F', 'M', 'F'], '๋‚˜์ด' : [20, 30, 40, 25, 42]})
df2 = pd.DataFrame({'ID' : [3, 4, 5, 6, 7], 'ํ‚ค' : [160.5, 170.3, 180.1, 142.3, 153.7], '๋ชธ๋ฌด๊ฒŒ' : [45.1, 50.3, 72.1, 38,  42]})
df1

df2

-๋ฌธ์ œ : ์„ฑ๋ณ„๊ณผ ๋‚˜์ด๊ฐ€ ํ™•์ธ๋œ ์œ ์ €๋“ค์„ ๋Œ€์ƒ์œผ๋กœ ํ‚ค์™€ ๋ชธ๋ฌด๊ฒŒ์˜ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•˜๊ธฐ

# ์™ผ์ชฝ์— ์ž…๋ ฅํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๊ธฐ์ค€(how='left')
pd.merge(df1, df2, how = 'left', on = 'ID)

-๋ฌธ์ œ : ํ‚ค์™€ ๋ชธ๋ฌด๊ฒŒ๊ฐ€ ํ™•์ธ๋œ ์œ ์ €๋“ค์„ ๋Œ€์ƒ์œผ๋กœ ์„ฑ๋ณ„๊ณผ ๋‚˜์ด์˜ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•˜๊ธฐ

pd.merge(df2, df1, how = 'left', on = 'ID')

๋˜๋Š”

pd.merge(df1, df2, how = 'right', on = 'ID')

-๋ฌธ์ œ : ํ‚ค, ๋ชธ๋ฌด๊ฒŒ, ์„ฑ๋ณ„, ๋‚˜์ด ์ •๋ณด๊ฐ€ ๋ชจ๋‘ ํ™•์ธ๋œ ์œ ์ €๋“ค์˜ ์ •๋ณด๋ฅผ ์ถœ๋ ฅํ•˜๊ธฐ

pd.merge(df1, df2, how = 'inner', on = 'ID')

-๋ฌธ์ œ : ๋ชจ๋“  ์œ ์ €๋“ค์˜ ์ •๋ณด๋ฅผ ์ถœ๋ ฅํ•˜๊ธฐ

# outer ์˜ต์…˜์„ ์ค˜์„œ id๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•ฉ์น˜๋ฉด ์–ด๋А ํ•œ์ชฝ์—๋ผ๋„ ์—†๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ NaN๊ฐ’์ด ์ง€์ •
pd.merge(df1, df2, how = 'outer', on = 'ID')

-๋ฌธ์ œ : ๋ชจ๋“  ์œ ์ €๋“ค์˜ ์ •๋ณด๋ฅผ ์ถœ๋ ฅํ•˜๊ธฐ

df1 = pd.DataFrame({'USER_ID' : [1, 2, 3, 4, 5], '์„ฑ๋ณ„' : ['F', 'M', 'F', 'M', 'F'], '๋‚˜์ด' : [20, 30, 40, 25, 42]})
df2 = pd.DataFrame({'ID' : [3, 4, 5, 6, 7], 'ํ‚ค' : [160.5, 170.3, 180.1, 142.3, 153.7], '๋ชธ๋ฌด๊ฒŒ' : [45.1, 50.3, 72.1, 38,  42]})
pd.merge(df1, df2, how = 'outer', left_on = 'USER_ID', right_on = 'ID')

-๋ฌธ์ œ : df1 ์€ ํšŒ์›์˜ ์ •๋ณด๋ฅผ ์ €์žฅํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด๋ฉฐ, df2 ๋Š” ๊ฐ ํšŒ์›์˜ ๊ตฌ๋งค ๋‚ด์—ญ์„ ์ €์žฅํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด๋‹ค. ๊ฐ ํšŒ์›์˜ ์ •๋ณด์™€ ๊ตฌ๋งค ๋‚ด์—ญ์„ ์ทจํ•ฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ๋งŒ๋“ค๊ธฐ

df1 = pd.DataFrame({'ID' : [1, 2, 3, 4, 5], '๊ฐ€์ž…์ผ' : ['2021-01-02', '2021-01-04', '2021-01-10', '2021-02-10', '2021-02-24'], '์„ฑ๋ณ„' : ['F', 'M', 'F', 'M', 'M']})
df2 = pd.DataFrame({'๊ตฌ๋งค์ˆœ์„œ' : [1, 2, 3, 4, 5], 'ID' : [1, 1, 2, 4, 1], '๊ตฌ๋งค์›”' : [1, 1, 2, 2, 3], '๊ธˆ์•ก' : [1000, 1500, 2000, 3000, 4000]})
df1

df2

pd.merge(df1, df2, how = 'left', on = 'ID')

๋Œ“๊ธ€