๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
study๐Ÿ“š/NLP

ROUGE score

by ์Šค๋‹ 2022. 10. 30.

ROUGE score

ROUGE

Recall-Oriented Understudy for Gisting Evaluation

ํ…์ŠคํŠธ ์š”์•ฝ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ. ํ…์ŠคํŠธ ์ž๋™ ์š”์•ฝ, ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ๋“ฑ ์ž์—ฐ์–ด ์ƒ์„ฑ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์ง€ํ‘œ์ด๋ฉฐ, ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์š”์•ฝ๋ณธ ํ˜น์€ ๋ฒˆ์—ญ๋ณธ์„ ์‚ฌ๋žŒ์ด ๋ฏธ๋ฆฌ ๋งŒ๋“ค์–ด ๋†“์€ ์ฐธ์กฐ๋ณธ๊ณผ ๋Œ€์กฐํ•ด ์„ฑ๋Šฅ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐ

if

  • ์‹œ์Šคํ…œ ์š”์•ฝ(๋ชจ๋ธ ์ƒ์„ฑ ์š”์•ฝ) : the cat was found under the bed
  • ์ฐธ์กฐ์š”์•ฝ(Gold standard, ๋Œ€๊ฒŒ ์‚ฌ๋žŒ์ด ์ง์ ‘ ๋งŒ๋“  ์š”์•ฝ) : the cat was under the bed

๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์‹œ์Šคํ…œ ์š”์•ฝ๊ณผ ์‚ฌ๋žŒ์ด ๋งŒ๋“ค์–ด ๋†“์€ ์ฐธ์กฐ ์š”์•ฝ ๊ฐ„ ๊ฒน์น˜๋Š” ๋‹จ์–ด ์ด 6๊ฐœ
ํ•˜์ง€๋งŒ ์ด 6์ด๋ผ๋Š” ์ˆซ์ž๋Š” ์„ฑ๋Šฅ ์ง€ํ‘œ(Metric)๋กœ ๋ฐ”๋กœ ์‚ฌ์šฉํ•˜๊ธฐ์— ์ ํ•ฉํ•˜์ง€ ์•Š์€ ์ˆ˜
→ ์ •๋Ÿ‰์  ์ง€ํ‘œ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ’์„ ์–ป๊ธฐ ์œ„ํ•ด Recall๊ณผ Precision์„ ๊ณ„์‚ฐํ•ด์•ผ ํ•จ

ROUGE์—์„œ์˜ Precision๊ณผ Recall์˜ ์˜๋ฏธ

1) Recall

์ฐธ์กฐ ์š”์•ฝ๋ณธ์„ ๊ตฌ์„ฑํ•˜๋Š” ๋‹จ์–ด ์ค‘ ๋ช‡ ๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ์‹œ์Šคํ…œ ์š”์•ฝ๋ณธ์˜ ๋‹จ์–ด๋“ค๊ณผ ๊ฒน์น˜๋Š”์ง€ ๋ณด๋Š” ์ ์ˆ˜

unigram์„ ํ•˜๋‚˜์˜ ๋‹จ์–ด๋กœ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•˜๋ฉด

$$
\frac{number\ of \ overlapped\ words}{Total\ words\ in\ reference\ summary}
$$

์•ž์˜ ์‚ดํŽด๋ณธ ์˜ˆ์ œ์—์„œ์˜ Recall ์ ์ˆ˜๋Š”

$$
Recall=\frac{6}{6}=1.0
$$

→ ์ฐธ์กฐ ์š”์•ฝ๋ณธ ๋‚ด ๋ชจ๋“  unigram์ด ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์‹œ์Šคํ…œ ์š”์•ฝ๋ณธ์— ๋“ฑ์žฅํ–ˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ
๋ชจ๋ธ์ด ์ž์‹ ์ด ์•Œ๊ณ  ์žˆ๋Š” ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ์ƒ์„ฑํ•ด ์š”์•ฝ๋ณธ์„ ๋งŒ๋“ค๊ฒŒ ๋˜๋ฉด, ์–ด๋–ป๊ฒŒ๋“  ์ฐธ์กฐ ์š”์•ฝ๋ณธ์„ ๊ตฌ์„ฑํ•˜๋Š” ๋‹จ์–ด๋“ค์„ ์š”์•ฝ๋ณธ์— ํฌํ•จํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ฌด์กฐ๊ฑด ์ข‹์€ ์ ์ˆ˜๋ผ๊ณ  ํ•  ์ˆ˜ ์—†์Œ

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Precision์„ ๊ณ„์‚ฐํ•  ํ•„์š”๊ฐ€ ์žˆ์Œ

2) Precision

๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์‹œ์Šคํ…œ ์š”์•ฝ๋ณธ ์ค‘ ์ฐธ์กฐ ์š”์•ฝ๋ณธ๊ณผ ๊ฒน์น˜๋Š” ๋‹จ์–ด๋“ค์ด ์–ผ๋งˆ๋‚˜ ๋งŽ์ด ์กด์žฌํ•˜๋Š”์ง€

$$
\frac{Number\ of\ overlapped\ words}{Total\ words\ in\ system\ summary}
$$

์œ„ ์˜ˆ์ œ์— ๋Œ€ํ•ด ๊ณ„์‚ฐํ•˜๋ฉด

$$
Precision=\frac{6}{7}=0.86
$$

๋ณด๋‹ค ์ •ํ™•ํ•œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด์„œ๋Š” Precision๊ณผ Recall์„ ๋ชจ๋‘ ๊ณ„์‚ฐํ•œ ํ›„, F-Measure๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์ด ๋ฐ”๋žŒ์ง

$$
H=\frac{2ab}{a+b}
$$

ROUNE-N : ROUGE-1 ๊ณผ ROUGE-2

1) ROUGE-1

์‹œ์Šคํ…œ ์š”์•ฝ๋ณธ๊ณผ ์ฐธ์กฐ ์š”์•ฝ๋ณธ ๊ฐ„ ๊ฒน์น˜๋Š” unigram์˜ ์ˆ˜๋ฅผ ๋ณด๋Š” ์ง€ํ‘œ

2) ROUGE-2

์‹œ์Šคํ…œ ์š”์•ฝ๋ณธ๊ณผ ์ฐธ์กฐ ์š”์•ฝ๋ณธ ๊ฐ„ ๊ฒน์น˜๋Š” biogram์˜ ์ˆ˜๋ฅผ ๋ณด๋Š” ์ง€ํ‘œ

  • ์‹œ์Šคํ…œ ์š”์•ฝ : the cat was found under the bed
  • ์ฐธ์กฐ ์š”์•ฝ : the cat was under the bed
  • ์‹œ์Šคํ…œ ์š”์•ฝ(biograms)
    the cat, cat was, was found, found under, under the, the bed
  • ์ฐธ์กฐ ์š”์•ฝ(biograms)
    the cat, cat was, was under, under the, the bed

$$
ROUGE2_{recall}=\frac{4}{5}=0.8
$$

$$
ROUGE2_{precision}=\frac{4}{6}=0.67
$$

๊ธฐํƒ€ ROUGE ์ง€ํ‘œ๋“ค

1) ROUGE-L

LCS ๊ธฐ๋ฒ•์„ ์ด์šฉํ•ด ์ตœ์žฅ ๊ธธ์ด๋กœ ๋งค์นญ๋˜๋Š” ๋ฌธ์ž์—ด์„ ์ธก์ •. LCS์˜ ์žฅ์ ์€ ROUGE-2์™€ ๊ฐ™์ด ๋‹จ์–ด๋“ค์˜ ์—ฐ์†์  ๋งค์นญ์„ ์š”๊ตฌํ•˜์ง€ ์•Š๊ณ , ์–ด๋–ป๊ฒŒ๋Š” ๋ฌธ์ž์—ด ๋‚ด์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋งค์นญ์„ ์ธก์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ณด๋‹ค ์œ ์—ฐํ•œ ์„ฑ๋Šฅ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅ

  • Reference : police killed the gunman
  • System-1 : police kill the gunman
  • System-2 : the gunman kill police
  • ROUGE-N : System-1 = System-2 ( ‘police’, ‘ the gunman’)
  • ROUGE-L
    • System-1 = 3/4 (’police the gunman’)
    • System-2 = 2/4 (’the gunman’)

2) ROUGE-S

ํŠน์ • Window size๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, Window size ๋‚ด์— ์œ„์น˜ํ•˜๋Š” ๋‹จ์–ด์Œ๋“ค์„ ๋ฌถ์–ด ํ•ด๋‹น ๋‹จ์–ด์Œ๋“ค์ด ์–ผ๋งˆ๋‚˜ ์ค‘๋ณต๋˜๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š” ์ง€๋ฅผ ์ธก์ • Skip-gram Co-ocurrence๊ธฐ๋ฒ•์ด๋ผ ๋ถ€๋ฅด๊ธฐ๋„ ํ•จ

skip-gram ๋ฐฉ์‹๊ณผ ๊ฐ™์ด, ์ตœ๋Œ€ 2์นธ(bigram) ๋‚ด์— ์œ„์น˜ํ•˜๋Š” ๋‹จ์–ด ์Œ์˜ recall์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. skip-gram์˜ ํŠน์„ฑ์ƒ ์ด์–ด์ง€์ง€ ์•Š์•„๋„ ๋˜๋ฏ€๋กœ ์ƒ๋Œ€์ ์œผ๋กœ ๊ฑฐ๋ฆฌ์— ์˜ํ–ฅ์„ ๋œ ๋ฐ›์Šต๋‹ˆ๋‹ค.

  • ์ •๋‹ต๋ฌธ์žฅ : "๋ฅ˜ํ˜„์ง„์˜ ํฌ์‹ฌ ํŒจ์ŠคํŠธ ๋ณผ์€ ๋น ๋ฅด์ง€ ์•Š์ง€๋งŒ ๋งค์šฐ ์ •๊ตํ•˜๋‹ค."
  • ์ƒ์„ฑ๋ฌธ์žฅ : "๋ฅ˜ํ˜„์ง„์˜ ํˆฌ์‹ฌ ํŒจ์ŠคํŠธ ๋ณผ์€ ๋А๋ฆฌ์ง€๋งŒ ๋งค์šฐ ์ •ํ™•ํ•˜๋‹ค."

$$
N_{์ •๋‹ต๋ฌธ์žฅ}=7 \ N_{((๋ฅ˜ํ˜„์ง„,ํŒจ์ŠคํŠธ),(๋ฅ˜ํ˜„์ง„,๋ณผ),(ํŒจ์ŠคํŠธ,๋ณผ),(๋ณผ,๋งค์šฐ))}=4 \ ROUGE-S=\frac{4}{7}
$$

3) ROUGE-W

Weighted Longest Common Subsequence

ROUGE-W๋Š” ROUGE-L์˜ ๋ฐฉ๋ฒ•์— ๋”ํ•˜์—ฌ ์—ฐ์†์ ์ธ ๋งค์นญ(consecutive matches)์— ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๋Š” ๋ฐฉ๋ฒ•

$$
X=[\underline{A}\ \underline{B}\ \underline{C}\ \underline{D}\ E\ F\ G] \ Y_1=[\underline{A}\ \underline{B}\ \underline{C}\ \underline{D}\ H\ I\ K] \ Y_2=[\underline{A}\ H\ \underline{B}\ K\ \underline{C}\ I\ \underline{D}]
$$

ROUGE-L์˜ ๊ด€์ ์—์„œ๋Š” Y_1๊ณผ Y_2์˜ ๊ฒฐ๊ณผ๊ฐ€ ๊ฐ™์ง€๋งŒ,

ROUGE-W์˜ ๊ด€์ ์—์„œ๋Š” consecutive matches๋กœ ์ด๋ฃจ์–ด์ง„ ์˜ˆ์‹œ์ธ Y_1์ด ๋” ์šฐ์ˆ˜ํ•œ ๊ฒฐ๊ณผ

4) ROUGE-SU

Extension of ROUGE-S

ROUGE-S๋Š” ๋™์‹œ์— ์ถœํ˜„ํ•˜๋Š” word pair๊ฐ€ ํ•˜๋‚˜๋„ ๊ฒน์น˜์ง€ ์•Š์„ ์‹œ 0์ด ๋จ

ํ•˜์ง€๋งŒ ์•„๋ž˜ ์˜ˆ์‹œ์˜ ๊ฒฝ์šฐ ์–ด์ˆœ์„ ๋ฐ”๊ฟจ์„ ๋ฟ, ๊ฐ™์€ ์˜๋ฏธ์— ๋ฌธ์žฅ์ž„์—๋„ ROUGE-S๊ฐ€ 0์ด ๋˜์–ด๋ฒ„๋ฆผ

  • ์ •๋‹ต๋ฌธ์žฅ : ๋ฅ˜ํ˜„์ง„์ด ๊ณต์„ ๋˜์กŒ๋‹ค.
  • ์ƒ์„ฑ๋ฌธ์žฅ : ๋˜์กŒ๋‹ค ๊ณต์„ ๋ฅ˜ํ˜„์ง„์ด

ROUGE-SU๋Š” Unigram์„ ํ•จ๊ป˜ ๊ณ„์‚ฐํ•˜์—ฌ ์ด๋ฅผ ๋ณด์ •ํ•ด ์ค๋‹ˆ๋‹ค.

  • ์ •๋‹ต๋ฌธ์žฅ : ((๋ฅ˜ํ˜„์ง„,๊ณต), (๋ฅ˜ํ˜„์ง„,๋˜์กŒ๋‹ค), (๊ณต,๋˜์กŒ๋‹ค), ๋ฅ˜ํ˜„์ง„, ๊ณต, ๋˜์กŒ๋‹ค)
  • ์ƒ์„ฑ๋ฌธ์žฅ : ((๋˜์กŒ๋‹ค,๊ณต), (๋˜์กŒ๋‹ค,๋ฅ˜ํ˜„์ง„), (๊ณต,๋ฅ˜ํ˜„์ง„), ๋ฅ˜ํ˜„์ง„, ๊ณต, ๋˜์กŒ๋‹ค)

 

๋Œ“๊ธ€