Skip to content

WHY1862/level2-mrc-nlp-16

ย 
ย 

Repository files navigation

๐Ÿš€ Open Domain Question Answering (ODQA)

๐Ÿ“• ํ”„๋กœ์ ํŠธ ๊ฐœ์š”

๋ณธ ํ”„๋กœ์ ํŠธ๋Š” Retrieval์„ ๊ธฐ๋ฐ˜์œผ๋กœํ•œ Question Answering(QA)๋ฅผ ์ฃผ์ œ๋กœ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์งˆ๋ฌธ-๋ฌธ๋งฅ-์ •๋‹ต ์Œ์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•ด์„œ ํŠน์ • ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์ •๋‹ต์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์€ ๋ฐฉ๋Œ€ํ•œ ์ง€์‹์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋Š” ํšจ์œจ์ ์ธ ๋ฐฉ์•ˆ์„ ์ œ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ํ‘œํ˜„์— ๋Œ€ํ•œ retrieval ์„ฑ๋Šฅ ๊ฐœ์„ , reader๋ฅผ ํ†ตํ•œ ์œ ์—ฐํ•œ ๋‹ต๋ณ€ ์ƒ์„ฑ ๋“ฑ ๋ชจ๋ธ ๊ณ ๋„ํ™”๋ฅผ ํ†ตํ•ด ๊ธฐ์กด์˜ ๊ฒ€์ƒ‰ ์—”์ง„์—์„œ ๋”์šฑ ๋ฐœ์ „๋œ ๊ฒ€์ƒ‰ ๊ธฐ์ˆ ์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ ๋ฉค๋ฒ„ ์†Œ๊ฐœ

๊ฐ•๊ฒฝ์ค€ ๊น€์žฌ๊ฒธ ์›ํ˜ธ์˜ ์œ ์„ ์šฐ
KKJ KJK WHY YSW

โš–๏ธ ์—ญํ•  ๋ถ„๋‹ด

ํŒ€์› ์—ญํ• 
๊ฐ•๊ฒฝ์ค€ EDA, ๋ชจ๋ธ๋ง, ๋ชจ๋ธ ์‹คํ—˜ ์ฝ”๋“œ ๊ด€๋ฆฌ, ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฐœ์„  ์‹คํ—˜
๊น€์žฌ๊ฒธ EDA ๋ฐ ๋ฐ์ดํ„ฐ ๊ฒ€์ˆ˜, ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐ ์ฆ๊ฐ• ์‹คํ—˜, ๋ชจ๋ธ ์„œ์น˜ ๋ฐ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ๋“ฑ ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฐœ๋ฐœ, ์•™์ƒ๋ธ”
์›ํ˜ธ์˜ EDA ๋ฐ ๋ฐ์ดํ„ฐ ๊ฒ€์ˆ˜, ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ์กฐ์‚ฌโ‹…์‹คํ—˜ ๋ฐ ๊ด€๋ จ ์ฝ”๋“œ๊ด€๋ฆฌ, ๋ชจ๋ธ ์„œ์น˜ ๋ฐ ์‹คํ—˜
์œ ์„ ์šฐ EDA ๋ฐ ๋ฐ์ดํ„ฐ ๊ฒ€์ˆ˜, ํ…์ŠคํŠธ ์ •์ œ, ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ ๊ด€๋ฆฌ, ๋ชจ๋ธ ์‹คํ—˜ ๋ฐ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

๐Ÿ’ป ๊ฐœ๋ฐœ/ํ˜‘์—… ํ™˜๊ฒฝ

  • ์ปดํ“จํŒ… ํ™˜๊ฒฝ
    • V100 ์„œ๋ฒ„ (VS code์™€ SSH๋กœ ์—ฐ๊ฒฐํ•˜์—ฌ ์‚ฌ์šฉ)
  • ํ˜‘์—… ํ™˜๊ฒฝ
    • notion github WandB
  • ์˜์‚ฌ์†Œํ†ต
    • zoom

๐Ÿ“‘ ๋ฐ์ดํ„ฐ ์„ค๋ช…

  • ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ
    • train_dataset / test_dataset
      • question: ์งˆ๋ฌธ text
      • context: ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต์ด ํฌํ•จ๋œ passage
      • answer (train_dataset only)
        • answer index: context ๋‚ด์—์„œ ์ •๋‹ต์ด ์‹œ์ž‘๋˜๋Š” index
        • answer text: textํ˜•ํƒœ๋กœ ์ œ์‹œ๋œ context ๋‚ด์˜ ์ •๋‹ต
      • wikipedia_documents
        • ์œ„ํ‚คํ”ผ๋””์•„ ๋ฌธ์„œ ์ง‘ํ•ฉ

๐Ÿ—‚๏ธ ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

. level2-mrc-nlp-16
โ”œโ”€ .gitihub
โ”œโ”€ data
โ”‚  โ”œโ”€ embedding
โ”‚  โ”‚  โ”œโ”€ context_sparse_embedding.bin
โ”‚  โ”‚  โ””โ”€ context_dense_embedding.bin
โ”‚  โ”œโ”€ test_dataset
โ”‚  โ””โ”€ train_dataset
โ”œโ”€ data_modules
โ”‚  โ”œโ”€ data_sets.py
โ”‚  โ””โ”€data_loaders.py
โ”œโ”€ model
โ”‚  โ”œโ”€ loss.py
โ”‚  โ”œโ”€ metric.py
โ”‚  โ””โ”€ model.py
โ”œโ”€ utils
โ”‚  โ”œโ”€ __init__.py
โ”‚  โ”œโ”€ add_data.py
โ”‚  โ”œโ”€ embedding.py
โ”‚  โ”œโ”€ augmentation.py
โ”‚  โ”œโ”€ augmentation_requirements.py
โ”‚  โ””โ”€ util.py
โ”œโ”€ .flake8
โ”œโ”€ .gitignore
โ”œโ”€ .gitmessage.txt
โ”œโ”€ .pre-commit-config.yaml
โ”œโ”€ README.md
โ”œโ”€ config_reader.yaml
โ”œโ”€ config_retrieval.yaml
โ”œโ”€ context_dense_embedding.yaml
โ”œโ”€ context_sparse_embedding.yaml
โ”œโ”€ inference.py
โ”œโ”€ requirements.txt
โ”œโ”€ train_reader.py
โ”œโ”€ train_retrieval.py
โ””โ”€ test.py

๐Ÿ“– ํ”„๋กœ์ ํŠธ ์ˆ˜ํ–‰ ๊ฒฐ๊ณผ

  • EDA

    • Unknown token ๋ถ„์„
      • ์ •์ƒ์ ์ธ ๋‹จ์–ด์ž„์—๋„, ์ธ์‹๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ
      • ์˜ˆ์‹œ
        • ์—†์•ด๋‹ค๋Š”
        • ๋ณด์‚ดํ•Œ์œผ๋กœ
        • ๊พธ๋ฐˆ์ด
        • ์˜ป์น , ์˜ป๋‚˜๋ฌด
        • ์จ๊ทธ๋ ๊ฑฐ๋ฆด
        • ์ŠฌํŽ๋‹ค
      • ์˜ˆ์ƒ ์ฒ˜๋ฆฌ ๋ฐฉ์•ˆ
        • ์ธ์‹๋˜์ง€ ์•Š๋Š” ๊ธ€์ž ๋ถ„์„ ํ›„ ์ถ”๊ฐ€
        • ์ ์ ˆํ•œ ์˜๋ฏธ ๋‹จ์œ„๋ฅผ ํ† ํฐ์œผ๋กœ ์ถ”๊ฐ€
      • ์˜ˆ์ƒ ํšจ๊ณผ
        • ๋”์šฑ ๋‹ค์–‘ํ•œ ๋‹จ์–ด์— ๋Œ€ํ•œ ์ธ์‹ ๊ฐ€๋Šฅ
        • ์˜๋ฏธ ๋‹จ์œ„๋ฅผ ํ† ํฐ์œผ๋กœ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ, ๊ฐ™์€ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š” ๋‹จ์–ด์˜ ํ™œ์šฉํ˜•์— ๋Œ€ํ•ด ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ํ† ํฐํ™”๊ฐ€ ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ
    • ์™ธ๊ตญ์ธ ์ด๋ฆ„์„ ํ•œ๊ธ€๋กœ ํ‘œ๊ธฐ์‹œ, ํ•œ๊ธ€์—์„œ ์ž˜ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ๊ธ€์ž๊ฐ€ ํฌํ•จ๋˜๋Š” ๊ฒฝ์šฐ
      • ์˜ˆ์‹œ
        • ๋ฒต๊ณจ
        • ๋จ€์Šค์ฝ”ํ”„์Šคํ‚ค
        • ๋“„
        • ๋ฒ ์ด์š˜
      • ์˜ˆ์ƒ ์ฒ˜๋ฆฌ ๋ฐฉ์•ˆ
        • ์ธ์‹๋˜์ง€ ์•Š๋Š” ๊ธ€์ž ๋ถ„์„ ํ›„ ์ถ”๊ฐ€
      • ์˜ˆ์ƒ ํšจ๊ณผ
        • ์งˆ๋ฌธ์— ํŠน์ • ์ธ๋ฌผ์˜ ์ด๋ฆ„์ด ์ง์ ‘์ ์œผ๋กœ ๋“ค์–ด๊ฐ€๋Š” ๊ฒฝ์šฐ๋„ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์—, ์„ฑ๋Šฅ ๊ฐœ๋ฐœ์— ๋„์›€์ด ๋  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ
    • Annotation Bias
      • ํ† ํฐํ™”๋œ ํ…์ŠคํŠธ๋ฅผ ๊ธฐ์ค€์œผ๋กœ question์˜ ํ† ํฐ์ด context์— ํฌํ•จ๋˜๋Š” ๋น„์œจ์„ ํ†ตํ•ด annotation bias ์ธก์ •
      • Summary for covering ratio
        Statistic Value
        Mean 0.70
        Standard deviation 0.11
        Minimum 0.10
        Maximum 1.00
        • ๊ฝค ๋†’์€ covering ratio๋ฅผ ๋ณด์—ฌ์คŒ
        • sparse embedding์ด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ผ ๊ฒƒ์œผ๋กœ ์ƒ๊ฐ๋จ
      • ์˜ˆ์ƒ ์ฒ˜๋ฆฌ ๋ฐฉ์•ˆ
        • ์œ ์‚ฌ์–ด ๋Œ€์ฒด ๋“ฑ์˜ augmentation์„ ํ†ตํ•œ ์–ธ์–ด ํ‘œํ˜„์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด
  • Augmentation

    • ์–ด์ˆœ ๋ณ€๊ฒฝ(EDA)
      • ์„ค๋ช…
        • ์ž„์˜๋กœ ๋ฌธ์žฅ์˜ ๋‹จ์–ด ์ˆœ์„œ๋ฅผ ๋ณ€๊ฒฝํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด
      • ์ ์šฉ ์˜ˆ์‹œ
        • ๋Œ€ํ†ต๋ น์„ ํฌํ•จํ•œ ๋ฏธ๊ตญ์˜ ํ–‰์ •๋ถ€ ๊ฒฌ์ œ๊ถŒ์„ ๊ฐ–๋Š” ๊ตญ๊ฐ€ ๊ธฐ๊ด€์€? โ†’ ๋Œ€ํ†ต๋ น์„ ํฌํ•จํ•œ ๋ฏธ๊ตญ์˜ ํ–‰์ •๋ถ€ ๊ฒฌ์ œ๊ถŒ์„ ๊ฐ–๋Š” ๊ธฐ๊ด€์€? ๊ตญ๊ฐ€
        • ํ˜„๋Œ€์  ์ธ์‚ฌ์กฐ์ง๊ด€๋ฆฌ์˜ ์‹œ๋ฐœ์ ์ด ๋œ ์ฑ…์€? โ†’ ํ˜„๋Œ€์  ์ธ์‚ฌ์กฐ์ง๊ด€๋ฆฌ์˜ ์‹œ๋ฐœ์ ์ด ์ฑ…์€? ๋œ
        • ๊ฐ•ํฌ์ œ๊ฐ€ 1717๋…„์— ์“ด ๊ธ€์€ ๋ˆ„๊ตฌ๋ฅผ ์œ„ํ•ด ์“ฐ์—ฌ์กŒ๋Š”๊ฐ€? โ†’ ๊ฐ•ํฌ์ œ๊ฐ€ ๊ธ€์€ ์“ด 1717๋…„์— ๋ˆ„๊ตฌ๋ฅผ ์œ„ํ•ด ์“ฐ์—ฌ์กŒ๋Š”๊ฐ€?
      • ๊ฒฐ๊ณผ
        • ์ฆ๊ฐ•๋œ ๋ฌธ์žฅ๊ณผ ์›๋ž˜ ๋ฌธ์žฅ๊ฐ„์˜ ์˜๋ฏธ ์ฐจ์ด๋Š” ํฌ์ง€ ์•Š์Œ
        • ํ•ด๋‹น ๋ฐ์ดํ„ฐ ์ ์šฉ์‹œ ์„ฑ๋Šฅ ์ €ํ•˜ ๋ฐœ์ƒ
        • ์–ธ์–ด ํ‘œํ˜„์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด ์ฐจ์›์—์„œ๋„ ์˜๋ฏธ ์—†์Œ
        • Method EM F1
          Base 0.5708 0.6629
          Augmented 0.5541 0.6385
    • ํŠน์ˆ˜ ๊ธฐํ˜ธ ์ถ”๊ฐ€(AEDA)
      • ์„ค๋ช…
        • text์— ์ž„์˜๋กœ ๊ตฌ๋‘์ ('.', ',', '!', '?', ';')์„ ์ถ”๊ฐ€ ํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ๋‹ค์–‘์„ฑ์„ ํ™•๋ณด
      • ์ ์šฉ ์˜ˆ์‹œ
        • ๋Œ€ํ†ต๋ น์„ ํฌํ•จํ•œ ๋ฏธ๊ตญ์˜ ํ–‰์ •๋ถ€ ๊ฒฌ์ œ๊ถŒ์„ ๊ฐ–๋Š” ๊ตญ๊ฐ€ ๊ธฐ๊ด€์€? โ†’ ๋Œ€ํ†ต๋ น์„ ํฌํ•จํ•œ ๋ฏธ๊ตญ์˜ ํ–‰์ •๋ถ€ ๊ฒฌ์ œ๊ถŒ์„ . ๊ฐ–๋Š” ๊ตญ๊ฐ€ ๊ธฐ๊ด€์€? ;
        • ํ˜„๋Œ€์  ์ธ์‚ฌ์กฐ์ง๊ด€๋ฆฌ์˜ ์‹œ๋ฐœ์ ์ด ๋œ ์ฑ…์€? โ†’ "ํ˜„๋Œ€์  ์ธ์‚ฌ์กฐ์ง๊ด€๋ฆฌ์˜ ์‹œ๋ฐœ์ ์ด ๋œ , ์ฑ…์€? ;
        • ๊ฐ•ํฌ์ œ๊ฐ€ 1717๋…„์— ์“ด ๊ธ€์€ ๋ˆ„๊ตฌ๋ฅผ ์œ„ํ•ด ์“ฐ์—ฌ์กŒ๋Š”๊ฐ€? โ†’ "๊ฐ•ํฌ์ œ๊ฐ€ , 1717๋…„์— ์“ด ๊ธ€์€ , ๋ˆ„๊ตฌ๋ฅผ ์œ„ํ•ด ์“ฐ์—ฌ์กŒ๋Š”๊ฐ€? ?
      • ๊ฒฐ๊ณผ
        • EM์€ ์†Œํญ ์ฆ๊ฐ€ํ–ˆ์œผ๋‚˜, F1์€ ์†Œํญ ๊ฐ์†Œ
        • ์œ ์˜๋ฏธํ•œ ์ฐจ์ด๋ฅผ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์—†์Œ
        • Method EM F1
          Base 0.5708 0.6629
          Augmented 0.5875 0.6599
    • ์งˆ๋ฌธ ์ƒ์„ฑ
      • ์„ค๋ช…
        • ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ ์ƒ์˜ ์งˆ๋ฌธ์— ์ด์–ด์ง€๋Š” ๋‚ด์šฉ์„ ์ƒ์„ฑํ•˜์—ฌ ์ฆ๊ฐ•
        • ์–ธ์–ด ๋ชจ๋ธ (skt/kogpt2-base-v2) ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ์„ฑ
      • ์ ์šฉ ์˜ˆ์‹œ
        • ๋Œ€ํ†ต๋ น์„ ํฌํ•จํ•œ ๋ฏธ๊ตญ์˜ ํ–‰์ •๋ถ€ ๊ฒฌ์ œ๊ถŒ์„ ๊ฐ–๋Š” ๊ตญ๊ฐ€ ๊ธฐ๊ด€์€? โ†’ ๋Œ€ํ†ต๋ น์„ ํฌํ•จํ•œ ๋ฏธ๊ตญ์˜ ํ–‰์ •๋ถ€ ๊ฒฌ์ œ๊ถŒ์„ ๊ฐ–๋Š” ๊ตญ๊ฐ€ ๊ธฐ๊ด€์€?? ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ด๋‹ฌ๋ผ๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ ๋ฏธ๊ตญ์˜ ๊ฒฝ์ œ๋ ฅ์„ ์–ด๋–ป๊ฒŒ ํ‚ค์›Œ์•ผ ํ•  ๊ฒƒ์ธ๊ฐ€? ์ด๋Ÿฐ ๋ฌธ์ œ์˜
        • ํ˜„๋Œ€์  ์ธ์‚ฌ์กฐ์ง๊ด€๋ฆฌ์˜ ์‹œ๋ฐœ์ ์ด ๋œ ์ฑ…์€? โ†’ ํ˜„๋Œ€์  ์ธ์‚ฌ์กฐ์ง๊ด€๋ฆฌ์˜ ์‹œ๋ฐœ์ ์ด ๋œ ์ฑ…์€?์ด๋‹ค. ์ด๋Ÿฐ ์ฑ…์€ '์™ธ๋ถ€์— ์˜ํ•œ ์กฐ์ง๊ด€๋ฆฌ๊ฐ€ ์•„๋‹ˆ๋ผ ๋‚ด๋ถ€์˜ ์ž๋ฐœ์  ์กฐ์ง๊ด€๋ฆฌ๊ฐ€ ์ด๋ฃจ์–ด์ ธ์•ผ ํ•œ๋‹ค'๋Š” ๊ฒƒ์„
        • ๊ฐ•ํฌ์ œ๊ฐ€ 1717๋…„์— ์“ด ๊ธ€์€ ๋ˆ„๊ตฌ๋ฅผ ์œ„ํ•ด ์“ฐ์—ฌ์กŒ๋Š”๊ฐ€? โ†’ ๊ฐ•ํฌ์ œ๊ฐ€ 1717๋…„์— ์“ด ๊ธ€์€ ๋ˆ„๊ตฌ๋ฅผ ์œ„ํ•ด ์“ฐ์—ฌ์กŒ๋Š”๊ฐ€?๋ผ๋Š” ์งˆ๋ฌธ์œผ๋กœ ์‹œ์ž‘๋˜์—ˆ๋‹ค. ์•„๋‹ˆ๋ฉด ๋ˆ„๊ตฌ๋ฅผ ์œ„ํ•ด ์“ฐ์—ฌ์กŒ๋Š”๊ฐ€? ๋ˆ„๊ฐ€ ๋ˆ„๊ตฌ์—๊ฒŒ
      • ๊ฒฐ๊ณผ
        • ์ƒ์„ฑ๋œ ๋ฌธ์žฅ์˜ ์งˆ์ด ์•ˆ ์ข‹์Œ
        • Method EM F1
          Base 0.5708 0.6629
          Augmented 0.5875 0.6772
    • ์—ญ๋ฒˆ์—ญ
      • ์„ค๋ช…
        • ํ•œ๊ตญ์–ด๋ฅผ ์˜์–ด๋กœ ๋ฒˆ์—ญํ•œ ๋’ค ๋‹ค์‹œ ํ•œ๊ตญ์–ด๋กœ ๋ฒˆ์—ญํ•˜๋Š” ๊ณผ์ •์„ ํ†ตํ•ด text์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด
      • ์ ์šฉ ์˜ˆ์‹œ
        • ๋Œ€ํ†ต๋ น์„ ํฌํ•จํ•œ ๋ฏธ๊ตญ์˜ ํ–‰์ •๋ถ€ ๊ฒฌ์ œ๊ถŒ์„ ๊ฐ–๋Š” ๊ตญ๊ฐ€ ๊ธฐ๊ด€์€? โ†’ ์–ด๋–ค ๊ตญ๊ฐ€๊ธฐ๊ด€์ด ๋Œ€ํ†ต๋ น์„ ํฌํ•จํ•œ ๋ฏธ๊ตญ ํ–‰์ •๋ถ€๋ฅผ ๊ฒฌ์ œํ•  ๊ถŒ๋ฆฌ๊ฐ€ ์žˆ๋Š”๊ฐ€?
        • ํ˜„๋Œ€์  ์ธ์‚ฌ์กฐ์ง๊ด€๋ฆฌ์˜ ์‹œ๋ฐœ์ ์ด ๋œ ์ฑ…์€? โ†’ ํ˜„๋Œ€ ์ธ์‚ฌ ์šด์˜์˜ ์ถœ๋ฐœ์ ์ด ์–ด๋–ค ์ฑ…์ด ๋์„๊นŒ?
        • ๊ฐ•ํฌ์ œ๊ฐ€ 1717๋…„์— ์“ด ๊ธ€์€ ๋ˆ„๊ตฌ๋ฅผ ์œ„ํ•ด ์“ฐ์—ฌ์กŒ๋Š”๊ฐ€? โ†’ ๊ฐ•ํฌ์ œ๊ฐ€ ๋ˆ„๊ตฌ๋ฅผ ์œ„ํ•ด 1717๋…„์— ์ผ๋‚˜์š”?
      • ๊ฒฐ๊ณผ
        • ์ƒ์„ฑ๋œ ๋ฌธ์žฅ์˜ ์งˆ์ด ์ข‹์Œ
        • ์–ธ์–ด ํ‘œํ˜„์˜ ๋‹ค์–‘์„ฑ์„ ํ™•๋ณดํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋จ
        • Method EM F1
          Base 0.5708 0.6629
          Augmented 0.5678 0.6682
  • Modeling

    • chunking
      • task ์„ค๋ช…
        • context๊ฐ€ ๋„ˆ๋ฌด ๊ธธ์–ด์„œ, ๋ชจ๋ธ์˜ input size ์ œํ•œ์„ ๋„˜๋Š” ๊ฒฝ์šฐ ๋ฐœ์ƒ
        • ๊ธด text๋ฅผ chunk๋ณ„๋กœ ๋‚˜๋ˆ„์–ด ๊ฒฐ๊ณผ ๋„์ถœ
      • chunking ์ ์šฉ ๋ฐฉ์•ˆ
        • fixed length
          • ์ผ์ • ๊ธธ์ด ๋‹จ์œ„๋กœ chunking
          • stride๋ฅผ ์„ค์ •ํ•˜์—ฌ ์ „ ํ›„์˜ chunk๊ฐ€ ์ผ์ • ๊ธธ์ด์˜ ๊ณตํ†ต๋œ ๋ถ€๋ถ„ ๋ณด์œ 
        • truncation
          • ์ผ์ • ๊ธธ์ด๋กœ ์ ˆ๋‹จํ•˜์—ฌ retrieval ์ง„ํ–‰ (์ฒซ ๋ฒˆ์งธ chunk๋งŒ ํ™œ์šฉ)
          • ์ข‹์€ ์„ฑ๋Šฅ ๋ณด์ž„
          • ๋งŽ์€ ๋ฌธ์„œ๋“ค์ด ๋‘๊ด„์‹์œผ๋กœ ์ž‘์„ฑ๋ผ ์žˆ์–ด์„œ ์„ฑ๋Šฅ์ด ๊ดœ์ฐฎ์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์˜๊ฒฌ
        • summary
          • ๋‘๊ด„์‹ ๋ฌธ์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋Š” ๊ฒƒ ๋ณด๋‹ค, summary๋ฅผ ์ง์ ‘ ์ƒ์„ฑ ํ›„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์•ˆ ๊ฐ€๋Šฅ
          • summary๋ฅผ ์‹œ๋„ํ–ˆ์œผ๋‚˜, ์ฆ‰๊ฐ์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋‚˜ํƒ€๋‚˜์ง€๋Š” ์•Š์Œ
          • ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•œ ๋ถ„์„ ํ•„์š” ํ–ˆ์ง€๋งŒ, ์ „์ฒด context dataset์— ๋Œ€ํ•œ summary ์ƒ์„ฑ์— ์ง€๋‚˜์น˜๊ฒŒ ๋งŽ์€ ์‹œ๊ฐ„ ์†Œ๋น„๋˜๋ฏ€๋กœ ๋ฐฉ๋ฒ•๋ก  ์ ์šฉ์ด ์–ด๋ ค์›€
        • chunk๋ณ„ ๊ฒฐ๊ณผ ํ•ฉ์‚ฐ ๋ฐฉ์•ˆ
          • mean
            • ๊ฐ text์— ๋Œ€ํ•œ chunk ๋ณ„ embedding vector์˜ ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” ๋ฐฉ์‹
          • max
            • ๊ฐ chunk๋ณ„ embedding๊ณผ question embedding์— ๋Œ€ํ•œ similarity์˜ ์ตœ๋Œ€๊ฐ’ ํ™œ์šฉ
            • ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ search๋ฅผ ์œ„ํ•ด์„œ chunk๋ณ„ embedding์„ ๋ชจ๋‘ ์ €์žฅํ•ด์•ผ ํ•จ โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ ๋ฐœ์ƒ
        • ๋ฌธ์ œ์ 
          • chunk ๊ธธ์ด๊ฐ€ ๋งค์šฐ ํฐ ๊ฒฝ์šฐ
            • ๋ฌธ์ œ ์ƒํ™ฉ
              • token length 512 ๊ธฐ์ค€์œผ๋กœ ์ตœ๋Œ€ 80๊ฐœ ๊ฐ€๋Ÿ‰์˜ chunk๊ฐ€ ์ƒ์„ฑ๋˜๋Š” ๊ฒฝ์šฐ๊นŒ์ง€ ์กด์žฌ โ†’ OOM ๋ฐœ์ƒ
              • chunk๋ฅผ ๋‚˜๋ˆ„๊ณ  batch size๋ฅผ 1๋กœ ๋งŒ๋“ค์–ด๋„ OOM์ด ๋ฐœ์ƒ
              • reader model์—์„œ๋Š” ๋ชจ๋“  context๋ฅผ ํ™œ์šฉํ•ด์„œ ๋‹ต์„ ์ฐพ์•„์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋ฌธ์„œ๋ฅผ ์ผ์ • ๊ธธ์ด์—์„œ ์ ˆ๋‹จ ๋ถˆ๊ฐ€
            • ํ•ด๊ฒฐ ๋ฐฉ์•ˆ
              • ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, batch size๋ฅผ 1๋กœ ๊ณ ์ •ํ•˜๊ณ  ๊ฐ ์ฒญํฌ๋ฅผ ๊ฐœ๋ณ„์ ์œผ๋กœ ๋ชจ๋ธ์— ์ž…๋ ฅํ•˜๋Š” ๋ฐฉ์‹์„ ํ†ตํ•ด, OOM ๋ฌธ์ œ๋ฅผ ํ”ผํ•จ
              • OOM์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด train data๋ฅผ ํ•œ ๋ฒˆ์— ์ „๋ถ€ ๊ณ„์‚ฐํ•˜์ง€ ์•Š๊ณ , mini-batch๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ์›๋ฆฌ
    • hybrid retrieval
      • task ์„ค๋ช…
        • question๊ณผ context ์‚ฌ์ด์˜ ๋‹จ์–ด ํ‘œํ˜„์— ๋Œ€ํ•œ ๋†’์€ covering ratio ๊ธฐ๋ฐ˜์œผ๋กœ sparse embedding์˜ ๋†’์€ ์„ฑ๋Šฅ ์˜ˆ์ƒ
        • ์ ์ ˆํ•œ augmentation๊ณผ ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ ํ™œ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋ฉด, dense embedding์„ ํ™œ์šฉํ•˜์—ฌ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ฐ€๋Šฅ
      • sparse embedding ์ ์šฉ
        • context๋ณ„๋กœ ๊ธธ์ด์˜ ์ฐจ์ด๊ฐ€ ํฌ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฌํ•œ ์ ์„ ๋ฐ˜์˜ํ•˜๊ธฐ ์œ„ํ•ด bm25 ํ™œ์šฉ
        • parameter test
          K1 top-k match ratio
          0.5 0.8833
          0.8 0.9
          1 0.8958
          2 0.8833
          3 0.8708
          • K1 : bm25 parameter
          • top-k match ratio : ์„ ํƒํ•œ k๊ฐœ์˜ context ์ค‘ real context๊ฐ€ ํฌํ•จ๋˜๋Š” ๋น„์œจ
    • concat retrieval
      • task ์„ค๋ช…
        • ๋‘ ํ…์ŠคํŠธ๋ฅผ concatํ•˜์—ฌ ๋ชจ๋ธ output์œผ๋กœ ์œ ์‚ฌ๋„๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋ฐฉ์‹ ์ ์šฉ
        • question๊ณผ context๊ฐ„์˜ attention ํ™œ์šฉ์ด ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ผ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ
      • ๋ฌธ์ œ์ 
        • ๊ฐ๊ฐ embeddingํ•˜๋Š” ๊ฒฝ์šฐ๋Š” searchํ•  ๋•Œ ๋ฏธ๋ฆฌ context์— ๋Œ€ํ•œ embedding์„ ๊ณ„์‚ฐํ•œ ๋’ค์—, search๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅ
        • ํ•˜์ง€๋งŒ, concat method๋Š” ์ƒˆ๋กœ์šด ์งˆ๋ฌธ์ด ๋‚˜์˜ฌ ๋•Œ ๋งˆ๋‹ค ๋ชจ๋“  context์™€ concat์„ ํ†ตํ•œ ๊ณ„์‚ฐ ํ•„์š” โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ ๊ณ„์‚ฐ ์‹œ๊ฐ„ ๋ฌธ์ œ
      • ํ™œ์šฉ ๋ฐฉ์•ˆ
        • Reranking์„ ํ™œ์šฉํ•˜์—ฌ sparse embedding์„ ํ†ตํ•ด k1๊ฐœ์˜ ๋ฌธ์„œ๋ฅผ ์„ ํƒํ•œ ๋’ค, ํ•ด๋‹น ๋ฌธ์„œ์— ๋Œ€ํ•ด์„œ๋งŒ concat retrieval ์ ์šฉ์„ ํ†ตํ•ด ๊ณ„์‚ฐ๋Ÿ‰ ์ตœ์†Œํ™”
      • ๊ฒฐ๊ณผ
        Method top-k match ratio
        Not Concat 0.7625
        Concat 0.8875
        • top-k match ratio : ์„ ํƒํ•œ k๊ฐœ์˜ context ์ค‘ real context๊ฐ€ ํฌํ•จ๋˜๋Š” ๋น„์œจ
        • not concat์€ ๊ธฐ์กด์— ํ™œ์šฉํ•˜๋˜ sparse embedding๊ณผ์˜ weighted mean ๋ฐฉ์‹์„ ํ™œ์šฉํ•˜์—ฌ, ์ตœ์ข… ์„ ํƒ๊นŒ์ง€ sparse embedding์˜ ์˜ํ–ฅ์„ ๋ฐ›์•„ ๋”์šฑ ๋†’์€ ์ ์ˆ˜๊ฐ€ ๋‚˜์˜ค๋Š” ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ
  • ๋ชจ๋ธ์„œ์น˜

    • ๊ณ ๋ ค ์‚ฌํ•ญ
      • retrieval์˜ ๊ฒฝ์šฐ embedding์„ ์ƒ์„ฑํ•˜๋Š” ๋ฌธ์ œ์ด๊ธฐ ๋•Œ๋ฌธ์—, encoder model ์œ„์ฃผ๋กœ search
      • reader์˜ ๊ฒฝ์šฐ extraction based MRC๋ฅผ ์ง„ํ–‰ํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— reader model ๋˜ํ•œ encoder model ์œ„์ฃผ๋กœ search
      • ๊ธด text ์ฒ˜๋ฆฌํ•ด์•ผ ํ•˜๋Š” ์ƒํ™ฉ์„ ๊ณ ๋ คํ•˜์—ฌ RoBERTa ๊ณ„์—ด ๋ชจ๋ธ ์œ„์ฃผ๋กœ ํ™œ์šฉ
        • BERT๋Š” ์•„๋ฌด ๋‘ ๋ฌธ์žฅ์„ ๋ถ™์—ฌ์„œ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์•„์ฃผ ์งง์€ ์ผ€์ด์Šค๋„ ์กด์žฌ
        • RoBERTa๋Š” ํ† ํฐํ™”๋œ ๋ฌธ์žฅ ๊ธธ์ด๊ฐ€ 512๊ฐ€ ๋„˜์ง€ ์•Š๋Š” ์„ ์—์„œ ์ตœ๋Œ€ํ•œ ๋ฌธ์žฅ์„ ์ด์–ด ๋ถ™์—ฌ์„œ ํ•™์Šต
        • ๋”ฐ๋ผ์„œ, ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๊ธด context๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” task์—์„œ ๋” ์ข‹์€ ์„ฑ๋Šฅ ์˜ˆ์ƒ

์‹คํ–‰ ์ฝ”๋“œ

train_retrieval.py / train_reader.py

wandb sweep config_retrieval.yaml  ## retrieval ํ•™์Šต
wandb sweep config_reader.yaml  ## reader ํ•™์Šต
wandb agent SWEEP_ID --count 5 ## SWEEP_ID์— ์œ„์—์„œ ๋ฐ˜ํ™˜๋œ sweep id   ## --count ๋’ค์—๋Š” ๋ฐ˜๋ณต ์‹คํ—˜ ์ง„ํ–‰ํ•  ํšŸ์ˆ˜

context_sparse_embedding.py

# -m : model name (AutoModel.frompretrained()์— ๋„ฃ๋Š” model name)
# -k, -b, -e : bm25 parameter (optional, float)
python context_sparse_embedding.py -m jhgan/ko-sroberta-multitask

context_dense_embedding.py

# -mp : model path (artifact ์ƒ์˜ model path, ํ•˜๋‹จ ์ฒซ ๋ฒˆ์งธ ์ด๋ฏธ์ง€ ๋นจ๊ฐ„ ๋ฐ‘์ค„)
# -mn : model name (artifact ์ƒ์˜ model name, ํ•˜๋‹จ ๋‘ ๋ฒˆ์งธ ์ด๋ฏธ์ง€ ๋นจ๊ฐ„ ๋ฐ‘์ค„)
# -b : batch size (optional, int)
python3 context_dense_embedding.py -mp [model path] -mn [model name]

test.py

# -rtmp : retrieval model path (artifact ์ƒ์˜ model path)
# -rtmn : retrieval model name (artifact ์ƒ์˜ model name)
# -rdmp : reader model path (artifact ์ƒ์˜ model path)
# -rdmn : reader model name (artifact ์ƒ์˜ model name)
# -k : number of selected contexts (optional, int)
# -w : weight for dense embedding in hybrid model (optional, float, 0~1)
python3 test.py -rtmp [retrieval model path] -rtmn [retrieval model name] -rdmp [reader model path] -rdmn [reader model name]

inference.py

python3 inference.py -rtmp [retrieval model path] -rtmn [retrieval model name] -rdmp [reader model path] -rdmn [reader model name]

About

level2-mrc-nlp-16 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%