Skip to content

WHY1862/level2-nlp-datacentric-nlp-16

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

51 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿš€ Data-Centric Topic Classification

๐Ÿ“• ํ”„๋กœ์ ํŠธ ๊ฐœ์š”

๋ณธ ํ”„๋กœ์ ํŠธ๋Š” Topic Classification ๋ฌธ์ œ๋ฅผ Data-Centricํ•œ ์ ‘๊ทผ์„ ํ†ตํ•ด ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์„ ์ฃผ์ œ๋กœ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

ํ˜„์—…์—์„œ ๋ฐ์ดํ„ฐ์˜ ์ค‘์š”๋„๋Š” ๋งค์šฐ ๋†’์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ML/DL ๋ชจ๋ธ์— ๋น„ํ•ด ๋ฐ์ดํ„ฐ์— ๊ด€ํ•œ ์—ฐ๊ตฌ๋Š” ํ™œ๋ฐœํžˆ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ํ”„๋กœ์ ํŠธ๋Š” ์ด๋Ÿฌํ•œ ํ๋ฆ„์—์„œ ๋ฒ—์–ด๋‚˜ ๋ชจ๋ธ์— ๋Œ€ํ•œ ์ˆ˜์ • ์—†์ด Data-Centricํ•œ ์ ‘๊ทผ ๋งŒ์œผ๋กœ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”์‹œํ‚ค๊ณ , ํ•ด๋‹น ๊ณผ์ •์—์„œ ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ๋ฐฉ์•ˆ์— ๋Œ€ํ•ด ํƒ๊ตฌํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์€ ํ˜„์—…์—์„œ์˜ ๋ณต์žกํ•˜๊ณ  ์ž˜ ์ •์ œ๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๋“ค์„ ์ ์ ˆํžˆ ์ฒ˜๋ฆฌํ•˜๋Š” ์ผ์— ์†”๋ฃจ์…˜์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“• ํ”„๋กœ์ ํŠธ ์š”์•ฝ

  • Text denoise ๋‹จ๊ณ„ ๋ณ„ ์„ธ์„ธํ•œ instruction์„ prompt๋กœ ์ž…๋ ฅํ•˜์—ฌ LM ๊ธฐ๋ฐ˜์˜ denoising ์ˆ˜ํ–‰
  • Labeling error๋ฅผ ๊ต์ •ํ•˜๊ธฐ ์œ„ํ•ด, text embedding๊ณผ ํ•จ๊ป˜ ๊ฐ์ข… ML model ํ™œ์šฉ
  • Back translation, Mix-up, C-BERT, Word Random Shuffle ๋“ฑ ๋ถ€์กฑํ•œ train data size๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ์ข… augmentation method ์‹œ๋„

๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ ๋ฉค๋ฒ„ ์†Œ๊ฐœ

๊ฐ•๊ฒฝ์ค€ ๊น€์žฌ๊ฒธ ์›ํ˜ธ์˜ ์œ ์„ ์šฐ
KKJ KJK WHY YSW

โš–๏ธ ์—ญํ•  ๋ถ„๋‹ด

ํŒ€์› ์—ญํ• 
๊ฐ•๊ฒฝ์ค€ noisy text detection, text denoising, c-bert, code ๊ด€๋ฆฌ
๊น€์žฌ๊ฒธ text denoising, noisy text detection, data relabeling, mix-up
์›ํ˜ธ์˜ augmentation, text denoising, back translation, word random shuffle
์œ ์„ ์šฐ EDA, analysis for noise pattern, text denoising

๐Ÿ’ป ๊ฐœ๋ฐœ/ํ˜‘์—… ํ™˜๊ฒฝ

  • ์ปดํ“จํŒ… ํ™˜๊ฒฝ
    • V100 ์„œ๋ฒ„ (VS code์™€ SSH๋กœ ์—ฐ๊ฒฐํ•˜์—ฌ ์‚ฌ์šฉ)
  • ํ˜‘์—… ํ™˜๊ฒฝ
    • notion github WandB
  • ์˜์‚ฌ์†Œํ†ต
    • zoom

๐Ÿ“‘ ๋ฐ์ดํ„ฐ ์„ค๋ช…

  • ๋ฐ์ดํ„ฐ ์„ค๋ช…
    • Train / Test (2,800 / 30,000)
    • ๊ตฌ์„ฑ
      • ID : ๊ฐ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ ID
      • text : ๊ธฐ์‚ฌ ์ œ๋ชฉ
      • target : ๊ธฐ์‚ฌ ๋ถ„๋ฅ˜ / ์ •์ˆ˜ํ˜• ์ธ์ฝ”๋”ฉ
        • ์ƒํ™œ๋ฌธํ™”, ์Šคํฌ์ธ , ์„ธ๊ณ„, ์ •์น˜, ๊ฒฝ์ œ, IT๊ณผํ•™, ์‚ฌํšŒ์˜ 7๊ฐ€์ง€ ์ฃผ์ œ ์ค‘ ํ•˜๋‚˜
    • ํŠน์ง•
      • text, target์— noise ํฌํ•จ
      • text ์ค‘ ์ผ๋ถ€๋ฅผ ๋‹ค๋ฅธ ascii ์ฝ”๋“œ๋กœ ๋ณ€๊ฒฝ
      • target ์ค‘ ์ผ๋ถ€ ์ž„์˜๋กœ ๋ณ€๊ฒฝ

๐Ÿ—‚๏ธ ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

. level2-datacentric-nlp-16
โ”œโ”€ .github
โ”œโ”€ data
โ”‚  โ”œโ”€ train.csv
โ”‚  โ””โ”€ test.csv
โ”œโ”€ dataloader
โ”‚  โ””โ”€ datasets.py
โ”œโ”€ augmentation
โ”‚  โ”œโ”€ C-BERT.py
โ”‚  โ”œโ”€ PororoBT.py
โ”‚  โ”œโ”€ shuffle.py
โ”‚  โ””โ”€ synonym_replacement.py
โ”œโ”€ prompts
โ”‚  โ”œโ”€ prompt_gemma.py
โ”‚  โ””โ”€ prompt_llama.py
โ”œโ”€ utils
โ”‚  โ”œโ”€ clean_text.py
โ”‚  โ””โ”€ util.py
โ”œโ”€ .flake8
โ”œโ”€ .gitignore
โ”œโ”€ .gitmessage.txt
โ”œโ”€ .pre-commit-config.yaml
โ”œโ”€ README.md
โ”œโ”€ requirements.txt
โ”œโ”€ baseline_code.ipynb
โ”œโ”€ clean.py
โ””โ”€ label_corrector.py

๐ŸŽฅ ์‹คํ–‰ ์ฝ”๋“œ

clean.py

# text denoising
## -s : seed number setting
## -m : huggingface model id
## -ku : korean ratio upper bound
## -kl : korean ratio lower bound
python3 clean.py -s 456 -m aifeifei798/Meta-Llama-3.1-8B-Instruct -ku 0.75 -kl 0.5

label_corrector.py

# label denoising
## -s : seed number setting
## -m : huggingface model id
## -mi : max iteration for logistic regression
## -k : number of folds for cross validataion
python3 label_corrector.py -s 456 -m klue/roberta-base -mi 400 -k 5

C-BERT.py

# C-BERT augmentation
## -s : random seed number
## -m : huggingface model id
## -n : the number of labels to predict
## -k : the number of candidates for synonym replacements
## -e : epoch size
## -b : batch size
## -lr : learning rate for training C-BERT
## -w : weight decay for learning rate scheduler
python3 C-BERT.py -s 456 -m FacebookAI/xlm-roberta-large -n 7 -k 3 -e 10 -b 16 -lr 0.001 -w 0.0001

synonym_replacement.py

# synonym replacement augmentation
## -s : random seed number
## -m : huggingface model id
## -n : the number of labels to predict
## -k : the number of candidates for synonym replacements
python3 synonym_replacement.py -s 456 -m FacebookAI/xlm-roberta-large -n 7 -k 3

PororoBT.py

# back translation augmentation
## -i : "input.csv" name 
## -o : "output.csv" name
python3 PororoBT.py -i train.csv -o ouput.csv

Shuffle.py

# Random word shuffle augmentation
## -i : "input.csv" name 
## -o : "output.csv" name
python3 Shuffle.py -i train.csv -o ouput.csv

๐Ÿ“– ํ”„๋กœ์ ํŠธ ์ˆ˜ํ–‰ ๊ฒฐ๊ณผ

  • Text Noise Detection

    • ๋ชฉ์ 
      • data ์ƒ์—๋Š” ์ˆ˜์ •ํ•˜์ง€ ์•Š์•„๋„ ๋  ๋งŒํผ ๊น”๋”ํ•œ text์™€ ์–ด๋Š ์ •๋„ ์ˆ˜์ •์ด ๊ฐ€๋Šฅํ•œ text, ์ˆ˜์ •ํ•  ์ˆ˜ ์—†๋Š” ์ •๋„๋กœ ์†์ƒ๋œ text ๋“ฑ ์—ฌ๋Ÿฌ ์œ ํ˜•์˜ text๊ฐ€ ์กด์žฌ
      • text์— ๋Œ€ํ•œ ์†์ƒ ์ •๋„๋ฅผ ์ •์˜ํ•˜์—ฌ, ์ˆ˜์ •ํ•  text๋ฅผ ์„ ๋ณ„ํ•˜๋Š” ๊ณผ์ • ํ•„์š”
    • ํ•œ๊ตญ์–ด ๋น„์œจ
      • ํ•œ๊ธ€ ๋ฌธ์ž๊ฐ€ ์˜์–ด, ํŠน์ˆ˜ ๋ฌธ์ž, ๊ณต๋ฐฑ ๋ฐ ์ˆซ์ž ๋“ฑ์œผ๋กœ ๋Œ€์ฒด๋˜๋Š” ๋ฐฉ์‹์˜ noise
      • ์†์ƒ์ด ํฐ text๋Š” text ๋‚ด ํ•œ๊ธ€ ๋น„์œจ์ด ๋‚ฎ์„ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋จ
      • ๋”ฐ๋ผ์„œ, ํ•œ๊ธ€ ๋น„์œจ์„ ๊ธฐ์ค€์œผ๋กœ ๊ตฌ๊ฐ„์„ ๋‚˜๋ˆ„์–ด normal data, cleanable data, not-cleanable data ๋ถ„๋ฅ˜
      • normal data
        • ์ „ํ˜€ ์†์ƒ๋˜์ง€ ์•Š์•˜๊ฑฐ๋‚˜, ์ˆ˜์ •ํ•˜์ง€ ์•Š์•„๋„ ๊ดœ์ฐฎ์€ ์ˆ˜์ค€
        • ์˜ˆ์‹œ (ํ•œ๊ธ€ ๋ฐ ๊ณต๋ฐฑ ๋น„์œจ 0.8 ์ด์ƒ ๊ธฐ์ค€)
      • ํŽ˜์ด์Šค๋ถ ์ธํ„ฐ๋„ท ๋“œ๋ก  ์•„ํ€ผ๋ผ ์‹ค๋ฌผ ์ฒซ ์‹œํ—˜๋น„ํ–‰ ์„ฑ๊ณต
      • ํ•ด์™ธ๋กœ๋ฐ m๊ธˆํญํƒ„ n๋™์ฐจ๋‹จ ๋” ๋นจ$์ง„๋‹ค
      • ๋•… ํŒŒ= ์ฝ”l๋‚˜ ๊ฒฉ๋ฆฌ์‹œ์„ค ํƒˆ์ถœํ•œ ์™ธ๊ตญ์ธ ์ฒญ_์„œ VS
      • cleanable data
        • ์•ฝ๊ฐ„ ์†์ƒ๋์ง€๋งŒ, ์›ํ˜•์„ ์–ด๋Š์ •๋„ ์ถ”์ • ๊ฐ€๋Šฅํ•œ ์ˆ˜์ค€
        • ์˜ˆ์‹œ (ํ•œ๊ธ€ ๋ฐ ๊ณต๋ฐฑ ๋น„์œจ 0.8 ๋ฏธ๋งŒ 0.6 ์ด์ƒ ๊ธฐ์ค€)
      • m ๊น€์ •) ์ž์ฃผํ†ต์ผ ์ƒˆ,?r์—ด1๋‚˜๊ฐ€์•ผ1๋ณด
      • ์ฝ”๋กœ๋‚˜ r๋Œ€^๋“ฑ๊ต)๋ชจ์Šต
      • ๋ฌธ๋Œ€ํ†ต๋ น ๊น€์ •*mํŠธ/ํ”„7 YTD ์กฐ์†ํžˆH๋๋‚ด๊ณ  A๋‹ค๊ณ !,p2ํ•ฉ
      • not-cleanable data
        • ์†์ƒ์ด ๋„ˆ๋ฌด ์ปค, ์›ํ˜•์„ ์ถ”์ •ํ•  ์ˆ˜ ์—†๋Š” ์ˆ˜์ค€
        • ์˜ˆ์‹œ (ํ•œ๊ธ€ ๋ฐ ๊ณต๋ฐฑ ๋น„์œจ 0.6 ๋ฏธ๋งŒ ๊ธฐ์ค€)
      • E๋‹ฌA](j์ƒZwQ์„  ์ผ*77์•„-๋Š”๋ฐโ€ฆ nfDํŽธ
      • .๋‹ฌ CES %๊ตด#N๋ฐ”@์€^์ƒˆa|๋”oํฐI์ค‘์ €oํฐO rb
      • ์—ฌํ–‰^์‹eํ•œ$8์ˆ˜&mT30,_Y๊ธฐ! ์‚ฌ์ง„# ์ด๋งˆ์ง„ ํ”„<์Šค
  • Text Denoising

    • ๋ชฉ์ 
      • text์— ์ ์šฉ๋œ noise ๊ทœ์น™์„ ๋ถ„์„ํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์›ํ˜•์„ ๋ณต๊ตฌํ•˜๋Š” ๊ฒƒ์„ ๋ชฉ์ ์œผ๋กœ ํ•จ
      • text์˜ ์›ํ˜•์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ํ•™์Šต์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์—†์œผ๋ฏ€๋กœ, LLM์— prompt engineering์„ ์ ์šฉํ•˜๋Š” ๋ฐฉ์•ˆ ํ™œ์šฉ
    • ๋ชจ๋ธ ์„ ์ •
      • ์„ ์ • ๊ธฐ์ค€
        • noise๊ฐ€ ์„ž์ธ ๋ฌธ์žฅ์„ ๋ณต์›ํ•˜๋Š” ์ž‘์—…์—๋Š” ๋†’์€ ์ˆ˜์ค€์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ ์š”๊ตฌ๋จ
        • ํ•˜์ง€๋งŒ, ์ œํ•œ๋œ ์ปดํ“จํŒ… ๋ฆฌ์†Œ์Šค ๋‚ด์—์„œ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์—๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Œ
        • ๋”ฐ๋ผ์„œ, ์ž‘์€ ํฌ๊ธฐ๋กœ ์ตœ๋Œ€ํ•œ์˜ ํšจ์šฉ์„ ๋‚ผ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ๊ธฐ์ค€์œผ๋กœ ์„ ์ •
      • Llama
        • scaling law ๊ธฐ๋ฐ˜์œผ๋กœ model size์— ์ ํ•ฉํ•œ dataset size๋ฅผ ํ†ตํ•ด, ์ œํ•œ๋œ ๋ชจ๋ธ ํฌ๊ธฐ ๋‚ด์—์„œ ์ตœ๊ณ ์˜ ํšจ์œจ์„ฑ ๋‹ฌ์„ฑ
        • Llama - 8B ๋ชจ๋ธ ํ™œ์šฉ
      • Gemma
        • scaling law ๊ธฐ๋ฐ˜์˜ ์ ์ ˆํ•œ dataset size์— ๋”๋ถˆ์–ด, ์ž‘์€ ๋ชจ๋ธ์— ํฐ ๋ชจ๋ธ์˜ ์ง€์‹์„ ์ „๋‹ฌํ•˜๋Š” ์ง€์‹ ์ฆ๋ฅ˜ ํ•™์Šต ๊ธฐ์ˆ ์„ ํ†ตํ•ด ์ž‘์€ ํฌ๊ธฐ์˜ ๋ชจ๋ธ์—์„œ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ
        • Gemma - 9B ๋ชจ๋ธ ํ™œ์šฉ
    • Prompt Engineering
      • Few shot
        • Llama - 8B ๋ชจ๋ธ ํ™œ์šฉ
        • ๋ฌธ์žฅ ๋ณต์›์— ๋Œ€ํ•œ ์˜ˆ์‹œ ๊ธฐ๋ฐ˜์˜ ์งˆ๋ฌธ๊ณผ ๋‹ต๋ณ€ ์Œ์œผ๋กœ๋งŒ prompt๋ฅผ ๊ตฌ์„ฑ
        • ๊ฒฐ๊ณผ
          • Llama model ํ™œ์šฉ
          • Accuracy : 0.7205 / F1-score : 0.7041
          • CoT ๊ธฐ๋ฐ˜์˜ prompt์— ๋น„ํ•ด ๋ถ€์กฑํ•œ ์„ฑ๋Šฅ
          • ์ž์˜์ ์œผ๋กœ ์ˆ˜์ •ํ•˜๋Š” ๊ฒฝ์šฐ ๋Œ€๋ถ€๋ถ„
          • ์†์ƒ์ด ๋œํ•œ ํ…์ŠคํŠธ์— ์˜คํžˆ๋ ค ์†์ƒ์„ ์ฃผ๋Š” ๊ฒฝ์šฐ๋„ ๋ฐœ์ƒ
      • Chain of Thoughts
        • ๋ฌธ์žฅ์˜ ๋ณต์› ๋‹จ๊ณ„ ๋ฐ ๊ทœ์น™์„ ์„ธ๋ถ€์ ์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ง€์‹œ
        • Llama model base
          • Llama - 8B ๋ชจ๋ธ ํ™œ์šฉ
          • ๋…ธ์ด์ฆˆ ํŒจํ„ด ์„ค๋ช…โ‹…๋ณต์› ๊ทœ์น™ ์ œ์‹œโ‹…์˜ˆ์‹œโ‹…๋ณต์› ์š”์ฒญ๋ฌธ์˜ 4๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ
          • ๊ฒฐ๊ณผ
            • ๋ฌธ์žฅ์˜ ์™„๊ฒฐ์„ฑ ๋†’์Œ
            • ๋‹ค์†Œ ์ž์˜์ ์ธ ์ˆ˜์ •์„ ํ•˜๋Š” ๊ฒฝ์šฐ ์—ฌ์ „ํžˆ ์กด์žฌ
            • Accuracy : 0.8277 / F1-score : 0.8247
        • Gemma model base
          • Gemma - 9B ๋ชจ๋ธ ํ™œ์šฉ
          • ๋ณต์› ์กฐ๊ฑดโ‹…๋ณต์› ๋‹จ๊ณ„โ‹…๋ณต์› ์˜ˆ์‹œโ‹…๋ณต์› ์š”์ฒญ๋ฌธ์˜ 4๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ
          • ๊ฒฐ๊ณผ
            • ์กฐ๊ฑด, ์˜ˆ์‹œ, ์š”์ฒญ์˜ 3๋‹จ๊ณ„ ๊ตฌ์„ฑ ๋ณด๋‹ค, ๋ณต์› ๋‹จ๊ณ„๋ฅผ ์ถ”๊ฐ€์„ ๋•Œ ๋ณต์› ์กฐ๊ฑด ๋ฐ˜์˜ ๋Šฅ๋ ฅ ํ–ฅ์ƒ
            • ํ•˜๋‚˜์˜ ์˜ˆ์‹œ๋งŒ ์ฃผ์—ˆ์„ ๋•Œ ๋ณด๋‹ค ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์˜ˆ์‹œ๋ฅผ ์ฃผ์—ˆ์„ ๋•Œ ๋‹ค์–‘ํ•œ ์ผ€์ด์Šค์— ๋Œ€ํ•œ ์ผ๊ด€๋œ ๋ณต์› ๋Šฅ๋ ฅ ํ–ฅ์ƒ์— ๋„์›€ ๋จ
            • Accuracy : 0.8293 / F1-score : 0.8257
      • ๊ฒฐ๋ก 
        • ๋‹จ์ˆœํžˆ ์˜ˆ์‹œ๋ฅผ ์—ฐ์†์ ์œผ๋กœ ์ œ์‹œํ•˜๋Š” ๊ฒƒ ๋ณด๋‹ค, ์„ธ๋ถ€์ ์œผ๋กœ ๊ณผ์ •์„ ์„ค๋ช…ํ•˜๋Š” ๊ฒƒ์ด ๋ชจ๋ธ์—๊ฒŒ๋„ ๋„์›€์ด ๋จ
        • ๋‹ค๋งŒ, ๊ทœ์น™์„ ์„ค์ •ํ•ด๋„ ๊ทœ์น™์ด ์™„๋ฒฝํ•˜๊ฒŒ ์ง€์ผœ์ง€์ง€๋Š” ์•Š์œผ๋ฏ€๋กœ, ๊ทœ์น™์— ๋Œ€ํ•œ ์ ์ ˆํ•œ ์˜ˆ์‹œ๋„ ํ•จ๊ป˜ ์ œ์‹œํ•˜๋Š” ๊ฒŒ ์ข‹์Œ
  • Label Denoising

    • Denoising Tool
      • CleanLab
        • DL๊ธฐ๋ฐ˜์˜ text embedding ๋ชจ๋ธ์— linear layer๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  ํ•ด๋‹น layer์— ๋Œ€ํ•œ ํ•™์Šต์„ ํ†ตํ•ด, classification task ์ˆ˜ํ–‰
          • ํ•™์Šต ์ฝ”๋“œ์˜ ์šฉ์ดํ•จ๊ณผ CleanLab๊ณผ์˜ ํ˜ธํ™˜์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ, logistic regression์„ ์ ์šฉํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ linear layer ๋Œ€์ฒด
          • logistic regression์—์„œ ๋”์šฑ ํ™•์žฅํ•˜์—ฌ, SVMโ‹…Random Forest ๋“ฑ ๋‹ค์–‘ํ•œ ML ๋ชจ๋ธ ์ ์šฉ ๋ฐ ์•™์ƒ๋ธ”
        • ๊ฐ label ๋ณ„ predicted probability์— ๋Œ€ํ•ด threshold๋ฅผ ์„ค์ •ํ•˜๊ณ , ํ•ด๋‹น threshold ์ดํ•˜์ผ ๊ฒฝ์šฐ labeling error๋กœ ํŒ๋‹จ
          • ์ผ๋ฐ˜์ ์œผ๋กœ threshold๋Š” ๊ฐ label๋กœ ์˜ˆ์ธก๋œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด predicted probability์˜ ํ‰๊ท ๊ฐ’ ํ™œ์šฉ
          • Over-correcting์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ, Mis-labeling data์˜ ๊ฐœ์ˆ˜์— ๋งž์ถฐ threshold ์กฐ์ • ์‹œ๋„
      • ๊ฒฐ๊ณผ
        • logistic regression
          • Accuracy : 0.8237 / F1-score : 0.8218
        • ensemble
          • ๋ชจ๋ธ ๊ตฌ์„ฑ : logistic regression, randomforestclassifier, SVM
          • Accuracy : 0.8266 / F1-score : 0.8241
  • Augmentation

    • C-BERT
      • BERT ๊ธฐ๋ฐ˜์˜ Masked LM์„ ํ™œ์šฉํ•œ ์œ ์‚ฌ์–ด ๋Œ€์ฒด ์ฆ๊ฐ• ๊ธฐ๋ฒ•
      • BERT ๋ชจ๋ธ์˜ input์— label embedding์„ ์ถ”๊ฐ€ํ•˜์—ฌ, label ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•œ ์ฆ๊ฐ• ๊ตฌํ˜„
      • ๊ฒฐ๊ณผ
        • train data ์ƒ์˜ noise๋กœ ์ธํ•ด, label embedding์— ๋Œ€ํ•œ ์ ์ ˆํ•œ ํ•™์Šต์ด ์–ด๋ ค์›€
        • ํ† ํฐ์„ ๊ฐ์ข… ํŠน์ˆ˜๋ฌธ์ž๋กœ ์˜ˆ์ธกํ•˜๋Š” ๊ฒฝ์šฐ ๋ฐœ์ƒ
    • ์œ ์˜์–ด ๋Œ€์ฒด
      • Masked LM์„ ํ™œ์šฉํ•œ ์œ ์‚ฌ์–ด ๋Œ€์ฒด ์ฆ๊ฐ• ๊ธฐ๋ฒ•
      • ๊ฒฐ๊ณผ
        Accuracy F1-Score
        ์ฆ๊ฐ• ์ „ 0.8275 0.8246
        ์ฆ๊ฐ• ํ›„ 0.8297 0.8269
    • Back Translation
      • ํ•œ๊ตญ์–ด๋ฅผ ์˜์–ด๋กœ ๋ฒˆ์—ญํ•œ ๋’ค ๋‹ค์‹œ ํ•œ๊ตญ์–ด๋กœ ๋ฒˆ์—ญํ•˜์—ฌ, ๋ฒˆ์—ญ ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์–ธ์–ด ํ‘œํ˜„์˜ ๋‹ค์–‘์„ฑ ํ™œ์šฉ
      • ๋ฒˆ์—ญ ๋ชจ๋ธ์˜ ํ™œ์šฉ์— ์žˆ์–ด์„œ ๋ฒˆ์—ญ ์ž์ฒด์˜ ํ’ˆ์งˆ์€ ๋งค์šฐ ์ค‘์š”
      • ํ•œ๊ตญ์–ด ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์— ํŠนํ™”๋˜์–ด, ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ ๊ด€๋ จํ•ด์„œ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š” Kakao brain/Pororo ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ํ”„๋ ˆ์ž„์›Œํฌ์˜ ๋ฒˆ์—ญ ๋ชจ๋ธ ํ™œ์šฉ
      • ๊ฒฐ๊ณผ
        Accuracy F1-Score
        ์ฆ๊ฐ• ์ „ 0.7584 0.7546
        ์ฆ๊ฐ• ํ›„ 0.7917 0.7836
        • ๋ฒˆ์—ญ ๊ณผ์ •์—์„œ, ๊ธฐ์‚ฌ ์ œ๋ชฉ์—์„œ ์ฃผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฌธ์ฒด ํƒˆํ”ผ
    • Random word Shuffle
      • Text์— ๋Œ€ํ•ด ์กฐ์‚ฌ๋ฅผ ์ œ์™ธํ•œ ๋‹จ์–ด ์ˆœ์„œ ์ž„์˜๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•
      • ๊ฒฐ๊ณผ
        Accuracy F1-Score
        ์ฆ๊ฐ• ์ „ 0.8110 0.8050
        ์ฆ๊ฐ• ํ›„ 0.8084 0.8021
        • ์ž„์˜ ๋ณ€๊ฒฝ์œผ๋กœ ์ธํ•ด, ๋ฌธ์žฅ์˜ ์™„๊ฒฐ์„ฑ ๋ณด์กด์— ํ•œ๊ณ„
        • ๊ธฐ์กด ํ‘œํ˜„๊ณผ ๋น„๊ตํ•ด, ๋‹ค์–‘์„ฑ ํ™•๋ณด์— ํฌ๊ฒŒ ๋„์›€์ด ๋˜์ง€๋Š” ์•Š์Œ
    • Mix-Up
      • Text embedding ๊ธฐ์ค€์œผ๋กœ ์„ ํ˜• ๋ณด๊ฐ„ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์„ ํ†ตํ•ด ์ฆ๊ฐ•
      • ์—ฌ๋Ÿฌ label์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ต์ฐจ๋กœ mixํ•  ๊ฒฝ์šฐ label ๋ฐฐ์ •์— ๋ฌธ์ œ๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ, ๊ฐ label ๋ณ„๋กœ ์ ์šฉ
      • ๊ฒฐ๊ณผ
        Accuracy F1-Score
        ์ฆ๊ฐ• ์ „ 0.8266 0.8241
        ์ฆ๊ฐ• ํ›„ 0.8185 0.8173
  • ์ตœ์ข… ์ œ์ถœ ๋ชจ๋ธ

    • Text denoising ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์›๋ณธ ํ›ผ์†์— ๋Œ€๋น„ํ•˜์—ฌ, denoising ํ›„ ์›๋ณธ ํ…์ŠคํŠธ์— ์ฆ๊ฐ•ํ•˜๋Š” ๋ฐฉ์‹ ํ™œ์šฉ
    • Llama, Gemma model ๊ฐ๊ฐ denoising ๊ฒฐ๊ณผ์— ์ฐจ์ด๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ, ๊ฐ๊ฐ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ๋‘ ํ™œ์šฉ
    • ๊ธฐํƒ€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋˜ back translation, ์œ ์˜์–ด ๋Œ€์ฒด ํ™œ์šฉ
    • ๊ฒฐ๊ณผ
      • Accuracy : 0.8401 / F1-score : 0.8365

About

level2-nlp-datacentric-nlp-16 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 77.7%
  • Jupyter Notebook 22.3%