Notice
Recent Posts
Recent Comments
Link
ยซ   2025/06   ยป
์ผ ์›” ํ™” ์ˆ˜ ๋ชฉ ๊ธˆ ํ† 
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30
Archives
Today
Total
๊ด€๋ฆฌ ๋ฉ”๋‰ด

์„ธ์ฐฌํ•˜๋Š˜

[๋…ผ๋ฌธ] Attention Is All You Need_ํŠธ๋žœ์Šคํฌ๋จธ ์ดํ•ด(๊ฐœ์š” + ๊ตฌ์กฐ์ •)_ํŒŒํŠธ1 ๋ณธ๋ฌธ

์นดํ…Œ๊ณ ๋ฆฌ ์—†์Œ

[๋…ผ๋ฌธ] Attention Is All You Need_ํŠธ๋žœ์Šคํฌ๋จธ ์ดํ•ด(๊ฐœ์š” + ๊ตฌ์กฐ์ •)_ํŒŒํŠธ1

HotSky92 2025. 3. 24. 13:05

๐Ÿง  Transformer ๊ฐœ์š” 

Transformer๋Š” 2017๋…„ Vaswani et al.์ด ๋ฐœํ‘œํ•œ
๋…ผ๋ฌธ *Attention Is All You Need์—์„œ ์ฒ˜์Œ ์†Œ๊ฐœ๋œ ๋ชจ๋ธ๋กœ,

๊ธฐ์กด ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP)์—์„œ ์ฃผ๋ฅ˜๋กœ ์‚ฌ์šฉ๋˜๋˜ RNN๊ณผ CNN ๊ธฐ๋ฐ˜์˜

Sequence-to-Sequence ๋ชจ๋ธ์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ–ˆ๋‹ค.

 

๊ธฐ์กด์˜ RNN ๊ธฐ๋ฐ˜ ๋ชจ๋ธ(์˜ˆ: LSTM, GRU ๋“ฑ)์€ ๋ฌธ์žฅ์„ ๊ตฌ์„ฑํ•˜๋Š” ๋‹จ์–ด๋“ค์„

์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ์–ด๋ ต๊ณ , ๋ฌธ์žฅ์˜ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก

์•ž ๋‹จ์–ด์˜ ์ •๋ณด๊ฐ€ ๋’ค๋กœ ๊ฐˆ์ˆ˜๋ก ํฌ๋ฏธํ•ด์ง€๋Š” ์žฅ๊ธฐ ์˜์กด์„ฑ(long-range dependency) ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด "๋‚˜๋Š” ์–ด๋ฆด ์  ๊ฟˆ์ด ๋ญ์˜€๋ƒ๋ฉด..." ๊ฐ™์€ ๊ธด ๋ฌธ์žฅ์—์„œ,
๋งˆ์ง€๋ง‰ ๋‹จ์–ด๋ฅผ ์ƒ์„ฑํ•  ๋•Œ ์ฒ˜์Œ ๋งํ•œ "๋‚˜๋Š”"์ด ์ž˜ ๊ธฐ์–ต๋˜์ง€ ์•Š๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

 

ํ•œํŽธ, CNN ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์ง€๋งŒ,

๊ณ ์ •๋œ ์ปค๋„ ํฌ๊ธฐ๋กœ ์ธํ•ด ํ•œ ๋ฒˆ์— ๋ณผ ์ˆ˜ ์žˆ๋Š” ๋‹จ์–ด์˜ ๋ฒ”์œ„(= ์ˆ˜์šฉ ์˜์—ญ, receptive field)๊ฐ€ ์ œํ•œ๋œ๋‹ค.

์ด ๋•Œ๋ฌธ์— ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ๋‹จ์–ด๋“ค ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์ด ์žˆ๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด "๋‚˜๋Š”"๊ณผ "๊ฐ”๋‹ค"๊ฐ€ ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ์„ ๋•Œ, CNN์€ ์ด ๋‘ ๋‹จ์–ด์˜ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๋ ค๋ฉด

์—ฌ๋Ÿฌ ์ธต์„ ๊ฑฐ์ณ์•ผ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋œ ๊ฒƒ์ด Transformer ๋ชจ๋ธ์ด๋‹ค.


Transformer๋Š” ๊ธฐ์กด์˜ recurrence(๋ฐ˜๋ณต ๊ตฌ์กฐ)์™€ convolution(ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ)์„ ๋ชจ๋‘ ์ œ๊ฑฐํ•˜๊ณ ,

๋Œ€์‹  Self-Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜๋งŒ์œผ๋กœ ๋ฌธ์žฅ์˜ ๋ชจ๋“  ๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ•œ ๋ฒˆ์— ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€๋‹ค.

์ด ๋•๋ถ„์— ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•ด์ง€๊ณ , ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ๋‹จ์–ด ๊ฐ„ ์˜์กด์„ฑ๋„ ๋‹จ ํ•œ ๋ฒˆ์˜ ์—ฐ์‚ฐ์œผ๋กœ

๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์–ด ํ•™์Šต ์†๋„์™€ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜์—ˆ๋‹ค.

 


์ถœ์ฒ˜, Attention Is All You Need



๐Ÿ“ฆ ๊ตฌ์กฐ ์„ค๋ช… (์˜ˆ์‹œ: ์˜์–ด → ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ)

Transformer๋Š” ํฌ๊ฒŒ Encoder์™€ Decoder๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.

  • ์˜ˆ๋ฅผ ๋“ค์–ด, ์ž…๋ ฅ ๋ฌธ์žฅ์ด "I am a student"์ด๊ณ  ์ถœ๋ ฅ(์ •๋‹ต)์ด "๋‚˜๋Š” ํ•™์ƒ์ด๋‹ค"์ธ ๊ฒฝ์šฐ๋ฅผ ์ƒ๊ฐํ•ด๋ณด์ž.

๐Ÿ”ท 1. Input Embedding + Positional Encoding

์ž…๋ ฅ ๋ฌธ์žฅ์€ ๋‹จ์–ด๋“ค์„ ์ˆซ์ž ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•œ ์ž„๋ฒ ๋”ฉ์„ ๊ฑฐ์น˜๊ณ ,

์ˆœ์„œ๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋„๋ก ์œ„์น˜ ์ •๋ณด(Positional Encoding)**๊ฐ€ ์ถ”๊ฐ€๋œ๋‹ค.

์ด๊ฑธ ํ†ตํ•ด ๋ชจ๋ธ์€ "I"๊ฐ€ ์ฒซ ๋ฒˆ์งธ ๋‹จ์–ด๊ณ  "student"๊ฐ€ ๋„ค ๋ฒˆ์งธ๋ผ๋Š” ์ˆœ์„œ๋ฅผ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋‹ค.

๐Ÿ”ถ 2. Encoder: ์ž…๋ ฅ ์ „์ฒด ๋ฌธ๋งฅ์„ ํŒŒ์•…

Encoder๋Š” ๋™์ผํ•œ ๊ตฌ์กฐ์˜ ๋ธ”๋ก์„ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฐ˜๋ณตํ•ด์„œ ์Œ“์€ ๊ตฌ์กฐ์ด๋ฉฐ, ๊ฐ ๋ธ”๋ก์—๋Š” ๋‹ค์Œ ๋‘ ๊ฐ€์ง€๊ฐ€ ๋“ค์–ด ์žˆ๋‹ค:

  • Multi-Head Self-Attention: "I am a student" ๋‚ด์˜ ๋‹จ์–ด๋“ค์ด ์„œ๋กœ๋ฅผ ๋ฐ”๋ผ๋ณด๋ฉฐ(์–ดํ…์…˜)
    ์ค‘์š”ํ•œ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด "student"๋Š” "I"์™€ ์—ฐ๊ฒฐ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฑธ ํ•™์Šตํ•œ๋‹ค.

    ๐Ÿง  ์ „์ œ: ๋ฌธ์žฅ๊ณผ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ
    "The cat sat on the mat"
    
    ์ด ๋ฌธ์žฅ์€ 6๊ฐœ์˜ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ณ , ๊ฐ ๋‹จ์–ด๋Š” ์ž„๋ฒ ๋”ฉ๋˜์–ด d_model ์ฐจ์› ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„๋ผ.
    (์˜ˆ: d_model = 512)







    โœ… ์ •๋ฆฌ
    Q, K, V ์ƒ์„ฑ ๊ฐ ๋‹จ์–ด๋ฅผ ์„ธ ๊ฐ€์ง€ ๊ด€์ ์—์„œ ๋ณ€ํ™˜
    ๋‚ด์ (Q·K) ๋‹จ์–ด ๊ฐ„ ๊ด€๋ จ์„ฑ(์œ ์‚ฌ๋„) ๊ณ„์‚ฐ
    Softmax ์ค‘์š”๋„(๊ฐ€์ค‘์น˜)๋กœ ๋ณ€ํ™˜
    ๊ฐ€์ค‘ํ•ฉ ์ค‘์š”ํ•œ ๋‹จ์–ด์˜ ์ •๋ณด๋ฅผ ๋ชจ์•„์„œ ๋‚ด ํ‘œํ˜„์œผ๋กœ ๋งŒ๋“ฆ

 

์ธ์ฝ”๋” 1์ธต์˜ ๊ตฌ์„ฑ (ํ•œ ์ธต ๊ธฐ์ค€):

  1. Input Embedding + Positional Encoding
    → ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ์œ„์น˜ ์ •๋ณด๋ฅผ ๋”ํ•จ
  2. Multi-Head Self-Attention
    → ๋ฌธ์žฅ ์•ˆ ๋‹จ์–ด๋“ค์ด ์„œ๋กœ๋ฅผ ๋ฐ”๋ผ๋ณด๋ฉฐ ๊ด€๊ณ„ ํŒŒ์•… (์œ„์— ์ •๋ฆฌํ•œ ๋‚ด์šฉ)
  3. Add & LayerNorm
    → ์ž”์ฐจ ์—ฐ๊ฒฐ(residual)๊ณผ ์ •๊ทœํ™”(layer normalization)๋กœ ์•ˆ์ •์ ์ธ ํ•™์Šต

์ž”์ฐจ์—ฐ๊ฒฐ๊ณผ ์ •๊ทœํ™” ์˜ˆ์‹œ

 

 

 

 

 

 

์ •๋ฆฌ - ์ž”์ฐจ์™€ ์ •๊ทœํ™”๋ฅผ ํ•˜๋Š” ์ด์œ 

ํŠธ๋žœ์Šคํฌ๋จธ์—์„œ ๋ฉ€ํ‹ฐํ—ค๋“œ ์–ดํ…์…˜์€ ๋ฌธ๋งฅ์„ ๋ฐ˜์˜ํ•œ ์ƒˆ๋กœ์šด ํ‘œํ˜„์„ ๋งŒ๋“ค์–ด๋‚ด์ง€๋งŒ,
์ด ๊ณผ์ •์—์„œ ์›๋ž˜ ์ž…๋ ฅ ๋ฒกํ„ฐ์˜ ๊ตฌ์กฐ๋‚˜ ์ •๋ณด๊ฐ€ ๊ณผํ•˜๊ฒŒ ๋ณ€ํ˜•๋˜๊ฑฐ๋‚˜ ์†์‹ค๋  ์œ„ํ—˜์ด ์žˆ๋‹ค.
์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ์ž”์ฐจ ์—ฐ๊ฒฐ(Residual Connection)์„ ์‚ฌ์šฉํ•˜์—ฌ ์–ดํ…์…˜์˜ ์ถœ๋ ฅ์—
์›๋ž˜ ์ž…๋ ฅ์„ ๋”ํ•ด์คŒ์œผ๋กœ์จ ๋ชจ๋ธ์ด ๋ฌธ๋งฅ ์ •๋ณด์™€ ์ž…๋ ฅ ์ž์ฒด์˜ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.
์ด๋ ‡๊ฒŒ ๋”ํ•ด์ง„ ๊ฒฐ๊ณผ๋Š” ๊ฐ’์˜ ๋ถ„ํฌ๊ฐ€ ๋ถˆ์•ˆ์ •ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—,
์ด์–ด์„œ ๋ ˆ์ด์–ด ์ •๊ทœํ™”(Layer Normalization)๋ฅผ ์ ์šฉํ•ด ํ‰๊ท  0, ๋ถ„์‚ฐ 1๋กœ ์ •๋ˆํ•จ์œผ๋กœ์จ
ํ•™์Šต์˜ ์•ˆ์ •์„ฑ๊ณผ ์ˆ˜๋ ด ์†๋„๋ฅผ ๋†’์ธ๋‹ค. ์ด ๋‘ ๊ตฌ์„ฑ์€ ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ๊นŠ์€ ์ธต์—์„œ๋„
์ •๋ณด ์†์‹ค ์—†์ด ์˜๋ฏธ ์žˆ๋Š” ํ‘œํ˜„์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š” ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด๋‹ค.

 

4. Feed Forward Network (FFN)
→ ๊ฐ ๋‹จ์–ด์˜ ํ‘œํ˜„์„ ๋น„์„ ํ˜•์ ์œผ๋กœ ํ™•์žฅ (2-layer MLP)

๐Ÿ”ง Feed Forward Network(FFN)์˜ ๊ตฌ์„ฑ ๋‹ค์‹œ ๋ณด๊ธฐ

ํŠธ๋žœ์Šคํฌ๋จธ์˜ FFN์€ ์‚ฌ์‹ค์ƒ ์•„์ฃผ ๊ฐ„๋‹จํ•œ 2์ธต์งœ๋ฆฌ ์‹ ๊ฒฝ๋ง์ด์•ผ.
๋”ฅ๋Ÿฌ๋‹์—์„œ ํ”ํžˆ ์“ฐ๋Š” MLP (Multi-Layer Perceptron) ๊ตฌ์กฐ๊ณ ,
๊ฐ ๋‹จ์–ด(๋ฒกํ„ฐ)์— ๋Œ€ํ•ด ๊ฐœ๋ณ„์ ์œผ๋กœ ๋™์ผํ•œ FFN์„ ์ ์šฉํ•ด.

 

 

 

 

 

 

 

๐Ÿ“Œ ์š”์•ฝ ํ๋ฆ„

[์ž…๋ ฅ ๋ฒกํ„ฐ x] → ์„ ํ˜•ํ™•์žฅ (W1) → ReLU → ์„ ํ˜•์ถ•์†Œ (W2) → ์ตœ์ข… ์ถœ๋ ฅ
 

๋‹จ์–ด ํ•˜๋‚˜ํ•˜๋‚˜์— ๋Œ€ํ•ด ์ด ๊ณผ์ •์„ ๊ฑฐ์น˜๊ณ ,
๊ฒฐ๊ณผ๋Š” Self-Attention์ด ๋ฐ˜์˜๋œ ํ›„ ํ•œ์ธต ๋” ์˜๋ฏธ์ ์œผ๋กœ ๋ณ€ํ˜•๋œ ํ‘œํ˜„์ด ๋˜๋Š” ๊ฑฐ์•ผ.

 


โ€ป ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”๋Š” ๋™์ผํ•œ ๊ตฌ์กฐ์˜ ์ธต์„ ์—ฌ๋Ÿฌ ๊ฐœ ๋ฐ˜๋ณตํ•˜๋„๋ก ์„ค๊ณ„๋˜์–ด์žˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋ ค๋ฉด

     ๊ฐ ์ธต์˜ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ์ฐจ์›์ด ๊ฐ™์•„์•ผ ๋‹ค์Œ ์ธต์— ๊ทธ๋Œ€๋กœ ๋„ฃ์„ ์ˆ˜ ์žˆ๋‹ค.


5. Add & LayerNorm (๋˜ ํ•œ ๋ฒˆ)
→ ๋‹ค์‹œ ์ž”์ฐจ ์—ฐ๊ฒฐ๊ณผ ์ •๊ทœํ™”

์ด๊ฑธ N๋ฒˆ ๋ฐ˜๋ณต (๋ณดํ†ต 6์ธต) → ์ธ์ฝ”๋” ์ „์ฒด ์™„์„ฑ!

 

๋‹ค์‹œํ•œ๋ฒˆ ์ธ์ฝ”๋” ๋ถ€๋ถ„ ์ •๋ฆฌ


๐Ÿง  Transformer ์ธ์ฝ”๋” ์ •๋ฆฌ

Transformer ์ธ์ฝ”๋”๋Š” ์ž…๋ ฅ ๋ฌธ์žฅ์„ ๋ฌธ๋งฅ์„ ๋ฐ˜์˜ํ•œ ํ’๋ถ€ํ•œ ํ‘œํ˜„ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ตฌ์กฐ๋กœ,
๋™์ผํ•œ ํ˜•ํƒœ์˜ ๋ธ”๋ก(layer)์„ ์—ฌ๋Ÿฌ ์ธต(N์ธต) ๋ฐ˜๋ณตํ•ด์„œ ๊ตฌ์„ฑ๋œ๋‹ค.

๊ฐ ์ธ์ฝ”๋” ๋ธ”๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ˆœ์„œ๋กœ ๋™์ž‘ํ•œ๋‹ค:


โ‘  ์ž…๋ ฅ ์ž„๋ฒ ๋”ฉ + ์œ„์น˜ ์ธ์ฝ”๋”ฉ (Positional Encoding)

๋จผ์ € ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๊ฐ ๋‹จ์–ด๋Š” ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ๋กœ ์ž„๋ฒ ๋”ฉ๋˜๋ฉฐ,
Transformer๋Š” ์ˆœ์„œ๋ฅผ ์ฒ˜๋ฆฌํ•  ๋Šฅ๋ ฅ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ ๋‹จ์–ด์—
์œ„์น˜ ์ •๋ณด๋ฅผ ๋”ํ•˜๋Š” Positional Encoding์„ ์ ์šฉํ•œ๋‹ค.
์ด๋ ‡๊ฒŒ ํ•ด์„œ ๋งŒ๋“ค์–ด์ง„ ์ž…๋ ฅ ๋ฒกํ„ฐ๋“ค์ด ์ธ์ฝ”๋”์˜ ์ฒซ ๋ฒˆ์งธ ๋ธ”๋ก์œผ๋กœ ๋“ค์–ด๊ฐ„๋‹ค.


โ‘ก Multi-Head Self-Attention

์ž…๋ ฅ๋œ ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์€ ์„œ๋กœ๋ฅผ "๋ฐ”๋ผ๋ณด๋ฉฐ" ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ์ฃผ๊ณ ๋ฐ›๋Š”๋‹ค.
๊ฐ ๋‹จ์–ด๋Š” Query, Key, Value ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜๋˜์–ด,
๋ชจ๋“  ๋‹จ์–ด๋“ค๊ณผ์˜ ์œ ์‚ฌ๋„(์ค‘์š”๋„)๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๊ทธ์— ๋”ฐ๋ผ ์ •๋ณด๋ฅผ ๊ฐ€์ค‘ํ•ฉํ•œ๋‹ค.

์ด ๊ณผ์ •์„ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ—ค๋“œ(head)์—์„œ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰ํ•˜๋ฉด,
์„œ๋กœ ๋‹ค๋ฅธ ๊ด€์ ์—์„œ ๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ณ 
์ด ๊ฒฐ๊ณผ๋“ค์„ ์ด์–ด๋ถ™์—ฌ ๋ฌธ๋งฅ์ด ๋ฐ˜์˜๋œ ํ‘œํ˜„์„ ๋งŒ๋“ ๋‹ค.


โ‘ข ์ž”์ฐจ ์—ฐ๊ฒฐ(Residual Connection) + ๋ ˆ์ด์–ด ์ •๊ทœํ™”(Layer Normalization)

Self-Attention์˜ ์ถœ๋ ฅ์€ ์ž…๋ ฅ ๋ฒกํ„ฐ์™€ ๋”ํ•ด์ ธ(์ž”์ฐจ ์—ฐ๊ฒฐ),
์›๋ž˜ ์ •๋ณด๊ฐ€ ์†Œ์‹ค๋˜์ง€ ์•Š๋„๋ก ๋ณด์กด๋œ๋‹ค.
์ดํ›„ ์ •๊ทœํ™” ๊ณผ์ •์„ ๊ฑฐ์ณ ๊ฐ’์˜ ๋ถ„ํฌ๋ฅผ ์•ˆ์ •ํ™”์‹œ์ผœ,
ํ•™์Šต์ด ๋” ๋น ๋ฅด๊ณ  ์•ˆ์ •์ ์œผ๋กœ ์ง„ํ–‰๋˜๋„๋ก ํ•œ๋‹ค.


โ‘ฃ Feed Forward Network (FFN)

๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•œ ๊ฐ ๋‹จ์–ด ๋ฒกํ„ฐ๋Š”
๊ฐœ๋ณ„์ ์œผ๋กœ 2์ธต์งœ๋ฆฌ ์™„์ „์—ฐ๊ฒฐ ์‹ ๊ฒฝ๋ง์„ ํ†ต๊ณผํ•œ๋‹ค.
๋จผ์ € ์ฐจ์›์„ ํ™•์žฅํ•˜์—ฌ ๋น„์„ ํ˜• ํ™œ์„ฑํ™”(ReLU)๋ฅผ ์ ์šฉํ•˜๊ณ ,
๋‹ค์‹œ ์›๋ž˜ ์ฐจ์›์œผ๋กœ ์ค„์ด๋ฉด์„œ ๋” ์ถ”์ƒ์ ์ธ ํ‘œํ˜„ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜๋œ๋‹ค.

์ด ๊ณผ์ •์€ ๋‹จ์–ด ํ•˜๋‚˜ํ•˜๋‚˜์— ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ๋œ๋‹ค.


โ‘ค ๋‹ค์‹œ ์ž”์ฐจ ์—ฐ๊ฒฐ + ๋ ˆ์ด์–ด ์ •๊ทœํ™”

FFN์˜ ์ถœ๋ ฅ์—๋„ ์ž…๋ ฅ์„ ๋”ํ•˜๊ณ ,
๋ ˆ์ด์–ด ์ •๊ทœํ™”๋ฅผ ์ ์šฉํ•˜์—ฌ ๊ฐ’์„ ์•ˆ์ •ํ™”์‹œํ‚จ๋‹ค.
์ด ๊ณผ์ •์„ ํ†ตํ•ด ๊ฐ ๋‹จ์–ด ๋ฒกํ„ฐ๋Š” ๋ฌธ๋งฅ๊ณผ ๊ณ ์œ  ์ •๋ณด๋ฅผ ๋ชจ๋‘ ๋‹ด์€ ์ƒํƒœ๋กœ ๋‹ค์Œ ๋ธ”๋ก์œผ๋กœ ์ „๋‹ฌ๋œ๋‹ค.


โœ… ์ธ์ฝ”๋” ์š”์•ฝ ๊ตฌ์กฐ (ํ•œ ์ธต ๊ธฐ์ค€):

Input → [Multi-Head Self-Attention]
      → Add & LayerNorm
      → Feed Forward Network
      → Add & LayerNorm
      → Output → ๋‹ค์Œ ์ธต์œผ๋กœ

์ด ๊ตฌ์กฐ๊ฐ€ N๋ฒˆ ๋ฐ˜๋ณต๋˜๋ฉด ์ธ์ฝ”๋” ์ „์ฒด๊ฐ€ ์™„์„ฑ๋˜๊ณ ,
์ตœ์ข… ์ถœ๋ ฅ์€ ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ ๋ฌธ๋งฅ-aware ๋ฒกํ„ฐ ํ‘œํ˜„์ด ๋œ๋‹ค.

 

๐Ÿ”ธ 3. Decoder: ๋ฒˆ์—ญ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑ(์—ฌ๊ธฐ ๋ถ€ํ„ฐ๋Š” ๋‹ค์Œ ํŽธ์—์„œ ๋‹ค๋ฃจ)

Decoder๋„ Encoder์™€ ์œ ์‚ฌํ•œ ๊ตฌ์กฐ์ง€๋งŒ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฐจ์ด์ ์ด ์žˆ๋‹ค:

  • Masked Multi-Head Attention: ๋””์ฝ”๋”๋Š” ์ถœ๋ ฅ ๋ฌธ์žฅ์„ ํ•œ ๋‹จ์–ด์”ฉ ์ƒ์„ฑํ•œ๋‹ค.
    ํ•™์Šต ์‹œ์—๋Š” "๋‚˜๋Š” ํ•™์ƒ์ด๋‹ค" ์ „์ฒด ๋ฌธ์žฅ์„ ์ด์šฉํ•˜๋˜,
    ์˜ˆ๋ฅผ ๋“ค์–ด "๋‚˜๋Š”"์„ ์ž…๋ ฅํ–ˆ์„ ๋•Œ ์•ž ๋‹จ์–ด๊นŒ์ง€๋งŒ ๋ณด๊ณ  ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ๋งˆ์Šคํฌ๋ฅผ ์”Œ์šด๋‹ค.
    ์ฆ‰, "ํ•™์ƒ"์ด๋ผ๋Š” ์ •๋‹ต์€ ๊ฐ€๋ ค๋‘๊ณ  "๋‚˜๋Š”"๋งŒ ๋ณด๊ณ  "ํ•™์ƒ"์„ ์˜ˆ์ธกํ•ด์•ผ ํ•œ๋‹ค.
  • Encoder-Decoder Attention: ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ์ •๋ณด๋ฅผ ์ฐธ๊ณ ํ•ด ์ถœ๋ ฅ ๋‹จ์–ด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.
    "๋‚˜๋Š”"์ด๋ผ๋Š” ๋‹จ์–ด๋ฅผ ์ƒ์„ฑํ•  ๋•Œ, Encoder๊ฐ€ ๋ถ„์„ํ•œ "I am a student"์˜ ์˜๋ฏธ๋ฅผ ๊ฐ™์ด ๊ณ ๋ คํ•œ๋‹ค.
  • Feed-Forward Network: ์˜ˆ์ธก๋œ ์ •๋ณด๋“ค์„ ์ตœ์ข…์ ์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

๐ŸŽฏ 4. Linear + Softmax → ํ™•๋ฅ  ์˜ˆ์ธก

Decoder์—์„œ ๋‚˜์˜จ ์ถœ๋ ฅ์€ Linear Layer๋ฅผ ํ†ตํ•ด ๋‹จ์–ด ์‚ฌ์ „ ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜๋˜๊ณ ,
Softmax๋ฅผ ํ†ตํ•ด ๋‹ค์Œ ๋‹จ์–ด๋กœ ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์ด ๋†’์€์ง€๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, ๋‹ค์Œ ๋‹จ์–ด๋กœ "ํ•™์ƒ"์ผ ํ™•๋ฅ ์ด 0.82, "์„ ์ƒ"์ผ ํ™•๋ฅ ์ด 0.05, "์˜์‚ฌ"์ผ ํ™•๋ฅ ์ด 0.01 ๊ฐ™์€ ์‹์œผ๋กœ
ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ƒ์„ฑํ•œ ๋’ค, ๊ฐ€์žฅ ๋†’์€ ๋‹จ์–ด๋ฅผ ์„ ํƒํ•ด ์ถœ๋ ฅํ•œ๋‹ค.


๐Ÿ“š ํ•™์Šต ๋ฐฉ์‹

Transformer๋Š” ์ง€๋„ํ•™์Šต(Supervised Learning) ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋˜๊ธฐ ๋•Œ๋ฌธ์—,
๋ฐ˜๋“œ์‹œ ์ž…๋ ฅ๊ณผ ์ •๋‹ต ์ถœ๋ ฅ ์Œ์ด ๋งค์นญ๋œ ๋ฐ์ดํ„ฐ์…‹์ด ํ•„์š”ํ•˜๋‹ค.

์˜ˆ์‹œ:

์ž…๋ ฅ (Input) ์ถœ๋ ฅ (Target)

"I am a student" "๋‚˜๋Š” ํ•™์ƒ์ด๋‹ค"
๐Ÿ–ผ๏ธ ์ด๋ฏธ์ง€ (๊ฐ•์•„์ง€ ์‚ฌ์ง„) "๊ฐ•์•„์ง€๊ฐ€ ํ’€๋ฐญ์—์„œ ๋›ด๋‹ค"


ํ•™์Šต ์ค‘์—๋Š” ๋””์ฝ”๋”์— ์ถœ๋ ฅ ๋ฌธ์žฅ์„ ํ•œ ์นธ์”ฉ ๋ฐ€์–ด์„œ ์ž…๋ ฅ(?)ํ•˜๊ณ ,
๊ฐ ์‹œ์ ๋งˆ๋‹ค ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์ด ์ด๋ค„์ง„๋‹ค.


โœ… ์š”์•ฝ

  • Transformer๋Š” RNN/CNN์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•œ ์™„์ „ ์–ดํ…์…˜ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด๋‹ค.
  • ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์„ ๋™์‹œ์— ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์–ด ํ•™์Šต์ด ๋น ๋ฅด๊ณ , ๊ธด ๋ฌธ์žฅ์—์„œ๋„ ํšจ๊ณผ์ ์ด๋‹ค.
  • ๊ตฌ์กฐ๋Š” Encoder-Decoder๋กœ ๋‚˜๋‰˜๋ฉฐ,
    ๊ฐ ์ธต์€ Multi-Head Attention + Feed Forward + LayerNorm + Residual๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.
  • Self-Attention์€ ๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๊ณ , Multi-Head๋Š” ๋‹ค์–‘ํ•œ ์‹œ๊ฐ์œผ๋กœ ์ด๋ฅผ ๊ฐ•ํ™”ํ•œ๋‹ค.
  • ํ•™์Šต ์‹œ์—๋Š” ์ž…๋ ฅ-์ถœ๋ ฅ ์Œ์ด ํ•„์š”ํ•˜๋ฉฐ, ์ถœ๋ ฅ์€ Softmax๋ฅผ ํ†ตํ•ด ๋‹จ์–ด ์‚ฌ์ „ ์ „์ฒด์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ์ƒ์„ฑ๋œ๋‹ค.