본문 바로가기
  • GDG on campus Ewha Tech Blog
3-2기 스터디/MLOps

[4주차] Transformers

by 서 진 2022. 5. 17.

Full Stack Deep Learning

1. Transfer Learning in Computer Vision

  • 이미지 데이터 1 만장으로 새 분류→ 과적합 발생할 수 있음
  • → fine tuning
  • → Resnet-50가 좋은 성능을 보일 것
    • 큰 데이터로 학습 시킨 large model (=pretrained model)
    • 학습 되어있는 모델을 가져와 새로운 레이어를 더하거나 교체해 학습
    • 더 적은 데이터로 빠르고 정확하게 학습 가능전이 학습이란?

  • Model zoo
    • pretrained-model
    • tensorflow, pytorch 둘 다 사용 가능

2. Embeddings and Language Models

  • 자연어 처리에서 실제 input은 단어이지만, 딥러닝에서는 벡터임
  • 단어를 벡터로 어떻게 바꿀까?
    • 원-핫 인코딩
    • 문제) 작동은 되나 어휘 크기에 따라 제대로 확장 X→ Violates what we know about word similarity
    • → 매우 높은 차원의 희소 벡터에서의 신경망은 잘 작동하지 못함
    • dense vector문제) how do we find the values of the embedding matrix?
      • Learn as part of the task
      • Learn a Language Model→ Skip-grams (Look on both sides of the target word)
      • → N-Grams
      • 작업 속도를 높이기 위해서는?→ Word2Vec
      • Binary instead of multi-class
    • → embedding matrix

3. "NLP's ImageNet Moment": ELMO/ULMFit

  • around 2017

  • Elmo
    • SQuAD
    • SNLI
    • GLUE
  • ULMFit
    • similar to Elmo

4. Transformers

  • Paper
    • Encoder-decoder with only attention and fully-connected layers
    • 실제 매커니즘
    • focus just on the encoder
  • → Attention is all you need(2017) 다음 논문 스터디 때 읽기
  • (Masked) Self-attention
  • Positional encoding
  • Layer normalization

4.1 Attention in detail

Basic self-attention

  • No learned weights
  • Order of the sequence does not affect result of computations

Let's learn some weights!

  • x_i 를 어떻게 사용할 것인지 생각하기 (세 가지 방법)
    • Query
    • → Compared to every other vector to compute attention weights for its own output y_i
    • Key
    • → Compared to every other vector to compute attention weight w_ij for output y_j
    • Value
    • → Summed with other vectors to form the result of the attention weighted sum
  • Transformer
    • Learned query, key, value weights
    • Multiple heads
    • Order of the sequence does not affect result of computations
    • → encode each vector with position

4.2 BERT, GPT-2, DistillBERT, T5

  • GPT / GPT-2
  • → Generative Pre-trained Transformer
  • BERT
  • → Bidirectional Encoder Representations from Transformers
  • Transformer

  • T5: Text-to-Text Transfer Transformer
  • GPT-3
  • DistillBERT
  • → a smaller model is trained to reproduce the output of a larger model1. Transfer Learning in Computer Vision

 

Lab

Reading

'3-2기 스터디 > MLOps' 카테고리의 다른 글

[7주차] Troubleshooting  (0) 2022.06.21
[5주차] ML Projects  (0) 2022.05.31
[3주차] RNNs  (0) 2022.05.10
[2주차] CNNs  (0) 2022.05.03
[1주차] Fundamentals  (0) 2022.04.08

댓글