Model optimizations for deployment

Generative AI with Large Language Models

by Taeyoon.Kim.DS 2023. 9. 19. 21:18

https://www.coursera.org/learn/generative-ai-with-llms/lecture/qojKp/model-optimizations-for-deployment

Model optimizations for deployment - Week 3 | Coursera

Video created by deeplearning.ai, Amazon Web Services for the course "Generative AI with Large Language Models". Reinforcement learning and LLM-powered applications

www.coursera.org

Before deploying a large language model (LLM) for applications, consider factors like inference speed, available compute resources, and the trade-offs between model performance, inference speed, and storage. Think about whether your model needs to interact with external data or applications and how to connect to those resources. Determine the intended application or API interface for consuming your model.
Three important model optimization techniques for deployment:

Model Distillation: Train a smaller student model to mimic the behavior of a larger teacher model, reducing storage and compute requirements while maintaining performance. Quantization: Transform model weights to lower precision representations to reduce memory footprint and compute resources. Pruning: Eliminate redundant model parameters, such as weights close to zero, to reduce model size without sacrificing accuracy.

Summary in Korean (한국어 요약):

애플리케이션에 대규모 언어 모델 (LLM)을 배포하기 전에 추론 속도, 사용 가능한 컴퓨팅 리소스 및 모델 성능, 추론 속도 및 저장에 대한 트레이드오프를 고려하십시오.
모델이 외부 데이터나 애플리케이션과 상호 작용해야 하는지 여부를 고려하고 해당 리소스에 어떻게 연결할 것인지 생각하십시오.
모델을 소비할 의도된 애플리케이션 또는 API 인터페이스를 결정하십시오.
배포를 위한 중요한 모델 최적화 기술 3가지:

모델 디스틸레이션: 더 큰 선생님 모델의 동작을 모방하도록 더 작은 학생 모델을 훈련시켜 저장 및 컴퓨팅 요구 사항을 줄이면서 성능을 유지합니다.
양자화: 모델 가중치를 낮은 정밀도 표현으로 변환하여 메모리 공간과 컴퓨팅 리소스를 줄입니다.
가지치기: 모델 크기를 줄이기 위해 가중치 등 불필요한 모델 매개변수를 제거하며 정확도를 희생하지 않습니다.

Distillation, Quantization and Pruning.

Fine tune LLM as my teach model and distillation LLM stuent.

Using the teacher model prediction as labels, and using student model as result.

Compare the loss between labels and prediction result.

Adding temperature parameter T.

Teacher model labels refers to soft labels, student model shows soft predictions. hard predictions.

hard labels (ground truth).

Post-Training Quantization (PTQ).

Pruning

Remove model weights with values close or equal to zero

LLM - pruned LLM

Pruning methods sometimes ruquire FULL model re-training, PFET/LoRA/Post-training

In theory, reduces model size and improves performance

in practice, only samll % in LLMs are zero-weights.

저작자표시 비영리 변경금지 (새창열림)

'Generative AI with Large Language Models' 카테고리의 다른 글

AWS Sagemaker JumpStart (0)	2023.09.20
Scaling human feedback (0)	2023.08.28
RLHF: Reward hacking (0)	2023.08.28
RLHF: Fine-tuning with reinforcement learning (0)	2023.08.28
RLHF: Reward model (0)	2023.08.28