RLHF: Reward hacking

Generative AI with Large Language Models

by Taeyoon.Kim.DS 2023. 8. 28. 21:43

https://www.coursera.org/learn/generative-ai-with-llms/lecture/eJVnL/scaling-human-feedback

تحميل Lädt... Chargement... Loading... Cargando... Carregando... Загрузка... Yükleniyor... 载入中

www.coursera.org

Reinforcement Learning from Human Feedback (RLHF) aligns LLMs with human preferences using a reward model and a reinforcement learning algorithm like PPO. However, it can lead to reward hacking where models generate low-quality text to optimize rewards. To mitigate this, a reference model with frozen weights is used. KL divergence measures the difference between reference and updated models, and a penalty is added to reward the updated model if it diverges too much. RHF with path adapters reduces memory usage. Assess performance by measuring toxicity reduction using data sets like summarization or dialogue.

인간 피드백으로부터 강화 학습 (RLHF)은 보상 모델과 PPO와 같은 강화 학습 알고리즘을 사용하여 LLMs를 인간의 선호도와 조율합니다. 그러나 이것은 리워드 해킹으로 이어질 수 있으며, 모델이 보상을 최적화하기 위해 저 품질 텍스트를 생성하는 경우가 있습니다. 이를 완화하기 위해 얼음이 튼 가중치를 가진 참조 모델이 사용됩니다. KL 발산은 참조 및 업데이트된 모델 간의 차이를 측정하며 업데이트된 모델이 너무 많이 벗어나면 보상에 벌칙이 추가됩니다. 경로 어댑터가 포함 된 RHF는 메모리 사용량을 줄입니다. 요약이나 대화와 같은 데이터 세트를 사용하여 독성 감소를 측정하여 성능을 평가합니다.

저작자표시 비영리 변경금지 (새창열림)

'Generative AI with Large Language Models' 카테고리의 다른 글

Model optimizations for deployment (0)	2023.09.19
Scaling human feedback (0)	2023.08.28
RLHF: Fine-tuning with reinforcement learning (0)	2023.08.28
RLHF: Reward model (0)	2023.08.28
RLHF: Obtaining feedback from humans (0)	2023.08.28