RLHF: Fine-tuning with reinforcement learning

Generative AI with Large Language Models

by Taeyoon.Kim.DS 2023. 8. 28. 21:33

https://www.coursera.org/learn/generative-ai-with-llms/lecture/sAKto/rlhf-fine-tuning-with-reinforcement-learning

RLHF: Fine-tuning with reinforcement learning - Week 3 | Coursera

Video created by deeplearning.ai, Amazon Web Services for the course "Generative AI with Large Language Models". Reinforcement learning and LLM-powered applications

www.coursera.org

To align the instruction-fine-tuned LLM (large language model) with human preferences, the reward model is used in the RLHF (reinforcement learning from human feedback) process. A prompt is sent to the LLM, which generates a completion. This completion, along with the original prompt, forms a prompt-completion pair that is sent to the reward model. The reward model evaluates this pair based on human feedback and assigns a reward value. A higher reward value indicates a more aligned response. This reward value is then used by the reinforcement learning algorithm to update the LLM's weights, resulting in an RL-updated LLM. This process iterates for a set number of epochs, improving alignment with human preferences over time. The final model, after alignment, is referred to as the human-aligned LLM. Proximal policy optimization (PPO) is a common reinforcement learning algorithm used in this process.

인스트럭션으로 미세 조정된 LLM (Large Language Model)을 인간의 선호도에 맞게 조정하기 위해 리워드 모델이 RLHF (인간 피드백으로부터 강화 학습) 프로세스에서 사용됩니다.

LLM에 프롬프트를 보내고 LLM은 Completion을 생성합니다. 이러한 Completion and 원래의 프롬프트는 리워드 모델로 전달되어 프롬프트 완료 쌍을 형성합니다. 리워드 모델은 이 쌍을 인간의 피드백을 기반으로 평가하고 리워드 값을 할당합니다. 더 높은 리워드 값은 더 맞춘 응답을 나타냅니다. 이 리워드 값은 그런 다음 강화 학습 알고리즘에 의해 사용되어 LLM의 가중치를 업데이트하며 RL 업데이트된 LLM을 생성합니다. 이 프로세스는 일정한 epoch 수에 대해 반복되어 시간이 지남에 따라 인간의 선호도와 더욱 일치하게 됩니다. 인간의 선호도에 맞춘 후 최종 모델을 인간의 선호도에 맞춘 LLM이라고 합니다. 이 프로세스에서 사용되는 일반적인 강화 학습 알고리즘 중 하나는 Proximal Policy Optimization (PPO)입니다.

저작자표시 비영리 변경금지

'Generative AI with Large Language Models' 카테고리의 다른 글

Scaling human feedback (0)	2023.08.28
RLHF: Reward hacking (0)	2023.08.28
RLHF: Reward model (0)	2023.08.28
RLHF: Obtaining feedback from humans (0)	2023.08.28
Reinforcement learning from human feedback (RLHF) (0)	2023.08.28