상세 컨텐츠

본문 제목

RLHF: Obtaining feedback from humans

Generative AI with Large Language Models

by Taeyoon.Kim.DS 2023. 8. 28. 21:21

본문

https://www.coursera.org/learn/generative-ai-with-llms/lecture/lQBGW/rlhf-obtaining-feedback-from-humans

 

RLHF: Obtaining feedback from humans - Week 3 | Coursera

Video created by deeplearning.ai, Amazon Web Services for the course "Generative AI with Large Language Models". Reinforcement learning and LLM-powered applications

www.coursera.org

In this video, the process of fine-tuning a Language Model (LLM) with Reinforcement Learning from Human Feedback (RLHF) is explained. The first step involves selecting an LLM suitable for the desired task, such as text summarization or question answering. An instruct model, which has been fine-tuned across various tasks, is often a good starting point. The LLM, along with a prompt dataset, is used to generate multiple responses for each prompt. Human labelers are then engaged to evaluate these completions based on predefined criteria, like helpfulness or toxicity. Labelers rank completions for each prompt, establishing consensus among multiple labelers. Clear instructions are provided to labelers to ensure their understanding of the task. After collecting ranking data from labelers, it is converted into pairwise comparisons of completions, with rewards assigned accordingly. The preferred completion is placed first to train the reward model effectively. The process of gathering ranked feedback is explained to be more advantageous than thumbs-up, thumbs-down feedback.

 

이 비디오에서는 Reinforcement Learning from Human Feedback (RLHF)를 사용하여 언어 모델 (LLM)을 미세 조정하는 과정이 설명됩니다. 첫 번째 단계는 원하는 작업에 적합한 LLM을 선택하는 것으로, 텍스트 요약 또는 질문 응답과 같은 작업에 대한 것입니다. 다양한 작업에서 미세 조정된 instruct 모델이 often 좋은 시작점입니다. LLM과 프롬프트 데이터 세트를 사용하여 각 프롬프트에 대한 여러 응답을 생성합니다. 그런 다음 인간 레이블러를 참여시켜 레이블러가 정의한 기준, 예를 들어 도움이 되는지 또는 유해성과 같은 기준에 따라 이러한 완성을 평가하도록합니다. 레이블러는 프롬프트마다 완성을 순위 매기고 여러 레이블러 사이에 합의를 도출합니다. 작업에 대한 이해를 보장하기 위해 레이블러에게 명확한 지침이 제공됩니다. 레이블러로부터 순위 매기기 데이터를 수집한 후 이를 완성물 간의 쌍 비교로 변환하고, 그에 따라 보상을 할당합니다. 선호 완료 사항을 먼저 놓도록 정렬합니다. 순위 매기기 피드백을 수집하는 것이 엄지척, 엄지척 피드백보다 더 유리하다고 설명됩니다.

관련글 더보기