Model evaluation

Generative AI with Large Language Models

by Taeyoon.Kim.DS 2023. 8. 22. 20:21

https://www.coursera.org/learn/generative-ai-with-llms/lecture/8Wvg3/model-evaluation

Model evaluation - Week 2 | Coursera

Video created by deeplearning.ai, Amazon Web Services for the course "Generative AI with Large Language Models". Fine-tuning and evaluating large language models

www.coursera.org

In this section, you'll learn about metrics used to assess the performance of large language models. Traditional machine learning metrics like accuracy don't work well for non-deterministic language models due to the complexity of language-based evaluation. Metrics like ROUGE and BLEU are commonly used to evaluate language models.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the quality of automatically generated summaries by comparing them to human-generated reference summaries. It includes ROUGE-1 (unigram matches), ROUGE-2 (bigram matches), and ROUGE-L (longest common subsequence) scores.

BLEU (Bilingual Evaluation Understudy) evaluates the quality of machine-translated text by comparing it to human-generated translations. BLEU calculates precision across various n-gram sizes and then averages the results.

ROUGE-1 measures unigram matches between the reference and generated output, but it doesn't consider word order. ROUGE-2 uses bigrams to account for word ordering but yields lower scores. ROUGE-L calculates recall, precision, and F1 score based on the longest common subsequence. Clipping functions can limit unigram matches to avoid inflated scores.

Language model libraries often include ROUGE score implementations. BLEU score assesses translation quality and increases as the generated output gets closer to the reference. While ROUGE and BLEU are useful, they are not sole indicators of a model's performance. Evaluation benchmarks developed by researchers are more comprehensive and should be used for final model assessment.

이 섹션에서는 대형 언어 모델의 성능을 평가하는 데 사용되는 지표에 대해 알아보겠습니다. 전통적인 기계 학습 지표인 정확도는 비결정적인 언어 모델에 대한 복잡한 언어 기반 평가의 어려움 때문에 잘 작동하지 않습니다. ROUGE 및 BLEU와 같은 지표가 언어 모델을 평가하는 데 일반적으로 사용됩니다.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)는 자동 생성된 요약의 품질을 평가하기 위해 인간이 생성한 참조 요약과 비교합니다. ROUGE-1 (유니그램 일치), ROUGE-2 (바이그램 일치) 및 ROUGE-L (가장 긴 공통 부분 수열) 점수를 포함합니다.

BLEU (Bilingual Evaluation Understudy)는 기계 번역된 텍스트의 품질을 인간이 생성한 번역과 비교하여 평가합니다. BLEU는 다양한 n-그램 크기에서 정밀도를 계산한 다음 결과를 평균화합니다.

ROUGE-1은 참조와 생성된 출력 간의 유니그램 일치를 측정하지만 단어 순서를 고려하지 않습니다. ROUGE-2는 단어 순서를 고려하기 위해 바이그램을 사용하지만 점수가 낮아집니다. ROUGE-L은 가장 긴 공통 부분 수열을 기반으로 리콜, 정밀도 및 F1 점수를 계산합니다. 클리핑 함수를 사용하여 유니그램 일치를 제한하여 과장된 점수를 피할 수 있습니다.

언어 모델 라이브러리에는 종종 ROUGE 점수 구현이 포함되어 있습니다. BLEU 점수는 생성된 출력이 참조에 가까워질수록 증가하며, ROUGE 및 BLEU는 유용하지만 모델의 성능을 전적으로 나타내는 것은 아닙니다. 연구자들이 개발한 평가 벤치마크가 더 포괄적이며 최종 모델 평가에 사용되어야 합니다.

저작자표시 비영리 변경금지

'Generative AI with Large Language Models' 카테고리의 다른 글

PEFT techniques 1: LoRA (0)	2023.08.23
Parameter efficient fine-tuning (PEFT) (0)	2023.08.23
Multi-task instruction fine-tuning (0)	2023.08.22
Fine-tuning on a single task (0)	2023.08.22
Instruction fine-tuning (0)	2023.08.22