상세 컨텐츠

본문 제목

Scaling laws and compute-optimal models

Generative AI with Large Language Models

by Taeyoon.Kim.DS 2023. 8. 21. 21:17

본문

https://www.coursera.org/learn/generative-ai-with-llms/lecture/SmRNp/scaling-laws-and-compute-optimal-models

 

Scaling laws and compute-optimal models - Week 1 | Coursera

Video created by deeplearning.ai, Amazon Web Services for the course "Generative AI with Large Language Models". Generative AI use cases, project lifecycle, and model pre-training

www.coursera.org

In this video, you'll learn about the relationship between model size, training data, configuration, and performance in training large language models (LLMs). The primary goal during pre-training is to maximize the model's performance by minimizing token prediction loss. To achieve better performance, you can increase the size of the training dataset and the number of parameters in your model. These two factors can be scaled to improve performance in theory. However, your compute budget, including factors like GPU availability and training time, limits what you can do.

A petaFLOP per second day measures the number of floating point operations performed at one petaFLOP per second for a day. It's a way to quantify compute resources, with one petaFLOP roughly equivalent to eight NVIDIA V100 GPUs running efficiently for a day. The chart compares the petaFLOP per second days needed to pre-train different variants of models, including Bert, Roberta, T5, and GPT-3, which vary in parameter size from hundreds of millions to 175 billion. Larger models require significantly more compute resources.

Researchers have explored the trade-offs between training dataset size, model size, and compute budget. There's a clear relationship between compute budget and model performance, which can be approximated by a power-law relationship. Given a fixed compute budget, you can improve model performance by increasing the training dataset size and model parameters.

The Chinchilla paper from 2022 investigated the optimal balance between these factors. It suggested that many large models may be over-parameterized and under-trained, benefiting from larger datasets. Optimal training dataset size is about 20 times the number of model parameters. Compute-optimal models, like Chinchilla, outperform non-optimal ones on various tasks. Smaller models optimized for performance are emerging as a trend, challenging the notion that bigger models are always better.

 

이 동영상에서는 모델 크기, 훈련 데이터, 구성 및 성능 간의 관계에 대해 알아보겠습니다. 사전 훈련 중 주요 목표는 모델의 성능을 극대화하기 위해 토큰 예측 손실을 최소화하는 것입니다. 성능을 향상시키려면 교육 데이터 집합의 크기와 모델 매개변수 수를 늘릴 수 있습니다. 이러한 두 가지 요소는 원칙적으로 성능을 향상시키기 위해 조절할 수 있습니다. 그러나 GPU 가용성 및 교육 시간과 같은 요소를 포함한 계산 예산에 제약이 있습니다.

1 페타플롭/초당 하루는 1 페타플롭/초당 수행되는 부동 소수점 연산의 수를 측정하는 것으로, 1 페타플롭은 초당 1조 개의 부동 소수점 연산에 해당합니다. 이것은 계산 리소스를 양적화하는 방법으로, 1 페타플롭은 대략 하루 동안 효율적으로 작동하는 8개의 NVIDIA V100 GPU와 동등합니다. 이 차트는 다양한 모델(예: Bert, Roberta, T5, GPT-3)의 사전 훈련에 필요한 1 페타플롭/초당 하루를 비교하며, 이 모델들은 매개변수 크기가 수백만에서 1750억까지 다양합니다. 큰 모델은 훨씬 더 많은 계산 리소스가 필요합니다.

연구자들은 교육 데이터 크기, 모델 크기 및 계산 예산 간의 트레이드 오프를 탐구했습니다. 계산 예산과 모델 성능 사이에는 계산 리소스와 모델 성능 사이의 관계가 명확하며, 이것은 거듭제곱 법칙 관계로 근사될 수 있습니다. 고정된 계산 예산이 주어진 경우 교육 데이터 크기와 모델 매개변수를 늘려 모델 성능을 향상시킬 수 있습니다.

2022년에 발표된 'Chinchilla' 논문은 이러한 요인 사이의 최적 균형을 조사했습니다. 이 논문은 많은 대규모 모델이 과도하게 매개변수화되었고 부족하게 훈련되었을 수 있으며, 더 큰 데이터셋에서 훈련받을 수 있을 것이라고 제안했습니다. 최적의 교육 데이터 크기는 모델 매개변수 수의 약 20배 정도입니다. 계산 최적 모델인 'Chinchilla'와 같은 모델은 다양한 작업에서 비최적 모델인 'GPT-3'과 같은 모델을 능가합니다. 성능 최적화를 위해 작은 모델들이 등장하면서 큰 모델이 항상 더 나은 것은 아니라는 개념에 도전하는 트렌드가 나타나고 있습니다.

 

관련글 더보기