The tokenization pipeline

Hugging Face Course

by Taeyoon.Kim.DS 2023. 9. 14. 19:08

https://www.youtube.com/watch?v=Yffk5aydLzg&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=17

A tokenizer takes texts as inputs and outputs numbers the associated model can make sense of.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer("Let's try to tokenize!")
print(inputs["input-ids"]

[101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]

The tokenization pipeline: from input text to a list of numbers.

The first step of the pipeline is to split the text into tokens.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Let's try to tokenize!")
print(tokens)

['let', "'", 's', 'try', 'to', 'token', '##ize', '!']

input_ids = tokenizer.convert_tokens_to_ids(tokens)

[CLS] let's try to tokenize! [SEP] - bert-base-uncased
"<s>Let's try to tokenize!</s>" - roberta-base

Batching inputs together (PyTorch)

! pip install datasets transformers[sentencepiece]

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sentences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this.",
]
tokens = [tokenizer.tokenize(sentence) for sentence in sentences]
ids = [tokenizer.convert_tokens_to_ids(token) for token in tokens]
print(ids[0])
print(ids[1])

import torch

ids = [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],
       [1045, 5223, 2023, 1012]]

input_ids = torch.tensor(ids)

Generates an error.

import torch

ids = [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],
       [1045, 5223, 2023, 1012,    0,    0,    0,     0,     0,    0,    0,    0,    0,    0]]

input_ids = torch.tensor(ids)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer.pad_token_id

from transformers import AutoModelForSequenceClassification

ids1 = torch.tensor(

[[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]]

)

ids2 = torch.tensor([[1045, 5223, 2023, 1012]])

all_ids = torch.tensor(

[[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],

[1045, 5223, 2023, 1012, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

print(model(ids1).logits)

print(model(ids2).logits)

print(model(all_ids).logits)

all_ids = torch.tensor(

[[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],

[1045, 5223, 2023, 1012, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

)

attention_mask = torch.tensor(

[[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],

[ 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

)

Attention layers use the padding tokens in the context they look at for each token. To tell the attention layers to ignore the padding tokens, we need to pass them an attention mask.

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

output1 = model(ids1)

output2 = model(ids2)

print(output1.logits)

print(output2.logits)

With the proper attention mask, predictions are the same for a given sentence, with or without padding.

output = model(all_ids, attention_mask=attention_mask)

print(output.logits)

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sentences = [

"I've been waiting for a HuggingFace course my whole life.",

"I hate this.",

]

print(tokenizer(sentences, padding=True))

Used with padding=True, the tokenizer can directly prepare the inputs with padding and the proper attention mask.

저작자표시 비영리 변경금지 (새창열림)

'Hugging Face Course' 카테고리의 다른 글

The Trainer API (0)	2023.09.15
Transformers Pipeline, Tokenizer, Model, and Result (0)	2023.09.14
What happens inside the pipeline function? (TensorFlow) (0)	2023.09.13
What happens inside the pipeline function? (PyTorch) (0)	2023.09.13
The Transformer architecture (0)	2023.09.13

Taeyoon.Kim.DS

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문

Batching inputs together (PyTorch)

'Hugging Face Course' 카테고리의 다른 글

관련글 더보기

추가 정보

인기글

최신글

티스토리툴바