상세 컨텐츠

본문 제목

The tokenization pipeline

Hugging Face Course

by Taeyoon.Kim.DS 2023. 9. 14. 19:08

본문

https://www.youtube.com/watch?v=Yffk5aydLzg&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=17 

A tokenizer takes texts as inputs and outputs numbers the associated model can make sense of.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer("Let's try to tokenize!")
print(inputs["input-ids"]

[101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]

The tokenization pipeline: from input text to a list of numbers.

The first step of the pipeline is to split the text into tokens.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Let's try to tokenize!")
print(tokens)

['let', "'", 's', 'try', 'to', 'token', '##ize', '!']
input_ids = tokenizer.convert_tokens_to_ids(tokens)

[CLS] let's try to tokenize! [SEP] - bert-base-uncased
"<s>Let's try to tokenize!</s>" - roberta-base

 

Batching inputs together (PyTorch)

! pip install datasets transformers[sentencepiece]

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sentences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this.",
]
tokens = [tokenizer.tokenize(sentence) for sentence in sentences]
ids = [tokenizer.convert_tokens_to_ids(token) for token in tokens]
print(ids[0])
print(ids[1])
import torch

ids = [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],
       [1045, 5223, 2023, 1012]]

input_ids = torch.tensor(ids)

Generates an error.

import torch

ids = [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],
       [1045, 5223, 2023, 1012,    0,    0,    0,     0,     0,    0,    0,    0,    0,    0]]

input_ids = torch.tensor(ids)
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.pad_token_id
from transformers import AutoModelForSequenceClassification

ids1 = torch.tensor(
    [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]]
)
ids2 = torch.tensor([[1045, 5223, 2023, 1012]])
all_ids = torch.tensor(
    [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],
     [1045, 5223, 2023, 1012,    0,    0,    0,     0,     0,    0,    0,    0,    0,    0]]
)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
print(model(ids1).logits)
print(model(ids2).logits)
print(model(all_ids).logits)
all_ids = torch.tensor(
    [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],
     [1045, 5223, 2023, 1012,    0,    0,    0,     0,     0,    0,    0,    0,    0,    0]]
)
attention_mask = torch.tensor(
    [[   1,    1,    1,    1,    1,    1,    1,     1,     1,    1,    1,    1,    1,    1],
     [   1,    1,    1,    1,    0,    0,    0,     0,     0,    0,    0,    0,    0,    0]]
)

Attention layers use the padding tokens in the context they look at for each token. To tell the attention layers to ignore the padding tokens, we need to pass them an attention mask.

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
output1 = model(ids1)
output2 = model(ids2)
print(output1.logits)
print(output2.logits)

With the proper attention mask, predictions are the same for a given sentence, with or without padding.

output = model(all_ids, attention_mask=attention_mask)
print(output.logits)

 

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sentences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this.",
]
print(tokenizer(sentences, padding=True))

Used with padding=True, the tokenizer can directly prepare the inputs with padding and the proper attention mask.

관련글 더보기