https://www.youtube.com/watch?v=Yffk5aydLzg&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=17
A tokenizer takes texts as inputs and outputs numbers the associated model can make sense of.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer("Let's try to tokenize!")
print(inputs["input-ids"]
[101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]
The tokenization pipeline: from input text to a list of numbers.
The first step of the pipeline is to split the text into tokens.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Let's try to tokenize!")
print(tokens)
['let', "'", 's', 'try', 'to', 'token', '##ize', '!']
input_ids = tokenizer.convert_tokens_to_ids(tokens)
[CLS] let's try to tokenize! [SEP] - bert-base-uncased
"<s>Let's try to tokenize!</s>" - roberta-base
! pip install datasets transformers[sentencepiece]
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sentences = [
"I've been waiting for a HuggingFace course my whole life.",
"I hate this.",
]
tokens = [tokenizer.tokenize(sentence) for sentence in sentences]
ids = [tokenizer.convert_tokens_to_ids(token) for token in tokens]
print(ids[0])
print(ids[1])
import torch
ids = [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],
[1045, 5223, 2023, 1012]]
input_ids = torch.tensor(ids)
Generates an error.
import torch
ids = [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],
[1045, 5223, 2023, 1012, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
input_ids = torch.tensor(ids)
Attention layers use the padding tokens in the context they look at for each token. To tell the attention layers to ignore the padding tokens, we need to pass them an attention mask.
With the proper attention mask, predictions are the same for a given sentence, with or without padding.
Used with padding=True, the tokenizer can directly prepare the inputs with padding and the proper attention mask.
The Trainer API (0) | 2023.09.15 |
---|---|
Transformers Pipeline, Tokenizer, Model, and Result (0) | 2023.09.14 |
What happens inside the pipeline function? (TensorFlow) (0) | 2023.09.13 |
What happens inside the pipeline function? (PyTorch) (0) | 2023.09.13 |
The Transformer architecture (0) | 2023.09.13 |