【思考】Byte-Pair-Encoding tokenizer 的offset mapping实现
-
huggingface库的tokenizers只有
rust
版本的get offset mapping,没有python版本的,因此个人想着试着找个方法实现以下这个方法。准备
安装
pip install fastcore
# 安装fastcore # !pip install fastcore from paddlenlp.transformers import XLNetTokenizer, GPTTokenizer from fastcore.all import patch_to from paddle.utils import try_import from transformers import XLNetTokenizerFast, GPT2TokenizerFast @patch_to(GPTTokenizer) def get_offset_mapping(self, text): token_mapping = [] global_offsets = 0 re = try_import("regex") for token in re.findall(self.pat, text): newtokens = "" char2bpe = [] for char_index, each_element in enumerate(token): for b in each_element.encode("utf-8"): newtokens += self.byte_encoder[b] char2bpe.append(char_index) cum_bpe_offset = 0 for bpe_token in self.bpe(newtokens).split(" "): start = newtokens.index(bpe_token) + cum_bpe_offset end = start + len(bpe_token) new_start = char2bpe[start] + global_offsets new_end = char2bpe[end - 1] + global_offsets + 1 if bpe_token[0] == "Ġ": new_start += 1 token_mapping.append((new_start, new_end)) cum_bpe_offset += len(bpe_token) newtokens = newtokens[len(bpe_token) :] global_offsets += len(token) return token_mapping # 数据来自squad1.1 data = {'id': '56f7c651aef2371900625bf5', 'title': 'Martin_Luther', 'context': "Martin Luther (/ˈluːθər/ or /ˈluːðər/; German: [ˈmaɐ̯tiːn ˈlʊtɐ] ( listen); 10 November 1483 – 18 February 1546) was a German professor of theology, composer, priest, former monk and a seminal figure in the Protestant Reformation. Luther came to reject several teachings and practices of the Late Medieval Catholic Church. He strongly disputed the claim that freedom from God's punishment for sin could be purchased with money. He proposed an academic discussion of the power and usefulness of indulgences in his Ninety-Five Theses of 1517. His refusal to retract all of his writings at the demand of Pope Leo X in 1520 and the Holy Roman Emperor Charles V at the Diet of Worms in 1521 resulted in his excommunication by the Pope and condemnation as an outlaw by the Emperor.", 'question': 'Of what nationality was Martin Luther?', 'answers': ['German', 'German', 'German'], 'answer_starts': [39, 119, 119], 'is_impossible': False} text = data["context"] # 比较gpt2的offset mapping。 # huggingface版本 for a,b in zip(gpt2_hgtokenizer(text,return_offsets_mapping=True)["offset_mapping"],gpt2_hgtokenizer.tokenize(text)): print(text[a[0]:a[1]],"======",b) # paddle版本 for a,b in zip(gpt2_pdtokenizer.get_offset_mapping(text),gpt2_pdtokenizer.tokenize(text)): print(text[a[0]:a[1]],"======",b)
注:
- gpt2有关空格的offset mapping,本方法与huggingface略有不同。
- 可能存在其他问题,仅供参考~