【思考】SentencePiece tokenizer 的offset mapping实现
-
huggingface库的tokenizers只有
rust
版本的get offset mapping,没有python版本的,因此个人想着试着找个方法实现以下这个方法。准备
安装
pip install fastcore
# 安装fastcore # !pip install fastcore from paddlenlp.transformers import XLNetTokenizer, GPTTokenizer from fastcore.all import patch_to from paddle.utils import try_import from transformers import XLNetTokenizerFast, GPT2TokenizerFast SENTENCEPIECE_UNDERLINE = "▁" import unicodedata @patch_to(XLNetTokenizer) def get_offset_mapping(self, text): text = text.replace("``", '"').replace("''", '"') normalized_text, char_mapping = '', [] for i, ch in enumerate(text): if not self.keep_accents: ch = unicodedata.normalize("NFKD", ch) if not unicodedata.combining(ch): normalized_text += ch char_mapping.extend([i] * len(ch)) else: normalized_text += ch char_mapping.extend([i] * len(ch)) if self.do_lower_case: normalized_text = normalized_text.lower() text, token_mapping, offset = normalized_text, [], 0 split_tokens = self.tokenize(text) if split_tokens[0] == SENTENCEPIECE_UNDERLINE: split_tokens = split_tokens[1:] token_mapping.append((0,0)) for token in split_tokens: if token[0] == SENTENCEPIECE_UNDERLINE: token=token[1:] if len(token) == 0: length = 1 else: length = len(token) start = text[offset:].index(token) + offset end = start + length token_mapping.append( (char_mapping[start], char_mapping[end - 1] + 1)) offset = end return token_mapping xlnet_pdtokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased") xlnet_hgtokenizer = XLNetTokenizerFast.from_pretrained("xlnet-base-cased") # 数据来自squad1.1 data = {'id': '56f7c651aef2371900625bf5', 'title': 'Martin_Luther', 'context': "Martin Luther (/ˈluːθər/ or /ˈluːðər/; German: [ˈmaɐ̯tiːn ˈlʊtɐ] ( listen); 10 November 1483 – 18 February 1546) was a German professor of theology, composer, priest, former monk and a seminal figure in the Protestant Reformation. Luther came to reject several teachings and practices of the Late Medieval Catholic Church. He strongly disputed the claim that freedom from God's punishment for sin could be purchased with money. He proposed an academic discussion of the power and usefulness of indulgences in his Ninety-Five Theses of 1517. His refusal to retract all of his writings at the demand of Pope Leo X in 1520 and the Holy Roman Emperor Charles V at the Diet of Worms in 1521 resulted in his excommunication by the Pope and condemnation as an outlaw by the Emperor.", 'question': 'Of what nationality was Martin Luther?', 'answers': ['German', 'German', 'German'], 'answer_starts': [39, 119, 119], 'is_impossible': False} text = data["context"] # 比较xlnet的offset mapping。 # huggingface版本 for a,b in zip(xlnet_hgtokenizer(text,return_offsets_mapping=True)["offset_mapping"],xlnet_hgtokenizer.tokenize(text)): print(text[a[0]:a[1]],"======",b) # paddle版本 for a,b in zip(xlnet_pdtokenizer.get_offset_mapping(text),xlnet_pdtokenizer.tokenize(text)): print(text[a[0]:a[1]],"======",b)
注:
- xlnet有关空格的offset mapping,本方法与huggingface略有不同。
- 可能存在其他问题,仅供参考~