Navigation

    Gpushare.com

    • Register
    • Login
    • Search
    • Popular
    • Categories
    • Recent
    • Tags

    【思考】Byte-Pair-Encoding tokenizer 的offset mapping实现

    技术交流
    1
    1
    94
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • 183****0229
      183****0229 last edited by 183****0229

      huggingface库的tokenizers只有rust版本的get offset mapping,没有python版本的,因此个人想着试着找个方法实现以下这个方法。

      准备

      安装pip install fastcore

      # 安装fastcore
      # !pip install fastcore
      
      from paddlenlp.transformers import XLNetTokenizer, GPTTokenizer
      from fastcore.all import patch_to
      from paddle.utils import try_import
      from transformers import XLNetTokenizerFast, GPT2TokenizerFast
      
      @patch_to(GPTTokenizer)
      def get_offset_mapping(self, text):
          token_mapping = []
          global_offsets = 0
          re = try_import("regex")
          for token in re.findall(self.pat, text):
              newtokens = ""
              char2bpe = []
              for char_index, each_element in enumerate(token):
                  for b in each_element.encode("utf-8"):
                      newtokens += self.byte_encoder[b]
                      char2bpe.append(char_index)
      
              cum_bpe_offset = 0
              for bpe_token in self.bpe(newtokens).split(" "):
                  start = newtokens.index(bpe_token) + cum_bpe_offset
                  end = start + len(bpe_token)
                  new_start = char2bpe[start] + global_offsets
                  new_end = char2bpe[end - 1] + global_offsets + 1
                  if bpe_token[0] == "Ġ":
                      new_start += 1
                  token_mapping.append((new_start, new_end))
                  cum_bpe_offset += len(bpe_token)
                  newtokens = newtokens[len(bpe_token) :]
      
              global_offsets += len(token)
      
          return token_mapping
      
      # 数据来自squad1.1
      data = {'id': '56f7c651aef2371900625bf5',
       'title': 'Martin_Luther',
       'context': "Martin Luther (/ˈluːθər/ or /ˈluːðər/; German: [ˈmaɐ̯tiːn ˈlʊtɐ] ( listen); 10 November 1483 – 18 February 1546) was a German professor of theology, composer, priest, former monk and a seminal figure in the Protestant Reformation. Luther came to reject several teachings and practices of the Late Medieval Catholic Church. He strongly disputed the claim that freedom from God's punishment for sin could be purchased with money. He proposed an academic discussion of the power and usefulness of indulgences in his Ninety-Five Theses of 1517. His refusal to retract all of his writings at the demand of Pope Leo X in 1520 and the Holy Roman Emperor Charles V at the Diet of Worms in 1521 resulted in his excommunication by the Pope and condemnation as an outlaw by the Emperor.",
       'question': 'Of what nationality was Martin Luther?',
       'answers': ['German', 'German', 'German'],
       'answer_starts': [39, 119, 119],
       'is_impossible': False}
      text = data["context"]
      
      # 比较gpt2的offset mapping。
      # huggingface版本
      for a,b in zip(gpt2_hgtokenizer(text,return_offsets_mapping=True)["offset_mapping"],gpt2_hgtokenizer.tokenize(text)):
          print(text[a[0]:a[1]],"======",b)
      # paddle版本
      for a,b in zip(gpt2_pdtokenizer.get_offset_mapping(text),gpt2_pdtokenizer.tokenize(text)):
          print(text[a[0]:a[1]],"======",b)
      

      注:

      • gpt2有关空格的offset mapping,本方法与huggingface略有不同。
      • 可能存在其他问题,仅供参考~
      1 Reply Last reply Reply Quote 2
      • First post
        Last post