gpushare这个平台运行了已经一周年了。很多人问我当初为什么要构建这个平台。讲一个大家耳熟能详的事情,公司招聘了一个博士,博士安装环境用了好久好久,大家都在质疑博士的能力。
很多博士都是做的基础学科研究,学科研究的大数据环境和异构计算环境都是非常健全的,但是很多的中小公司或者是新业务线并不能提供给博士一个健全的环境。造成了很大的时间浪费。
在另外一个方面是我们今天国内的核心信息科学科研交流中心着实是很少很缺少。做这个平台我希望可以通过平台的激励大家进行更多的学科研究的产出。很早以前听一位老师讲过,一个国家所有人都是销售那么赚来的钱也是浮躁的钱,只有科学研究才能增强一个国家的真正实力。
在互联网+每三年的不同体量信息量来看,我们还远远没有达到信息科学的精细化管理。很多学科都是方兴未艾的,我宁愿平台成为一个教育中心。一个真正的产学研的平台。在国内的云GPU计算力平台中gpushare整理了以下几个中心。云市场、数据集、合作伙伴、解决方案、社区、帮助与支持、热门课程。是目前集百家之所长打造的一个gpu平台。我最开始的初衷逐渐成为了现实我非常的开心。我整理了一下我目前能解决的问题:
1、快速的环境选择后的计算能力平台。
2、计算资源管理解决方案
3、前沿论文方法论落地解决方案
4、再小的研究也有自己的社区
在这几个方向上的解决方案上我也是国内GPU服务商的深度体验用户。
1、任务模式是很棒的模式,可以动态对显卡资源进行安排。类似于一套aiops的策略对资源进行管理应用。
Best posts made by 152****5202
-
模仿平台ceo写个小文章 大家春节快乐
-
RE: 【有奖话题NO.7】传说中的万能Transformer,你有用过吗?
如果我们想做一个网页的类型分类,怎么办。
首先第一步,我们寻找网页中的高频特征。import json from collections import Counter html_data = json.load(open("html_data.json", "r")) out_message_data = {"0": [], "1": [], "2": []} is_chinese_out = [] for i, floor_list in html_data.items(): for floor in floor_list: words = floor.split("\n") is_chinese_out.extend(words) is_chinese_out = list(set(is_chinese_out)) json.dump(is_chinese_out, open("html_chinese_data_sub_one.json", "w"), ensure_ascii=False)
第二我们开始吧网页转换成一系列的id。
vocab_dict = json.load(open("html_chinese_data_sub_one.json", "r")) content_id = {} for i,content in enumerate(vocab_dict): content_id[content] = i
继续往下一步,我们修改原来以字为单位的bert的任务。变成以行转id为特征的文本分类任务。
class data_generator(DataGenerator): """数据生成器 """ def __iter__(self, random=False): batch_token_ids, batch_segment_ids, batch_labels = [], [], [] for is_end, (text, label) in self.sample(random): token_ids = [] segment_ids = [] for text_one in text.split("\n"): token_ids.append(content_id[text_one]) segment_ids.append(0) batch_token_ids.append(token_ids) batch_segment_ids.append(segment_ids) batch_labels.append([label]) if len(batch_token_ids) == self.batch_size or is_end: batch_token_ids = sequence_padding(batch_token_ids) batch_segment_ids = sequence_padding(batch_segment_ids) batch_labels = sequence_padding(batch_labels) yield [batch_token_ids, batch_segment_ids], batch_labels batch_token_ids, batch_segment_ids, batch_labels = [], [], []
再接下来,我们开始加载bert的参数这里选择一些小参数
{ "attention_probs_dropout_prob": 0.1, "directionality": "bidi", "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 3, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "type_vocab_size": 2, "vocab_size": 21128 }
再准备一个更大的参数
{ "attention_probs_dropout_prob": 0.1, "directionality": "bidi", "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "type_vocab_size": 2, "vocab_size": 389137 }
如果我们要缩小vocab_size可以取网页中top vocab_size的特征作为我我们模型的特征。也可以基于文本聚类降低我们的特征数量。
接下来就是见证奇迹的时刻。#! -*- coding:utf-8 -*- import os os.environ['TF_KERAS'] = '1' os.environ['KMP_DUPLICATE_LIB_OK'] = 'True' import json from bert4keras.backend import keras, set_gelu from bert4keras.tokenizers import Tokenizer from bert4keras.models import build_transformer_model from bert4keras.optimizers import Adam, extend_with_piecewise_linear_lr from bert4keras.snippets import sequence_padding, DataGenerator from bert4keras.snippets import open from keras.layers import Lambda from keras.layers import Dense import tensorflow as tf set_gelu('tanh') # 切换gelu版本 maxlen = 128 batch_size = 32 config_path = 'bert/bert_config_rbt3.json' checkpoint_path = None dict_path = 'bert/vocab.txt' vocab_dict = json.load(open("html_chinese_data_sub_one.json", "r")) content_id = {} for i,content in enumerate(vocab_dict): content_id[content] = i data_items = json.load(open("datasets/html_data.json", "r")) label_id = {} # import json # json.dump(label_id,open("label_id.json","w"),ensure_ascii=False) def load_data(all_dataset, start_count, end_count): """加载数据 单条格式:(文本, 标签id) """ D = [] index = 0 for label, j in all_dataset.items(): count_sum = len(j) print(int(count_sum * start_count), int(count_sum * end_count)) for j in j[int(count_sum * start_count):int(count_sum * end_count)]: D.append((j, int(label))) return D # 加载数据集 count_sum = 0 train_data = load_data(data_items, 0, 0.6) valid_data = load_data(data_items, 0.6, 0.8) test_data = load_data(data_items, 0.8, 1) print(len(train_data)) print(len(valid_data)) print(len(test_data)) # 建立分词器 tokenizer = Tokenizer(dict_path, do_lower_case=True) class data_generator(DataGenerator): """数据生成器 """ def __iter__(self, random=False): batch_token_ids, batch_segment_ids, batch_labels = [], [], [] for is_end, (text, label) in self.sample(random): token_ids = [] segment_ids = [] for text_one in text.split("\n"): token_ids.append(content_id[text_one]) segment_ids.append(0) batch_token_ids.append(token_ids) batch_segment_ids.append(segment_ids) batch_labels.append([label]) if len(batch_token_ids) == self.batch_size or is_end: batch_token_ids = sequence_padding(batch_token_ids) batch_segment_ids = sequence_padding(batch_segment_ids) batch_labels = sequence_padding(batch_labels) yield [batch_token_ids, batch_segment_ids], batch_labels batch_token_ids, batch_segment_ids, batch_labels = [], [], [] # 加载预训练模型 bert = build_transformer_model( config_path=config_path, checkpoint_path=checkpoint_path, return_keras_model=False, ) output = Lambda(lambda x: x[:, 0], name='CLS-token')(bert.model.output) output = Dense( units=len(data_items.keys()), activation='softmax', kernel_initializer=bert.initializer )(output) model = keras.models.Model(bert.model.input, output) model.summary() # 派生为带分段线性学习率的优化器。 # 其中name参数可选,但最好填入,以区分不同的派生优化器。 AdamLR = extend_with_piecewise_linear_lr(Adam, name='AdamLR') model.compile( loss='sparse_categorical_crossentropy', # optimizer=Adam(1e-5), # 用足够小的学习率 optimizer=AdamLR(learning_rate=1e-4, lr_schedule={ 1000: 1, 2000: 0.1 }), metrics=['accuracy'], ) # 转换数据集 train_generator = data_generator(train_data, batch_size) valid_generator = data_generator(valid_data, batch_size) test_generator = data_generator(test_data, batch_size) def evaluate(data): total, right = 0., 0. for x_true, y_true in data: y_pred = model.predict(x_true).argmax(axis=1) y_true = y_true[:, 0] total += len(y_true) right += (y_true == y_pred).sum() return right / total class Evaluator(keras.callbacks.Callback): """评估与保存 """ def __init__(self): self.best_val_acc = 0. def on_epoch_end(self, epoch, logs=None): val_acc = evaluate(valid_generator) if val_acc > self.best_val_acc: self.best_val_acc = val_acc model.save_weights('html_best_model.weights') tf.keras.models.save_model(model, save_format="tf", filepath="best_model/2") test_acc = evaluate(test_generator) print( u'val_acc: %.5f, best_val_acc: %.5f, test_acc: %.5f\n' % (val_acc, self.best_val_acc, test_acc) ) if __name__ == '__main__': evaluator = Evaluator() model.fit( train_generator.forfit(), steps_per_epoch=len(train_generator), epochs=10, callbacks=[evaluator] ) print(u'final test acc: %05f\n' % (evaluate(test_generator)))
哦这算是一种艺术呢。
-
医疗命名实体识别不同参数准确率
实验环境
nvidia-3090选择pytorch为底层框架
可以同时在一张卡中同时跑多个模型。我们这里实验环境采用同类别四种模型参数的模型进行支撑。
tiny版本bert参数{ "emb_size": 128, "feedforward_size": 512, "hidden_size": 128, "hidden_act": "gelu", "heads_num": 2, "layers_num": 2, "max_seq_length": 512, "dropout": 0.1, "embedding": "word_pos_seg", "encoder": "transformer", "mask": "fully_visible", "target": "bert" }
small版本参数
{ "emb_size": 512, "feedforward_size": 2048, "hidden_size": 512, "hidden_act": "gelu", "heads_num": 8, "layers_num": 4, "max_seq_length": 512, "dropout": 0.1, "embedding": "word_pos_seg", "encoder": "transformer", "mask": "fully_visible", "target": "bert" }
middle 参数
{ "emb_size": 512, "feedforward_size": 2048, "hidden_size": 512, "hidden_act": "gelu", "heads_num": 8, "layers_num": 8, "max_seq_length": 512, "dropout": 0.1, "embedding": "word_pos_seg", "encoder": "transformer", "mask": "fully_visible", "target": "bert" }
large 参数
{ "emb_size": 1024, "feedforward_size": 4096, "hidden_size": 1024, "hidden_act": "gelu", "heads_num": 16, "layers_num": 24, "max_seq_length": 512, "dropout": 0.1, "embedding": "word_pos_seg", "encoder": "transformer", "mask": "fully_visible", "target": "bert" }
实体总共以下几个类别
{'TargetedTreatB': 0, 'BodyPartsB': 1, 'AbnormalType': 2, 'IOPB': 3, 'PathogenB': 4, 'Symptom': 5, 'DEPB': 6, 'Disease': 7, 'ImageTB': 8, 'MedEquipB': 9, 'BodyFunction': 10, 'Surgery': 11, 'MedEquip': 12, 'SignsB': 13, 'DiseaseB': 14, 'SurgeryB': 15, 'SymptomB': 16, 'TargetedTreat': 17, 'EnMedOrder': 18, 'EnMedOrderB': 19, 'DEP': 20, 'LabTB': 21, 'O': 22, 'AbnormalTypeB': 23, 'Drug': 24, 'BodyParts': 25, 'BodyFunctionB': 26, 'IOP': 27, 'BodySubstanceB': 28, 'GeneralTest': 29, 'Pathogen': 30, 'ImageT': 31, 'BodySubstance': 32, 'GeneralTestB': 33, 'Signs': 34, 'LabT': 35, 'DrugB': 36}
核心代码
import argparse import os import sys import torch.nn as nn import torch.nn.functional as F uer_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), "..")) sys.path.append(uer_dir) from uer.layers import * from uer.encoders import * from uer.utils.config import load_hyperparam from uer.utils.optimizers import * from uer.utils.seed import set_seed from uer.utils.tokenizers import * from uer.opts import finetune_opts import pandas as pd from finetune.run_classifier import build_optimizer, load_or_initialize_parameters class NerTagger(nn.Module): def __init__(self, args): super(NerTagger, self).__init__() self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab)) self.encoder = str2encoder[args.encoder](args) self.labels_num = args.labels_num self.output_layer = nn.Linear(args.hidden_size, self.labels_num) self.crf_target = args.crf_target if args.crf_target: from torchcrf import CRF self.crf = CRF(self.labels_num, batch_first=True) self.seq_length = args.seq_length def forward(self, src, tgt, seg): """ 向前传播 Args: src: [batch_size x seq_length] tgt: [batch_size x seq_length] seg: [batch_size x seq_length] Returns: loss: Sequence labeling loss. logits: Output logits. """ # Embedding. emb = self.embedding(src, seg) # Encoder. output = self.encoder(emb, seg) # Target. logits = self.output_layer(output) if self.crf_target: tgt_mask = seg.type(torch.uint8) pred = self.crf.decode(logits, mask=tgt_mask) for j in range(len(pred)): while len(pred[j]) < self.seq_length: pred[j].append(self.labels_num - 1) pred = torch.tensor(pred).contiguous().view(-1) if tgt is not None: loss = -self.crf(F.log_softmax(logits, 2), tgt, mask=tgt_mask, reduction='mean') return loss, pred else: return None, pred else: tgt_mask = seg.contiguous().view(-1).float() logits = logits.contiguous().view(-1, self.labels_num) pred = logits.argmax(dim=-1) if tgt is not None: tgt = tgt.contiguous().view(-1, 1) one_hot = torch.zeros(tgt.size(0), self.labels_num). \ to(torch.device(tgt.device)). \ scatter_(1, tgt, 1.0) numerator = -torch.sum(nn.LogSoftmax(dim=-1)(logits) * one_hot, 1) numerator = torch.sum(tgt_mask * numerator) denominator = torch.sum(tgt_mask) + 1e-6 loss = numerator / denominator return loss, pred else: return None, pred def read_dataset(args, path): dataset, columns = [], {} with open(path, mode="r", encoding="utf-8") as f: for line_id, line in enumerate(f): if line_id == 0: for i, column_name in enumerate(line.strip().split("\t")): columns[column_name] = i continue line = line.strip().split("\t") labels = line[columns["label"]] tgt = [args.l2i[l] for l in labels.split(" ")] text_a = line[columns["text_a"]] src = args.tokenizer.convert_tokens_to_ids(args.tokenizer.tokenize(text_a)) seg = [1] * len(src) if len(src) > args.seq_length: src = src[: args.seq_length] tgt = tgt[: args.seq_length] seg = seg[: args.seq_length] PAD_ID = args.tokenizer.convert_tokens_to_ids([PAD_TOKEN])[0] while len(src) < args.seq_length: src.append(PAD_ID) tgt.append(args.labels_num - 1) seg.append(0) dataset.append([src, tgt, seg]) return dataset def read_bank_dataset(args, path): """ 读取银行评论命名实体识别数据集 """ dataset, columns = [], {} train_data = pd.read_csv(path) for line, labels in zip(train_data.text, train_data.BIO_anno): src = args.tokenizer.convert_tokens_to_ids(args.tokenizer.tokenize(" ".join(list(line)))) tgt = [args.l2i[l] for l in labels.split(" ")] seg = [1] * len(src) if len(src) > args.seq_length: src = src[: args.seq_length] tgt = tgt[: args.seq_length] seg = seg[: args.seq_length] PAD_ID = args.tokenizer.convert_tokens_to_ids([PAD_TOKEN])[0] while len(src) < args.seq_length: src.append(PAD_ID) seg.append(0) while len(tgt) < args.seq_length: tgt.append(4) if len(src) == len(tgt): dataset.append([src, tgt, seg]) # print(dataset[0]) # print() return dataset def read_medical_dataset(args, path): """ 读取银行评论命名实体识别数据集 """ dataset, columns = [], {} train_data = json.load(open(path, "r")) # print(train_data) for train_data in train_data: labels = train_data["ner_tags"] line = train_data["tokens"] # print(line) # print(labels) # 去 某 宝 上 买 张 , 上 传 , 肯 定 可 以 过 的 , 放 心 吧 src = args.tokenizer.convert_tokens_to_ids(args.tokenizer.tokenize(" ".join(line))) # print(" ".join(list(line))) # print(args.tokenizer.tokenize(" ".join(list(line)))) # print(src) tgt = [args.l2i[l] for l in labels] seg = [1] * len(src) if len(src) > args.seq_length: src = src[: args.seq_length] tgt = tgt[: args.seq_length] seg = seg[: args.seq_length] PAD_ID = args.tokenizer.convert_tokens_to_ids([PAD_TOKEN])[0] while len(src) < args.seq_length: src.append(PAD_ID) seg.append(0) while len(tgt) < args.seq_length: tgt.append(args.l2i["O"]) if len(src) == len(tgt): dataset.append([src, tgt, seg]) # print(dataset[0]) # print() return dataset def batch_loader(batch_size, src, tgt, seg): instances_num = src.size()[0] for i in range(instances_num // batch_size): src_batch = src[i * batch_size: (i + 1) * batch_size, :] tgt_batch = tgt[i * batch_size: (i + 1) * batch_size, :] seg_batch = seg[i * batch_size: (i + 1) * batch_size, :] yield src_batch, tgt_batch, seg_batch if instances_num > instances_num // batch_size * batch_size: src_batch = src[instances_num // batch_size * batch_size:, :] tgt_batch = tgt[instances_num // batch_size * batch_size:, :] seg_batch = seg[instances_num // batch_size * batch_size:, :] yield src_batch, tgt_batch, seg_batch def train(args, model, optimizer, scheduler, src_batch, tgt_batch, seg_batch): model.zero_grad() src_batch = src_batch.to(args.device) tgt_batch = tgt_batch.to(args.device) seg_batch = seg_batch.to(args.device) # print(src_batch, tgt_batch, seg_batch) loss, _ = model(src_batch, tgt_batch, seg_batch) if torch.cuda.device_count() > 1: loss = torch.mean(loss) if args.fp16: with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() else: loss.backward() optimizer.step() scheduler.step() return loss def evaluate(args, dataset): src = torch.LongTensor([sample[0] for sample in dataset]) tgt = torch.LongTensor([sample[1] for sample in dataset]) seg = torch.LongTensor([sample[2] for sample in dataset]) batch_size = args.batch_size correct, gold_entities_num, pred_entities_num = 0, 0, 0 args.model.eval() for i, (src_batch, tgt_batch, seg_batch) in enumerate(batch_loader(batch_size, src, tgt, seg)): src_batch = src_batch.to(args.device) tgt_batch = tgt_batch.to(args.device) seg_batch = seg_batch.to(args.device) loss, pred = args.model(src_batch, tgt_batch, seg_batch) gold = tgt_batch.contiguous().view(-1, 1) for j in range(gold.size()[0]): if gold[j].item() in args.begin_ids: gold_entities_num += 1 for j in range(pred.size()[0]): if pred[j].item() in args.begin_ids and gold[j].item() != args.l2i["[PAD]"]: pred_entities_num += 1 pred_entities_pos, gold_entities_pos = set(), set() for j in range(gold.size()[0]): if gold[j].item() in args.begin_ids: start = j for k in range(j + 1, gold.size()[0]): if gold[k].item() == args.l2i["[PAD]"] or gold[k].item() == args.l2i["O"] or gold[ k].item() in args.begin_ids: end = k - 1 break else: end = gold.size()[0] - 1 gold_entities_pos.add((start, end)) for j in range(pred.size()[0]): if pred[j].item() in args.begin_ids and gold[j].item() != args.l2i["[PAD]"]: start = j for k in range(j + 1, pred.size()[0]): if pred[k].item() == args.l2i["[PAD]"] or pred[k].item() == args.l2i["O"] or pred[ k].item() in args.begin_ids: end = k - 1 break else: end = pred.size()[0] - 1 pred_entities_pos.add((start, end)) for entity in pred_entities_pos: if entity not in gold_entities_pos: continue for j in range(entity[0], entity[1] + 1): if gold[j].item() != pred[j].item(): break else: correct += 1 print("Report precision, recall, and f1:") eps = 1e-9 p = correct / (pred_entities_num + eps) r = correct / (gold_entities_num + eps) f1 = 2 * p * r / (p + r + eps) print("{:.3f}, {:.3f}, {:.3f}".format(p, r, f1)) return f1 def main(): parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter) finetune_opts(parser) parser.add_argument("--vocab_path", default=None, type=str, help=" 词汇表文件的路径。 Path of the vocabulary file.") parser.add_argument("--model_label", default=None, type=str, help=" 模型类别。 Path of the vocabulary file.") parser.add_argument("--spm_model_path", default=None, type=str, help=" 句子片段模型的路径。 Path of the sentence piece model.") parser.add_argument("--label2id_path", type=str, required=True, help="Path of the label2id file.") parser.add_argument("--crf_target", action="store_true", help="Use CRF loss as the target function or not, default False.") args = parser.parse_args() # Load the hyper parameters of the config file. args = load_hyperparam(args) set_seed(args.seed) args.begin_ids = [] with open(args.label2id_path, mode="r", encoding="utf-8") as f: l2i = json.load(f) print("Labels: ", l2i) l2i["[PAD]"] = len(l2i) for label in l2i: if label[-1] == "B": args.begin_ids.append(l2i[label]) args.l2i = l2i args.labels_num = len(l2i) # 读取文本转id的工具 args.tokenizer = SpaceTokenizer(args) # Build sequence labeling model. print(args) # for i in list(args): # print(i) model = NerTagger(args) # Load or initialize parameters. load_or_initialize_parameters(args, model) args.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(args.device) # Training phase. instances = read_medical_dataset(args, args.train_path) src = torch.LongTensor([ins[0] for ins in instances]) tgt = torch.LongTensor([ins[1] for ins in instances]) seg = torch.LongTensor([ins[2] for ins in instances]) instances_num = src.size(0) batch_size = args.batch_size args.train_steps = int(instances_num * args.epochs_num / batch_size) + 1 print("Batch size: ", batch_size) print("The number of training instances:", instances_num) optimizer, scheduler = build_optimizer(args, model) if args.fp16: try: from apex import amp except ImportError: raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.") model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level) if torch.cuda.device_count() > 1: print("{} GPUs are available. Let's use them.".format(torch.cuda.device_count())) model = torch.nn.DataParallel(model) args.model = model total_loss, f1, best_f1 = 0.0, 0.0, 0.0 print("Start training.") for epoch in range(1, args.epochs_num + 1): model.train() src_batch, tgt_batch, seg_batch = [], [], [] for i, (src_batch, tgt_batch, seg_batch) in enumerate(batch_loader(batch_size, src, tgt, seg)): loss = train(args, model, optimizer, scheduler, src_batch, tgt_batch, seg_batch) total_loss += loss.item() if (i + 1) % args.report_steps == 0: print("Epoch id: {}, Training steps: {}, Avg loss: {:.3f}".format(epoch, i + 1, total_loss / args.report_steps)) total_loss = 0.0 f1 = evaluate(args, read_medical_dataset(args, args.dev_path)) if f1 > best_f1: best_f1 = f1 torch.save(model.state_dict(), args.output_model_path) if not os.path.exists(args.model_label): os.mkdir(args.model_label) torch.onnx.export(model, (src_batch.to(args.device), tgt_batch.to(args.device), seg_batch.to(args.device)), os.path.join(args.model_label, "uer_py_ner_part" + str(best_f1) + ".onnx"), verbose=True) else: continue # Evaluation phase. if args.test_path is not None: print("Test set evaluation.") if torch.cuda.device_count() > 1: args.model.module.load_state_dict(torch.load(args.output_model_path)) else: args.model.load_state_dict(torch.load(args.output_model_path)) evaluate(args, read_medical_dataset(args, args.test_path)) if __name__ == "__main__": main()
启动参数
tiny命名实体识别训练python finetune/run_ner_medical.py --train_path data/medical_ner.json --dev_path data/medical_ner.json --output_model_path models/part.bin --label2id_path data/medical_label2id.json --vocab_path models/google_zh_vocab.txt --batch_size 64 --epochs_num 300 --config_path models/bert/tiny_config.json --model_label onnx_tiny
mini命名实体识别训练
python finetune/run_ner_medical.py --train_path data/medical_ner.json --dev_path data/medical_ner.json --output_model_path models/part.bin --label2id_path data/medical_label2id.json --vocab_path models/google_zh_vocab.txt --batch_size 16 --epochs_num 300 --config_path models/bert/mini_config.json --model_label onnx_mini
small命名实体识别训练
python finetune/run_ner_medical.py --train_path data/medical_ner.json --dev_path data/medical_ner.json --output_model_path models/part.bin --label2id_path data/medical_label2id.json --vocab_path models/google_zh_vocab.txt --batch_size 16 --epochs_num 300 --config_path models/bert/small_config.json --model_label onnx_small
-
恒源云是元宇宙重要基础建设
最近有扎克伯格热潮的元宇宙概念将我们的目前互联网业应用推到了一个不得不去推进到的时代,我们从2000年开始的信息化到2012年兴起的移动互联网化,再到2018年开始AI人工智能热潮的阶段后忽然发现一个单点的生态是没有办法满足用户的所有的需求的。这时候一种可以迁移的具有有效等价交易体系的虚拟世界互通需求就成为了时代的缩影。 我们为什么要讲恒源云是元宇宙的基础建设呢?这就不得不从我们的模型融合讲起,在元宇宙中我们既需要保证元生态的独立性也需要将元生态找到中间态进行融合。 作为一个第三方的异构高性能计算核心场景,一个独立于第三方的计算平台提供场景的计算通道和融合通道就变得异常的重要。
未来恒源云也将会引入autodl自动网络架构探索,automl自动参数探索从而从更自动化智能化的角度解决元宇宙交互场景中带来的信息偏差问题。
我们以企业公告生成为场景,来看文本续写和元宇宙之间的关系,从数据方面来说,我们首先需要基于所有的企业公告训练一个通用的,在这之后我们在微调一个垂直文本生成模型。而其他的部分我们可以看出来几乎和传统的方案没带来什么样的变化。这就是一种元宇宙的应用。
那么我们vr场景结合,例如迪士尼有很多的ip,环球影城也有很多的ip。怎么样能让这些ip和终端的客户进行对话,并且具有对话内容的多样性,又可以保证每一个ip对话的特性,我们就可以基于这种元宇宙的设计模式所展开。
Latest posts made by 152****5202
-
RE: 基于tensorflow的图神经网络框架stellargraph
如上图所示,有超过207个传感器的2016次速度记录观测(时间步长)。每5分钟记录一次速度。这意味着,在一个小时内,你将有12次观察。同样地,一天将包含288(12x24)个观测值。总的来说,数据包括207天(12X24X7)内每5分钟记录的速度。
时空数据预测是一个有监督的学习问题
时间序列预测问题可以归结为一个监督学习问题。我们可以使用前一个时间步作为输入特性,使用下一个时间步作为输出进行预测。然后,时空预测问题可以建模为预测未来的特征值,给定该实体的特征的历史值以及“连接”到该实体的实体的特征值。例如,速度预测问题,传感器的历史速度是时间序列,传感器之间的距离是传感器连通性或接近性的指标。
-
模仿平台ceo写个小文章 大家春节快乐
gpushare这个平台运行了已经一周年了。很多人问我当初为什么要构建这个平台。讲一个大家耳熟能详的事情,公司招聘了一个博士,博士安装环境用了好久好久,大家都在质疑博士的能力。
很多博士都是做的基础学科研究,学科研究的大数据环境和异构计算环境都是非常健全的,但是很多的中小公司或者是新业务线并不能提供给博士一个健全的环境。造成了很大的时间浪费。
在另外一个方面是我们今天国内的核心信息科学科研交流中心着实是很少很缺少。做这个平台我希望可以通过平台的激励大家进行更多的学科研究的产出。很早以前听一位老师讲过,一个国家所有人都是销售那么赚来的钱也是浮躁的钱,只有科学研究才能增强一个国家的真正实力。
在互联网+每三年的不同体量信息量来看,我们还远远没有达到信息科学的精细化管理。很多学科都是方兴未艾的,我宁愿平台成为一个教育中心。一个真正的产学研的平台。在国内的云GPU计算力平台中gpushare整理了以下几个中心。云市场、数据集、合作伙伴、解决方案、社区、帮助与支持、热门课程。是目前集百家之所长打造的一个gpu平台。我最开始的初衷逐渐成为了现实我非常的开心。我整理了一下我目前能解决的问题:
1、快速的环境选择后的计算能力平台。
2、计算资源管理解决方案
3、前沿论文方法论落地解决方案
4、再小的研究也有自己的社区
在这几个方向上的解决方案上我也是国内GPU服务商的深度体验用户。
1、任务模式是很棒的模式,可以动态对显卡资源进行安排。类似于一套aiops的策略对资源进行管理应用。 -
RE: 【有奖话题NO.7】传说中的万能Transformer,你有用过吗?
如果我们想做一个网页的类型分类,怎么办。
首先第一步,我们寻找网页中的高频特征。import json from collections import Counter html_data = json.load(open("html_data.json", "r")) out_message_data = {"0": [], "1": [], "2": []} is_chinese_out = [] for i, floor_list in html_data.items(): for floor in floor_list: words = floor.split("\n") is_chinese_out.extend(words) is_chinese_out = list(set(is_chinese_out)) json.dump(is_chinese_out, open("html_chinese_data_sub_one.json", "w"), ensure_ascii=False)
第二我们开始吧网页转换成一系列的id。
vocab_dict = json.load(open("html_chinese_data_sub_one.json", "r")) content_id = {} for i,content in enumerate(vocab_dict): content_id[content] = i
继续往下一步,我们修改原来以字为单位的bert的任务。变成以行转id为特征的文本分类任务。
class data_generator(DataGenerator): """数据生成器 """ def __iter__(self, random=False): batch_token_ids, batch_segment_ids, batch_labels = [], [], [] for is_end, (text, label) in self.sample(random): token_ids = [] segment_ids = [] for text_one in text.split("\n"): token_ids.append(content_id[text_one]) segment_ids.append(0) batch_token_ids.append(token_ids) batch_segment_ids.append(segment_ids) batch_labels.append([label]) if len(batch_token_ids) == self.batch_size or is_end: batch_token_ids = sequence_padding(batch_token_ids) batch_segment_ids = sequence_padding(batch_segment_ids) batch_labels = sequence_padding(batch_labels) yield [batch_token_ids, batch_segment_ids], batch_labels batch_token_ids, batch_segment_ids, batch_labels = [], [], []
再接下来,我们开始加载bert的参数这里选择一些小参数
{ "attention_probs_dropout_prob": 0.1, "directionality": "bidi", "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 3, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "type_vocab_size": 2, "vocab_size": 21128 }
再准备一个更大的参数
{ "attention_probs_dropout_prob": 0.1, "directionality": "bidi", "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "type_vocab_size": 2, "vocab_size": 389137 }
如果我们要缩小vocab_size可以取网页中top vocab_size的特征作为我我们模型的特征。也可以基于文本聚类降低我们的特征数量。
接下来就是见证奇迹的时刻。#! -*- coding:utf-8 -*- import os os.environ['TF_KERAS'] = '1' os.environ['KMP_DUPLICATE_LIB_OK'] = 'True' import json from bert4keras.backend import keras, set_gelu from bert4keras.tokenizers import Tokenizer from bert4keras.models import build_transformer_model from bert4keras.optimizers import Adam, extend_with_piecewise_linear_lr from bert4keras.snippets import sequence_padding, DataGenerator from bert4keras.snippets import open from keras.layers import Lambda from keras.layers import Dense import tensorflow as tf set_gelu('tanh') # 切换gelu版本 maxlen = 128 batch_size = 32 config_path = 'bert/bert_config_rbt3.json' checkpoint_path = None dict_path = 'bert/vocab.txt' vocab_dict = json.load(open("html_chinese_data_sub_one.json", "r")) content_id = {} for i,content in enumerate(vocab_dict): content_id[content] = i data_items = json.load(open("datasets/html_data.json", "r")) label_id = {} # import json # json.dump(label_id,open("label_id.json","w"),ensure_ascii=False) def load_data(all_dataset, start_count, end_count): """加载数据 单条格式:(文本, 标签id) """ D = [] index = 0 for label, j in all_dataset.items(): count_sum = len(j) print(int(count_sum * start_count), int(count_sum * end_count)) for j in j[int(count_sum * start_count):int(count_sum * end_count)]: D.append((j, int(label))) return D # 加载数据集 count_sum = 0 train_data = load_data(data_items, 0, 0.6) valid_data = load_data(data_items, 0.6, 0.8) test_data = load_data(data_items, 0.8, 1) print(len(train_data)) print(len(valid_data)) print(len(test_data)) # 建立分词器 tokenizer = Tokenizer(dict_path, do_lower_case=True) class data_generator(DataGenerator): """数据生成器 """ def __iter__(self, random=False): batch_token_ids, batch_segment_ids, batch_labels = [], [], [] for is_end, (text, label) in self.sample(random): token_ids = [] segment_ids = [] for text_one in text.split("\n"): token_ids.append(content_id[text_one]) segment_ids.append(0) batch_token_ids.append(token_ids) batch_segment_ids.append(segment_ids) batch_labels.append([label]) if len(batch_token_ids) == self.batch_size or is_end: batch_token_ids = sequence_padding(batch_token_ids) batch_segment_ids = sequence_padding(batch_segment_ids) batch_labels = sequence_padding(batch_labels) yield [batch_token_ids, batch_segment_ids], batch_labels batch_token_ids, batch_segment_ids, batch_labels = [], [], [] # 加载预训练模型 bert = build_transformer_model( config_path=config_path, checkpoint_path=checkpoint_path, return_keras_model=False, ) output = Lambda(lambda x: x[:, 0], name='CLS-token')(bert.model.output) output = Dense( units=len(data_items.keys()), activation='softmax', kernel_initializer=bert.initializer )(output) model = keras.models.Model(bert.model.input, output) model.summary() # 派生为带分段线性学习率的优化器。 # 其中name参数可选,但最好填入,以区分不同的派生优化器。 AdamLR = extend_with_piecewise_linear_lr(Adam, name='AdamLR') model.compile( loss='sparse_categorical_crossentropy', # optimizer=Adam(1e-5), # 用足够小的学习率 optimizer=AdamLR(learning_rate=1e-4, lr_schedule={ 1000: 1, 2000: 0.1 }), metrics=['accuracy'], ) # 转换数据集 train_generator = data_generator(train_data, batch_size) valid_generator = data_generator(valid_data, batch_size) test_generator = data_generator(test_data, batch_size) def evaluate(data): total, right = 0., 0. for x_true, y_true in data: y_pred = model.predict(x_true).argmax(axis=1) y_true = y_true[:, 0] total += len(y_true) right += (y_true == y_pred).sum() return right / total class Evaluator(keras.callbacks.Callback): """评估与保存 """ def __init__(self): self.best_val_acc = 0. def on_epoch_end(self, epoch, logs=None): val_acc = evaluate(valid_generator) if val_acc > self.best_val_acc: self.best_val_acc = val_acc model.save_weights('html_best_model.weights') tf.keras.models.save_model(model, save_format="tf", filepath="best_model/2") test_acc = evaluate(test_generator) print( u'val_acc: %.5f, best_val_acc: %.5f, test_acc: %.5f\n' % (val_acc, self.best_val_acc, test_acc) ) if __name__ == '__main__': evaluator = Evaluator() model.fit( train_generator.forfit(), steps_per_epoch=len(train_generator), epochs=10, callbacks=[evaluator] ) print(u'final test acc: %05f\n' % (evaluate(test_generator)))
哦这算是一种艺术呢。
-
医疗命名实体识别不同参数准确率
实验环境
nvidia-3090选择pytorch为底层框架
可以同时在一张卡中同时跑多个模型。我们这里实验环境采用同类别四种模型参数的模型进行支撑。
tiny版本bert参数{ "emb_size": 128, "feedforward_size": 512, "hidden_size": 128, "hidden_act": "gelu", "heads_num": 2, "layers_num": 2, "max_seq_length": 512, "dropout": 0.1, "embedding": "word_pos_seg", "encoder": "transformer", "mask": "fully_visible", "target": "bert" }
small版本参数
{ "emb_size": 512, "feedforward_size": 2048, "hidden_size": 512, "hidden_act": "gelu", "heads_num": 8, "layers_num": 4, "max_seq_length": 512, "dropout": 0.1, "embedding": "word_pos_seg", "encoder": "transformer", "mask": "fully_visible", "target": "bert" }
middle 参数
{ "emb_size": 512, "feedforward_size": 2048, "hidden_size": 512, "hidden_act": "gelu", "heads_num": 8, "layers_num": 8, "max_seq_length": 512, "dropout": 0.1, "embedding": "word_pos_seg", "encoder": "transformer", "mask": "fully_visible", "target": "bert" }
large 参数
{ "emb_size": 1024, "feedforward_size": 4096, "hidden_size": 1024, "hidden_act": "gelu", "heads_num": 16, "layers_num": 24, "max_seq_length": 512, "dropout": 0.1, "embedding": "word_pos_seg", "encoder": "transformer", "mask": "fully_visible", "target": "bert" }
实体总共以下几个类别
{'TargetedTreatB': 0, 'BodyPartsB': 1, 'AbnormalType': 2, 'IOPB': 3, 'PathogenB': 4, 'Symptom': 5, 'DEPB': 6, 'Disease': 7, 'ImageTB': 8, 'MedEquipB': 9, 'BodyFunction': 10, 'Surgery': 11, 'MedEquip': 12, 'SignsB': 13, 'DiseaseB': 14, 'SurgeryB': 15, 'SymptomB': 16, 'TargetedTreat': 17, 'EnMedOrder': 18, 'EnMedOrderB': 19, 'DEP': 20, 'LabTB': 21, 'O': 22, 'AbnormalTypeB': 23, 'Drug': 24, 'BodyParts': 25, 'BodyFunctionB': 26, 'IOP': 27, 'BodySubstanceB': 28, 'GeneralTest': 29, 'Pathogen': 30, 'ImageT': 31, 'BodySubstance': 32, 'GeneralTestB': 33, 'Signs': 34, 'LabT': 35, 'DrugB': 36}
核心代码
import argparse import os import sys import torch.nn as nn import torch.nn.functional as F uer_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), "..")) sys.path.append(uer_dir) from uer.layers import * from uer.encoders import * from uer.utils.config import load_hyperparam from uer.utils.optimizers import * from uer.utils.seed import set_seed from uer.utils.tokenizers import * from uer.opts import finetune_opts import pandas as pd from finetune.run_classifier import build_optimizer, load_or_initialize_parameters class NerTagger(nn.Module): def __init__(self, args): super(NerTagger, self).__init__() self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab)) self.encoder = str2encoder[args.encoder](args) self.labels_num = args.labels_num self.output_layer = nn.Linear(args.hidden_size, self.labels_num) self.crf_target = args.crf_target if args.crf_target: from torchcrf import CRF self.crf = CRF(self.labels_num, batch_first=True) self.seq_length = args.seq_length def forward(self, src, tgt, seg): """ 向前传播 Args: src: [batch_size x seq_length] tgt: [batch_size x seq_length] seg: [batch_size x seq_length] Returns: loss: Sequence labeling loss. logits: Output logits. """ # Embedding. emb = self.embedding(src, seg) # Encoder. output = self.encoder(emb, seg) # Target. logits = self.output_layer(output) if self.crf_target: tgt_mask = seg.type(torch.uint8) pred = self.crf.decode(logits, mask=tgt_mask) for j in range(len(pred)): while len(pred[j]) < self.seq_length: pred[j].append(self.labels_num - 1) pred = torch.tensor(pred).contiguous().view(-1) if tgt is not None: loss = -self.crf(F.log_softmax(logits, 2), tgt, mask=tgt_mask, reduction='mean') return loss, pred else: return None, pred else: tgt_mask = seg.contiguous().view(-1).float() logits = logits.contiguous().view(-1, self.labels_num) pred = logits.argmax(dim=-1) if tgt is not None: tgt = tgt.contiguous().view(-1, 1) one_hot = torch.zeros(tgt.size(0), self.labels_num). \ to(torch.device(tgt.device)). \ scatter_(1, tgt, 1.0) numerator = -torch.sum(nn.LogSoftmax(dim=-1)(logits) * one_hot, 1) numerator = torch.sum(tgt_mask * numerator) denominator = torch.sum(tgt_mask) + 1e-6 loss = numerator / denominator return loss, pred else: return None, pred def read_dataset(args, path): dataset, columns = [], {} with open(path, mode="r", encoding="utf-8") as f: for line_id, line in enumerate(f): if line_id == 0: for i, column_name in enumerate(line.strip().split("\t")): columns[column_name] = i continue line = line.strip().split("\t") labels = line[columns["label"]] tgt = [args.l2i[l] for l in labels.split(" ")] text_a = line[columns["text_a"]] src = args.tokenizer.convert_tokens_to_ids(args.tokenizer.tokenize(text_a)) seg = [1] * len(src) if len(src) > args.seq_length: src = src[: args.seq_length] tgt = tgt[: args.seq_length] seg = seg[: args.seq_length] PAD_ID = args.tokenizer.convert_tokens_to_ids([PAD_TOKEN])[0] while len(src) < args.seq_length: src.append(PAD_ID) tgt.append(args.labels_num - 1) seg.append(0) dataset.append([src, tgt, seg]) return dataset def read_bank_dataset(args, path): """ 读取银行评论命名实体识别数据集 """ dataset, columns = [], {} train_data = pd.read_csv(path) for line, labels in zip(train_data.text, train_data.BIO_anno): src = args.tokenizer.convert_tokens_to_ids(args.tokenizer.tokenize(" ".join(list(line)))) tgt = [args.l2i[l] for l in labels.split(" ")] seg = [1] * len(src) if len(src) > args.seq_length: src = src[: args.seq_length] tgt = tgt[: args.seq_length] seg = seg[: args.seq_length] PAD_ID = args.tokenizer.convert_tokens_to_ids([PAD_TOKEN])[0] while len(src) < args.seq_length: src.append(PAD_ID) seg.append(0) while len(tgt) < args.seq_length: tgt.append(4) if len(src) == len(tgt): dataset.append([src, tgt, seg]) # print(dataset[0]) # print() return dataset def read_medical_dataset(args, path): """ 读取银行评论命名实体识别数据集 """ dataset, columns = [], {} train_data = json.load(open(path, "r")) # print(train_data) for train_data in train_data: labels = train_data["ner_tags"] line = train_data["tokens"] # print(line) # print(labels) # 去 某 宝 上 买 张 , 上 传 , 肯 定 可 以 过 的 , 放 心 吧 src = args.tokenizer.convert_tokens_to_ids(args.tokenizer.tokenize(" ".join(line))) # print(" ".join(list(line))) # print(args.tokenizer.tokenize(" ".join(list(line)))) # print(src) tgt = [args.l2i[l] for l in labels] seg = [1] * len(src) if len(src) > args.seq_length: src = src[: args.seq_length] tgt = tgt[: args.seq_length] seg = seg[: args.seq_length] PAD_ID = args.tokenizer.convert_tokens_to_ids([PAD_TOKEN])[0] while len(src) < args.seq_length: src.append(PAD_ID) seg.append(0) while len(tgt) < args.seq_length: tgt.append(args.l2i["O"]) if len(src) == len(tgt): dataset.append([src, tgt, seg]) # print(dataset[0]) # print() return dataset def batch_loader(batch_size, src, tgt, seg): instances_num = src.size()[0] for i in range(instances_num // batch_size): src_batch = src[i * batch_size: (i + 1) * batch_size, :] tgt_batch = tgt[i * batch_size: (i + 1) * batch_size, :] seg_batch = seg[i * batch_size: (i + 1) * batch_size, :] yield src_batch, tgt_batch, seg_batch if instances_num > instances_num // batch_size * batch_size: src_batch = src[instances_num // batch_size * batch_size:, :] tgt_batch = tgt[instances_num // batch_size * batch_size:, :] seg_batch = seg[instances_num // batch_size * batch_size:, :] yield src_batch, tgt_batch, seg_batch def train(args, model, optimizer, scheduler, src_batch, tgt_batch, seg_batch): model.zero_grad() src_batch = src_batch.to(args.device) tgt_batch = tgt_batch.to(args.device) seg_batch = seg_batch.to(args.device) # print(src_batch, tgt_batch, seg_batch) loss, _ = model(src_batch, tgt_batch, seg_batch) if torch.cuda.device_count() > 1: loss = torch.mean(loss) if args.fp16: with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() else: loss.backward() optimizer.step() scheduler.step() return loss def evaluate(args, dataset): src = torch.LongTensor([sample[0] for sample in dataset]) tgt = torch.LongTensor([sample[1] for sample in dataset]) seg = torch.LongTensor([sample[2] for sample in dataset]) batch_size = args.batch_size correct, gold_entities_num, pred_entities_num = 0, 0, 0 args.model.eval() for i, (src_batch, tgt_batch, seg_batch) in enumerate(batch_loader(batch_size, src, tgt, seg)): src_batch = src_batch.to(args.device) tgt_batch = tgt_batch.to(args.device) seg_batch = seg_batch.to(args.device) loss, pred = args.model(src_batch, tgt_batch, seg_batch) gold = tgt_batch.contiguous().view(-1, 1) for j in range(gold.size()[0]): if gold[j].item() in args.begin_ids: gold_entities_num += 1 for j in range(pred.size()[0]): if pred[j].item() in args.begin_ids and gold[j].item() != args.l2i["[PAD]"]: pred_entities_num += 1 pred_entities_pos, gold_entities_pos = set(), set() for j in range(gold.size()[0]): if gold[j].item() in args.begin_ids: start = j for k in range(j + 1, gold.size()[0]): if gold[k].item() == args.l2i["[PAD]"] or gold[k].item() == args.l2i["O"] or gold[ k].item() in args.begin_ids: end = k - 1 break else: end = gold.size()[0] - 1 gold_entities_pos.add((start, end)) for j in range(pred.size()[0]): if pred[j].item() in args.begin_ids and gold[j].item() != args.l2i["[PAD]"]: start = j for k in range(j + 1, pred.size()[0]): if pred[k].item() == args.l2i["[PAD]"] or pred[k].item() == args.l2i["O"] or pred[ k].item() in args.begin_ids: end = k - 1 break else: end = pred.size()[0] - 1 pred_entities_pos.add((start, end)) for entity in pred_entities_pos: if entity not in gold_entities_pos: continue for j in range(entity[0], entity[1] + 1): if gold[j].item() != pred[j].item(): break else: correct += 1 print("Report precision, recall, and f1:") eps = 1e-9 p = correct / (pred_entities_num + eps) r = correct / (gold_entities_num + eps) f1 = 2 * p * r / (p + r + eps) print("{:.3f}, {:.3f}, {:.3f}".format(p, r, f1)) return f1 def main(): parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter) finetune_opts(parser) parser.add_argument("--vocab_path", default=None, type=str, help=" 词汇表文件的路径。 Path of the vocabulary file.") parser.add_argument("--model_label", default=None, type=str, help=" 模型类别。 Path of the vocabulary file.") parser.add_argument("--spm_model_path", default=None, type=str, help=" 句子片段模型的路径。 Path of the sentence piece model.") parser.add_argument("--label2id_path", type=str, required=True, help="Path of the label2id file.") parser.add_argument("--crf_target", action="store_true", help="Use CRF loss as the target function or not, default False.") args = parser.parse_args() # Load the hyper parameters of the config file. args = load_hyperparam(args) set_seed(args.seed) args.begin_ids = [] with open(args.label2id_path, mode="r", encoding="utf-8") as f: l2i = json.load(f) print("Labels: ", l2i) l2i["[PAD]"] = len(l2i) for label in l2i: if label[-1] == "B": args.begin_ids.append(l2i[label]) args.l2i = l2i args.labels_num = len(l2i) # 读取文本转id的工具 args.tokenizer = SpaceTokenizer(args) # Build sequence labeling model. print(args) # for i in list(args): # print(i) model = NerTagger(args) # Load or initialize parameters. load_or_initialize_parameters(args, model) args.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(args.device) # Training phase. instances = read_medical_dataset(args, args.train_path) src = torch.LongTensor([ins[0] for ins in instances]) tgt = torch.LongTensor([ins[1] for ins in instances]) seg = torch.LongTensor([ins[2] for ins in instances]) instances_num = src.size(0) batch_size = args.batch_size args.train_steps = int(instances_num * args.epochs_num / batch_size) + 1 print("Batch size: ", batch_size) print("The number of training instances:", instances_num) optimizer, scheduler = build_optimizer(args, model) if args.fp16: try: from apex import amp except ImportError: raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.") model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level) if torch.cuda.device_count() > 1: print("{} GPUs are available. Let's use them.".format(torch.cuda.device_count())) model = torch.nn.DataParallel(model) args.model = model total_loss, f1, best_f1 = 0.0, 0.0, 0.0 print("Start training.") for epoch in range(1, args.epochs_num + 1): model.train() src_batch, tgt_batch, seg_batch = [], [], [] for i, (src_batch, tgt_batch, seg_batch) in enumerate(batch_loader(batch_size, src, tgt, seg)): loss = train(args, model, optimizer, scheduler, src_batch, tgt_batch, seg_batch) total_loss += loss.item() if (i + 1) % args.report_steps == 0: print("Epoch id: {}, Training steps: {}, Avg loss: {:.3f}".format(epoch, i + 1, total_loss / args.report_steps)) total_loss = 0.0 f1 = evaluate(args, read_medical_dataset(args, args.dev_path)) if f1 > best_f1: best_f1 = f1 torch.save(model.state_dict(), args.output_model_path) if not os.path.exists(args.model_label): os.mkdir(args.model_label) torch.onnx.export(model, (src_batch.to(args.device), tgt_batch.to(args.device), seg_batch.to(args.device)), os.path.join(args.model_label, "uer_py_ner_part" + str(best_f1) + ".onnx"), verbose=True) else: continue # Evaluation phase. if args.test_path is not None: print("Test set evaluation.") if torch.cuda.device_count() > 1: args.model.module.load_state_dict(torch.load(args.output_model_path)) else: args.model.load_state_dict(torch.load(args.output_model_path)) evaluate(args, read_medical_dataset(args, args.test_path)) if __name__ == "__main__": main()
启动参数
tiny命名实体识别训练python finetune/run_ner_medical.py --train_path data/medical_ner.json --dev_path data/medical_ner.json --output_model_path models/part.bin --label2id_path data/medical_label2id.json --vocab_path models/google_zh_vocab.txt --batch_size 64 --epochs_num 300 --config_path models/bert/tiny_config.json --model_label onnx_tiny
mini命名实体识别训练
python finetune/run_ner_medical.py --train_path data/medical_ner.json --dev_path data/medical_ner.json --output_model_path models/part.bin --label2id_path data/medical_label2id.json --vocab_path models/google_zh_vocab.txt --batch_size 16 --epochs_num 300 --config_path models/bert/mini_config.json --model_label onnx_mini
small命名实体识别训练
python finetune/run_ner_medical.py --train_path data/medical_ner.json --dev_path data/medical_ner.json --output_model_path models/part.bin --label2id_path data/medical_label2id.json --vocab_path models/google_zh_vocab.txt --batch_size 16 --epochs_num 300 --config_path models/bert/small_config.json --model_label onnx_small
-
恒源云是元宇宙重要基础建设
最近有扎克伯格热潮的元宇宙概念将我们的目前互联网业应用推到了一个不得不去推进到的时代,我们从2000年开始的信息化到2012年兴起的移动互联网化,再到2018年开始AI人工智能热潮的阶段后忽然发现一个单点的生态是没有办法满足用户的所有的需求的。这时候一种可以迁移的具有有效等价交易体系的虚拟世界互通需求就成为了时代的缩影。 我们为什么要讲恒源云是元宇宙的基础建设呢?这就不得不从我们的模型融合讲起,在元宇宙中我们既需要保证元生态的独立性也需要将元生态找到中间态进行融合。 作为一个第三方的异构高性能计算核心场景,一个独立于第三方的计算平台提供场景的计算通道和融合通道就变得异常的重要。
未来恒源云也将会引入autodl自动网络架构探索,automl自动参数探索从而从更自动化智能化的角度解决元宇宙交互场景中带来的信息偏差问题。
我们以企业公告生成为场景,来看文本续写和元宇宙之间的关系,从数据方面来说,我们首先需要基于所有的企业公告训练一个通用的,在这之后我们在微调一个垂直文本生成模型。而其他的部分我们可以看出来几乎和传统的方案没带来什么样的变化。这就是一种元宇宙的应用。
那么我们vr场景结合,例如迪士尼有很多的ip,环球影城也有很多的ip。怎么样能让这些ip和终端的客户进行对话,并且具有对话内容的多样性,又可以保证每一个ip对话的特性,我们就可以基于这种元宇宙的设计模式所展开。