用PaddlePaddle做機器閱讀理解——比賽冠軍方案分享

用PaddlePaddle做機器閱讀理解——比賽冠軍方案分享

5 人贊了文章

本篇內容來自近期結束的「百度 PaddlePaddle AI 大賽——智能問答」第一名開源方案分享,感謝iioiio同學對社區和知乎小夥伴們的知識貢獻,希望以下內容對你的學習能有所幫助。

註:由於文章字數限制,考慮到篇幅及可讀性,項目中的前期準備,正文的部分代碼和輸出已省略,本篇主要展示作者思路和模型構建部分。原文的項目非常詳細,歡迎大家點擊原文鏈接查看完整項目,Fork後可以在K-Lab復現代碼。

原文鏈接:PaddlePaddle來做機器閱讀理解(DuReader數據集)

作者:iioiio

寫在前面

這個版本並不是最終提交版哈,文檔說明不太一樣,但是代碼基本一樣的。

train部分的舊參數不小心被新參數覆蓋了,與predict沒有銜接,可以看出一個是final_1,一個是 final_2,不過不影響觀看。

如果數據無法訪問,則可以從 ai.baidu.com/broad 這裡獲取 BROAD 數據集來實驗。

一些實(ji)用(ben)操作:

  1. 模型準備階段
  • 預訓練詞向量
  • 標定訓練標籤

2. 模型構建階段

    • Stack BiLSTM
    • P對Q的注意力
    • 參數注意力

3. 模型訓練階段

    • 詞向量置於CPU
    • Xavier初始化
    • 衰減學習率
    • 梯度剪裁

4. 模型測試階段

    • Numpy上三角矩陣獲取概率積的最大值位置

1. 文件結構

本代碼的數據處理流程及文件結構參考 dureader baseline 的 tensorflow版[1]。

  • utils DuReader的結果驗證模塊
    • __init__.py
    • dureader_eval.py
    • preprocess.py
  • vocab.py 辭彙表
  • get_train_label.py 定位訓練集答案,得到訓練標籤
  • dataset.py 數據集
  • xavier_initialization.py 模型參數初始化
  • model.py paddle模型(data => start_probs, end_probs, loss)
  • rc_model.py 模型的train,infer等操作
  • run.py 主程序

準備:

如果沒有辭彙表,請先運行下面的glove[2],生成vectors.txt(2.5 小時左右)。

然後運行 run.py --prepare (with args),通過vocab.py構建辭彙表

準備一次即可,自動保存本地pkl文件後可以免去該步驟

運行:

首先,運行 run.py --train (with args),通過dataset.py讀取數據,並經rc_model.py封裝的train操作,來運行model.py里的模型

其次,運行 run.py --predict (with args),同上,只不過換為測試數據,運行infer操作。

2. 模型介紹

系統結構圖如下:

數據流如下:

在原始文本經過分詞、詞向量轉化後,首先通過多層雙向LSTM進行編碼,然後進行問題的參數注意力、文檔對問題的注意力,而後將注意力結果與文檔編碼進行拼接。隨後經由LSTM進行建模,最後經由兩個獨立的線性層生成未歸一化的答案起點、終點概率。最後,使用Softmax歸一化概率輸出。

各部分的功能如下:

輸入層:循環網路編碼

混合層:注意力進行文檔和問題的對齊

建模層:對混合後的特徵進行建模

輸出層:對建模結果進行輸出

引用

[1] EastonWang. Dureader[OL]. github.com/baidu/DuRead, 2017.

[2] Pennington J, Socher R, Manning C. Glove: Global Vectors for Word Representation[C]// Conference on Empirical Methods in Natural Language Processing. 2014:1532-1543.

Written by iioiio, E-Mail: xghuaxiang@163.com

Prepare Env

%load_ext klab-autotime!pip2 install pathlib!pip2 install joblib!mkdir dureader_final

cd dureader_final/

GloVe word embeddings

ONLY need to be run once. After prepare, the vocab is then saved in ~/work/dureader/duread_files/vocab/<vocabname>.pkl

代碼略,點擊這裡查看完整項目

Download Evaluate scripts

Official evaluate scipt of dureader baseline (github.com/baidu/DuRead)

代碼略,點擊這裡查看完整項目

Python Files

1. Vocab and Data

(由於字數限制,該部分代碼略,點擊這裡查看完整項目)

2. Model and RC Model

%%writefile xavier_initialization.py# -*- coding:utf8 -*-# modify from torch funcimport mathimport numpy as npdef _calculate_fan_in_and_fan_out(array): 得到輸入輸出維度 dimensions = array.ndim if dimensions < 2: raise ValueError("Fan in and fan out can not be computed for tensor with less than 2 dimensions") if dimensions == 2: # Linear fan_in = array.shape[1] fan_out = array.shape[0] else: num_input_fmaps = array.shape[1] num_output_fmaps = array.shape[0] receptive_field_size = 1 if array.ndim > 2: receptive_field_size = array[0][0].size fan_in = num_input_fmaps * receptive_field_size fan_out = num_output_fmaps * receptive_field_size return fan_in, fan_outdef xavier_normal(*shape): Xavier 正態分布 array = np.zeros(shape) fan_in, fan_out = _calculate_fan_in_and_fan_out(array) std = math.sqrt(2.0 / (fan_in + fan_out)) return np.random.normal(0, std, array.shape)def xavier_init_params(params, exclude=[]): Xavier初始化 for p in params.names(): shape = params.get_shape(p) if p not in exclude: params.set(p, xavier_normal(*shape))

%%writefile model.py# -*- coding:utf8 -*-import paddle.v2 as paddleimport paddle.v2.activation as Actimport paddle.v2.attr as Attrimport paddle.v2.layer as layerimport paddle.v2.pooling as Poolclass Model: def __init__(self, emb_size, emb_dim, enc_dim): self.emb_size = emb_size self.emb_dim = emb_dim self.enc_dim = enc_dim self.emb_param = Attr.Param(emb, is_static=True) def emb(self, x, drop_rate=0.2): id轉化為詞向量 x_emb = layer.embedding(input=x, size=self.emb_dim, param_attr=self.emb_param, layer_attr=Attr.ExtraLayerAttribute(device=-1)) x_emb = layer.dropout(x_emb, drop_rate) return x_emb def fc(self, x, size, prefix=, act=Act.Identity(), drop_rate=0.): 全連接層 proj = layer.full_matrix_projection(input=x, size=size, param_attr=Attr.Param(name=prefix + _fc_w)) x_fc = paddle.layer.mixed( size=size, input=proj, act=act, bias_attr=Attr.Param(name=prefix + _fc_b) ) return layer.dropout(x_fc, drop_rate) def enc(self, x, prefix=, nlayer=2, drop_rate=0.2): Stack Bilstm編碼 lstm_last = [] for direct in [_fwd_, _bwd_]: h = x for i in range(nlayer): full_prefix = prefix + _lstm + direct + str(i) h = paddle.layer.lstmemory( input=self.fc(h, size=4 * self.enc_dim, prefix=full_prefix), bias_attr=Attr.Param(name=full_prefix + _b), param_attr=Attr.Param(name=full_prefix + _w), reverse=(direct == _bwd_)) lstm_last.append(h) x_enc = paddle.layer.concat(input=lstm_last) return layer.dropout(x_enc, drop_rate) def att(self, a, b): b對a的注意力 (用b的每個元素去注意a) def step(i, j): expand = layer.expand(input=i, expand_as=j) dot_prod = layer.dot_prod(expand, j) att_weight = layer.mixed(size=1, bias_attr=False, act=Act.SequenceSoftmax(), input=layer.identity_projection(input=dot_prod)) scaled = layer.scaling(input=j, weight=att_weight) return layer.pooling(input=scaled, pooling_type=Pool.Sum()) return layer.recurrent_group(input=[b, layer.StaticInput(a)], step=step) def param_att(self, x, prefix=): x的參數注意力 (用參數去注意x) att_weight = layer.fc(input=x, size=1, act=Act.SequenceSoftmax(), param_attr=Attr.Param(prefix + _att_weight), bias_attr=False) scaled = layer.scaling(input=x, weight=att_weight) return layer.pooling(input=scaled, pooling_type=Pool.Sum()) def loss(self, start_prob, end_prob, start_label, end_label): 求得損失 probs = layer.seq_concat(a=start_prob, b=end_prob) labels = layer.seq_concat(a=start_label, b=end_label) log_probs = layer.mixed(size=probs.size, act=Act.Log(), bias_attr=False, input=paddle.layer.identity_projection(probs)) neg_log_probs = layer.slope_intercept(input=log_probs, slope=-1, intercept=0) loss = paddle.layer.mixed(size=1, input=paddle.layer.dotmul_operator(a=neg_log_probs, b=labels)) sum_val = paddle.layer.pooling(input=loss, pooling_type=paddle.pooling.Sum()) cost = paddle.layer.sum_cost(input=sum_val) return cost def forward(self): 正向傳播 # # data layer # ps = [] for i in range(5): ps.append(paddle.layer.data(name=p + str(i), type=paddle.data_type.integer_value_sequence(self.emb_size))) q = paddle.layer.data(name=q, type=paddle.data_type.integer_value_sequence(self.emb_size)) start_labels = paddle.layer.data(name=s, type=paddle.data_type.dense_vector_sequence(1)) end_labels = paddle.layer.data(name=e, type=paddle.data_type.dense_vector_sequence(1)) # # network # q_emb = self.emb(q) q_enc = self.enc(q_emb, prefix=q, nlayer=2) q_att_i = self.param_att(q_enc, prefix=q) fs = [] for i, p in enumerate(ps): p_emb = self.emb(p) p_enc = self.enc(p_emb, prefix=p, nlayer=2) p_att = self.att(q_enc, p_enc) q_att = layer.expand(q_att_i, p_enc) context = layer.concat(input=[p_enc, p_att, q_att]) f = self.enc(context, prefix=m, nlayer=1) fs.append(f) f = reduce(lambda x, y: layer.seq_concat(a=x, b=y), fs) start_probs = self.fc(f, size=1, prefix=s, act=Act.SequenceSoftmax()) end_probs = self.fc(f, size=1, prefix=e, act=Act.SequenceSoftmax()) cost = self.loss(start_probs, end_probs, start_labels, end_labels) return start_probs, end_probs, cost

%%writefile rc_model.py# -*- coding:utf8 -*-import gzipimport jsonimport loggingimport osimport numpy as npimport paddle.v2 as paddleimport paddle.v2.optimizer as optfrom pathlib import Pathfrom model import Modelfrom utils import compute_bleu_rougefrom utils import normalizefrom xavier_initialization import xavier_init_paramsclass RCModel(object): def __init__(self, args, vocab): # logging self.logger = logging.getLogger("brc") # basic config self.hidden_size = args.hidden_size self.optim_type = args.optim self.learning_rate = args.learning_rate self.weight_decay = args.weight_decay # length limit self.max_p_num = args.max_p_num self.max_p_len = args.max_p_len self.max_q_len = args.max_q_len self.max_a_len = args.max_a_len # global variable(decide whether to save model!) self.max_score = 0 self.max_bleu_score = 0 self.max_rouge_score = 0 # paddle paddle.init(use_gpu=args.gpu, trainer_count=1) self.model = Model(vocab.embeddings.shape[0], args.embed_size, args.hidden_size) self.start_probs, self.end_probs, self.cost = self.model.forward() self.parameters = paddle.parameters.create(self.cost) # initialize parameters xavier_init_params(self.parameters, exclude=[emb]) self.parameters.set(emb, vocab.embeddings) # log parameter info for p in self.parameters.names(): shape = self.parameters.get_shape(p) self.logger.info(parameters {}, shape {}.format(p, shape)) num_params = sum([np.prod(self.parameters.get_shape(_)) for _ in self.parameters.names()]) self.logger.info(Num parameters: {}, ie. {:.2f} M.format(num_params, num_params / 1024 / 1024)) def train(self, data, args, evaluate=True): 訓練 def event_handler(event): log_interval = 20 if isinstance(event, paddle.event.EndIteration): if event.batch_id % log_interval == log_interval - 1: self.logger.info([{} / {}] Average loss from batch {} is {}.format( event.batch_id / log_interval + 1, len(data.train_set) / log_interval / args.train_batch_size, event.batch_id + 1, event.cost)) if isinstance(event, paddle.event.EndPass): if evaluate: self.logger.info(Evaluating the model after epoch {}.format(event.pass_id + 1)) result = self.evaluate(data.get_valid, data.feeding) self.logger.info(Dev eval result: {}.format(result)) bleu_score = result[Bleu-4] if bleu_score > self.max_bleu_score: self.max_bleu_score = bleu_score self.save(args.model_dir, args.model_name + _best_bleu_score + _epoch{}.format(event.pass_id + 1)) rouge_score = result[Rouge-L] if rouge_score > self.max_rouge_score: self.max_rouge_score = rouge_score self.save(args.model_dir, args.model_name + _best_rouge_score + _epoch{}.format(event.pass_id + 1)) score = (bleu_score + rouge_score) / 2. if score > self.max_score: self.max_score = score self.save(args.model_dir, args.model_name + _epoch{}.format(event.pass_id + 1)) # default model, so it has no suffix else: self.save(args.model_dir, args.model_name) optimizer = paddle.optimizer.Adam(learning_rate=args.learning_rate, regularization=opt.L2Regularization(rate=args.weight_decay), gradient_clipping_threshold=args.gradient_clipping_threshold, learning_rate_decay_a=args.lr_decay, learning_rate_decay_b=len(data.train_set), learning_rate_schedule="discexp") trainer = paddle.trainer.SGD(cost=self.cost, parameters=self.parameters, update_equation=optimizer) trainer.train(reader=data.get_train, feeding=data.feeding, event_handler=event_handler, num_passes=args.epochs) def evaluate(self, reader, feeding, result_dir=None, result_prefix=None): 預測。如果傳入結果的路徑,則保存本地。 pred_answers, ref_answers = [], [] self.inferer = paddle.inference.Inference( output_layer=(self.start_probs, self.end_probs), parameters=self.parameters) count = 0 for batch in reader(): if count and count % 50 == 0: self.logger.info(have processed {} batches.format(count)) count += 1 y_pred = self.inferer.infer(input=batch, flatten_result=False, feeding=feeding) start_probs = y_pred[0] end_probs = y_pred[1] offset = 0 for sample in batch: meta = sample[8] # must specify !!! passage_tokens = meta[passage_list_tokens] num_tokens = sum([len(_) for _ in passage_tokens]) best_answer = self._find_best_answer(passage_tokens, start_probs[offset: offset + num_tokens], end_probs[offset: offset + num_tokens]) offset += num_tokens pred_answers.append({question_id: meta[question_id], question_type: DESCRIPTION, answers: [best_answer], entity_answers: [[]], yesno_answers: []}) if answers in meta: ref_answers.append({question_id: meta[question_id], question_type: DESCRIPTION, answers: meta[answers], entity_answers: [[]], yesno_answers: []}) if result_dir is not None and result_prefix is not None: result_file = os.path.join(result_dir, result_prefix + .json) with open(result_file, w) as f: for pred_answer in pred_answers: f.write(json.dumps(pred_answer, ensure_ascii=False) +
) self.logger.info(Saving {} results to {}.format(result_prefix, result_file)) if len(ref_answers) > 0: pred_dict, ref_dict = {}, {} for pred, ref in zip(pred_answers, ref_answers): question_id = ref[question_id] if len(ref[answers]) > 0: pred_dict[question_id] = normalize(pred[answers]) ref_dict[question_id] = normalize(ref[answers]) bleu_rouge = compute_bleu_rouge(pred_dict, ref_dict) else: bleu_rouge = None return bleu_rouge def _find_best_answer(self, passage_tokens, start_probs, end_probs): 得到最佳答案 best_p_idx, best_span, best_score = None, None, 0 offset = 0 for p_idx, passage in enumerate(passage_tokens): if p_idx >= self.max_p_num: continue if offset == len(start_probs): break answer_span, score = self._find_best_answer_for_passage( start_probs[offset: offset + len(passage)], end_probs[offset: offset + len(passage)] ) offset += len(passage) if score > best_score: best_score = score best_p_idx = p_idx best_span = answer_span if best_p_idx is None or best_span is None: best_answer = else: best_answer = .join( passage_tokens[best_p_idx][best_span[0]: best_span[1] + 1]) return best_answer def _find_best_answer_for_passage(self, start_probs, end_probs): 得到篇章的最佳答案 size = len(start_probs) assert size > 0 a = start_probs.repeat(size).reshape(size, size) b = end_probs.repeat(size).reshape(size, size).T x = np.triu(a * b) + (1 - np.triu(np.ones((size, size)), self.max_a_len)) best_start, best_end = np.unravel_index(x.argmax(), x.shape) max_prob = x.max() return (best_start, best_end), max_prob def save(self, model_dir, model_prefix): 保存模型 with gzip.open(str(Path(model_dir, model_prefix + .tar)), w+) as f: self.parameters.to_tar(f) self.logger.info(Model saved in {}, with prefix {}..format(model_dir, model_prefix)) def restore(self, model_dir, model_prefix): 恢復模型 self.parameters = paddle.parameters.Parameters.from_tar( gzip.open(str(Path(model_dir, model_prefix + .tar)), r)) self.logger.info(Model restored from {}, with prefix {}.format(model_dir, model_prefix))

限於字數,後續的模型訓練過程、模型測試及提交部分在本篇中省略,歡迎點擊原文鏈接查看完整項目,原文代碼非常詳盡:

原文鏈接:PaddlePaddle來做機器閱讀理解(DuReader數據集)


推薦閱讀:

人工智慧技術在聲紋識別方面的應用 | 解讀技術
世界盃預測,其實很簡單
目標檢測論文閱讀:Deformable Convolutional Networks
微分享回放 | 機器學習在攜程酒店服務領域的實踐

TAG:自然語言處理 | 機器學習 | 深度學習DeepLearning |