Multi-Label Classification

07-20

來自專欄自然語言處理論文閱讀筆記

標題(2018):

SGM: Sequence Generation Model for Multi-Label Classification

摘要：

Existing methods tend to ignore the correlations between labels.
Besides, different parts of the text can contribute differently to predicting different labels.
Propose to view the multi-label classification task as a sequence generation problem.

引入：

Binary relevance (BR) transform the MLC task into multiple single-label classification problems.

In this paper, inspired by the tremendous success of the sequence-to-sequence (Seq2Seq) model.

貢獻：

The decoder uses an LSTM to generate labels sequentially, and predicts the next label based on its previously predicted labels.
Furthermore, the attention mechanism considers the contributions of different parts of text.
In addition, a novel decoder structure with global embedding is proposed.

模型

The overview of our proposed model. MS denotes the masked softmax layer. GE denotes the global embedding.

Overview：

Given the label space with L labels ${ l_{1}, l_{2}, ...,l_{L} }$ , a text sequence x containing m words, the task is to assign a subset y containing n labels in the label space L to x.

finding an optimal label sequence y that maximizes the conditional probability:

First, we sort the label sequence of each sample according to the frequency of the labels in the training set.

In addition, the bos and eos symbols are added to the head and tail of the label sequence.

Sequence Generation:

The whole sequence generation model consists of an encoder and a decoder with the attention mechanism.

Encoder:

a sentence with m words and w is the one-hot representation of the word.

We first embed w to a dense embedding vector x.

We obtain the final hidden representation of the i-th word by concatenating the hidden states from both directions.

Attention:

The attention mechanism produces a context vector by focusing on different portions of the text sequence and aggregating the hidden representations of those informative words.

s t is the current hidden state of the decoder at time-step t

The final context vector c which is passed to the decoder at time-step t is calculated as follows:

Decoder:

The hidden state s t of the decoder at time-step t is computed as follows:

g( $y_{t-1}$ ) is the embedding of the label which has the highest probability under the distribution $y_{t-1}$ .

$y_{t-1}$ is the probability distribution:

I is the mask vector that is used to prevent the decoder from predicting repeated labels, and f is a nonlinear activation function.

At the training stage, the loss function is the cross-entropy loss function.

We employ the beam search algorithm.

Global Embedding:

The proposed sequence generation model generates labels sequentially and predicts the next label conditioned on its previously predicted labels.

Therefore, it is likely that we would get a succession of wrong label predictions.

If the prediction is wrong at time-step t, which is also called exposure bias.

To a certain extent, the beam search algorithm alleviates this problem.

The exposure bias problem ought to be relieved by considering all informative signals contained in $y_{t-1 }$ .

we propose a new decoder structure, where the embedding vector g( $y_{t-1}$ ) at time-step t is capable of representing the overall information at (t?1)-th time step.

where y_t?1 is the i-th element of y_t?1 and e_i is the embedding vector of the i-th label.

e denotes the embedding of the label which has the highest probability

H is the transform gate controlling the proportion of the weighted average embedding:

The global embedding is the optimized combination of the original embedding and the weighted average embedding by using transform gate H.

By considering the probability of every label, the model is capable of reducing damage caused by mis-predictions made in the previous time steps.

Experiments: