Lecture 7: Introduction to Natural Language Processing (Part 2)

Language Model

Model the probability of a sentence

Using Chain Rule of probabilities

Bi-gram Approximation:

N-gram conditional probabilities can be estimated based on raw concurrence counts in the observed sequence

💡

The same function and the same set of parameters are used at every time step.

序列生成策略

在训练时，模型在每个时间步使用真实的目标作为下一时间步的输入，而非使用模型自己在前一时间步生成的输出。

可以加速模型收敛。

每个时间步动态选择使用真实值或使用模型自己预测的值。

Run forward and backward through chunks of the sequence instead of whole sequence.

Carry hidden states forward in time forever, only backpropagate for some smaller number of steps.

Exploding Gradients → Gradient Clipping

Vanishing Gradients → Change RNN architecture

LSTM 单元由四个主要部分组成：

的每个元素表示遗忘的比例，接近 1 表示保留，接近 0 表示遗忘。

遗忘旧信息 & 添加新信息。

💡

Cell 指的是一个和长期记忆单元，存储模型需要保留的长期信息。注意这里的更新方式是加法，从而有效避免了梯度问题。

决定输出的隐藏状态。结合 cell 状态和输入门进行选择性的更新。