Vivek Narayanan's blog

“The sparsely gated mixture of experts layer” is an interesting paper by Geoffrey Hinton and others at Google.

When you have a large number of training examples and you want to extract the most out of your data, one way of doing it would be increasing the capacity of the machine learning models. But this also means that for small incremental gains in accuracy, you have to increase the number of parameters by a few orders of magnitude. This is computationally very expensive, if the entire network is activated for each example. But only certain parts of the network might be needed for different types of inputs. This could be done deterministically by training multiple models and selecting a particular model based on certain heuristics.

But it might be hard to come up with such heuristics and decide what to include in each model. This paper introduces the concept of a trainable gating layer. The model is a collection of experts, each individual expert can be a feedforward neural network, the gating layer selects a small number of experts for each input.

$y = \sum_{i=1}^{n} G(x)_i E_i(x)$ $G(x) = Softmax(PreserveTopK(W_gx))$

The gating function is a modified softmax layer, which sets the values not in the top $k$ to $-\infty$ . It can be trained jointly with the expert networks using gradient descent. This sparsity ensures that both training and prediction can be really fast as the zeroed out networks don’t need to be evaluated or backpropagated through. The final prediction is a sparse linear combination of the experts.

There are some issues with this approach though, during training the weights on some of the experts can get larger and due to reinforcement only these experts would be trained. The authors suggest two ways to solve this.

Adding Gaussian noise to the outputs before applying the softmax.
Introduce a loss penalty for unbalanced distribution of examples to experts, penalizing high variance in the distribution of examples.

$\text{Importance}_i = \sum_{x \in X} G_i(x)$ $\text{loss}_{\text{imp}} = w_i CV(\text{Importance}_i)$

The coefficient of variation ( $CV$ ) is the ratio of standard deviation to the mean of a random variable.

If there are hundreds of experts and mini-batch training is used, each expert network will only receive a small number of examples per batch, which would be computationally inefficient. So, the batch size should be as large as possible. But for a large batch size, all the activations and weight updates should be stored in memory, which is also infeasible for a large number of neural networks. They suggest using a very large batch size for the gating network and then sharding the individual models across multiple machines and training with model parallelism. This way individual networks will still have reasonable batch sizes and it also helps with memory usage during evaluation time.

This is an interesting way of increasing the capacity of the model while keeping computational costs reasonable. Different experts can act on different properties of the data and all of them need not be active simultaneously.

A common theme with several machine learning tasks that involve sequences is that you need to generate a sequence one term at a time. Examples are any kind of text generation, image captioning, time series predictions, machine translation, or a task where you have to make a sequence of decisions and you get reward signals only once in a while. Ideally we would want to maximize a score that takes the entire sequence into account. But it is hard to backpropagate these scores, as they usually tend to be non differentiable. So we have to resort to something like using a cross entropy loss for each term in the sequence. This isn’t great because it allows for errors to accumulate at each step during prediction time, as we are using the previous prediction as an input.

One way of approaching this problem is to use the non differentiable rewards/scores that take into account the whole sequence by applying the REINFORCE algorithm. For this we need to treat the neural network as a reinforcement learning agent. For example, the hidden state of an RNN can be the state of the agent and an action can be predicting the next term in the sequence. A score associated with generating a sequence according to the goal of the task is a reward.

The REINFORCE algorithm

The setting is as follows: we have an agent that can take actions from a set $A$. Let $a_t$ denote the action it takes at time $t$. It samples an action according to a probability distribution $p_\theta(a_t | a_{1:t-1})$. It might get a reward $r(a_{1:t})$ at each time step. And the episode terminates at some variable length $T$, if the reward takes the whole sequence into account, $r(a_{1:T})$ might be the only reward in the episode. So, the goal here is to maximize the expected future reward $J(\theta)$. Let $\mathcal{A}$ be the set of all sequences of actions that terminate.

$\begin{align} \text{Let, } R(a_{1:T}) &= \sum_{t=1}^{T} r(a_{1:T})\\ J(\theta) &= \sum_{a_{1:T} \in \mathcal{A}} p_\theta(a_{1:T}) R(a_{1:T}) \end{align}$

Since we want to maximize $J(\theta)$, we are interested in its gradient, which can be written as an expectation.

$\begin{align} \frac{\partial J(\theta)}{ \partial \theta} &= \sum_{a_{1:T} \in \mathcal{A}} p_\theta(a_{1:T}) \frac{1}{p_\theta(a_{1:T})}\frac{\partial p_\theta(a_{1:T}) }{ \partial \theta}R(a_{1:T}) \\ &= \sum_{a_{1:T} \in \mathcal{A}} p_\theta(a_{1:T}) \frac{\partial \log p_\theta(a_{1:T}) }{ \partial \theta}R(a_{1:T}) \\ &= \mathbb{E}_{a_{1:T} \sim p_{\theta}} \frac{\partial \log p_\theta(a_{1:T}) }{ \partial \theta}R(a_{1:T}) \\ &= \mathbb{E}_{a_1 \sim p_{\theta}(a_1)} \mathbb{E}_{a_2 \sim p_{\theta}(a_2|a_1)} ... \mathbb{E}_{a_n \sim p_{\theta}(a_T | a_{1:T-1})} \sum_{t=1}^{T} \frac{\partial \log p_\theta(a_t | a_{1:t-1}) }{ \partial \theta} R(a_{1:T}) \\ \end{align}$

We can approximate this gradient computation as follows: start by sampling an action $a_1$ from the policy and observe the rewards. Continue sampling $a_{t} | a_{1:t-1}$ according to the policy and finally backpropagate the sum of rewards to the network. We can have the network output the log probabilities $\log p_\theta(a_t | a_{1:t-1})$ at each time step for each action, which can be used to sample the next action. And since we already know how to compute the gradients for the network, we can use this model to train a neural net which to optimize for a score function that is not necessarily differentiable.

But this algorithm can be slow to converge for very large action spaces like text generation, so it is better to pretrain the model with a cross entropy loss and then slowly introduce the REINFORCE step with an annealing schedule. That is, you alternate between REINFORCE and cross entropy training, slowly increasing the proportion of REINFORCE iterations.

References

Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992
Zaremba, Wojciech, and Ilya Sutskever. “Reinforcement Learning Neural Turing Machines-Revised.” arXiv preprint arXiv:1505.00521 (2015).

Vivek Narayanan's blog

Sparsely Gated Mixture of Experts - Paper Notes

Using Reinforcement Learning to Train Neural Networks.

The REINFORCE algorithm

References