TF-dropout — dropout for sequences

Piotr Niewiński
5 min readJan 25, 2022

In this post I will describe quite a simple concept of TF-dropout (Time-Features-dropout). It is an architecture of a neural network dropout layer where you pay more attention to the shape of the dropout mask.

By mask, I mean a vector of zeros and ones deciding which values of a given input vector should be ‘turned off’ (masked) by a dropout layer. It is intended for higher dimensional vectors like sequences of embeddings, so I address this solution mainly for the NLP area and its tasks.

Sequence Representation in Neural Networks

The two most common scenarios when working with neural networks and NLP sequences are tasks of tagging a sequence, where you use a per token classifier, or sequence classification with a single classifier for the entire sequence.

No matter if you have a sequence of words, subwords or characters. In all cases, each token usually is represented with a dense vector like an embedding. Your network or layer takes a two-dimensional tensor as an input, or three-dimensional if you have in mind batches of data.

B is a batch size (number of sequences)
S is a sequence length (number of embeddings)
F is a token width (number of features in a single embedding vector).
You also may look at the batch as a 3D matrix of floats.

Neural network for sequences may be built in many different ways using various architectural concepts like dense layers, convolutions, RNN layers (e.g. LSTM), or Transformer blocks. What is common for all of them, is the fact that it would be hard to build and train such networks without a dropout layer. You would usually find a dropout layer in the network’s input, output, or between layers.

Dropout

Let’s look a bit closer at some of the facts about dropout. Dropout on an input vector is defined by the dropout probability p.

It means that you get a dropped input vector where every single value from that vector is masked with a zero value with a given probability p. The remaining values are just being copied*. 1-p of all values are copied, and p are equal to 0. The masking probability is equal for all values.

Each vector from the batch is being masked with the same probability, but independently from each other. It means that each vector from a batch may be multiplied by a different mask.

*The notion that the remaining values are only being copied is not 100% accurate. In fact, each masked vector is multiplied by a scalar: 1/(1-p) to keep the norm of a vector. If you are interested in the details, you may find them in the original paper from Hinton et al.

TF-dropout

While working with many different sequential neural networks in the NLP area, I have observed that there is a dropout architecture that usually works a bit better than the one presented above. I call it TF-dropout. Its name comes from time and features and it is defined by two probabilities: pt and pf.

For clarification, pt is the probability of masking along the time axis (-2), and pf is the probability of masking along the feature axis (-1). Importantly, masking patterns for both axes are constant along the entire batch.

Below you can see the piece of code that implements TF-dropout using normal dropout:

There is some intuition behind this concept: given those two independent probabilities, you can decide whether to put more dropout on ‘time-dependent’ information or ‘between-feats-dependencies’.

Let’s look at two examples. First, assume that pt = 0.5 and pf = 0.0. For every batch in each sequence you will get half of the vectors fully masked. Now assume the opposite scenario. For pt = 0.0 and pf = 0.5 half of the features will be completely masked — in each vector in each sequence of the entire batch. Selection of masked vectors or features is fixed within the batch — we fix indexes of vectors or indexes of features in both scenarios.

In real scenarios, both probabilities will be higher than zero, and I advise you to use the hyperparameter search method to evaluate best values for both. In my experiments, both values always differ, sometimes significantly (which may actually prove that such a concept is useful for the model). I haven’t found a model that would give better results with normal dropout. TF-dropout always improves the performance. In my experiments, I always put TF-dropout at the very beginning of the neural graph, just in the input. I believe TF-dropout may also be used anywhere in between layers, but I haven’t tested this approach.

If you are interested in testing the TF-dropout layer, please feel free to copy/edit the code, and use it with your task. You may also contact me in case you have any questions.

--

--