**[Update 2022 Dec. 4]** Added contrastive learning / decoding methods.

## Intro

The task of generating content from deep learning (statistical) models is pretty different from other common machine learning tasks.

The underlying objective is to train a model so it can generate realistic content during inference. For continuous domains, state of the art models are mostly based on adversarial (Generative Adversarial Networks, GANs) and denoising diffusion training objectives. These models are capable to create impressive and highly realistic results in a wide variety of styles. For discrete domains however, generative systems are mostly based on sequential models, that are trained with teacher forcing and used autoregressively during inference (generate elements one after another). This discrepancy between training objective and inference procedure often make results to degenerate. This problem is known as **exposure bias**.

There attempts to apply adversarial training objective to discrete domains are often very specific toward certain tasks, and often lead to complicated training procedure, including mode collapse and gradients with high variance.

We will take a look to all these solutions, which goes beyond teacher forcing and autoregressive inference. We will see how we can build a GAN for discrete data, and the drawbacks it brings. Given these, we will introduce how the adversarial feedback of a discriminator can be used in cooperation with a sequential model to get more realistic results.

## Problem formulation

Let’s formulate the problem before going further.

Symbol | Meaning |
---|---|

$g_\theta$ | Generator network with parameters $\theta$ |

$d_\delta$ | Discriminator network with parameters $\delta$ |

$p_d$ | Data distribution over real samples |

$\mathbf{x} \in \mathbb{R}^L$ | A real sample $\sim p_d(\mathbf{x})$ |

$p_\theta$ | Model distribution over generated samples |

$p_z$ | Data distribution over Gaussian noise / latent space |

$\mathbf{z} \in \mathbb{R}^L$ | Latent variable sampled from the latent space $\sim p_z(\mathbf{z})$ |

$\mathcal{V}$ | The vocabulary set, linking each token (or integer index) to its word / subword / meaning |

$V$ | Size of the vocabulary, with $\lvert \mathcal{V} \lvert = V$ |

The objective of a training model is to generate content that is realistic, thus similar (not identical) to the data it was trained on. This objective can be expressed as $\underset{\theta}{\mathrm{min}}\: \mathrm{KL}(p_\theta \lvert\lvert p_d)$ where we want the model distribution to be the closest possible to the data distribution.

A generative model would ideally be trained (i.e. updated with gradients computed) to reflect this objective.

## The exposure bias

Sequential models are Maximum Likelihood Estimation (MLE) based, and have a training objective expressed as $\mathrm{max}\: \mathbb{E}_{\mathbf{x} \sim p_d(\mathbf{x})} [log \left( g_\theta (\mathbf{x}) \right)]$. Yet, during inference they are used autoregressively to generate content, as $g_{\theta} \left( \mathbf{x} \right) = \prod_{t=1}^{T} g_{\theta} \left( x_t \lvert \mathbf{x}_{<t} \right)$. As the inference step $t$ progresses, the model is increasingly likely to predict incoherent or repetitive tokens which can deviate from the original context. This text degeneration issue is commonly called **exposure bias**.

## Autoregressive decoding methods

Greedy decoding, which consist in picking the token with the highest probability score given by the model, is known to be very prone to repetitions and text degeneration. Stochastic decodings help to reduce these effects. The two most used methods are top-k sampling and nucleus sampling. Here is a very good post which goes into details on these techniques. We will see here more complex decoding methods.

Some NLG tasks require to impose some constraints on the generated sequence. For instance the generation of a recipe from a list of ingredients. On top one could also desire to not predict specific output words / tokens. The commonly used technique is to fine-tune a large pre-trained model on task-specific datasets. Doing so, the model however do not learn to effectively satisfy the constraints and often struggles with new constraints and few or zero-shot tasks. The underlying task is then to find an optimal output sequence which satisfies the constraints while having high likelihoods for each element within. NeuroLogic addresses this issue by converting the constraints into penalties during the decoding process. It uses beam-search to generate multiple sequences and steer them toward predictions with low penalties.

The authors enumerate four states that each constraint can take during the decoding: reversible unsatisfaction, irreversible unsatisfaction, reversible satisfaction and irreversible satisfaction. At each decoding step, the model distribution of each beam ($\in \mathbb{R}^{V}$) are modified to: 1) discard tokens leading to irreversible unsatisfaction; 2) keep only tokens with the top-k likelihoods and top-$\beta$ in term of satisfied constraints; 3) group the remaining tokens by constraint satisfied. Within each group, each token is assigned a score $s = p_\theta(y_t \lvert \mathbf{y}_{<t}) + \lambda \underset{D(c_i, \mathbf{y}_{0:t})}{max} \frac{\hat{c_i}}{c_i}$ where $\lambda$ is an adjustable parameter, $c_i$ is the number of successive tokens the constraint $i$ contains, $\hat{c_i}$ is the number of tokens already predicted of the $i$, $D(c_i, \mathbf{y})$ means the constraint $c_i$ is satisfied by $\mathbf{y}$. The token with the highest score is selected for each beam. The NeuroLogic technique achieved SOTA ROUGE and BLEU results on conditional text generation tasks such as CommonGen.

NeuroLogic A*esque relies on NeuroLogic and adds an A* search heuristic on top to estimate the score of the tokens at each decoding step. The score of the token $v$ at decoding step $t$ is calculated as $f(y_{t,v}) = s(\mathbf{y}_{<t}) + h(y_{t,v})$ where $s(\mathbf{y}_{<t})$ is the decoded sequence so far and $h(y_{t,v})$ is a estimation of the score of the future sequence. In practice, $h(y_{t,v})$ is computed by generating a beam search continuation of $l$ tokens in the future. Each continuation is then evaluated by the Neurologic score formula, which consider user defined constraints.

COLD Decoding also tackles controllable generation by sampling the decoding steps conditioned on a continuous relaxation of the text guided by diffusion denoising steps and constraints.

In the figure above, the soft sequence $\tilde{\mathbf{Y}}$ is initially gaussian noise which is iteratively refined toward the target constrained distribution after $N$ diffusion steps, given an energy function $E(\tilde{\mathbf{Y}}) = \sum_{i}^{} \lambda_i f_i(\tilde{\mathbf{Y}})$.

An appealing advantage of COLD decoding is it can be associated to any sequential model, and allow to dynamically filter the output tokens according to constraints, even high level or structural one which are known to be hard to satisfy.

## GAN and adversarial training

GANs are a *model architecture* that present a training objective very interesting for all generative purposes. We will see how they work, why they work well for continuous modalities such as images, but not no for discrete ones like natural language. Despite this, several attempts have been made and successfully generate coherent text results in certain conditions.

### GAN working

A GAN is made of two networks:

- A generator $g_\theta$ which takes random Gaussian noise $\mathbf{z}$ as input, and produces an output $\mathbf{y}$;
- A discriminator $d_\delta$ which tells if its input is wether fake or real.

The role of the generator is to create content close to a real data distribution $p_d$, while the discriminator must tell apart real samples from fake ones created by the generator.

During training, when the discriminator fails, its parameters $\delta$ are updated. It is trained to maximize it’s result over real data $\mathrm{max}: \mathbb{E}_{\mathbf{x} \sim p_x(\mathbf{x})} [log \left( d_\delta (\mathbf{x}) \right)]$, while minimizing the prediction of generated samples $g_\theta (\mathbf{z})$, $\mathrm{min}: \mathbb{E}_{\mathbf{z} \sim p_z(\mathbf{z})} [log \left( 1 - d_\delta (g_\theta (\mathbf{z})) \right)]$.

When the discriminator succeed, the generator’s parameters $\theta$ are updated. It is trained to generate samples which maximize their result from the discriminator (fools it), hence minimizing $\mathbb{E}_{\mathbf{z} \sim p_z(\mathbf{z})} [log \left( 1 - d_\delta (g_\theta (\mathbf{z})) \right)]$.

This can in turn be written as: $$\underset{\theta}{\mathrm{min}}\: \underset{\delta}{\mathrm{max}}\: \mathcal{L}(g_\theta, d_\delta) = \mathbb{E}_{x \sim p_x(\mathbf{x})} [\mathrm{log}: d_\delta(x)] + \mathbb{E}_{z \sim p_z(\mathbf{z})} [\mathrm{log} (1 - d_\delta(g_\theta(\mathbf{z})))]$$

The generator will progressively produce content closer to $p_d$, and the discriminator will distinguish real from fake data with smaller differences. Both models are jointly trained, improving each other.

You see that the generator is directly trained to minimize the distance between its distribution and the data distribution: $\underset{\theta}{\mathrm{min}}\: \mathrm{KL}(p_\theta \lvert\lvert p_d)$. This allows to generate original content which contains the underlying features of the data distribution **implicitly**.

GANs have largely been applied to image generation, yielding impressive results, but also audio, such as StyleGAN and GANSynth.

### Further improvements

GANs are, however, not except of problems and drawbacks. Their training procedure is less stable than those of counterpart models, and they do not provide an easy direct control over the generation.

#### Wasserstein loss

GANs are known to be hard to train, easily falling into **mode collapse**: during training, as the generator learns to produce samples close to $p_x$ which tricks the discriminator most times, it will likely learn to predict only this specific sample. This is undesirable as we should expect the generator to predict plausible and **various** results.

GANs can also suffer from **vanishing gradients**. If the discriminator begins to converge and to be optimal, the gradients computed during the backpropagation will naturally have low values before reaching the generator’s parameters, resulting in weak updates. Learning then starts to slow, and can stop.

These issues can be alleviated **using Wasserstein distance as loss function**.

#### Control techniques

Since then techniques have been developed to control GANs from high level attributes, with a feature classifier or even from text captions and contrastive learning with VQGAN + CLIP. For the latter, I invite you to read this great blogpost for more details, and my blogpost on text-to-image models for a wider overview.

#### Efficient sampling

GAN models were originally presented as sampling from the latent space of the generator to produce content, leaving the discriminator after training. The discriminator however contains useful information on the data distribution that it would be wasteful not to take advantage on, as the common generation practice is based on rejecting unsatisfying samples. Discriminator Driven Latent Sampling (DDLS) proposes to use the discriminator at inference to efficiently sample from the latent space, by defining the generator and discriminator as an energy-based model and running Markov Chain Monte Carlo, yielding results of higher quality and more efficiently.

Ansari et al. proposed a improved and more general framework of sample refinement strategy they called Discriminator Gradient *f*low (DG*f*low). The main idea is to iteratively refine the latent sample $\mathbf{z}$ by computing the gradient of the f-divergence loss of $d_\delta (g_\theta (\mathbf{z}))$. In other words, we use the generator to generate from $\mathbf{z}$, assess the result with the discriminator, compute gradients from the score and backpropagate them to $\mathbf{z}$ to refine it. This can be iteratively done until the $g_\theta (\mathbf{z})$ is satisfying enough.
DG*f*low can be applied to any latent sampling model like GANs, VAE, Normalizing Flows, and its main principle is similar to the how Diffusion models work.

### Why not for discrete data ?

GANs have essentially been applied to continuous domains so far. The reason is mainly because the output of the generator $\mathbf{y}$ is the input of the discriminator, thus every operation in between must be differentiable. Sequential models however return sequences of probability distributions, from which we would have to sample from in order to estimate the result and pass it to the discriminator. Here is the bottleneck.

Indeed, each discrete element has to be represented in the form of a vector, commonly called embedding when fed to a model, and latent for each intermediate state between its layers. A discrete sequential model hence needs to receive a two-dimensional input $\mathbf{X} \in \mathbb{R}^{L \times E}$, for $L$ embeddings of size $E$, and will usually give an output of the same dimensions. The last hidden states are usually converted into logits $\in \mathbb{R}^{L \times V}$ with a final layer, from which we can sample from to determine the output token, over the vocabulary, of each element in the sequence.

If we want to use the generator of a discrete GAN to create new content after it has been trained, it would necessarily have to output a sequence of logits ($\mathbf{Y} \in \mathbb{R}^{L \times C}$, for $C$ classes). This means that during training we would have to sample $\mathbf{y} \sim g_\theta \left( \mathbf{x} \right)$, and feed $\mathbf{y}$ to the discriminator (which would further convert it to embeddings). The issue is that **sampling from a distribution is not a deterministic operation, hence not differentiable**. There is not derivative from sampling, so the gradients from the discriminator cannot be backpropagated to the generator.

## GANs for natural language generation

If we want to create a GAN for discrete data, we then need to find a way to bypass the sampling step and find a way to 1) make sure the generator outputs logits from which we will sample after training and 2) pass the gradients of the discriminator to the generator during training.

Some tricks have been presented and used to alleviate this bottleneck.

### Gumbel-Softmax

A first solution would be make the sampling differentiable, by making it deterministic. This is the goal of the Gumbel-softmax continuous relaxation trick.

Thz Gumbel-Softmax trick takes inspiration from the Gumbel-max trick, a method to sample from discrete distributions parametrized by unnormalized log-probabilities with the Gumbel distribution. The Gumbel-max trick returns a one-hot vector $\mathbf{h}$ from a distribution $\mathbf{y}$ as $\mathbf{h} = \mathtt{one\_hot} \left( \mathrm{arg \: max} \left(\mathbf{g} + \mathrm{log} \: \mathbf{y}\right)\right)$, where $\mathbf{g}$ is the Gumbel distribution. The $\mathrm{arg \: max}$ operation is however not differentiable.

The Gumbel-softmax alternative approximates the sampling from a categorical distribution using the $\mathrm{softmax}$ function as a continuous and differentiable approximation of $\mathrm{arg \: max}$: $\mathbf{h} = \mathrm{softmax} \left( (\mathbf{y} + \mathbf{g})\: / \:\tau \right)$. As $\tau \rightarrow 0$, the distribution becomes close to a one-hot vector. The operation is differentiable, allowing the backpropagation from the discriminator to the generator and so to train GAN on discrete data. In practice the training begins with large $\tau$, progressively annealed to 0.

We can note that Gumbel-softmax is similar to the reparameterization trick used in Variational Autoencoders, in the sense that it shifts the stochastic part which is used in combination with the deterministic nodes to approximate the sample. The reparameterization trick cannot be applied here as the parameters are not adjusted for discrete distributions, but rather for a fixed set of distributions. Hence adjusting the parameters over multiple discrete distributions would not produce convergence neither coherent samples.

Kusner and Hernández-Lobato first employed it to create a language GAN POC with LSTM layers. Later came RelGAN, which also uses Gumbel-Softmax, and is built of a attention-based generator and a CNN-based discriminator. Its particularity is that it maps the input of the discriminator into several embedding sequences, which are passed independently to $d_\delta$. The several losses are averaged to get the final loss value. The authors expect each embedding sequence to represent different features of the input. Their results showed better BLEU metrics compared with other GAN baselines using RL to estimate gradients.

Finally, Chen et al. proposed an other approach with optimal transport. They replace the binary function of the discriminator by the Earth-Mover’s Distance (EMD) between the sentence features of real and synthetic data. However as computing the EDM is intractable in the case of neural language processing, they created an adaptation they called Feature-Mover’s Distance (FMD). On text generation and style transfer, their baseline show competitive results compared with other GAN-based sequential generative models such as SeqGAN (presented a few lines below).

### Reinforcement learning policy

In reinforcement learning, a policy is a strategy that an agent uses to realize its goals. The policy dictate its actions depending on its state and the environment. The policy is learned following a reward which is calculated by a cost function of its actions.

REINFORCE, also known as the score function estimator, transforms the integral into an expectation. It uses the log-derivative differentiation rule $\nabla_\theta\: p_\theta (x) = p_\theta (x) \nabla_\theta\: \mathrm{log}\: p_\theta (x)$.

We can then turn the gradient into an expectation:

$$ \begin{align} \begin{aligned} \nabla_\theta \mathbb{E}_{x \sim p_\theta (x)} &= \nabla_\theta \int f(x) p_\theta (x) dx \\ &= \int f(x) \nabla_\theta p_\theta (x) dx \\ &= \int f(x) p_\theta (x) \nabla_\theta \mathrm{log}\: p_\theta (x) dx \\ &= \mathbb{E}_{x \sim p_\theta (x)} [f(x) \nabla_\theta \mathrm{log}\: p_\theta (x)] \end{aligned} \end{align} $$

Now that the distribution under the expectation is known, we can estimate the expectation with Monte Carlo sampling. The key advantage of REINFORCE is that places no restriction on the nature of $f$, which does not have to be differentiable so that we can estimate the gradients of its expected value.

Due to the sampling, the gradients can easily suffer from high variance. Two solutions to reduces variances which can be applied here to stabilize training: control variates and importance sampling.

REINFORCE is used by SeqGAN. The generator is a reinforcement learning policy $G(y_t \lvert \mathbf{y}_{1:t-1})$ of generating a sequence. The discriminator provides the reward (the probability of being true data) $d_\delta (\mathbf{y})$ over the whole sequence. At step $t$, the state of $g_\theta$ is defined as the sequence of already predicted tokens $\mathbf{y}_{1:t-1}$. The discriminator $d_\delta$ estimates if the sequence $\mathbf{y}_{1:T}$ is real or generated, and its result is used as the reward to update the generator’s policy. The generator is made of LSTM layers, and the discriminator of convolutional layers. They experimented on text and symbolic generation. The reported results shows SeqGAN outperforming an MLE baseline, the same generator $g_\theta$ trained with teacher forcing, on BLEU and human metrics.

The same strategy is use with MolGAN and ScratchGAN, which generate respectively molecule graphs and text with results comparable to MLE baselines in term of quality.

RankGAN goes further by proposing a new training objective for the discriminator (called Ranker here). Given a collection of human-written and one generated sentences, it learns to rank them from the most human-like to the most fake. The learning objective of the generator is then to produce a sentence that is ranked higher by the ranker. These combined objective can be expressed as $\underset{\theta}{\mathrm{min}}\: \underset{\delta}{\mathrm{max}}\: \mathcal{L}(g_\theta, d_\delta) = \mathbb{E}_{x \sim p_d(\mathbf{x})} [\mathrm{log}\: d_\delta(x \lvert U, C^-)] + \mathbb{E}_{x \sim p_\theta(\mathbf{x})} [\mathrm{log} (1 - d_\delta(x \lvert U, C^+))]$ where $U$ is the reference ranking set and $C^-$ and $C^+$ are the comparison sets drawn respectively from fake sentences generated from $g_\delta$ and real data from $\mathcal{X}$. The score of a sentence $\mathbf{x}$ (either fake or real) coupled with a comparison set $\mathcal{C}$ is defined by:

$$d_\delta (\mathbf{x} \lvert \mathcal{U}, \mathcal{C}) = \mathbb{E}_{\mathbf{u}\ \sim\ \mathcal{U}} \left[\frac{\mathrm{exp} (\lambda \alpha(\mathbf{x} \lvert \mathbf{u}))}{\sum_{i=1}^{\lvert \mathcal{C} \lvert} \mathrm{exp}(\lambda \alpha(\mathcal{C}_i \lvert \mathbf{u}))}\right]$$

where $\alpha$ is the cosine similarity function. The ranking training objective learns the discriminator to not only distinguish fake for real data, but emphasizes on the learning of the underlying features comprised within the real data distribution in a kind of contrastive way.

Note that some other low variance gradient estimator exist and can also be used in the case of GAN, such as Rebar or GO Gradient / GO Hessian.

### Chunked auto-regressive generation GAN

CARGAN employs a kind of hybrid generation method, mixing autoregressive and non autoregressive, for conditional waveform synthesis. The generation can be formulated as follow: $g_\theta (\mathbf{y}, \mathbf{x}) = \prod_{t=i}^{T} g_\theta(\mathbf{x}_{t-k:t} \lvert \mathbf{x}_{1:t-k})$ where $i \in {k,2k,…,T }$. The model thus has predicts chunks of sequence of $k$ length, one after an other conditioned on the previous ones. This method is notably used by WaveRNN or WaveFlow, and is particularly well suited for the audio modality due its continuous but discretized nature, with a time dimension along which one can desire to extend the content. I am not sure that this technique would give results comparable with MLE models if applied to text, I have not seen something similar yet.

### Issues with discrete GANs

A sequence of discrete elements is built of elements all dependant from each other. Hence sampling independently each element from a multivariate distribution is a very hazardous task. A token $y_i$ sampled first (with a high probability) could have no sense with another token $y_j$ is sampled (with a potentially high probability too). Hence we can doubt the capacity of a neural network to predict in a single step a sequence of discrete elements with coherent meaning. The nature of discrete data induces that a wide variety of plausible solutions close to the real data distribution $\mathcal{X}$ exist for most language modeling tasks. Hence even with the presented solution, discrete GANs are actually handling a hard and unpredictable task.

The conclusion is that language GANs still fall behind pure MLE-based models. Language GANs Falling Short empirically shows that discrete GANs tends to be more prone to mode collapse, and produce less diverse output. This is kind of expected, as the generator is trained to produce a multivariate distribution to fool a discriminator. We can very likely expect the model to learn a set of plausible examples and easily falling into mode collapse.

The paper also criticizes the fact that most language GANs introduced so far mostly focused on the quality of the results only. They claim that the difficulties to train a discrete GAN are not worth, and appear to be a larger problem than the exposure bias that this kind of model is trying to solve in the first place.

## Alternative training objective

In 2020, Welleck et al. introduced unlikelihood training. They aim to tackle the exposure bias by altering the traditional teacher forcing training objective to reduce the probability of previous tokens.

Given a set of candidate tokens $\mathcal{C}$, the “unlikelihood” loss is defined as:

$$\mathcal{L}_{UL}(p_\theta (\cdot\: \lvert\: \mathbf{x}_{<t}),\: \mathcal{C}) = - \sum_{c\: \in\: \mathcal{C}} \mathrm{log}(1 - p_\theta (c\: \lvert\: \mathbf{x}_{<t}))$$

And during training the global loss is:

$$\mathcal{L}(p_\theta(\cdot\: \lvert\: \mathbf{x}_{<t})) = -\alpha\: \mathcal{L}_{UL}(p_\theta (\cdot\: \lvert\: \mathbf{x}_{<t}), \mathcal{C}) - \mathrm{log}(x_t \lvert \mathbf{x}_{<t})$$

The model will intuitively updates its parameters towards lower probabilities for tokens within $\mathcal{C}$. And the $\alpha$ constant controls how the unlikelihood loss contributes to the model’s training. By setting $\mathcal{C}$ as the previous tokens (${ x_1, …, x_{t-1} }$), the authors show that this technique helps the model to predict less repeated tokens when used autoregressively. Moreover, frequent tokens are intuitively less likely to be predicted as they are likely to be present in the previous one already.

While this token-level penalty helps a language model to predict better results by itself, it is limited to the prefixes of sequences from the training set, and might be unsuited or useless for zero-shot tasks. The mismatch between training data and generated data remains if no training objective that operates on the overall generated results is used. This motivated the authors to also include a sequence-level unlikelihood objective.

Given a prefix $\mathbf{x}_{1:k}$, they propose to decode a continuation $\mathbf{x}_{k+1:N} \tilde p_\theta (\cdot \lvert \mathbf{x}_{1:k})$, construct per-step candidate sets ${ \mathcal{C}^{k+1}, …, \mathcal{C}^{N} }$, and use a per-step sequence-level loss as:

$$\mathcal{L}_{ULS}^t (p_\theta (\cdot\: \lvert\: \mathbf{x}_{<t}),\: \mathcal{C}^t) = - \sum_{c\: \in\: \mathcal{C}^t} \mathrm{log} (1 - p_\theta (c\: \lvert\: \mathbf{x}_{<t}))$$

Authors choose to penalize repeating n-grams in the continuation.

This sequence-level loss is used to fine-tune a pre-trained baseline. The authors reported that only 1500 updates substantially reduced degeneration.

## Adversarial methods

Given the issues previously introduced, researchers worked on methods to go beyond the maximum likelihood training objective (and teacher forcing) of sequential models, while keeping low (or acceptable) variances. Those are not GANs in the sense that they do not train a generator depending on a loss computed with a discriminator, but rather trained both independently and use them in cooperation afterwards.

One of the first effort is the Reward Augmented Maximum Likelihood (**RML**) framework. The authors proposed here to add an additional term $r(\mathbf{y} \lvert \mathbf{y}^*)$, to the training objective, where $\mathbf{y}$ is a true expected result and $\mathbf{y}^*$ a result generated from the model, to guide it toward more realistic results during inference. The objective as an expectation and the gradient are expressed as:

$$ \begin{align} \begin{aligned} \mathcal{L}_{RML}(\theta, \tau, \mathcal{D}) = \sum_{(\mathbf{x}, \mathbf{y} \in \mathcal{D})} \left\{ - \sum_{\mathbf{y} \in \mathcal{Y}} q(\mathcal{y} \lvert \mathcal{y}^* ; \tau) \mathrm{log} p_\theta (\mathbf{y} \lvert \mathbf{x}) \right\} \\ \nabla_\theta \mathcal{L}_{RML}(\theta, \tau) = \mathbb{E}_{p_\theta (\mathbf{y} \lvert \mathbf{x})} \left[ -\nabla_\theta \mathrm{log} p_\theta (\mathbf{y} \lvert \mathbf{x}) r(\mathbf{y} \lvert \mathbf{y}^*) \right] \end{aligned} \end{align} $$

Where $\tau$ controls the degree of regularization. This term can be based on any reward or metric such as BLEU. The training objective is equivalent to a RL one, but here we do not need to sample successively from the model and thus get a potentially very sparse gradient (given the high dimension of the associated space) with high variance. The results show naturally better BLEU metrics than their pure MLE counterparts on machine translation tasks, with a RNN model. The authors did not use this framework adversarially, but this could be done with $r$ as a feedback from a discriminator, as it will be achieve in future works.

Su et al. proposed to couple a generator language model with a discriminator, and Gibbs sampling, in order to iteratively refine an input sentence until it satisfies some user-defined constraints. Gibbs sampling is a Monte Carlo Markov Chain (MCMC) sampling strategy that iteratively sample from a multivariate distribution $\mathbf{Y} \in \mathbb{R}^{L \times c}$ which would be to complex to sample from in one step. For $T$ steps, each element of the sequence is resampled $x_i \sim p_\theta (x_i \lvert \mathbf{x}_{\backslash i})$. As $\mathbf{x}$ is resampled, it becomes more natural following $p_\theta$.

The authors actually trained $K$ discriminators on $K$ constraints which return $d_k (c_k \lvert \mathbf{x})$, and $d(\mathbf{c} \lvert \mathbf{x}) = \prod_{k=1}^{K} d_k (c_k \lvert \mathbf{x})$. An input sentence is corrected multiple times in parallel by keeping the most likely tokens at each correction step, removing candidates for which $d(\mathbf{c} \lvert \mathbf{x}) < threshold$, and returning the candidate maximizing $d(\mathbf{c} \lvert \mathbf{x})$.

Holtzman et al. used learned discriminators, each specializing in a different aspect of human conversational communication, to train a RNN model to generate text.

DAS (Discriminative Adversarial Search) also inspired by adversarial training, trains a discriminator to tell apart generated text from human written text. It’s main goal is to reduce the exposure bias of traditional NLG methods. The discriminator predicts a label for each token instead of for the entire sequence. The discriminator logprob is added to the score to guide sampling toward the human-written style. This adversarial training allows to train several times the discriminator until it fails to detect machine learning results.

Scialom et al. further increases the link between generator and discriminator by introducing a discrete adversarial training framework: SelfGAN; and a cooperative generation decoding adapted from Monte Carlo Tree Seach (MTCS): Coopt-MCTS.

The SelfGAN framework can be expressed as following (note that we omitted the *context* from the original paper which corresponds to a seq2seq / encoder-decoder model architecture):

```
Input: Generator g, Discriminator d, cooperative decoding method c, training set X
for n epochs do
for x in X do
s_coop <-- c(x, g, d)
g.train(input=x, target=s_coop)
d.train(real=x, fake=s_coop)
```

The cooperative decoding uses the discriminator to steer the generator toward humans results. Used during the training stage as the expected target for the generator’s training, it allows to progressively improve both models at the same time.

Coopt-MCTS employs MCTS in order to progressively generate the results while maintaining its quality according to the discriminator. The root is $\mathbf{x}$, i.e. the input token sequence to extend, to which is associated a score $s_0 = d_\delta (\mathbf{x})$. The child of a node is the next token to be selected. The objective is to find the path (sequence $\mathbf{y}$ which extends $\mathbf{x}$) for which $d_\delta (\mathbf{y})$ (or other part or subparts, including $\mathbf{x}$) is minimized, without exploring the whole tree, for optimization and complexity reasons.

- Selection: children nodes (tokens) are iteratively selected. The probability $g_\theta (\mathbf{x}_t )$ for the next token $x_{t+1}$ are used to this end, associated with the PUCT algorithm. Nucleus sampling is applied to reduce the set of possible tokens and discard those with low probabilities: $P(x_i) = s_i + c_{puct} g_\theta (x_i \lvert \mathbf{x}_{1:i-1}) \frac{\sqrt{\sum_{b}N(\mathbf{x}_{1:i-1}, b)}}{1 + N(\mathbf{x}_{1:i-1}, x_i)}$ where $c_{puct}$ is a constant, $N(\mathbf{x}, w)$ is the number of times $w$ was selected for the sequence $\mathbf{x}$, and $s$ is the score of the node given by the discriminator as $s_i = d_\delta (\mathbf{x}_{1:i})$.
- Expansion: if the children is not in the final state ($T$), extend it with $g_\theta$ and nucleus sampling.
- Simulation: when a child reaches the final state ($T$), we use the discriminator to evaluate it $d_\delta (\mathbf{y})$.
- Backpropagation: when a node becomes terminal (End of Sequence token or maximum length), the scores of its parent nodes back to the root are updated, as $s_i = \mathrm{max}(s_i, s_0)$

These steps are executed for a maximum number of steps, after which the next token of the root is selected by taking the one with the most visits. All is repeated to predict token by token the continuation of the sequence, until reaching a special token End Of Sentence, or the maximum length.

Note that the discriminator becomes more accurate as the sequence length it judges grows. If a long sequence is overall judged human, all its subsequences could also be considered human. And a sequence could be judged fake although its beginning could be judged human. The main advantage of Coopt-MCTS is that 1) the discriminator is used on sequence of relatively large sizes, which is how it should be used; 2) it allows to accurately explore nodes and chose the one considered humans, resulting in an overall good quality result.

Lamprier et al. further improve the cooperative generator-discriminator framework by introducing a partition function which regularizes the calculation of gradients while assuring convergence of both the generator and discriminator networks.

One of the main issues of cooperative frameworks so far was that their convergence was not proved and assured. For most of these works, the model distribution is updated by maximizing the likelihood between the probabilities $g_\theta (\mathbf{x})$ and a target distribution which is the result of some cooperation decoding $q = \mathrm{c}(p_\theta, d_\delta)$. We can express this is minimizing $\mathrm{KL}(q \lvert\lvert p_\theta)$.

But if we consider $q \triangleq \mathrm{c}(p_\theta, d_\delta) \propto exp(d_\delta)$ where this distribution is only proportional to the outputs of the discriminator, and a step where the generator is optimal (i.e. $p_\theta \approx p_d$). In the next step, the optimal discriminator yields $d_\delta(\mathbf{x})=0.5\; \forall\; \mathbf{x} \in \mathcal{X}$, thus optimizing the generator with $q \propto exp(d_\delta)$ will make it diverge from the optimal $p_d$. While being extreme, this examples illustrate the instabilities of convergence this recent family of model can suffer from.

Instead, the authors proposed to set $q \propto p_\theta ,d_\delta$, which theoretically guaranties convergence: if the generator and discriminator have enough parameters, as we successively train them, $p_\theta \rightarrow p_d$ and $d_\delta(\mathbf{x}) \rightarrow 0.5\; \forall\; \mathbf{x} \in \mathcal{X}$.

Moreover, if both parts of the discriminator are sufficiently trained (its mean is superior to 0.5), we have $\Delta \triangleq \mathrm{KL}(p_d \lvert\lvert p_\theta) - \mathrm{KL}(p_d \lvert\lvert p_\theta) \leq \mathrm{log}(\eta^{-1} - 1) > 0,; \eta \in \left[ 0.5; 1 \right]$, meaning $q \propto p_\theta ,d_\delta$ is a useful target to move toward.

The gradients of our previous objective $\mathrm{KL}(q \lvert\lvert p_\theta)$ can be rewritten with importance sampling as:

$$ \begin{align} \begin{aligned} \nabla_{p_\theta}\: \mathrm{KL}(q \lvert\lvert p_\theta) &= -\mathbb{E}_{\mathbf{x} \sim q(\mathbf{x})} \left[ \nabla\: \mathrm{log}\: p_\theta(\mathbf{y}) \right] \\ &= -\mathbb{E}_{\mathbf{x} \sim p_\theta(\mathbf{x})} \left[ \frac{q(\mathbf{x})}{p_\theta(\mathbf{x})} \nabla\: \mathrm{log}\: p_\theta(\mathbf{x}) \right] \\ &= - \frac{1}{Z}\: \mathbb{E}_{\mathbf{x} \sim p_\theta} \left[ d_\delta(\mathbf{x}) \nabla\: \mathrm{log}\: p_\theta(\mathbf{x}) \right]\\ Z &= \sum_{x \in \mathcal{X}} p_\theta(\mathbf{x}) d_\delta(\mathbf{x}) \end{aligned} \end{align} $$

$Z$ is the partition function of the discriminator and acts as a regularization of the gradient calculation. It can be written as an expectation $\mathbb{E}_{\mathbf{y} \sim p_\theta(\mathbf{y})} \left[ d_\delta(\mathbf{y}) \right]$, where it is clear that $Z$ is maximized when samples from $p_\theta$ are likely to be close to $p_d$. This term naturally avoid the explosion of gradient and the use of complex and tedious learning rate scheduler, which are often required with other discrete GAN settings.

Chaffin et al. empirically showed that an unidirectional transformer (causal attention mask) achieves results comparable to a bidirectional one (no attention mask), while being much more efficient. At inference, the previous hidden states can be reused as they do not need to be recomputed considering the tokens lastly sampled.

Donahue et al. proposed to use a GAN and a VAE in cooperation. The GAN generator is trained to generate latent spaces for the VAE decoder. The GAN discriminator is trained with latent spaces generated from the VAE encoder (real) and from the GAN generator (fake). We cannot really classify the architecture as a GAN, as the goal of the generator is not directly to generate content, but rather offer to explore the latent space that the VAE decoder will turn into content.

## Contrastive learning for text generation

CoNT, for Contrastive Neural Text, addresses the exposure bias with contrastive learning. It is a framework that can be used to train and infer from any language model.

Instead of naively applying contrastive loss terms as done with other modalities, CoNT:

- Uses generated samples to set the contrastive loss;
- Uses a N-pair loss which includes sequence-level scores of all pairs;
- Directly incorporates the learned sequence similarity score from the distance function into the inference stage.

Common contrastive loss terms are built from samples within a batch. Whereas it satisfies very well the overall training objective, it does not address the exposure bias issue. Using generated samples to build the contrastive loss allows to expose the model to its mistakes. The authors use beam search to generate them, and also note that a warm-up stage where the model is only trained with a pure NLL loss is recommended so that the generated samples have a sufficient quality.

Common contrastive loss also treat all negative samples equally, meaning that the relative difference between real and generated samples is ignore. To mitigate this, CoNT employs a pair-wise margin loss: 1) all generated samples are ranked with an oracle function $o(\cdot, \textbf{y})$ which computes a sequence-level score according to the real sample $\mathbf{y}$. The overall contrastive is formulated as:

$$\mathcal{L}_{N-pairs} = \sum_{(\mathbf{y}^-, \mathbf{y}^+) \in \mathcal{P}} \mathcal{L}(\mathbf{y}^-, \mathbf{y}^+) = \sum_{(\mathbf{y}^-, \mathbf{y}^+) \in \mathcal{P}} \mathrm{max} \left( 0, \mathrm{cos}(\mathbf{z}_{\mathbf{x}}, \mathbf{z}_{\mathbf{y}^-}) - \mathrm{cos}(\mathbf{z}_{\mathbf{x}}, \mathbf{z}_{\mathbf{y}^+}) + \gamma \right)$$

$\mathcal{P}$ is a set of pairs of samples, made of real and generated samples. The latter are classified $\mathbf{y}^-$ or $\mathbf{y}^+$ depending on their ranks. $\mathrm{cos}(\cdot, \cdot)$ is the cosine similarity function, $\mathbf{z}_{\mathbf{i}}$ is the hidden representation of the input sequence $\mathbf{i}$, and $\mathbf{x}$ and $\mathbf{y}$ is the input sequence. During inference, this learned similarity score is included to steer the predictions towards results closer to the data distribution.

SimCTG introduces a contrastive approach based on the similarity of the embeddings of tokens. The authors show that, on a model trained with a pure MLE training objective, the cosine similarities of the embeddings are relatively high. This is counterintuitive for generative tasks, where we expect a model to yield high high probabilities to tokens that are both coherent considering the past context, and informative (as opposite to degenerating, topic drifting or repetition). They thus introduced SimCTG, as a contrastive loss term, defined as:

$$\mathcal{L}_{CL} = \frac{1}{\lvert \mathrm{x} \lvert \times (\lvert \mathrm{x} \lvert - 1)} \sum_{i=1}^{\lvert \mathrm{x} \lvert} \sum_{j=1, j \neq i}^{\lvert \mathrm{x} \lvert} \mathrm{max} { 0, \phi - s(\mathbf{h}_{x_{i}}, \mathbf{h}_{x_{i}}) + s(\mathbf{h}_{x_{i}}, \mathbf{h}_{x_{j}}) }$$

In the equation above, $\phi \in [-1, 1]$ is a margin, $s(\cdot)$ is the cosine similarity function, $\mathbf{h}_{x_i}$ is the embedding of $x_i$, from the input sequence $\mathbf{x}$.

During training, the overall loss is set as $\mathcal{L} = \mathcal{L}_{MLE} + \mathcal{L}_{CL}$. This insure that the model still learns with teacher forcing to predict the next tokens according to the past ones, while making sure that it produces discriminative and isotropic embedding representations.

The authors also introduced a contrastive decoding method, defined as:

$$x_t = \underset{v \in \mathcal{V}^{(k)}}{\mathrm{arg max}} { (1-\alpha) \times p_\theta(v \lvert \mathrm{x}_{<t}) - \alpha \times (\mathrm{max} { s(\mathbf{h}_v, \mathbf{h}_{x_j}) : 1 \leq j \leq t-1 })}$$

$\mathcal{V}^{(k)}$ is the set of top-k predictions from $p_\theta(\cdot \lvert x<t)$, $\alpha$ balances the two components. The first is the model’s probability distribution and represents its confidence. The second measures how discriminative of candidate $v$ with respect to the previous context $\mathbf{x}_{<t}$, represents the degeneration penalty. A larger degeneration penalty of $v$ means it is more similar to the context, therefore more likely leading to model degeneration.

SimCTL associated with the contrastive decoding was shown to outperforms most common decoding methods such as beam search, top-k and nucleus.

Contrastive Decoding (CD) is another decoding strategy, based on two LM: a small “amateur” LM, and bigger “expert” LM. Smaller LM tend to be more prone to exposure bias failures, and that their probability distributions can be used as a signal to be sent to a larger one to increase the quality of generated results. CD produces higher quality text than decoding from the larger LM alone, and can be used with any LM.

## Conclusion

Pure MLE and autoregressive generation are still the way to go for NLG tasks. Some decoding methods, as MCTS guidance or contrastive learning, help to reduce the exposure bias, and are probably the ones giving the best results (at the price of time of execution).

It is possible to build a GAN to generate discrete data, using relaxation tricks or reinforcement learning. But these techniques usually come with high variance, and discrete GANs are more likely to fall in mode collapse. Many would consider that they comes with too much constraints, and struggle to produce results with both quality and diversity.

I believe this promising area of research will move towards the cooperation between pure MLE models.

## References

[1] I. Goodfellow et al., “Generative Adversarial Nets” in Advances in Neural Information Processing Systems, 2014, vol. 27.

[2] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The Curious Case of Neural Text Degeneration” in ICLR, 2020.

[3] A. Fan, M. Lewis, and Y. Dauphin, “Hierarchical Neural Story Generation,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2018, pp. 889–898. doi: 10.18653/v1/P18-1082.

[4] X. Lu, P. West, R. Zellers, R. Le Bras, C. Bhagavatula, and Y. Choi, “NeuroLogic Decoding: (Un)supervised Neural Text Generation with Predicate Logic Constraints” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2021, pp. 4288–4299.

[5] X. Lu et al., “NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jul. 2022, pp. 780–799.

[6] P. E. Hart, N. J. Nilsson, and B. Raphael, “A Formal Basis for the Heuristic Determination of Minimum Cost Paths” IEEE Transactions on Systems Science and Cybernetics, vol. 4, no. 2, pp. 100–107, 1968.

[7] L. Qin, S. Welleck, D. Khashabi, and Y. Choi, “COLD decoding: Energy-based constrained text generation with langevin dynamics” arXiv preprint arXiv:2202.11705, 2022.

[8] T. Karras, S. Laine, and T. Aila, “A Style-Based Generator Architecture for Generative Adversarial Networks” in CVPR, Jul. 2019.

[9] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, “GANSynth: Adversarial Neural Audio Synthesis” arXiv, 2019.

[10] A. Shoshan, N. Bhonker, I. Kviatkovsky, and G. Medioni, “GAN-Control: Explicitly Controllable GANs” in ICCV, Oct. 2021, pp. 14083–14093.

[11] M. Lee and J. Seok, “Controllable Generative Adversarial Network” IEEE Access, vol. 7, pp. 28158–28169, 2019, doi: 10.1109/ACCESS.2019.2899108.

[12] T. Che et al., “Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 12275–12287.

[13] A. F. Ansari, M. L. Ang, and H. Soh, “Refining Deep Generative Models via Discriminator Gradient Flow” in ICLR, 2021.

[14] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN” arXiv, 2017.

[15] E. Jang, S. Gu, and B. Poole, “Categorical Reparameterization with Gumbel-Softmax” in ICLR, 2017.

[16] C. J. Maddison, D. Tarlow, and T. Minka, “A* Sampling” in Advances in Neural Information Processing Systems, 2014, vol. 27.

[17] D. P. Kingma, T. Salimans, and M. Welling, “Variational Dropout and the Local Reparameterization Trick” in Advances in Neural Information Processing Systems, 2015, vol. 28.

[18] M. J. Kusner and J. M. Hernández-Lobato, “GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution” arXiv, 2016.

[19] W. Nie, N. Narodytska, and A. Patel, “RelGAN: Relational Generative Adversarial Networks for Text Generation” in ICLR, 2019.

[20] L. Chen et al., “Adversarial Text Generation via Feature-Mover’s Distance” in Advances in Neural Information Processing Systems, 2018, vol. 31.

[21] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017, pp. 2852–2858.

[22] N. De Cao and T. Kipf, “MolGAN: An implicit generative model for small molecular graphs”, arXiv, 2018.

[23] C. de Masson d’Autume, S. Mohamed, M. Rosca, and J. Rae, “Training Language GANs from Scratch” in Advances in Neural Information Processing Systems, 2019, vol. 32.

[24] K. Lin, D. Li, X. He, Z. Zhang, and M. Sun, “Adversarial Ranking for Language Generation” in Advances in Neural Information Processing Systems, 2017, vol. 30.

[25] G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. Sohl-Dickstein, “REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models” in Advances in Neural Information Processing Systems, 2017, vol. 30.

[26] Y. Cong, M. Zhao, K. Bai, and L. Carin, “GO Gradient for Expectation-Based Objectives” in ICLR, 2019.

[27] Y. Cong, M. Zhao, J. Li, J. Chen, and L. Carin, “GO Hessian for Expectation-Based Objectives” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 13, pp. 12060–12068, May 2021.

[28] M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. Courville, and Y. Bengio, “Chunked Autoregressive GAN for Conditional Waveform Synthesis” in ICLR 2022.

[29] N. Kalchbrenner et al., “Efficient Neural Audio Synthesis” in Proceedings of the 35th International Conference on Machine Learning, Jul. 2018, vol. 80, pp. 2410–2419.

[30]W. Ping, K. Peng, K. Zhao, and Z. Song, “WaveFlow: A Compact Flow-based Model for Raw Audio” in Proceedings of the 37th International Conference on Machine Learning, Jul. 2020, vol. 119, pp. 7706–7716.

[31] M. Caccia, L. Caccia, W. Fedus, H. Larochelle, J. Pineau, and L. Charlin, “Language GANs Falling Short” in ICLR, 2020.

[32] S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston, “Neural Text Generation With Unlikelihood Training” 2020.

[33] M. Norouzi et al., “Reward Augmented Maximum Likelihood for Neural Structured Prediction,” in Advances in Neural Information Processing Systems, 2016, vol. 29.

[34] J. Su, J. Xu, X. Qiu, and X. Huang, “Incorporating Discriminator in Sentence Generation: a Gibbs Sampling Method,” in AAAI Conference, 2018, doi: 10.1609/aaai.v32i1.11990.

[35] A. Holtzman, J. Buys, M. Forbes, A. Bosselut, D. Golub, and Y. Choi, “Learning to Write with Cooperative Discriminators” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2018, pp. 1638–1649. doi: 10.18653/v1/P18-1152.

[36] T. Scialom, P.-A. Dray, S. Lamprier, B. Piwowarski, and J. Staiano, “Discriminative Adversarial Search for Abstractive Summarization” in Proceedings of the 37th International Conference on Machine Learning, Jul. 2020, vol. 119, pp. 8555–8564.

[37] T. Scialom, P.-A. Dray, J. Staiano, S. Lamprier, and B. Piwowarski, “To Beam Or Not To Beam: That is a Question of Cooperation for Language GANs” in Advances in Neural Information Processing Systems, 2021, vol. 34, pp. 26585–26597.

[38] Lamprier, S., Scialom, T., Chaffin, A., Claveau, V., Kijak, E., Staiano, J., & Piwowarski, B. (2022). “Generative Cooperative Networks for Natural Language Generation” In Proceedings of the 39th International Conference on Machine Learning (Vol. 162, pp. 11891–11905).

[39] A. Chaffin et al., “Which Discriminator for Cooperative Text Generation?” arXiv preprint arXiv:2204.11586, 2022.

[40] D. Donahue and A. Rumshisky, “Adversarial Text Generation Without Reinforcement Learning”, arXiv, 2018. doi: 10.48550/ARXIV.1810.06640.

[41] Y. Su, T. Lan, Y. Wang, D. Yogatama, L. Kong, and N. Collier, “CoNT: Contrastive Neural Text Generation” in Advances in Neural Information Processing Systems, 2022, vol. 35.

[43] Y. Su, T. Lan, Y. Wang, D. Yogatama, L. Kong, and N. Collier, “A Contrastive Framework for Neural Text Generation,” in Advances in Neural Information Processing Systems, 2022. [Online]. Available: https://openreview.net/forum?id=V88BafmH9Pj

[44] X. L. Li et al., Contrastive Decoding: Open-ended Text Generation as Optimization. arXiv, 2022. doi: 10.48550/ARXIV.2210.15097.