
Text to image generation with Deep Learning
[Update 2022 Oct. 30] Added the text-to-video models recently introduced: Imagen Video and Phenaki. Notation Let’s formulate the problem before going further. Symbol Meaning $g_\theta$ Generator network with parameters $\theta$ $\mathbf{c}$ A caption, represented as a sequence of tokens $x$ An input image, optionally fed to $g_\theta$ to perform modification on it $y$ The output image, sampled from $g_\theta(\mathbf{c})$ or $g_\theta(\mathbf{c}, x)$ $\mathbf{z}$ A latent vector $\mathbf{h}$ Hidden states, intermadiate representation of the input data Intro and problem formulation We refer to text-to-image generation as the tasks of generating visual content conditioned on some text description....