The Evolution of AI-Generated Content Systems from Markov model to ChatGPT
AI has grown from just a statistical data summarizer to generating AI content. The last few years have been an incredible journey that now AI is being used to generate images, music, text, and other digital content and this is the best time to be alive to enjoy and contribute to the growth of AIGC.
AIGC has recently gained large attention from the larger community outside of AI practitioners. Today, everyone is interested in the content generated by AI and the quality of the content generated by models such as DALL-E-2 and ChatGPT. Fundamentally, AIGC is referred to the content automated created as it would have been created by a human. For example, ChatGPT developed by Open AI generates content based on the questions asked while DALL-E-2 can generate high-resolution and immersive images just with the description of the image in text for example “A human civilization on Mars”
Fundamentally, the way generating AI works is first it extracts the intent of the query and then uses the intent to generate content for that category. However, the framework alone in GAI is not entirely novel to researchers. Recent AIGC has made significant improvements over earlier research, primarily as a result of training more complex generative models on larger datasets, leveraging larger foundation model structures, and having access to abundant computational resources. For example, the base framework in GPT-3 is the same as GPT-2, but the model performance improvement came from an increase in Data Size from Web Text of 38GB to CommonCrawl corpus of 570GB thereby, increasing the model size from 1.7B to 175B, which makes GPT-3 easily generalizable on unseen human tasks as compared to GPT-2. With all the benefits the AI field experienced due to the increase in computing power and data, researchers are now experimenting with more techniques to make the AI feel closer to the human and improve it over time. For example, ChatGPT now uses Reinforcement Learning for Human Feedback.
Fundamentally, when AIGC receives input from one single type of data such as text or an image, and is asked to generate the same modality, while in cross-modality such as the DALL-E-2 model takes in input as text and produces an image.
In the late 1950s, the Hidden Markov Model(HMM) and Gaussian Mixture Model(GMM) were developed to learn from sequential data such as text and time series. At that time, enough computing resources did not exist not until the deep learning models saw significant improvement in model performance. Initially, Natural Language Processing (NLP) models started with the N-grams models which essentially create the word distribution for N pairs and do a search for the best sentence. However, this type of model cannot be adapted to the long sequence of text. This led to the advent of Recursive Neural Networks (RNN) allowing for longer dependency. However, it still suffered from long dependency which led to the advent of Long Short Term Memory(LSTM) and Gated Recurrent Unit (GRU) in 1997. This led to significant improvement in the benchmarks and quickly became the mainstream and go-to deep learning framework. Meanwhile, computer vision started picking up the steam on the image data. Before the deep learning methods for image understanding, practitioners used techniques such as image synthesis and mapping techniques were used. Practitioners usually relied on hand-engineered features for example manufacturing practitioners would construct features such as volume, dimensions, number of spheres, number of holes, etc. Yann Lecun invented Convolution Neural Network in the 1980s. CNN provided the fastest and most automated way of extracting features from the image and provided significant improvement in image detection and segmentation accuracy while increasing the speed of the prediction. Still, the problem persisted going beyond detection to generative AI. In 2014, GANs were invented along with Variational Autoencoder (VAE). These methods provided the greatest and fastest way to generate high-quality immersive images.
Transformers became a huge improvement over LSTMs and GRU both in terms of latency and accuracy on ultra-long sequences. In one of my blog posts, I conducted a study on how long really an LSTM can remember, and it did not even surpass a sequence of 20 tokens. Attention in the transformer became really foundational to how sequential deep learning worked. This powered many latest innovations such as GPT-3, DALL-E-2, and CODEX. Encoders and decoders make up a transformer. Whereas the decoder receives the hidden representation and produces the output sequence, the encoder receives the input sequence and creates hidden representations. Thus transformers became really fundamental to the foundational for pre-training tasks, where transformers can generate encoding which can then use for downstream tasks as features or further fine-tune individual tasks. It was obvious with the availability of large open-source data to create pre-trained open-source models such as BERT and RoBERTa. Masked data was used to train the model and the task was to predict the probability of the masked taken given the contextual information. This helped with next-sentence classification and other long-term sequential tasks. RoBERTa was an improvement in BERT by including a larger corpus of data and more challenging pre-training objectives which reused the same architecture as BERT. In the autoregressive tasks, where the model's job is to predict the next token in the sequence given prior information, GPT-type models were created which also becomes essential for generative AI.
Here is the list of the Generative AI models:
Decoder models: GPT models are great examples of decoder models. They work in an autoregressive manner, where they take a sequence of information and process information all at once to produce the next sequence. Fundamentally, this relies on the self-attention mechanism to process all the tokens in the sequence at once. GPT-2 and GPT-3 fundamentally maintain the same paradigm while scaling up the parameter space with large datasets. Gopher did a small modification by adding a residual connection and replacing the layer norm with the RS norm layer. BLOOM shares the same architecture except uses a full attention network instead of sparse attention to fully exploit the long-term dependencies.
Encoder-Decoder models:
These models were most widely used for Machine Translation for the text domain. The benefit of the such text-to-text encoder-decoder architecture is that these models can learn temporal, sequential, and spatial-temporal information. A dense layered neural network would essentially lack the information on the position and the order. Sequential models like LSTM, GRU, and RNN are widely used for embedding sequential data into a dense vector. As mentioned earlier, RNNs cannot retain long sequence information, Text-to-Text Transformers [T5] uses Attention blocks to represent both input and output in a standard format. This allows T5s to train for use cases like Machine Translation, Text summarization, and Question-Answer. Transformers are blazing fast as compared to RNNs due to the parallelization ability rather than computing gradients sequentially. Further, Google published ExT5 in 2021 which further scales up the performance. More advanced SOTA model is BERT:
The main limitation of the standard language model is that they assume that the dependency is only uni-directional. BERT employs Bidirectional dependencies using the Masked Language model (MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer. In addition to the masked language model, we also use a “next sentence prediction” task that jointly pre-trains text-pair representations. Next, BART was published in 2019, which blends the Bidirectional nature and the autoregressive paradigm from GPT together for general tasks. BART was trained on large amounts of text data using a denoising autoencoder approach, where the model is trained to reconstruct noisy input sentences from their clean versions. This approach helps the model learn to generate fluent and coherent text even in the presence of noise or errors. HTML BART uses the same denoising objective to build HTML tags and element information. Finally, DQ-BART was published that took BART and performed quantization to reduce the model size while maintaining the same state-of-the-art performance on the downstream individual tasks.
GANs:
GAN started to become popular with image data. The generator network is trained to generate synthetic data that is similar to the training data. For example, a GAN can be trained on a dataset of images and then generate new images that resemble the training data. The discriminator network is trained to distinguish between the real data and the synthetic data generated by the generator. During training, the generator and discriminator are trained in an adversarial way: the generator tries to produce data that can fool the discriminator, while the discriminator tries to correctly classify the real and synthetic data. This process continues until the generator produces synthetic data that is indistinguishable from the real data. GANs have been used to generate realistic images, video, audio, and text. They have also been used for tasks such as image-to-image translation, where the generator is trained to translate images from one domain to another, such as from a daytime scene to a nighttime scene. Here are the different types of GANs:
- LAPGANs — LAPGAN stands for Laplacian Pyramid Generative Adversarial Network generates high-resolution from low-resolution images. The LAPGAN model consists of a series of GANs, where each GAN is responsible for generating images at a specific resolution level. The first GAN generates a low-resolution image, which is then upsampled and passed to the next GAN to generate a higher-resolution image. This process continues until the final GAN generates a high-resolution image. LAPGAN has been used to generate high-quality images, such as faces and landscapes, from low-resolution images. It has also been used for other tasks, such as image super-resolution and inpainting, where missing parts of an image are filled in based on the surrounding context.
- DCGANs — DCGANs were introduced in a research paper titled “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford et al. in 2015. The paper demonstrated that DCGANs can generate high-quality images that resemble the training data. DCGANs have several advantages over traditional GANs, such as better stability during training and the ability to generate high-resolution images with more details and sharper edges. They have been used to generate high-quality images of faces, animals, and other objects, as well as for image-to-image translation tasks, such as converting a grayscale image to a color image.
- SAGAN — Self-Attention Generative Adversarial Network was introduced in a research paper titled “Self-Attention Generative Adversarial Networks” by Zhang et al. in 2018. The paper demonstrated that SAGAN can generate high-quality images with better visual coherence and fewer artifacts than traditional GANs. The self-attention mechanism used in SAGAN allows the generator to focus on different parts of the image when generating each pixel. This enables the generator to better capture the global structure and dependencies of the image, resulting in more realistic and coherent images. In addition to the self-attention mechanism, SAGAN also uses spectral normalization to stabilize the training process and prevent mode collapse, which is a common problem in GANs where the generator produces a limited set of outputs.
- StyleGAN — Style-Based Generative Adversarial Networks was introduced in a research paper titled “A Style-Based Generator Architecture for Generative Adversarial Networks” by Karras et al. in 2019. The paper demonstrated that StyleGAN can generate high-quality images with a high degree of control over the appearance of the generated images. StyleGAN is the use of a style-based generator architecture, which allows the generator to control the style and structure of the generated images at different scales. This enables the generator to generate images with fine-grained details and textures, while also controlling the overall appearance and style of the generated images.
- D2GAN and GMAN — Discrete Distribution Generative Adversarial Network(D2GAN) was introduced in a research paper titled “Discrete Generative Adversarial Networks” by Jang et al. in 2016 demonstrating D2GAN can generate high-quality discrete data, such as text and images with discrete color palettes. Gradient-Based Mutual Adaptation(GMAN) Networks demonstrated that GMAN can generate high-quality image-to-image translations without the need for paired training data. Overall, D2GAN and GMAN are both powerful techniques for generative modeling that have shown impressive results in generating high-quality discrete data and image-to-image translations, respectively
- CoGAN — Coupled Generative Adversarial Networks(CoGAN) were introduced by Liu and Tuzel in 2016. CoGAN can learn joint distributions of multiple domains by using two separate generators and one discriminator network. The key innovation of CoGAN is the use of a shared architecture for the discriminator, which allows it to learn to distinguish between the two domains while also enforcing shared features between them. This shared architecture helps to ensure that the generated images from both domains have consistent and meaningful relationships
Variational AutoEncoder(VAE)
Variational Autoencoders (VAE) are generative models that attempt to reflect data to a probabilistic distribution and learn reconstruction that is close to its original input. However, unlike traditional autoencoders, VAEs also model the distribution of the latent space using probabilistic methods. This is achieved by enforcing a prior distribution, typically a normal distribution, on the latent space, and then using the encoder network to model the mean and variance of this distribution. This means that the latent space representation of the input data is not a single point in the space, but rather a probability distribution over the space. During training, the VAE tries to minimize the reconstruction loss while also maximizing the similarity between the modeled distribution of the latent space and the prior distribution. This is done by introducing a regularization term to the loss function, known as the Kullback-Leibler (KL) divergence, which measures the difference between the modeled distribution and the prior distribution. The VAE has been shown to be effective in a variety of applications, such as image generation, data compression, and anomaly detection. It has also been used in combination with other deep learning techniques, such as GANs, to generate high-quality images with more diverse and complex structures
Flow
Learns an objective transformation between a simple base distribution, such as a Gaussian distribution, and a complex target distribution, such as an image or text distribution. The learned transformation is used to generate new samples from the target distribution. The key idea behind flow is to learn a sequence of invertible transformations that map the base distribution to the target distribution. Each transformation is typically composed of a series of simple functions, such as affine transformations or permutation functions, that are applied in a fixed order. The composition of these transformations can be efficiently computed and inverted, allowing for the generation of new samples from the target distribution. During training, the flow model tries to minimize the difference between the target distribution and the distribution of the generated samples, typically measured using the negative log-likelihood of the data. This is done by optimizing the parameters of the transformation functions using gradient-based optimization techniques. Low models have been applied to a variety of tasks, such as image generation, text generation, and density estimation. They have been shown to be effective in generating high-quality samples that are often more diverse and visually appealing compared to other generative models, such as VAEs and GANs.
Diffusion
Denoising diffusion probabilistic models (DDPMs) are a type of generative model used in machine learning to generate high-quality images and other types of data. Diffusion models build on the idea of a diffusion process, where noise is added to an image or other data and then gradually removed over time through a series of steps.
In a diffusion model, an image is transformed into a latent space representation that can be easily manipulated and then transformed back into an image. The transformation process involves gradually adding noise to the latent space representation and then gradually removing it. The diffusion process can be modeled using a Markov chain, where each step represents a conditional distribution that depends on the previous step. During training, the diffusion model learns the parameters of the Markov chain that describe the diffusion process. The model is trained to maximize the likelihood of the data, given the parameters of the Markov chain. This is typically done using a contrastive loss function that compares the probability of the data at each step of the diffusion process with the probability of the data at the final step. The diffusion model has been shown to be effective in generating high-quality images with rich and complex structures. It has also been used for image inpainting, super-resolution, and style transfer. Diffusion models have demonstrated impressive results compared to other generative models, such as GANs, particularly in terms of image quality and stability during training.
These generative models can be categorized into so many different categories based on multimodalities
- Text-to-Audio Generation
- Text-to-Music Generation
- Text-to-Graph Generation
- Text-to-Knowledge-Graph Generation
- Knowledge graph to Text generation
- Semantic parsing
- Text-to-Code parsing
Here, is the list of all AIGCs with the use cases and the models.
While there is a strong hype and also organic use cases to apply AIGCs there are still several reasonable fundamentals to consider for trustworthy and responsible AIGCs. One of the most important factors to consider is the accuracy of the generated response. Tools like ChatGPT generates data that may sound accurate, but it can be completely fake. As is NLP and ML models really do not have intelligence built into them. Even today they are mere data compression tools or essentially collections of functions. Functions cannot think and fitting large functions to data only guarantees reproducing the same data when considered 100% accurate, however real-world examples usually come from unseen data and the confounding effects that generate the data. Human has a thinking capability and cognitive capacity to understand, interpret, think, and take action. That’s one of the primary reasons why these so-called “SOTA” models fail at high-level math exams. If the input data is grammatically correct, and even if data is inaccurate, models hallucinate with confidence which can pose a serious threat to society. Online platforms such as StackOverflow bans AIGC to mitigate the risk of spread-misinformation causing threats to not only StackOverflow but also the users. Another problem with these AIGCs is biased and toxic responses. Since AIGCs responses are based on the data that are trained, some inherent biases still exist in the real world. For example, If you google coin toss it, most of the images generated has a Male’s hand with the coin, because historically Males had a higher proportion of earners than Females. InstructGPT from Open AI is based on Human Feedback, and this feedback guides future responses. Toxic and stereotype Q&A results in AI learning such data and using it to target another user. In order to combat the inaccurate information, Google proposed LaMDA and Bart and everyone knows how did that go!
As this is still an area of research, the progress made in the past decade has been immense. This can be attributed to the significant reduction in AI training costs and the increase in AI talent over time. Despite some issues with current solutions, including training on sensitive data, generalizing dataset models, and a lack of cognitive intelligence, there are vast opportunities for research to advance this field. I am eagerly anticipating future model innovations that can achieve human-level intelligence, which would not only assist us but also enhance our abilities in all aspects of life. This is the best time to be alive!
Cite:
@article{dkatariya2023ChatGPT,
title={The Evolution of AI-Generated Content Systems from Markov model to ChatGPT?},
author={Katariya, Dwipam},
journal={Medium, Analytics Vidya},
volume={2},
year={2023}
}
Reference:
Jacob Devlin, Ming-Wei Chang, Kenton Lee, & Kristina Toutanova. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, and R. Cucchiara, “From Show to Tell: A Survey on Deep Learning-based Image Captioning,” Nov. 2021. arXiv:2107.06912 [cs].
P. P. Liang, A. Zadeh, and L.-P. Morency, “Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions,” Sept. 2022. arXiv:2209.03430 [cs].
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” Mar. 2022. arXiv:2203.02155 [cs].
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep Reinforcement Learning from Human Preferences,” in Advances in Neural Information Processing Systems, vol. 30, Curran Associates, Inc., 2017.
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019
Katariya, D. (2021). What length of dependencies can LSTM & T-CNN really remember?. Medium, Towards Data Science, 3.
X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained models for natural language processing: A survey,” Science China Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020.
X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained models for natural language processing: A survey,” Science China Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020.
N. De Cao, W. Aziz, and I. Titov, “Block neural autoregressive flow,” in Uncertainty in artificial intelligence, pp. 12631273, PMLR, 2020
Xu, J. (2020). Flow-based Deep Generative Models.
Cao, Y., Li, S., Liu, Y., Yan, Z., Dai, Y., Yu, P. S., & Sun, L. (2023). A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT. ArXiv [Cs.AI].