8/15/2023 0 Comments Flamingo paper ioAs part of this process, we compared our model's performance when captioning images related to gender and skin colour, and ran our model's generated captions through Google's Perspective API, which evaluates toxicity of text. We also tested the model’s qualitative capabilities beyond our current benchmarks. After this training is done, Flamingo can be directly adapted to vision tasks via simple few-shot learning without any additional task-specific tuning. Following this method, we start from Chinchilla, our recently introduced compute-optimal 70B parameter language model, to train our final Flamingo model, an 80B parameter VLM. Then it is trained on a mixture of complementary large-scale multimodal data coming only from the web, without using any data annotated for machine learning purposes. In practice, Flamingo fuses large language models with powerful visual representations – each separately pre-trained and frozen – by adding novel architectural components in between. Right: Examples of expected inputs and outputs for three of our 16 benchmarks. Left: Few-shot performance of the Flamingo across 16 different multimodal tasks against task specific state-of-the-art performance. Given a few example pairs of visual inputs and expected text responses composed in Flamingo’s prompt, the model can be asked a question with a new image or video, and then generate an answer.įigure 2. Similar to the behaviour of large language models (LLMs), which can address a language task by processing examples of the task in their text prompt, Flamingo’s visual and text interface can steer the model towards solving a multimodal task. Flamingo’s simple interface makes this possible, taking as input a prompt consisting of interleaved images, videos, and text and then output associated language. This means Flamingo can tackle a number of difficult problems with just a handful of task-specific examples (in a “few shots”), without any additional training required. Today, in the preprint of our paper, we introduce Flamingo, a single visual language model (VLM) that sets a new state of the art in few-shot learning on a wide range of open-ended multimodal tasks. As part of DeepMind’s mission to solve intelligence, we’ve explored whether an alternative model could make this process easier and more efficient, given only limited task-specific information. This process is inefficient, expensive, and resource-intensive, requiring large amounts of annotated data and the need to train a new model each time it’s confronted with a new task. If the goal is to count and identify animals in an image, as in “three zebras”, one would have to collect thousands of images and annotate each image with their quantity and species. But for a typical visual model to learn a new task, it must be trained on tens of thousands of examples specifically labelled for that task. For instance, a child may recognise real animals at the zoo after seeing a few pictures of the animals in a book, despite differences between the two. One key aspect of intelligence is the ability to quickly learn how to perform a new task when given a brief instruction.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |