Gemini:一个高能力多模态模型系列
Abstract
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks — notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
本报告介绍了一个新的多模态模型系列Gemini,它在图像、音频、视频和文本理解方面展现出卓越的能力。Gemini系列包括Ultra、Pro和Nano三种尺寸,适用于从复杂推理任务到设备端内存受限用例的各种应用。在广泛基准上的评估表明,我们能力最强的Gemini Ultra模型在32项基准测试中的30项上推进了现有技术水平—— notably 成为首个在深入研究的考试基准MMLU上达到人类专家表现水平的模型,并在我们检验的20个多模态基准中的每一个上都改进了现有技术水平。我们相信,Gemini系列在跨模态推理和语言理解方面的新能力将支持各种各样的用例。我们讨论了通过Gemini、Gemini Advanced、Google AI Studio和Cloud Vertex AI等服务,负责任地对用户进行Gemini模型的训练后处理及部署的方法。
Introduction
We present Gemini, a family of highly capable multimodal models developed at Google. We trained Gemini models jointly across image, audio, video, and text data for the purpose of building a model with both strong generalist capabilities across modalities alongside cutting-edge understanding and reasoning performance in each respective domain.
我们推出Gemini,这是Google开发的一个高能力多模态模型系列。我们联合在图像、音频、视频和文本数据上训练Gemini模型,旨在构建一个既具有跨模态强大通用能力,又能在各自领域拥有顶尖理解和推理性能的模型。
Gemini 1.0, our first version, comes in three sizes: Ultra for highly-complex tasks, Pro for enhanced performance and deployability at scale, and Nano for on-device applications. Each size is specifically tailored to address different computational limitations and application requirements.
Gemini 1.0是我们的首个版本,提供三种尺寸:适用于高度复杂任务的Ultra、适用于增强性能和可扩展部署的Pro,以及适用于设备端应用的Nano。每种尺寸都专门针对不同的计算限制和应用需求而设计。
After large-scale pre-training, we post-train our models to improve overall quality, enhance target capabilities, and ensure alignment and safety criteria are met. Due to the varied requirements of our downstream applications, we have produced two post-trained Gemini model family variants. Chat-focused variants, referred to as Gemini Apps models, are optimized for Gemini and Gemini Advanced, our conversational AI service formerly known as Bard. Developer-focused variants, referred to as Gemini API models, are optimized for a range of products and are accessible through Google AI Studio and Cloud Vertex AI.
在大规模预训练之后,我们对模型进行后训练,以提高整体质量,增强目标能力,并确保满足对齐和安全标准。由于下游应用的需求各异,我们生成了两种后训练的Gemini模型系列变体。以聊天为中心的变体称为Gemini Apps模型,针对我们的对话式AI服务Gemini和Gemini Advanced进行了优化。以开发者为中心的变体称为Gemini API模型,针对一系列产品进行了优化,可通过Google AI Studio和Cloud Vertex AI访问。
We evaluate the performance of pre- and post-trained Gemini models on a comprehensive suite of internal and external benchmarks covering a wide range of language, coding, reasoning, and multimodal tasks.
我们在涵盖广泛语言、编码、推理和多模态任务的全面内部和外部基准上,评估了预训练和后训练Gemini模型的性能。
The Gemini family advances state-of-the-art in large-scale language modeling, image understanding, audio processing, and video understanding. It also builds on the work on sequence models, a long history of work in deep learning based on neural networks, and machine learning distributed systems that enable large-scale training.
Gemini系列推进了大规模语言建模、图像理解、音频处理和视频理解领域的现有技术水平。它同样基于序列模型的工作、基于神经网络的深度学习的悠久历史,以及实现大规模训练的机器学习分布式系统。
Our most capable model, Gemini Ultra, achieves new state-of-the-art results in 30 of 32 benchmarks we report on, including 10 of 12 popular text and reasoning benchmarks, 9 of 9 image understanding benchmarks, 6 of 6 video understanding benchmarks, and 5 of 5 speech recognition and speech translation benchmarks. Gemini Ultra is the first model to achieve human-expert performance on MMLU — a prominent benchmark testing knowledge and reasoning via a suite of exams — with a score above 90%. Beyond text, Gemini Ultra makes notable advances on challenging multimodal reasoning tasks. For example, on the recent MMMU benchmark, that comprises questions about images on multi-discipline tasks requiring college-level subject knowledge and deliberate reasoning, Gemini Ultra achieves a new state-of-the-art score of 62.4%, outperforming the previous best model by more than 5 percentage points. It provides a uniform performance lift for video question answering and audio understanding benchmarks.
我们能力最强的模型Gemini Ultra在我们报告的32项基准测试中的30项上取得了新的最先进结果,包括12项热门文本和推理基准中的10项、9项图像理解基准中的9项、6项视频理解基准中的6项,以及5项语音识别和语音翻译基准中的5项。Gemini Ultra是首个在MMLU(一项通过一系列考试测试知识和推理的著名基准)上达到人类专家表现水平的模型,得分超过90%。除了文本,Gemini Ultra在具有挑战性的多模态推理任务上也取得了显著进展。例如,在最近的MMMU基准上,该基准包含需要大学水平学科知识和深思熟虑推理的多学科任务中关于图像的问题,Gemini Ultra取得了62.4%的最新先进分数,比之前的最佳模型高出5个多百分点。它为视频问答和音频理解基准带来了普遍的性能提升。
Qualitative evaluation showcases impressive crossmodal reasoning capabilities, enabling the model to understand and reason across an input sequence of audio, images, and text natively (see Figure 5 and Table 13). Consider the educational setting depicted in Figure 1 as an example. A teacher has drawn a physics problem of a skier going down a slope, and a student has worked through a solution to it. Using Gemini models' multimodal reasoning capabilities, the model is able to understand the messy handwriting, correctly understand the problem formulation, convert both the problem and solution to mathematical typesetting, identify the specific step of reasoning where the student went wrong in solving the problem, and then give a worked through correct solution to the problem. This opens up exciting educational possibilities, and we believe the new multimodal and reasoning capabilities of Gemini models have dramatic applications across many fields.
定性评估展示了令人印象深刻的跨模态推理能力,使模型能够原生地理解和推理包含音频、图像和文本的输入序列。以图1描述的教育场景为例。一位老师画了一个滑雪者下坡的物理问题,一位学生给出了一个解答过程。利用Gemini模型的多模态推理能力,模型能够理解潦草的笔迹,正确理解问题表述,将问题和解答转换为数学排版,找出学生在解题过程中出错的特定推理步骤,然后给出一个正确的解题过程。这开辟了令人兴奋的教育可能性,我们相信Gemini模型新的多模态和推理能力将在许多领域产生巨大的应用。
The reasoning capabilities of large language models show promise toward building generalist agents that can tackle more complex multi-step problems. The AlphaCode team built AlphaCode 2, a new Gemini-model-powered agent, that combines Gemini models' reasoning capabilities with search and tool-use to excel at solving competitive programming problems. AlphaCode 2 ranks within the top 15% of entrants on the Codeforces competitive programming platform, a large improvement over its state-of-the-art predecessor in the top 50%.
大型语言模型的推理能力为构建能够处理更复杂多步问题的通用智能体带来了希望。AlphaCode团队构建了AlphaCode 2,一个由Gemini模型驱动的新智能体,它结合了Gemini模型的推理能力与搜索和工具使用,在解决竞争性编程问题方面表现出色。AlphaCode 2在Codeforces竞争性编程平台中排名前15%,相比于之前在前50%的最先进版本有了大幅提升。
In tandem, we advance the frontier of efficiency with Gemini Nano, a series of small models targeting on-device deployment. These models excel in on-device tasks, such as summarization, reading comprehension, text completion tasks, and exhibit impressive capabilities in reasoning, STEM, coding, multimodal, and multilingual tasks relative to their sizes.
与此同时,我们通过Gemini Nano推进了效率前沿,这是一系列针对设备端部署的小型模型。这些模型在设备端任务上表现出色,例如摘要、阅读理解、文本补全任务,并相对于其规模在推理、STEM、编码、多模态和多语言任务上展现出令人印象深刻的能力。
In the following sections, we first provide an overview of the model architecture, training infrastructure, and pre-training dataset. We then present detailed evaluations of the pre- and post-trained Gemini model family, covering well-studied benchmarks across text, code, image, audio and video — which include both English performance and multilingual capabilities. Next we discuss our approach to post-training, highlight common and distinct aspects of the Gemini Apps and Gemini API model variants, and benchmark their performance on key capabilities. Responsible deployment is critical: we explain our process for impact assessments, developing model policies, evaluations, and mitigations of harm before deployment decisions. Finally, we discuss the broader implications of Gemini models, their limitations alongside their potential applications — paving the way for a new era of research and innovation in AI.
在接下来的章节中,我们首先概述模型架构、训练基础设施和预训练数据集。然后,我们对预训练和后训练的Gemini模型系列进行详细评估,涵盖文本、代码、图像、音频和视频等深入研究过的基准——包括英语性能和多语言能力。接下来,我们讨论后训练方法,突出Gemini Apps和Gemini API模型变体的共性和差异,并对它们在关键能力上的性能进行基准测试。负责任部署至关重要:我们解释了在部署决策前进行影响评估、制定模型政策、评估和减轻危害的过程。最后,我们讨论了Gemini模型的更广泛影响、它们的局限性以及潜在应用——为AI研究和创新的新时代铺平道路。
Model Architecture
Gemini models build on top of Transformer decoders that are enhanced with improvements in architecture and model optimization to enable stable training at scale and optimized inference on Google's Tensor Processing Units. They are trained to support 32k context length, employing efficient attention mechanisms. Our first version, Gemini 1.0, comprises three main sizes to support a wide range of applications as discussed in Table 1.
Gemini模型构建于Transformer解码器之上,并通过架构和模型优化方面的改进进行了增强,以实现大规模稳定训练和在Google张量处理单元上的优化推理。它们经过训练以支持32k上下文长度,采用高效的注意力机制。我们的首个版本Gemini 1.0包含三种主要尺寸,以支持广泛的应用。
Gemini models are trained to accommodate visual input interleaved with a wide variety of audio and visual inputs, such as natural images, charts, text, and image outputs (see Figure 2). The visual encoding of Gemini models is inspired by our own foundational work on Flamingo, CoCa, and PaLI, with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens.
Gemini模型经过训练,可处理与各种音频和视觉输入交错的视觉输入,例如自然图像、图表、文本和图像输出。Gemini模型的视觉编码受到我们自己关于Flamingo、CoCa和PaLI的基础工作的启发,一个重要的区别是,这些模型从一开始就是多模态的,并且可以使用离散图像标记原生输出图像。
Video understanding is accomplished by encoding the video as a sequence of frames in the large context window. Video frames or images can be interleaved naturally with text or audio as part of the model input. The models can handle variable input resolution in order to spend more compute on tasks that require fine-grained understanding. In addition, Gemini models can directly ingest audio and video data.
视频理解是通过将视频编码为大上下文窗口中的帧序列来实现的。视频帧或图像可以与文本或音频自然地交错,作为模型输入的一部分。模型可以处理可变输入分辨率,以便将更多计算投入到需要细粒度理解的任务上。此外,Gemini模型可以直接接收音频和视频数据。
signals at 16kHz from Universal Speech Model (USM) features. This enables the model to capture nuances that are typically lost when the audio is naively mapped to a text input.
从通用语音模型特征中以16kHz采样率接收信号。这使得模型能够捕获当音频被简单地映射到文本输入时通常会丢失的细微差别。
Training the Gemini family of models required innovations in training algorithms, dataset, and infrastructure. For the Pro model, the inherent scalability of our infrastructure and learning algorithms enable us to complete pre-training in a matter of weeks, leveraging a fraction of the Ultra's resources. The Nano series of models leverage additional advancements in distillation and training algorithms to produce the best-in-class small language models for a wide variety of tasks, such as summarization and reading comprehension, which power our next generation on-device experiences.
训练Gemini系列模型需要在训练算法、数据集和基础设施方面进行创新。对于Pro模型,我们基础设施和学习算法的内在可扩展性使我们能够在几周内完成预训练,仅利用Ultra的一小部分资源。Nano系列模型利用蒸馏和训练算法的额外进步,为各种任务(如摘要和阅读理解)生产出同类最佳的小型语言模型,为我们下一代设备端体验提供支持。
Training Infrastructure
We trained Gemini models using TPUv5e and TPUv4, depending on their sizes and configuration. Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by Google across multiple datacenters. This represents a significant increase in scale over our prior flagship model PaLM-2 which presented new infrastructure challenges. Scaling up the number of accelerators results in a proportionate decrease in the mean time between failure of hardware in the overall system. We minimized the rate of planned reschedules and preemptions, but genuine machine failures are commonplace across all hardware accelerators at such large scales.
我们根据模型大小和配置,使用TPUv5e和TPUv4训练Gemini模型。训练Gemini Ultra使用了Google在多个数据中心拥有的大型TPUv4加速器集群。这比我们之前的旗舰模型PaLM-2的规模有了显著增加,带来了新的基础设施挑战。加速器数量的增加会导致整个系统硬件平均故障间隔时间成比例地减少。我们最大限度地减少了计划内重新调度和抢占的频率,但在如此大规模的硬件加速器中,真正的机器故障是常见的。
TPUv4 accelerators are deployed in "SuperPods" of 4096 chips, each connected to a dedicated optical switch, which can dynamically reconfigure 4x4x4 chip cubes into arbitrary 3D torus topologies in around 10 seconds. For Gemini Ultra, we decided to retain a small number of cubes per superpod to allow for hot standbys and rolling maintenance.
TPUv4加速器部署在由4096个芯片组成的“超级Pod”中,每个芯片都连接到专用的光交换机,该交换机可以在大约10秒内将4x4x4芯片立方体动态重新配置为任意3D环面拓扑。对于Gemini Ultra,我们决定在每个超级pod中保留少量立方体,以支持热备用和滚动维护。
TPU accelerators primarily communicate over the high speed inter-chip-interconnect, but at Gemini Ultra scale, we combine SuperPods in multiple datacenters using Google's intra-cluster and inter-cluster network. Google's network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.
TPU加速器主要通过高速芯片间互连进行通信,但在Gemini Ultra规模下,我们使用Google的集群内和集群间网络组合多个数据中心中的超级Pod。Google的网络延迟和带宽足以支持常用的同步训练范式,利用超级pod内的模型并行性和跨超级pod的数据并行性。
The 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow. The GSPMD partitioner in the XLA compiler partitions the training step computation, and the MegaScale XLA compiler pass statically schedules appropriate collectives so that they maximally overlap with the computation with very little variation in step time.
Jax和Pathways的“单控制器”编程模型允许单个Python进程编排整个训练过程,极大地简化了开发工作流程。XLA编译器中的GSPMD分区器对训练步骤计算进行分区,MegaScale XLA编译器传递静态地调度适当的集合通信,使其与计算最大程度地重叠,且步骤时间变化很小。
Maintaining a high goodput at this scale would have been impossible using the conventional approach of periodic checkpointing of weights to persistent cluster storage. For Gemini models, we instead made use of redundant in-memory copies of the model state, and on any unplanned hardware failures, we rapidly recover directly from an intact model replica. Compared to both PaLM and PaLM-2, this provided a substantial speedup in recovery time, despite the significantly larger training resources being used. As a result, the overall goodput for the largest-scale training job increased from 85% to 97%.
在这种规模下,使用传统的周期性将权重检查点到持久集群存储的方法来维持高有效率是不可能的。对于Gemini模型,我们转而利用模型状态的冗余内存副本,并且在任何计划外硬件故障时,直接从完整的模型副本快速恢复。与PaLM和PaLM-2相比,尽管使用了显著更大的训练资源,但这提供了大幅的恢复时间加速。结果,最大规模训练作业的整体有效率从85%提高到了97%。
Training at unprecedented scale invariably surfaces new and interesting systems failure modes - and in this instance one of the problems that we needed to address was that of "Silent Data Corruption (SDC)". Although these are extremely rare, the scale of Gemini models means that we can expect SDC events to impact training every week or two. Rapidly detecting and removing faulty hardware required several new techniques that exploit deterministic replay to isolate incorrect computations, combined with proactive SDC scanners on idle machines and hot standbys. Our fully deterministic infrastructure allowed us to quickly identify root causes during the development leading up to the Ultra model, and this was a crucial ingredient towards stable training.
前所未有规模的训练总会带来新的、有趣的系统故障模式——在这种情况下,我们需要解决的问题之一是“静默数据损坏”。尽管这些问题极为罕见,但Gemini模型的规模意味着我们预计每一两周就会有一次SDC事件影响训练。快速检测和移除故障硬件需要几种新技术,这些技术利用确定性重放来隔离不正确的计算,并结合空闲机器上的主动SDC扫描器和热备用。我们完全确定性的基础设施使我们能够在Ultra模型的开发过程中快速识别根本原因,这是实现稳定训练的关键因素。
Pre-Training Dataset
Gemini models are trained on a dataset that is both multimodal and multilingual. Our pre-training dataset uses data from web documents, books, and code, and includes image, audio, and video data.
Gemini模型在一个既多模态又多语言的数据集上进行训练。我们的预训练数据集使用了来自网络文档、书籍和代码的数据,并包括图像、音频和视频数据。
We use the SentencePiece tokenizer and find that training the tokenizer on a large sample of the entire training corpus improves the inferred vocabulary and subsequently improves model performance. For example, we find Gemini models can efficiently tokenize non-Latin scripts which can, in turn, benefit model quality as well as training and inference speed.
我们使用 SentencePiece 分词器,并发现对整个训练语料库的大样本进行分词器训练可以改进推断出的词汇表,进而提升模型性能。例如,我们发现 Gemini 模型可以高效地对非拉丁语系脚本进行分词,这反过来又有利于模型质量以及训练和推理速度。
The number of tokens used to train the largest models were determined following the approach in Hoffmann et al. The smaller models are trained for significantly more tokens to improve performance for a given inference budget, similar to the approach advocated in Touvron et al.
训练最大模型所使用的 token 数量是根据 Hoffmann 等人的方法确定的。较小的模型则使用显著更多的 token 进行训练,以在给定的推理预算下提高性能,这与 Touvron 等人倡导的方法类似。
We apply quality filters to all datasets, using both heuristic rules and model-based classifiers. We also perform safety filtering to remove harmful content based on our policies. To maintain the integrity of evaluations, we search for and remove any evaluation data that may have been in our training corpus before using data for training. The final data mixtures and weights were determined through ablations on smaller models. We stage training to alter the mixture composition during training – increasing the weight of domain-relevant data towards the end of training. We find that data quality is an important factor for highly-performing models, and believe that many interesting questions remain around finding the optimal dataset distribution for pre-training.
我们对所有数据集应用质量过滤器,使用启发式规则和基于模型的分类器。我们还根据我们的政策执行安全过滤,以移除有害内容。为了保持评估的完整性,在使用数据进行训练之前,我们会搜索并移除任何可能存在于训练语料库中的评估数据。最终的数据混合和权重是通过对较小模型进行消融实验确定的。我们分阶段进行训练,在训练过程中改变数据混合的组成——在训练后期增加领域相关数据的权重。我们发现数据质量对于高性能模型是一个重要因素,并且相信在寻找预训练的最佳数据集分布方面仍有许多有趣的问题有待探索。
Discussion and Conclusion
We have presented Gemini, a new family of models that advance multimodal model capabilities in text, code, image, audio, and video. Our most capable pre-trained model Gemini Ultra, alongside the post-trained Gemini Apps and Gemini API variants, make significant advances across the board. In the natural language domain, the performance gains from careful developments in data and model training at scale continue to deliver quality improvements, setting new state of the art in several benchmarks. In particular, Gemini Ultra surpasses human-expert performance on the exam benchmark MMLU, scoring 90.0%, which has been a defacto measure of progress for LLMs ever since it was first released in 2020. In the multimodal domain, Gemini Ultra sets new state of the art on most of the image understanding, video understanding, and audio understanding benchmarks without task-specific modifications or tuning. In particular, Gemini Ultra's multimodal reasoning capabilities are evident from its state-of-the-art performance on the recent MMMU benchmark, that comprises questions about images requiring college-level subject knowledge and deliberate reasoning.
我们推出了Gemini,这是一个新的模型系列,在文本、代码、图像、音频和视频方面提升了多模态模型的能力。我们能力最强的预训练模型Gemini Ultra,以及经过后训练的Gemini Apps和Gemini API变体,取得了全面的显著进展。在自然语言领域,通过精心开发数据和进行大规模模型训练所带来的性能提升,持续带来质量改进,在多个基准上树立了新的先进水平。特别值得一提的是,Gemini Ultra在考试基准MMLU上超越了人类专家的表现,取得了90.0%的分数,该基准自2020年首次发布以来,一直是衡量LLM进展的事实标准。在多模态领域,Gemini Ultra在大多数图像理解、视频理解和音频理解基准上,无需任务特定的修改或调优,就树立了新的先进水平。Gemini Ultra的多模态推理能力尤其体现在其在最新的MMMU基准上的先进表现,该基准包含需要大学水平学科知识和深思熟虑推理的图像问题。
Beyond the state-of-art results on benchmarks, what we are most excited about is the new use cases enabled by Gemini models. The new capabilities of Gemini models to parse complex images, such as charts or infographics, reason over interleaved sequences of images, audio, and text, and generate interleaved text and images as responses open a wide variety of new applications. As shown in figures throughout the report and appendix, Gemini models can enable new approaches in areas like education, everyday problem solving, multilingual communication, information summarization, extraction, and creativity. We expect that the users of these models will find all kinds of beneficial new uses that we have only scratched the surface of in our own investigations.
除了基准上的先进结果,我们最兴奋的是Gemini模型所带来的新用例。Gemini模型解析复杂图像(如图表或信息图)、对交错的图像、音频和文本序列进行推理,以及生成交错的文本和图像作为响应的新能力,开启了广泛的新应用。正如报告和附录中的图表所示,Gemini模型可以在教育、日常问题解决、多语言交流、信息摘要、提取和创意等领域实现新方法。我们期望这些模型的用户会发现各种有益的新用途,而我们自己的研究只是触及了表面。
Despite their impressive capabilities, we should note that there are limitations to the use of LLMs. There is a continued need for ongoing research and development on "hallucinations" generated by LLMs to ensure that model outputs are more reliable and verifiable. LLMs also struggle with tasks requiring high-level reasoning abilities like causal understanding, logical deduction, and counterfactual reasoning even though they achieve impressive performance on exam benchmarks. This underscores the need for more challenging and robust evaluations to measure their true understanding as the current state-of-the-art LLMs saturate many benchmarks.
尽管能力令人印象深刻,但我们应该注意到LLM的使用存在局限性。对LLM产生的“幻觉”需要持续进行研究和开发,以确保模型输出更可靠和可验证。LLM在需要高级推理能力(如因果理解、逻辑演绎和反事实推理)的任务上也表现不佳,尽管它们在考试基准上取得了令人印象深刻的性能。这强调了需要更具挑战性和稳健的评估来衡量它们的真正理解能力,因为当前最先进的LLM正在许多基准上达到饱和。
The Gemini family is a further step towards our mission to solve intelligence, advance science and benefit humanity, and we are enthusiastic to see how these models are used by our colleagues at Google and beyond. We build on many innovations in machine learning, data, infrastructure, and responsible development – areas that we have been pursuing at Google for over a decade. The models we present in this report provide a strong foundation towards our broader future goal to develop a large-scale, modularized system that will have broad generalization capabilities across many modalities.
Gemini系列是朝着我们解决智能、推动科学进步和造福人类的使命迈出的又一步,我们热切期待看到Google内外的同事们如何使用这些模型。我们建立在机器学习、数据、基础设施和负责任开发方面的诸多创新之上——这些领域我们已在Google追求了十多年。本报告中介绍的模型为我们更广泛的未来目标——开发一个大规模、模块化、具有跨多种模态广泛泛化能力的系统——提供了坚实的基础。