Differential Transformer
Microsoft
DIFF Transformer
Transformer tends to overallocate attention to irrelevant context. In this work, we introduce DIFF Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that DIFF Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, DIFF Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, DIFF Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position DIFF Transformer as a highly effective and promising architecture to advance large language models.
Transformer倾向于过度关注无关的上下文。在这项工作中,我们引入了DIFF Transformer,它在增强对相关上下文关注的同时抵消噪声。具体来说,差分注意力机制将注意力分数计算为两个独立的 softmax 注意力图之间的差值。这种减法抵消了噪声,促进了稀疏注意力模式的出现。语言建模的实验结果表明,DIFF Transformer 在扩展模型规模和训练词元的各种设置中均优于 Transformer。更有趣的是,它在实际应用中提供了显著的优势,例如长上下文建模、关键信息检索、幻觉缓解、上下文学习以及减少激活异常值。通过减少对无关上下文的干扰,DIFF Transformer 可以缓解问答和文本摘要中的幻觉。对于上下文学习,DIFF Transformer 不仅提高了准确性,而且对顺序排列更具鲁棒性,后者被认为是一个长期存在的鲁棒性问题。这些结果将 DIFF Transformer 定位为一种推进大语言模型的高效且前景广阔的架构。
Pixtral 12B
Pixtral 12B
We introduce Pixtral 12B, a 12-billion-parameter multimodal language model. Pixtral 12B is trained to understand both natural images and documents, achieving leading performance on various multimodal benchmarks, surpassing a number of larger models. Unlike many open-source models, Pixtral is also a cutting-edge text model for its size, and does not compromise on natural language performance to excel in multimodal tasks. Pixtral uses a new vision encoder trained from scratch, which allows it to ingest images at their natural resolution and aspect ratio. This gives users flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Pixtral 12B substantially outperforms other open models of similar sizes (Llama-3.2 11B & Qwen-2-VL 7B). It also outperforms much larger open models like Llama-3.2 90B while being 7x smaller. We further contribute an open-source benchmark, MM-MT-Bench, for evaluating vision-language models in practical scenarios, and provide detailed analysis and code for standardized evaluation protocols for multimodal LLMs. Pixtral 12B is released under Apache 2.0 license.
我们推出了 Pixtral 12B,一个拥有 120 亿参数的多模态语言模型。Pixtral 12B 经过训练,能够理解自然图像和文档,在各种多模态基准上取得了领先的性能,超越了许多更大的模型。与许多开源模型不同,Pixtral 在其规模上也是一个尖端的文本模型,并且不会为了在多模态任务中表现出色而牺牲自然语言性能。Pixtral 使用了一个从头开始训练的新视觉编码器,使其能够以图像的自然分辨率和宽高比处理图像。这为用户在处理图像时使用的词元数量提供了灵活性。Pixtral 还能在其 128K 词元的超长上下文窗口中处理任意数量的图像。Pixtral 12B 显著优于其他同等规模的开源模型。它也优于像 Llama-3.2 90B 这样规模大得多的开源模型,而体积却小了 7 倍。我们还贡献了一个开源基准 MM-MT-Bench,用于在实际场景中评估视觉语言模型,并为多模态大语言模型的标准化评估协议提供了详细分析和代码。Pixtral 12B 根据 Apache 2.0 许可证发布。