Skip to content

Qwen2.5 Technical Report

https://huggingface.co/papers/2412.15115

https://arxiv.org/abs/2412.15115

阿里

Qwen2.5 技术报告

In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning, including offline learning DPO and online learning GRPO. Post-training techniques significantly enhance human preference, and notably improve long text generation, structural data analysis, and instruction following.

在本报告中,我们介绍了 Qwen2.5,一个旨在满足多样化需求的全面大语言模型系列。与之前的版本相比,Qwen2.5 在预训练和后训练阶段均有显著改进。在预训练方面,我们将高质量的预训练数据集从之前的 7 万亿词元扩展到 18 万亿词元。这为常识、专业知识和推理能力提供了坚实的基础。在后训练方面,我们实施了包含超过 100 万个样本的复杂监督微调,以及多阶段强化学习,包括离线学习 DPO 和在线学习 GRPO。后训练技术显著增强了对人类偏好的对齐,并显著改进了长文本生成、结构化数据分析和指令跟随能力。

To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich configurations. The open-weight offerings include base models and instruction-tuned models in sizes of 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters. Quantized versions of the instruction-tuned models are also provided. Over 100 models can be accessed from Hugging Face Hub, ModelScope, and Kaggle. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio.

为了有效处理多样化的用例,我们以丰富的配置推出了 Qwen2.5 大语言模型系列。开放权重的产品包括基础模型和指令调优模型,参数规模涵盖 0.5B、1.5B、3B、7B、14B、32B 和 72B。我们还提供指令调优模型的量化版本。超过 100 个模型可在 Hugging Face Hub、ModelScope 和 Kaggle 上获取。此外,对于托管解决方案,专有模型目前包括两个混合专家变体:Qwen2.5-Turbo 和 Qwen2.5-Plus,两者均可从阿里云模型服务平台获得。

Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.

Qwen2.5 在评估语言理解、推理、数学、编程、人类偏好对齐等广泛基准上展现出顶级性能。具体而言,开放权重的旗舰模型 Qwen2.5-72B-Instruct 优于许多开源和专有模型,并与大约是其五倍大小的最先进开源权重模型 Llama-3-405B-Instruct 表现出竞争性性能。Qwen2.5-Turbo 和 Qwen2.5-Plus 提供了卓越的成本效益,同时分别与 GPT-4o mini 和 GPT-4o 竞争。此外,作为基础,Qwen2.5 模型在训练专用模型方面发挥了重要作用。


Structured 3D Latents for Scalable and Versatile 3D Generation

https://huggingface.co/papers/2412.01506

https://arxiv.org/abs/2412.01506

https://github.com/Microsoft/TRELLIS

Microsoft

面向可扩展与通用3D生成的结构化3D潜在表示

We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLat) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding.

我们提出了一种新颖的3D生成方法,用于通用且高质量的3D资产生成。其核心是一种统一的结构化潜在表示,该表示允许解码为不同的输出格式,如辐射场、3D高斯和网格。这是通过将稀疏填充的3D网格与从强大的视觉基础模型中提取的密集多视角视觉特征相结合来实现的,从而在保持解码灵活性的同时,全面捕捉结构信息和纹理信息。

We employ rectified flow transformers tailored for SLat as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models.

我们采用专为结构化潜在表示定制的整流流Transformer作为我们的3D生成模型,并在包含50万个多样化对象的大型3D资产生成数据集上训练了高达20亿参数的模型。我们的模型在文本或图像条件下都能生成高质量的结果,显著超越了现有方法,包括近期相似规模的模型。我们展示了灵活的输出格式选择和局部3D编辑能力,这些都是先前模型所不具备的。


Evaluating and Aligning CodeLLMs on Human Preference

https://huggingface.co/papers/2412.05210

https://arxiv.org/abs/2412.05210

https://codearenaeval.github.io/

阿里

面向人类偏好的代码大语言模型评估与对齐

Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common measure to evaluate the performance and capabilities of code LLMs. However, the current code LLMs focus on synthesizing the correct code snippet, ignoring the alignment with human preferences, where the query should be sampled from the practical application scenarios and the model-generated responses should satisfy the human preference. To bridge the gap between the model-generated response and human preference, we present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks, where 397 high-quality samples spanning 40 categories and 44 programming languages, carefully curated from user queries. Further, we propose a diverse synthetic instruction corpus SynCodeInstruct (nearly 20B tokens) by scaling instructions from the website to verify the effectiveness of the large-scale synthetic instruction fine-tuning, where Qwen2.5-SynCoder totally trained on synthetic instruction data can achieve top-tier performance of open-source code LLMs. The results find performance differences between execution-based benchmarks and CodeArena. Our systematic experiments of CodeArena on 40+ LLMs reveal a notable performance gap between open SOTA code LLMs and proprietary LLMs, underscoring the importance of the human preference alignment.

代码大语言模型在代码生成方面取得了显著进展。以往大多数与代码相关的基准由各种编程练习及其相应的测试用例组成,被用作衡量代码大语言模型性能和能力的常用手段。然而,当前的代码大语言模型专注于合成正确的代码片段,忽视了与人类偏好的对齐,即查询应从实际应用场景中采样,且模型生成的响应应满足人类偏好。为了弥合模型生成响应与人类偏好之间的差距,我们提出了一个经过严格人工筛选的基准 CodeArena,用以模拟真实世界编码任务的复杂性和多样性。该基准包含 397 个高质量样本,涵盖 40 个类别和 44 种编程语言,均从用户查询中精心整理而成。此外,我们通过扩展来自网站的指令,提出了一个多样化的合成指令语料库 SynCodeInstruct,以验证大规模合成指令微调的有效性。完全在合成指令数据上训练的 Qwen2.5-SynCoder 能够达到开源代码大语言模型的顶级性能。结果发现了基于执行的基准与 CodeArena 之间的性能差异。我们在超过 40 个大语言模型上对 CodeArena 进行的系统实验揭示了开源最先进代码大语言模型与专有大语言模型之间存在显著的性能差距,这凸显了人类偏好对齐的重要性。