Skip to content

TinyLlama: An Open-Source Small Language Model

https://huggingface.co/papers/2401.02385

https://arxiv.org/abs/2401.02385

https://github.com/jzhang38/tinyllama


We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for up to 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community, e.g., FlashAttention and Lit-GPT, achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes.

我们推出了 TinyLlama,一个紧凑的 11 亿参数语言模型,在约 1 万亿词元上预训练了多达 3 个周期。TinyLlama 基于 Llama 2 的架构和分词器构建,并利用了开源社区贡献的多项先进技术,实现了更高的计算效率。尽管规模相对较小,TinyLlama 在一系列下游任务中展现出了卓越的性能,显著优于现有同规模的开源语言模型。


DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

https://huggingface.co/papers/2401.14196

https://arxiv.org/abs/2401.14196

https://github.com/deepseek-ai/DeepSeek-Coder

DeepSeek

DeepSeek-Coder:当大语言模型遇见编程——代码智能的崛起

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.

大语言模型的快速发展彻底改变了软件开发中的代码智能。然而,闭源模型的主导地位限制了广泛的研究和开发。为解决这一问题,我们推出了 DeepSeek-Coder 系列,一系列参数规模从 13 亿到 330 亿不等的开源代码模型,在 2 万亿词元上从头开始训练。这些模型在高质量的项目级代码语料库上进行预训练,并采用具有 16K 窗口的填空任务来增强代码生成和填充能力。我们的广泛评估表明,DeepSeek-Coder 不仅在多个基准测试中达到了开源代码模型的最先进性能,而且超越了现有的闭源模型。此外,DeepSeek-Coder 模型采用宽松许可证,允许研究和无限制的商业使用。


Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

https://huggingface.co/papers/2401.10891

https://arxiv.org/abs/2401.10891

https://github.com/LiheYoung/Depth-Anything


Depth Anything:释放大规模无标签数据的力量

This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability (Figure 1). Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released here.

本工作提出了 Depth Anything,一种用于鲁棒单目深度估计的高度实用化解决方案。在不追求新颖技术模块的前提下,我们的目标是构建一个简单而强大的基础模型,能够处理任何情况下的任何图像。为此,我们通过设计一个数据引擎来收集和自动标注大规模无标签数据,从而扩展数据集,这显著扩大了数据覆盖范围,因此能够减少泛化误差。我们研究了两种简单而有效的策略,使得数据规模化扩展具有前景。首先,通过利用数据增强工具创建了一个更具挑战性的优化目标,这迫使模型主动寻求额外的视觉知识并获取鲁棒的表示。其次,我们开发了一种辅助监督,以强制模型从预训练编码器中继承丰富的语义先验。我们广泛评估了其零样本能力,包括六个公共数据集和随机拍摄的照片。它展现了令人印象深刻的泛化能力。此外,通过使用来自 NYUv2 和 KITTI 的度量深度信息对其进行微调,我们取得了新的 SOTA 结果。我们更好的深度模型也带来了更好的深度条件 ControlNet。我们的模型已在此发布。


InstantID: Zero-shot Identity-Preserving Generation in Seconds

https://huggingface.co/papers/2401.07519

https://arxiv.org/abs/2401.07519

https://instantid.github.io/

https://github.com/instantX-research/InstantID


InstantID:秒级零样本身份保持生成

There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin.

个性化图像合成领域,随着 Textual Inversion、DreamBooth 和 LoRA 等方法的发展取得了显著进展。然而,它们在现实世界中的应用受到高存储需求、耗时的微调过程以及需要多张参考图像的阻碍。相反,现有的基于 ID 嵌入的方法虽然只需一次前向推理,却面临挑战:它们要么需要对众多模型参数进行广泛微调,要么缺乏与社区预训练模型的兼容性,要么无法保持高保真度的人脸 fidelity。针对这些局限性,我们引入了 InstantID,一个基于扩散模型的强大解决方案。我们的即插即用模块仅需一张人脸图像即可巧妙处理各种风格的图像个性化,同时确保高保真度。为此,我们设计了一个新颖的 IdentityNet,通过施加强语义条件和弱空间条件,将人脸和地标图像与文本提示相结合来引导图像生成。InstantID 展现出卓越的性能和效率,证明其在身份保存至关重要的现实世界应用中极具价值。此外,我们的工作与流行的预训练文生图扩散模型无缝集成,可作为适应性强的插件使用。


DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

https://huggingface.co/papers/2401.02954

https://arxiv.org/abs/2401.02954

https://github.com/deepseek-ai/deepseek-llm

DeepSeek

DeepSeek LLM:以长期主义视角扩展开源语言模型

The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling laws described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate the scaling of large scale models in two prevalent used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and direct preference optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B across a range of benchmarks, especially in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that our DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

开源大语言模型的快速发展确实令人瞩目。然而,先前文献中描述的缩放定律呈现出不同结论,这给大语言模型的规模扩展蒙上了一层阴影。我们深入研究了缩放定律,并提出了我们独特的发现,这些发现在两种常用的开源配置中促进了大规模模型的扩展。在缩放定律的指导下,我们推出了DeepSeek LLM,一个致力于以长期主义视角推进开源语言模型的项目。为支持预训练阶段,我们开发了一个当前包含2万亿词元且持续扩展的数据集。我们进一步在DeepSeek LLM基础模型上进行了监督微调和直接偏好优化,从而创建了DeepSeek Chat模型。我们的评估结果表明,DeepSeek LLM 67B在多项基准上超越了LLaMA-2 70B,特别是在代码、数学和推理领域。此外,开放式评估显示,我们的DeepSeek LLM 67B Chat表现出优于GPT-3.5的性能。


VMamba: Visual State Space Model

https://huggingface.co/papers/2401.10166

https://arxiv.org/abs/2401.10166

https://github.com/mzeromiko/vmamba


VMamba:视觉状态空间模型

Designing computationally efficient network architectures remains an ongoing necessity in computer vision. In this paper, we adapt Mamba, a state-space language model, into VMamba, a vision backbone with linear time complexity. At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D bridges the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the collection of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments demonstrate VMamba's promising performance across diverse visual perception tasks, highlighting its superior input scaling efficiency compared to existing benchmark models.

设计计算高效的网络架构一直是计算机视觉领域的持续需求。在本文中,我们将状态空间语言模型 Mamba 改造为 VMamba,一种具有线性时间复杂度的视觉骨干网络。VMamba 的核心是堆叠的视觉状态空间块,其中包含二维选择性扫描模块。通过沿着四条扫描路径遍历,SS2D 弥合了一维选择性扫描的顺序特性与二维视觉数据的非序列结构之间的差距,有助于从不同来源和视角收集上下文信息。基于 VSS 块,我们开发了一系列 VMamba 架构,并通过一系列架构和实现上的增强对其进行了加速。大量实验表明,VMamba 在各种视觉感知任务上具有令人期待的性能,突显了其相较于现有基准模型在输入缩放效率方面的优越性。