Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
基于稀疏自编码器的人工文本检测特征层面洞察
Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b's residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.
随着先进大语言模型的兴起,人工文本检测变得越来越重要。尽管付出了诸多努力,但没有任何单一算法能够跨不同类型的未见文本持续表现良好,或保证能有效泛化到新的大语言模型上。可解释性在实现这一目标中扮演着关键角色。在本研究中,我们通过使用稀疏自编码器从 Gemma-2-2b 的残差流中提取特征,来增强人工文本检测的可解释性。我们识别出既具可解释性又高效的 features,并通过领域和模型特定的统计数据、引导方法以及人工或基于大语言模型的解读,分析了它们的语义和相关性。我们的方法为理解来自不同模型的文本与人类撰写内容的差异提供了宝贵的见解。我们表明,现代大语言模型具有独特的写作风格,尤其是在信息密集的领域,尽管它们可以通过个性化提示生成类似人类的输出。
Qwen2.5-Omni Technical Report
阿里
Qwen2.5-Omni 技术报告
In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model. Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMROPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-OMNI is comparable with similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-OMNI achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-OMNI's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-OMNI's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.
在本报告中,我们介绍了 Qwen2.5-Omni,一个端到端的多模态模型,旨在感知包括文本、图像、音频和视频在内的多样化模态,同时以流式方式生成文本和自然语音响应。为了实现多模态信息输入的流式处理,音频和视觉编码器均采用了分块处理方法。该策略有效解耦了长序列多模态数据的处理,将感知任务分配给多模态编码器,并将长序列的建模交由大语言模型负责。这种分工通过共享注意力机制增强了不同模态的融合。为了使视频输入的时间戳与音频同步,我们以交错方式顺序组织音频和视频,并提出了一种新颖的位置嵌入方法,称为 TMROPE。为了同时生成文本和语音并避免两种模态之间的干扰,我们提出了思考者-说话者架构。在此框架中,思考者作为负责文本生成的大语言模型,而说话者是一个双轨自回归模型,它直接利用思考者的隐藏表示来生成音频词元作为输出。思考者和说话者模型均设计为端到端的方式进行训练和推理。为了以流式方式解码音频词元,我们引入了一种限制感受野的滑动窗口 DiT,旨在减少初始数据包延迟。Qwen2.5-OMNI 与同等规模的 Qwen2.5-VL 性能相当,并优于 Qwen2-Audio。此外,Qwen2.5-OMNI 在 Omni-Bench 等多模态基准上取得了最先进的性能。值得注意的是,Qwen2.5-OMNI 在端到端语音指令跟随方面的性能与其在文本输入方面的能力相当,这已在 MMLU 和 GSM8K 等基准上得到证实。在语音生成方面,Qwen2.5-OMNI 的流式说话者在鲁棒性和自然度上优于大多数现有的流式和非流式替代方案。
Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation
归一化层在现代神经网络中无处不在,长期以来被认为是不可或缺的。本文证明,无需归一化的 Transformer 可以通过一种极其简单的技术实现同等或更优的性能。我们提出了动态 Tanh(DyT),一种逐元素操作
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
https://huggingface.co/docling-project/SmolDocling-256M-preview
SmolDocling:面向端到端多模态文档转换的超紧凑视觉语言模型
We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parameters vision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms — significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novel publicly sourced datasets for charts, tables, equations, and code recognition. Experimental results demonstrate that SmolDocling competes with other Vision Language Models that are up to 27 times larger in size, while reducing computational requirements substantially. The model is currently available, datasets will be publicly available soon.
我们推出了 SmolDocling,一个面向端到端文档转换的超紧凑视觉语言模型。我们的模型通过生成 DocTags(一种新的通用标记格式,能捕捉所有页面元素的完整上下文及位置)来全面处理整个页面。与依赖于大型基础模型或依赖于由多个专门模型手工构建流程的集成解决方案不同,SmolDocling 提供了一个端到端的转换方案,能够在一个拥有 2.56 亿参数的视觉语言模型中准确捕捉文档元素的内容、结构和空间位置。SmolDocling 在正确再现文档特征方面展现出稳健的性能,这些特征包括代码列表、表格、公式、图表、列表等,涵盖商业文档、学术论文、技术报告、专利和表格等多种文档类型——显著扩展了通常集中于科学论文的范围。此外,我们贡献了用于图表、表格、公式和代码识别的新颖公开数据集。实验结果表明,SmolDocling 能够与体积大 27 倍的视觉语言模型相媲美,同时大幅降低计算需求。该模型现已可用,数据集也将很快公开发布。
Wan: Open and Advanced Large-Scale Video Generative Models
Wan:开放先进的大规模视频生成模型
This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel spatio-temporal variational autoencoder (VAE), scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensive-ness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Meanwhile, Wan is the first model that can generate visual text in both Chinese and English, significantly enhancing its practical value. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. It also exhibits superior performance compared to larger open-source models, showcasing remarkable efficiency for text-to-video. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. In addition, we conduct extensive experimental analyses covering various aspects of the proposed Wan, presenting detailed results and insights. We believe these findings and conclusions will significantly advance video generation technology.
本报告介绍了 Wan,一个旨在推动视频生成边界发展的全面且开放的视频基础模型套件。Wan 构建于主流的扩散Transformer范式之上,通过一系列创新取得了生成能力的显著进步,这些创新包括我们新颖的时空变分自编码器、可扩展的预训练策略、大规模数据整理以及自动化评估指标。这些贡献共同增强了模型的性能和多功能性。具体而言,Wan 具有四个关键特征:领先性能:Wan 的 140 亿参数模型在海量包含数十亿图像和视频的数据集上训练,展现了视频生成关于数据和模型规模的扩展规律。它在多个内部和外部基准上持续优于现有开源模型以及最先进的商业解决方案,展现出明显且显著的性能优势。全面性:Wan 提供了两个强大的模型,即 13 亿参数模型和 140 亿参数模型,分别侧重于效率和效果。它还涵盖了多个下游应用,包括图像到视频、指令引导视频编辑以及个性化视频生成,总共涉及多达八个任务。同时,Wan 是首个能够生成中英文视觉文字的模型,显著提升了其实用价值。消费级效率:13 亿参数模型展现出卓越的资源效率,仅需 8.19 GB 显存,使其兼容广泛的消费级 GPU。它在文本到视频任务上也表现出优于更大开源模型的性能,展现了卓越的效率。开放性:我们开源了 Wan 的整个系列,包括源代码和所有模型,旨在促进视频生成社区的发展。这种开放性力求显著拓展行业中视频制作的创意可能性,并为学术界提供高质量的视频基础模型。此外,我们对所提出的 Wan 的各个方面进行了广泛的实验分析,展示了详细的结果和见解。我们相信这些发现和结论将显著推动视频生成技术的发展
Open Deep Search: Democratizing Search with Open-source Reasoning Agents
开放深度搜索:用开源推理智能体实现搜索民主化
We introduce Open Deep Search (ODS) to close the increasing gap between the proprietary search AI solutions, such as Perplexity's Sonar Reasoning Pro and OpenAI's GPT-4o Search Preview, and their open-source counterparts. The main innovation introduced in ODS is to augment the reasoning capabilities of the latest open-source LLMs with reasoning agents that can judiciously use web search tools to answer queries. Concretely, ODS consists of two components that work with a base LLM chosen by the user: Open Search Tool and Open Reasoning Agent. Open Reasoning Agent interprets the given task and completes it by orchestrating a sequence of actions that includes calling tools, one of which is the Open Search Tool. Open Search Tool is a novel web search tool that outperforms proprietary counterparts. Together with powerful open-source reasoning LLMs, such as DeepSeek-R1, ODS nearly matches and sometimes surpasses the existing state-of-the-art baselines on two benchmarks: SimpleQA and FRAMES. For example, on the FRAMES evaluation benchmark, ODS improves the best existing baseline of the recently released GPT-4o Search Preview by 9.7% in accuracy. ODS is a general framework for seamlessly augmenting any LLMs—for example, DeepSeek-R1 that achieves 82.4% on SimpleQA and 30.1% on FRAMES—with search and reasoning capabilities to achieve state-of-the-art performance: 88.3% on SimpleQA and 75.3% on FRAMES.
我们推出了开放深度搜索,旨在缩小 Perplexity 的 Sonar Reasoning Pro 和 OpenAI 的 GPT-4o Search Preview 等专有搜索AI解决方案与其开源替代品之间日益扩大的差距。ODS 的主要创新在于,通过能够明智地使用网络搜索工具来回答查询的推理智能体,增强最新开源大语言模型的推理能力。具体来说,ODS 由两个与用户选择的基础大语言模型协同工作的组件组成:开放搜索工具和开放推理智能体。开放推理智能体解释给定任务,并通过编排一系列包括调用工具在内的行动来完成它,其中一个工具就是开放搜索工具。开放搜索工具是一种新颖的网络搜索工具,其性能优于专有工具。与 DeepSeek-R1 等强大的开源推理大语言模型结合,ODS 在 SimpleQA 和 FRAMES 两个基准上几乎匹配,有时甚至超越了现有的最先进基线。例如,在 FRAMES 评估基准上,ODS 将近期发布的 GPT-4o Search Preview 的最佳现有基线准确率提升了 9.7%。ODS 是一个通用框架,能够无缝地为任何大语言模型增强搜索和推理能力,从而实现最先进的性能。
VGGT: Visual Geometry Grounded Transformer
VGGT:基于视觉几何的Transformer
We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives that require post-processing with visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis.
我们提出了 VGGT,一个前馈神经网络,能够从单张、少量或数百张视图中直接推断场景的所有关键 3D 属性,包括相机参数、点图、深度图和 3D 点轨迹。这种方法在 3D 计算机视觉领域迈出了一步,该领域的模型通常局限于并专门用于单一任务。它既简单又高效,能在不到一秒的时间内重建图像,并且仍然优于需要借助视觉几何优化技术进行后处理的替代方案。该网络在多个 3D 任务中取得了最先进的结果,包括相机参数估计、多视图深度估计、密集点云重建和 3D 点跟踪。我们还表明,使用预训练的 VGGT 作为特征骨干网络,能够显著增强下游任务,例如非刚性点跟踪和前馈新视角合成。