Skip to content

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

https://huggingface.co/papers/2409.01704

https://arxiv.org/abs/2409.01704

https://github.com/ucas-haoranwei/got-ocr2.0


通用OCR理论:迈向基于统一端到端模型的OCR-2.0

Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.

传统OCR系统因人们对人工光学字符智能处理需求的日益增长而愈发难以满足使用要求。在本文中,我们将所有人工光学信号统称为"字符",并提出了通用OCR理论以及一个优秀的模型——GOT,以推动OCR-2.0的到来。GOT拥有5.8亿参数,是一个统一、简洁且端到端的模型,由一个高压缩编码器和一个长上下文解码器组成。作为一个OCR-2.0模型,GOT能够在各种OCR任务中处理上述所有"字符"。在输入方面,该模型支持常用的场景型和文档型图像,包括切片和整页样式。在输出方面,GOT可以通过简单的提示生成纯文本或格式化结果。此外,该模型还具备交互式OCR功能,即通过坐标或颜色引导的区域级识别。此外,我们还使GOT适配了动态分辨率和多页OCR技术,以提高其实用性。实验中,我们提供了充分的结果来证明我们模型的优越性。


Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

https://huggingface.co/papers/2409.12191

https://arxiv.org/abs/2409.12191

阿里

Qwen2-VL:增强视觉语言模型在任何分辨率下的世界感知能力

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models.

我们推出了 Qwen2-VL 系列,这是对之前 Qwen-VL 模型的一次高级升级,重新定义了视觉处理中传统的预定分辨率方法。Qwen2-VL 引入了原生动态分辨率机制,使模型能够将不同分辨率的图像动态处理为不同数量的视觉词元。这种方法使模型能够生成更高效、更准确的视觉表示,与人类的感知过程紧密对齐。该模型还集成了多模态旋转位置嵌入,促进了文本、图像和视频之间位置信息的有效融合。我们采用统一的范式来处理图像和视频,增强了模型的视觉感知能力。为了探索大型多模态模型的潜力,Qwen2-VL 研究了大视觉语言模型的扩展规律。通过扩展模型规模以及训练数据量,Qwen2-VL 系列取得了极具竞争力的性能。值得注意的是,Qwen2-VL-72B 模型在各种多模态基准上取得了与 GPT-4o 和 Claude3.5-Sonnet 等领先模型相当的结果,优于其他通用模型。