Skip to content

Transformer Explainer: Interactive Learning of Text-Generative Models

https://huggingface.co/papers/2408.04619

https://arxiv.org/abs/2408.04619

https://github.com/poloclub/transformer-explainer

https://poloclub.github.io/transformer-explainer/


Transformers have revolutionized machine learning, yet their inner workings remain opaque to many. We present TRANSFORMER EXPLAINER, an interactive visualization tool designed for non-experts to learn about Transformers through the GPT-2 model. Our tool helps users understand complex Transformer concepts by integrating a model overview and enabling smooth transitions across abstraction levels of mathematical operations and model structures. It runs a live GPT-2 instance locally in the user's browser, empowering users to experiment with their own input and observe in real-time how the internal components and parameters of the Transformer work together to predict the next tokens. Our tool requires no installation or special hardware, broadening the public's education access to modern generative AI techniques.

Transformer 已经彻底改变了机器学习,但其内部工作原理对许多人来说仍然难以理解。我们推出了 TRANSFORMER EXPLAINER,一个为非专业人士设计的交互式可视化工具,旨在通过 GPT-2 模型帮助用户学习 Transformer。我们的工具通过整合模型概览,并支持在数学运算和模型结构的不同抽象层次间平滑切换,帮助用户理解复杂的 Transformer 概念。该工具在用户的浏览器中本地运行一个实时的 GPT-2 实例,使用户能够用自己的输入进行实验,并实时观察 Transformer 的内部组件和参数如何协同工作来预测下一个词元。我们的工具无需安装或特殊硬件,拓宽了公众接触现代生成式 AI 技术的教育渠道。


The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

https://huggingface.co/papers/2408.06292

https://arxiv.org/abs/2408.06292

https://github.com/sakanaai/ai-scientist


One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models (LLMs) to perform research independently and communicate their findings. We introduce THE AI SCIENTIST, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion and add them to a growing archive of knowledge, acting like the human scientific community. We demonstrate the versatility of this approach by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a meager cost of less than $15 per paper, illustrating the potential for our framework to democratize research and significantly accelerate scientific progress. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. THE AI SCIENTIST can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems.

通用人工智能的巨大挑战之一是开发能够进行科学研究并发现新知识的智能体。虽然前沿模型已被用作人类科学家的助手,例如用于头脑风暴想法、编写代码或执行预测任务,但它们仍然只执行科学过程中的一小部分。本文提出了第一个用于全自动科学发现的综合框架,使前沿大语言模型能够独立进行研究并交流其发现。我们介绍了 THE AI SCIENTIST,它能够生成新颖的研究想法、编写代码、执行实验、可视化结果,通过撰写完整的科学论文来描述其发现,然后运行模拟评审过程进行评估。原则上,这个过程可以重复进行,以开放的方式迭代开发想法,并将其添加到不断增长的知识档案中,就像人类科学界一样。我们通过将该方法应用于机器学习的三个不同子领域来展示其多功能性:扩散建模、基于Transformer的语言建模和学习动力学。每个想法都以每篇论文低于 15 美元的低廉成本实现并发展成一篇完整的论文,说明了我们的框架在使研究民主化和显著加速科学进步方面的潜力。为了评估生成的论文,我们设计并验证了一个自动评审器,我们证明它在评估论文分数方面达到了接近人类的性能。根据我们的自动评审器判断,THE AI SCIENTIST 能够生成超过顶级机器学习会议接受阈值的论文。这种方法标志着机器学习科学发现新纪元的开始:将AI智能体的变革性优势带到AI自身的整个研究过程中,并使我们更接近于一个能够将无尽的可负担创造力与创新释放到世界上最棘手问题上的世界。


CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

https://huggingface.co/papers/2408.06072

https://arxiv.org/abs/2408.06072

https://yzy-thu.github.io/CogVideoX-demo/

https://github.com/zai-org/CogVideo


We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos that align seamlessly with text prompts, with a frame rate of 16 fps and resolution of 768×1360 pixels. Previous video generation models often struggled with limited motion and short durations. It is especially difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we introduce a 3D Variational Autoencoder (VAE) to compress videos across spatial and temporal dimensions, enhancing both the compression rate and video fidelity. Second, to improve text-video alignment, we propose an expert transformer with expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing progressive training and multi-resolution frame packing, CogVideoX excels at generating coherent, long-duration videos with diverse shapes and dynamic movements. In addition, we develop an effective pipeline that includes various pre-processing strategies for text and video data. Our innovative video captioning model significantly improves generation quality and semantic alignment. Results show that CogVideoX achieves state-of-the-art performance in both automated benchmarks and human evaluation.

我们提出了 CogVideoX,一个基于扩散Transformer的大规模文本到视频生成模型,能够生成与文本提示无缝对齐的10秒连续视频,帧率为16 fps,分辨率为 768×1360 像素。先前的视频生成模型常常在有限的运动和短时长方面存在困难,尤其难以基于文本生成具有连贯叙事的视频。我们提出了几种设计来解决这些问题。首先,我们引入了一个3D变分自编码器,用于在空间和时间维度上压缩视频,从而提高了压缩率和视频保真度。其次,为了改善文本-视频对齐,我们提出了一种带有专家自适应层归一化的专家Transformer,以促进两种模态之间的深度融合。第三,通过采用渐进式训练和多分辨率帧打包,CogVideoX 在生成具有多样形状和动态运动的连贯、长时长视频方面表现出色。此外,我们开发了一个有效的流程,其中包括针对文本和视频数据的各种预处理策略。我们创新的视频字幕生成模型显著提高了生成质量和语义对齐。结果表明,CogVideoX 在自动化基准和人工评估中均达到了最先进的性能。