Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Microsoft
Phi-3 技术报告:一款可在本地手机运行的高性能语言模型
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.
我们推出了 phi-3-mini,一个在 3.3 万亿词元上训练、拥有 38 亿参数的语言模型。根据学术基准和内部测试的衡量,其整体性能可与 Mixtral 8x7B 和 GPT-3.5 等模型相媲美,尽管它体积小巧,足以部署在手机上。我们的训练数据集是用于 phi-2 的数据集的扩展版本,由经过严格过滤的公开网络数据和合成数据组成。该模型还针对鲁棒性、安全性和对话格式进行了进一步对齐。我们还提供了参数扩展的结果,训练了用于 4.8 万亿词元的 70 亿和 140 亿参数模型,分别称为 phi-3-small 和 phi-3-medium,两者能力都显著强于 phi-3-mini。为了增强多语言、多模态和长上下文能力,我们推出了 phi-3.5 系列中的三个模型:phi-3.5-mini、phi-3.5-MoE 和 phi-3.5-Vision。phi-3.5-MoE 是一个拥有 6.6 亿激活参数的混合专家模型,在语言推理、数学和代码任务上,相较于其他相似规模的开源模型,性能优越,并与 Gemini-1.5-Flash 和 GPT-4o-mini 相当。同时,源自 phi-3.5-mini 的 42 亿参数模型 phi-3.5-Vision 在推理任务中表现出色,并擅长处理单图像文本提示以及多图像文本提示。