Війна
🇺🇸 США
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
Computer Science > Computer Vision and Pattern Recognition
arXiv:2604.26752 (cs)
[Submitted on 29 Apr 2026]
Title:GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
Authors:GLM-V Team: Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, Xijun Liu, Wenmeng Yu, Weihan Wang, Wei Li, Shuaiqi Duan, Sheng Yang, Ruiliang Lv, Mingdao Liu, Lihang Pan, Ke Ning, Junhui Ji, Jinjiang Wang, Jing Chen, Jiazheng Xu, Jiale Zhu, Jiale Cheng, Ji Qi, Guobing Gan, Guo Wang, Cong Yao, Zijun Dou, Zihao Zhou, Zihan Wang, Zhiqi Ge, Zhijie Li, Zhenyu Hou, Zhao Xue, Zehui Wang, Zehai He, Yusen Liu, Yukuo Cen, Yuchen Li, Yuan Wang, Yijian Lu, Yanzi Wang, Yadong Xue, Xinyu Zhang, Xinyu Liu, Wenkai Li, Tianyu Tong, Tianshu Zhang, Shengdong Yan, Qinkai Zheng, Mingde Xu, Licheng Bao, Jiaxing Xu, Jiaxin Fan, Jiawen Qian, Jiali Chen, Jiahui Lin, Haozhi Zheng, Haoran Wang, Haochen Li, Fan Yang, Dan Zhang, Chuangxin Zhao, Chengcheng Wu, Boyan Shi, Bowei Jia, Baoxu Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang View a PDF of the paper titled GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents, by GLM-V Team: Wenyi Hong and 76 other authors
View PDF
HTML (experimental)
[v1] Wed, 29 Apr 2026 14:49:37 UTC (18,650 KB)
Full-text links:
new | recent | 2026-04 Change to browse by: cs
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
About arXivLabs
arXivLabs: experimental projects with community collaborators
Abstract:We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2604.26752 [cs.CV] |
| (or arXiv:2604.26752v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2604.26752 Focus to learn more arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Wenyi Hong [view email][v1] Wed, 29 Apr 2026 14:49:37 UTC (18,650 KB)
Full-text links:
Access Paper:
-
View a PDF of the paper titled GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents, by GLM-V Team: Wenyi Hong and 76 other authors
- View PDF
- HTML (experimental)
- TeX Source
Current browse context:
cs.CV < prev | next >new | recent | 2026-04 Change to browse by: cs
References & Citations
export BibTeX citation Loading...BibTeX formatted citation
× loading... Data provided by:Bookmark
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
Джерело
Читати оригінал
Поділитися
Схожі новини
Війна
Війна
Війна
Війна
Сили оборони відбили з початку доби 155 ворожих атак – Генштаб
Інтерфакс-Україна
·
Авіаудар винищувачем МіГ-29 по пункту управління БпЛА противника на Півдні. ВIДЕО
Цензор.НЕТ
·
Іран запровадив новий механізм контролю транзиту через Ормузьку протоку
Цензор.НЕТ
·
Удари України по Москві змінюють поведінку Путіна, – The Telegraph
УНІАН
·