WIP: Add support for CogAgent #12679

Tianyue-Zhao · 2025-03-31T22:57:21Z

Overview

This PR adds support for CogAgent
CogAgent is a visual model specializing in GUI recognition and visual grounding (providing accurate pixel coordinates for GUI actions)
It is based on the earlier CogVLM vision model, but has improved high-res performance and visual grounding for use as a GUI agent

Completes #4387

Architecture

CogAgent uses two CLIP encoders
The first one is largely similar to existing vision encoders in llama.cpp (LLAVA)
- Therefore the first encoder fits right in to the new infrastructure in llama : second attempt to refactor vision API #11292
The second encoders is for high-res images (1120 x 1120)
- Instead of using visual tokens, it transfers info through a cross-attention mechanism to the main LLM
- I'm not yet sure where to put this in the final implementation, since the vision refactor doesn't really provide facilities for double-encoder vision models

Current status

The current implementation is locally tested but temporary and not ready for usage
This PR waits for the new vision infrastructure brought by #11292
When the vision infrastructure in that PR is finalized, I will update the CogAgent implementation to take advantage of the new features

ngxson and others added 26 commits January 18, 2025 12:19

wip

2a458d1

llama : second attempt to refactor vision API

0a81051

add back convert hf to gguf

6cabdda

add mobilevlm

d0068ef

wip minicpmv

4a7ab89

change gguf KV from clip to vit

431bb08

reuse LLM_ARCH and LLM_TENSOR

bd0714b

rename everywhere

ad38e87

Merge branch 'master' into xsn/vision_2

32daa38

temporary refactor llama_vision_graph_builder

9716c7b

wip minicpmv

ba489b4

minicpmv works but missing uhd slices

c0d93dd

minicpm working without uhd

8586d23

correct positions for siglip

25a97ce

add SmolVLM

c3a654c

py: a bit cleaner

b986af8

Merge branch 'master' into xsn/vision_2

b72d755

Merge branch 'master' into xsn/vision_2

0959cc1

refactor minicpm-v support

90eefc2

Merge branch 'master' into xsn/vision_2

e884d3d

Added CogVLM

07f588d

Fixes to compile

4c7acaf

Saving version that runs just like before rebase

c4cf462

Add CogVLM to conversion script

5c19d77

Fix issues with CogAgent cross attention

1343d66

Fix for KV save and load

b518439

github-actions bot added examples python python script changes server labels Mar 31, 2025

Tianyue-Zhao changed the title ~~Add support for CogAgent~~ WIP: Add support for CogAgent Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add support for CogAgent #12679

WIP: Add support for CogAgent #12679

Tianyue-Zhao commented Mar 31, 2025

WIP: Add support for CogAgent #12679

Are you sure you want to change the base?

WIP: Add support for CogAgent #12679

Conversation

Tianyue-Zhao commented Mar 31, 2025

Overview

Architecture

Current status