Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add support for CogAgent #12679

Draft
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

Tianyue-Zhao
Copy link

Overview

This PR adds support for CogAgent
CogAgent is a visual model specializing in GUI recognition and visual grounding (providing accurate pixel coordinates for GUI actions)
It is based on the earlier CogVLM vision model, but has improved high-res performance and visual grounding for use as a GUI agent

Completes #4387

Architecture

  • CogAgent uses two CLIP encoders
  • The first one is largely similar to existing vision encoders in llama.cpp (LLAVA)
  • The second encoders is for high-res images (1120 x 1120)
    • Instead of using visual tokens, it transfers info through a cross-attention mechanism to the main LLM
    • I'm not yet sure where to put this in the final implementation, since the vision refactor doesn't really provide facilities for double-encoder vision models

Current status

The current implementation is locally tested but temporary and not ready for usage
This PR waits for the new vision infrastructure brought by #11292
When the vision infrastructure in that PR is finalized, I will update the CogAgent implementation to take advantage of the new features

@github-actions github-actions bot added examples python python script changes server labels Mar 31, 2025
@Tianyue-Zhao Tianyue-Zhao changed the title Add support for CogAgent WIP: Add support for CogAgent Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants