Fine-tuned Qwen2.5-VL-3B for UI Element Localization

This model is a fine-tuned version of Qwen/Qwen2.5-VL-3B-Instruct trained on the SeeClick dataset for predicting UI element coordinates.

Training Details

Base Model: Qwen/Qwen2.5-VL-3B-Instruct
Training Checkpoint: Step 2450, Epoch 0
Task: Given a UI screenshot and element description, predict the center coordinates (x, y) of the element.

Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch

model_name = "BLR2/qwen2.5-vl-3b-ui-grounding"

processor = AutoProcessor.from_pretrained(model_name)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

image = Image.open("screenshot.png").convert("RGB")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Given this UI screenshot, predict the center of: 'Submit button'."},
        ],
    },
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
img, vid = process_vision_info(messages)

inputs = processor(text=[text], images=img, videos=None, padding=True, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)

generated_ids_trimmed = generated_ids[0][len(inputs["input_ids"][0]):]
response = processor.decode(generated_ids_trimmed, skip_special_tokens=True)
print(response)  # Output: "0.7532 0.8921" (x, y coordinates)

Output Format

The model outputs normalized coordinates in the format x y where both values are in the range [0, 1]:

x: horizontal position (0 = left, 1 = right)
y: vertical position (0 = top, 1 = bottom)

Downloads last month: 16

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for BLR2/qwen2.5-vl-3b-ui-grounding

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(593)

this model