Fine-tuned Qwen2.5-VL-3B for UI Element Localization
This model is a fine-tuned version of Qwen/Qwen2.5-VL-3B-Instruct trained on the SeeClick dataset for predicting UI element coordinates.
Training Details
- Base Model: Qwen/Qwen2.5-VL-3B-Instruct
- Training Checkpoint: Step 2450, Epoch 0
- Task: Given a UI screenshot and element description, predict the center coordinates (x, y) of the element.
Usage
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch
model_name = "BLR2/qwen2.5-vl-3b-ui-grounding"
processor = AutoProcessor.from_pretrained(model_name)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
image = Image.open("screenshot.png").convert("RGB")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Given this UI screenshot, predict the center of: 'Submit button'."},
],
},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
img, vid = process_vision_info(messages)
inputs = processor(text=[text], images=img, videos=None, padding=True, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generated_ids_trimmed = generated_ids[0][len(inputs["input_ids"][0]):]
response = processor.decode(generated_ids_trimmed, skip_special_tokens=True)
print(response) # Output: "0.7532 0.8921" (x, y coordinates)
Output Format
The model outputs normalized coordinates in the format x y where both values are in the range [0, 1]:
x: horizontal position (0 = left, 1 = right)y: vertical position (0 = top, 1 = bottom)
- Downloads last month
- 16
Model tree for BLR2/qwen2.5-vl-3b-ui-grounding
Base model
Qwen/Qwen2.5-VL-3B-Instruct