Model Card for siglip-ft-enpedia

SigLIP-ft-enpedia is a fine-tuned variant of the SigLIP model, a state-of-the-art vision-language model designed for efficient document retrieval through alignment of image and text embeddings. Building on the original SigLIP architecture, we apply LoRA-based parameter-efficient fine-tuning on our custom children’s encyclopedia dataset, which consists of 8,484 Wikipedia page screenshots paired with broad topical queries. This adaptation enables the model to capture domain-specific semantic associations between visual encyclopedia content and user queries, thereby improving retrieval accuracy and robustness for educational applications.

drawing

Model Details

Model Description

Developed by: Department of Media & Communication, Kangwon National University/School of Information Science and Technology, Hangzhou Normal University
Model type: LoRA fine-tuned SigLIP for multimodal document retrieval
Language(s) (NLP): English (dataset queries, Wikipedia content)
License: inherits from the original SigLIP
Finetuned from model: Google’s SigLIP (google/siglip-so400m-patch14-384)

Intended uses & limitations

You can use the finetuned model for tasks like zero-shot image-text retrieval. See the model hub to look for other versions on a task that interests you.

How to use

Here is how to use this model to perform zero-shot image classification:

import io
import torch
from PIL import Image
from peft import PeftModel
from transformers import SiglipModel, SiglipProcessor
from datasets import load_dataset, Features, Image, Value

features = Features({
    "image": Image(decode=True),  
    "image_filename": Value("string"),  
    "keyword": Value("string"), 
    "broad_topical_query": Value("string"),
    "broad_topical_explanation": Value("string"),
    "specific_detail_query": Value("string"),
    "specific_detail_explanation": Value("string"),
    "visual_element_query": Value("string"),
    "visual_element_explanation": Value("string")
})

ds = load_dataset(
    "parquet",
    data_files={
        "train": ["wiki_dataset-train.parquet"],
        "test": ["wiki_dataset-test.parquet"]
    },
    features=features
)

train_ds = ds["train"]
test_ds = ds["test"]

base_model_id = "google/siglip-so400m-patch14-384"
ft_model_id = "dj86/siglip-ft-enpedia"

model = SiglipModel.from_pretrained(base_model_id)

model = PeftModel.from_pretrained(model, ft_model_id)

processor = SiglipProcessor.from_pretrained(base_model_id)

images = [train_ds[0]["image"], train_ds[1]["image"], train_ds[2]["image"]]
inputs = processor(images=images, return_tensors="pt")

texts = ["an image of "+train_ds[0]["keyword"], "an image of "+train_ds[1]["keyword"], "an image of "+train_ds[2]["keyword"]]
text_inputs = processor(text=texts, return_tensors="pt", padding=True)

with torch.no_grad():
    image_embeds = model.get_image_features(**inputs)
    text_embeds = model.get_text_features(**text_inputs)

image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)

similarity = torch.matmul(text_embeds, image_embeds.T)

print("Similarity:", similarity)

Training Details

Training Data

Source: Wikipedia pages corresponding to children’s encyclopedia topics referring to DK Children's Encyclopedia : The Book That Explains Everything.
Train set: 8,484 Wikipedia page screenshot–query pairs (wiki_dataset)
Annotation schema: Each page was paired with broad topical queries and specific detail queries with corresponding explanations.

Training Procedure

Optimization: LoRA fine-tuning on attention layers (q_proj, v_proj)
Hyperparameters: learning rate = 5e-5, LoRA rank = 8, LoRA alpha = 16, dropout = 0.05, epochs = 5.
Batch size : 8
Frameworks: Hugging Face Transformers + PEFT

Evaluation

Metric definitions:

drawing

Testing Details

Testing Data

Test set: 1,040 Wikipedia page screenshot–query pairs (wiki_dataset)

Results

drawing

Hardware

NVIDIA L40 (48GB) GPU

BibTeX entry and citation info

Downloads last month: 3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dj86/siglip-ft-enpedia

Base model

google/siglip-so400m-patch14-384

Adapter

(2)

this model