Model Card for siglip-ft-enpedia
SigLIP-ft-enpedia is a fine-tuned variant of the SigLIP model, a state-of-the-art vision-language model designed for efficient document retrieval through alignment of image and text embeddings. Building on the original SigLIP architecture, we apply LoRA-based parameter-efficient fine-tuning on our custom children’s encyclopedia dataset, which consists of 8,484 Wikipedia page screenshots paired with broad topical queries. This adaptation enables the model to capture domain-specific semantic associations between visual encyclopedia content and user queries, thereby improving retrieval accuracy and robustness for educational applications.

Model Details
Model Description
- Developed by: Department of Media & Communication, Kangwon National University/School of Information Science and Technology, Hangzhou Normal University
- Model type: LoRA fine-tuned SigLIP for multimodal document retrieval
- Language(s) (NLP): English (dataset queries, Wikipedia content)
- License: inherits from the original SigLIP
- Finetuned from model: Google’s SigLIP (google/siglip-so400m-patch14-384)
Intended uses & limitations
You can use the finetuned model for tasks like zero-shot image-text retrieval. See the model hub to look for other versions on a task that interests you.
How to use
Here is how to use this model to perform zero-shot image classification:
import io
import torch
from PIL import Image
from peft import PeftModel
from transformers import SiglipModel, SiglipProcessor
from datasets import load_dataset, Features, Image, Value
features = Features({
"image": Image(decode=True),
"image_filename": Value("string"),
"keyword": Value("string"),
"broad_topical_query": Value("string"),
"broad_topical_explanation": Value("string"),
"specific_detail_query": Value("string"),
"specific_detail_explanation": Value("string"),
"visual_element_query": Value("string"),
"visual_element_explanation": Value("string")
})
ds = load_dataset(
"parquet",
data_files={
"train": ["wiki_dataset-train.parquet"],
"test": ["wiki_dataset-test.parquet"]
},
features=features
)
train_ds = ds["train"]
test_ds = ds["test"]
base_model_id = "google/siglip-so400m-patch14-384"
ft_model_id = "dj86/siglip-ft-enpedia"
model = SiglipModel.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(model, ft_model_id)
processor = SiglipProcessor.from_pretrained(base_model_id)
images = [train_ds[0]["image"], train_ds[1]["image"], train_ds[2]["image"]]
inputs = processor(images=images, return_tensors="pt")
texts = ["an image of "+train_ds[0]["keyword"], "an image of "+train_ds[1]["keyword"], "an image of "+train_ds[2]["keyword"]]
text_inputs = processor(text=texts, return_tensors="pt", padding=True)
with torch.no_grad():
image_embeds = model.get_image_features(**inputs)
text_embeds = model.get_text_features(**text_inputs)
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)
similarity = torch.matmul(text_embeds, image_embeds.T)
print("Similarity:", similarity)
Training Details
Training Data
- Source: Wikipedia pages corresponding to children’s encyclopedia topics referring to DK Children's Encyclopedia : The Book That Explains Everything.
- Train set: 8,484 Wikipedia page screenshot–query pairs (wiki_dataset)
- Annotation schema: Each page was paired with broad topical queries and specific detail queries with corresponding explanations.
Training Procedure
- Optimization: LoRA fine-tuning on attention layers (q_proj, v_proj)
- Hyperparameters: learning rate = 5e-5, LoRA rank = 8, LoRA alpha = 16, dropout = 0.05, epochs = 5.
- Batch size : 8
- Frameworks: Hugging Face Transformers + PEFT
Evaluation
Metric definitions:

Testing Details
Testing Data
- Test set: 1,040 Wikipedia page screenshot–query pairs (wiki_dataset)
Results

Hardware
NVIDIA L40 (48GB) GPU
BibTeX entry and citation info
- Downloads last month
- 3
Model tree for dj86/siglip-ft-enpedia
Base model
google/siglip-so400m-patch14-384