Jac Vision

Platform for finetuning VLMs with complete pipeline from dataset to inference.

Jac Vision
2025-05-04
View on GitHub
VQAFinetuningImage Captioning

This VLM Finetuner is a web-based platform designed to simplify the process of finetuning Vision-Language Models (VLMs). It provides researchers and developers with a comprehensive suite of tools for model management, dataset preparation, fine-tuning, inference, and real-time monitoring through an intuitive React-based interface backed by a robust FastAPI backend.

Done in Collaboration With Jaseci Labs

Walkthrough

Complete Workflow Demonstration - From Dataset Preparation to Model Inference

Key Features

  • Model Management: Search, download, and manage Vision-Language Models from Hugging Face with support for access-controlled models using authentication tokens.
  • Advanced Fine-tuning: Fine-tune models using custom datasets with configurable hyperparameters, real-time progress tracking, and comprehensive training metrics.
  • Image Captioning: Generate, edit, and export image captions using various VLMs with support for custom prompts and batch processing.
  • Vision Question Answering (VQA): Interactive VQA with both pre-trained and fine-tuned models, supporting Gemini and OpenAI models with persistent history management.
  • System Monitoring: Real-time CPU, memory, and disk usage monitoring for optimal resource management during training operations.

How It Works

Model Fine-tuning Workflow

  1. Dataset Preparation:
    • Upload image folders via ZIP file or create datasets using the image captioning tool.
    • Generate or edit captions manually or automatically using various VLMs.
    • Export datasets in training-ready JSON format.
  2. Model Selection:
    • Search and download pre-trained VLMs from Hugging Face.
    • Select appropriate model based on task complexity and hardware constraints.
  3. Configuration:
    • Set hyperparameters (learning rate, batch size, epochs) or use adaptive mode.
    • Configure LoRA parameters for efficient fine-tuning.
    • Set training goals for goal-based training.
  4. Training:
    • Monitor real-time progress with live loss curves and metrics.
    • View TensorBoard logs for detailed analysis.
    • Receive notifications on training completion or errors.
  5. Evaluation & Deployment:
    • Test fine-tuned models using the inference interface.
    • Compare outputs with base models and other fine-tuned versions.
    • Export models in desired format for deployment.

Use Cases

  • Custom VQA Systems: Train models for domain-specific visual question answering (medical imaging, autonomous vehicles, retail).
  • Image Captioning: Generate captions for product catalogs, accessibility features, or content moderation.
  • Educational Tool: Learn about vision-language models, fine-tuning techniques, and model evaluation.
  • Rapid Prototyping: Quickly test and iterate on vision-AI applications.

Installation

For detailed setup instructions, please refer to the Getting Started Guide.

Conclusion

Jac Vision represents a complete solution for working with Vision-Language Models. By combining powerful backend capabilities with an intuitive frontend interface, it democratizes access to advanced AI technologies.

Contributors

For more details and contributions, visit the Jac Vision GitHub Repository.