Distilled Image Captioning 10x Smaller

Knowledge Distillation! This is a ~15.8M parameter student model distilled from a CLIP-based teacher model (157M params), running entirely in your browser.

What is Knowledge Distillation? The student model learned to mimic the outputs of a much larger CLIP-based teacher model. With 10x compression, we achieve a good balance between size and quality!
Model Parameters Size Encoder
Teacher (CLIP ViT-B/32) 157.8M ~600 MB Vision Transformer
Student (This demo) 15.8M ~60 MB ResNet-style CNN

The student uses a ResNet-style CNN encoder with residual blocks, trained using:

Compare with the original CLIP-based model to see the quality difference.

Loading model...

Drag & drop an image here, or click to select

Or try a sample image:

Dog running
Preview
Generating caption...

Generated Caption:

Student Model: ~15.8M parameters | Image: 224x224 | ResNet-style CNN Encoder | 4-layer Decoder | Pure JavaScript