Distilled Image Captioning Demo

Knowledge Distillation! This is a ~15.8M parameter student model distilled from a CLIP-based teacher model (157M params), running entirely in your browser.

            What is Knowledge Distillation? The student model learned to mimic the outputs of a much larger CLIP-based teacher model. With 10x compression, we achieve a good balance between size and quality!
        

Model	Parameters	Size	Encoder
Teacher (CLIP ViT-B/32)	157.8M	~600 MB	Vision Transformer
Student (This demo)	15.8M	~60 MB	ResNet-style CNN

The student uses a ResNet-style CNN encoder with residual blocks, trained using:

Soft targets: KL divergence with temperature scaling (T=4)
Feature matching: MSE loss on encoder outputs
Hard labels: Standard cross-entropy on Flickr8k captions

Compare with the original CLIP-based model to see the quality difference.

Distilled Image Captioning 10x Smaller

Generated Caption: