Teaching Machines to See: Vision-Language Models Made Efficient with SmolVLM
Andrés Marafioti | Friday, April 25, 2025 | Gurten Pavillon
Description
Abstract: Large Language Models (LLMs) have transformed how machines understand and generate text. But what happens when we teach them to see?
Vision-Language Models (VLMs) combine the power of visual and textual understanding, enabling machines to interpret and reason about the world in a multimodal way. In this talk, we’ll explore how VLMs work, demystify the mechanics behind their vision capabilities, and discuss why making them efficient matters. Along the way, I’ll introduce SmolVLM, our state-of-the-art compact VLM, and share insights on how we optimized it for on-device applications without compromising performance.
Whether you’re new to multimodal AI or a seasoned expert, you’ll leave with a deeper understanding of how machines see—and how they can do it smarter.