Google DeepMind announced Gemini Embedding 2, the first embedding model to natively embed text, images, video, audio, and documents into a unified semantic space across 100+ languages.
The model generates 3,072-dimensional vectors with flexible dimensionality. It supports text up to 8,192 tokens, video up to 128 seconds, audio up to 80 seconds, and PDFs up to 6 pages.
On video retrieval benchmarks, it scores 68.8 compared to Amazon Nova 2 at 60.3. Early adopters report 70% latency reductions compared to multi-model pipelines.
No other API endpoint natively handles video and audio embeddings, positioning Google uniquely in multimodal search and RAG applications.
“Multimodal embeddings will transform how we build search and retrieval systems.”
— Jeff Dean, Chief Scientist, Google DeepMind
3
Modalities supported
98.2%
MTEB benchmark
50%
Cost reduction vs v1