Google announces Gemini-1.5-Pro open API

Now available in more than 180 countries

Added native audio (voice) understanding, file API, system instructions, JSON mode and other functions

The Gemini model can now process audio input directly without having to convert the audio to text first.

New use case unlocking: Audio and video modalities

Gemini 1.5 Pro expands input modalities to include understanding audio (speech) in the Gemini API and Google AI Studio.

In addition, Gemini 1.5 Pro is capable of simultaneous inference of images (frames) and audio (voice) on videos uploaded to Google AI Studio, which means that this model has the ability to understand and process video content, not only the visual part of the video (such as image frames), but also the audio part (such as dialogue, background music, etc.).

Application potential includes:

1. Multimodal understanding: Gemini 1.5 Pro combines visual and audio information in videos for a more comprehensive content understanding. For example, it can more accurately identify and interpret video content by analyzing scenes and objects in video frames while listening to conversations or sounds in the video.
2. Content indexing and searching: Through a deep understanding of video images and audio, Gemini 1.5 Pro can help create more detailed content indexes that allow users to search based on visual and auditory information about video content.
3. Enhanced interactive experience: Using a comprehensive understanding of video, richer interactive applications can be developed, such as automatically generating video summaries, content-based recommendation systems, or creating interactive learning and entertainment experiences.
4. Video content analysis: Gemini 1.5 Pro can be used for video surveillance, content review, sentiment analysis and other scenarios. By understanding video and audio content at the same time, AI can automatically identify key events, emotional tendencies, or specific content tags in videos.
5. Creative content generation: A comprehensive understanding of video images and audio also allows Gemini 1.5 Pro to play a role in content creation, such as automatically generating video subtitles, dubbing, or creating animated videos based on a given script.

Gemini API improvements

1. System Instructions: Guide model responses through system instructions and are now available in Google AI Studio and the Gemini API. Define roles, formats, goals, and rules to guide the behavior of the model to suit specific use cases.
2. JSON Mode: Indicates that the model only outputs JSON objects. This mode supports the extraction of structured data from text or images. You can start with cURL, and Python SDK support is coming soon.
3. Improvements in function calls: It is now possible to select a mode to limit the output of the model and improve reliability. Select text, function call, or just the function itself.

New embedding model improves performance

Starting today, developers will be able to access Gemini’s next-generation text embedding model through the Gemini API. This new model, text-embedding-004 (text-embedding-preview-0409 in Vertex AI), achieved stronger retrieval performance in the MTEB benchmark than all existing models with comparable dimensions.

Details:https://goo.gle/3xxaUH1
Audio understanding:https://github.com/google-gemini/cookbook/blob/main/quickstarts/Audio.ipynb

Video: