May 14th saw Google's annual I/O developer conference take place. Here are all the significant takeaways from the event, which featured a *lot* of AI.
Whilst some of Google's thunder might have been stolen by OpenAI's ChatGPT-4o announcement, the company had some important AI based announcements of its own during its I/O event.
The most noteworthy update to the Gemini series of multimodal AI models is Gemini 1.5 Flash. The new model has been developed to focus on high volume, high frequency tasks at scale, making it more cost effective for developers to run.
Gemini 1.5 Flash is a lighter weight AI model than 1.5 Pro, but still capable of multimodal reasoning across high volumes of data. According to Google, Gemini Flash excels at summarisation, chat applications, video and image captioning, and the extraction of data from long documents amongst other things.
Apparently 1.5 Flash was trained by 1.5 Pro via a process called distillation, which transfers the most essential knowledge and skills from the larger model to the smaller one.
Gemini 1.5 Pro has also undergone improvements, extending its context window to 2 million tokens, a signifiant step in enabling the AI to better understand human language. Logical reasoning and planning, code generation, multi-turn conversation abilities, and audio and visual understanding have all undergone data and algorithmic advances.
1.5 Pro also now offers more control over responses for individual use cases, such as defining the response and personality style for a chatbot or automating workflows through multiple function calls. 1.5 Pro can also now reason through images and audio uploaded via Google AI Studio.
Google will be rolling out Gemini 1.5 Pro within Google Workspace. Google gave an example of how 1.5 Pro can keep you abreast of subjects in emails, such as at your kid's school. Gemini will be able to summarise emails from the school, as well as analysing attachments such as PDFs and giving the key takeaway points from them. Gemini will also be able to analyse recordings of Google Meets, summarising all the important information for you, should you be unavailable for the meeting. Gemini will also be able to draft replies to emails for you.
Lastly, Gemini Nano, the on-device AI system, is undergoing improvements to provide much more useful capabilities on your phone. For example features such as being to "ask this video" will be implemented, allowing users to quickly find relevant points within a video. The "ask this..." function will work over different media types, including PDF files.
Beginning with Pixel, Gemini Nano will now be able to process more information in context like sights, sounds and spoken language. Later in the year, Talkback will be able to better describe the content of images for people suffering from blindness or poor vision. Google is clear to point out that because these capabilities are on-device, they don't require a network connection to function. Gemini Nano will also be able to identify scam callers and alert you to unusual requests during a phone-call.
The demonstration of ChatGPT-4o's new latency free response times and ability to recognise and describe the world around it was impressive. But, Google has been hard at work to create an AI assistant that is every bit as good. While it's a work in progress, Project Astra has produced an impressive demonstration, which you can view below.
The way in which the system identifies its surroundings and objects within it is an impressive feat in and of itself, but it's the speed at which it achieves it that truly impresses. For example, Astra is asked to identify an object that makes sound via a phone camera. It notices that there's a speaker on the desk and calls it out. The user then draws an arrow on a part of the speaker to ask Astra what it does. It quickly and correctly identifies it as the tweeter, and describes what it does.
There are no plans currently on when an AI assistant like Project Astra will make it into a smartphone, but given the speed at which things are developing it is unlikely to be a long wait.
Right, now for the image-based stuff. Earlier this year Sora gave us the first glimpse of what AI could truly do when it came to video generation. Now, Google has announced its own tool, called Veo.
Not only can Veo generate video at 1080p resolution, it can also create content that is longer than 1-minute in duration. The results are impressive. Not only are the demonstrations incredibly realistic, the system offers users a lot of control over the output. For example, if you create a drone shot over a tropical coastline, you can ask Veo to modify the scene by adding in some kayaks. The system will then simply add them to the sequence without changing the look or overall scene generation. Users can also mask specific areas of the video and ask for changes to the selected portion of the image. It's this level of control that has been lacking in previous AI generation software. Additionally, users can feed Veo an image and ask it to base the results off it.
Even more usefully, Veo can be given a series of prompts to create a sequence of events. For example, the video below was created from the following prompt,
"A fast-tracking shot through a bustling dystopian sprawl with bright neon signs, flying cars and mist, night, lens flare, volumetric lighting.
A fast-tracking shot through a futuristic dystopian sprawl with bright neon signs, starships in the sky, night, volumetric lighting.
A neon hologram of a car driving at top speed, speed of light, cinematic, incredible details, volumetric lighting.
The cars leave the tunnel, back into the real world city Hong Kong."
Google says it has developed "cutting-edge latent diffusion transformers", which it claims prevent the odd jumps, morphing, and flicker that sometimes effects characters and objects in previous AI models, thus helping to ensure much more consistency.
All content created with Veo is watermarked via the SynthID system to help mitigate privacy, copyright and bias risks.
Imogen 3 is Google's latest AI image generation system, which will come in three different variations that are optimised for different types of visuals.
The system has undergone lots of improvements to its ability to understand prompts, making it not only understand natural language more easily, but to be able to pick up on small details in longer prompts. This allows Imogen 3 to better understand descriptions of composition, camera angle or image style.
Google claims that Imogen 3 can better render very fine detail in textures, such as the wrinkles on a hand, or knitted wool. A more significant development is that Imogen 3 can accurately create text, something that has eluded most AI image generators until now.
Imogen 2 has received updates that users can begin using right now. The most important of these is the ability to edit an image. By drawing a brush around the area you want to change, you can describe the modification that you want to make, and the AI will modify the existing image with the request.
Just like Veo, all images created with Imogen 3 are digitally watermarked with SynthID, making them identifiable as AI generated images. Unfortunately, ImageFX, which hosts Imogen 3 (and currently Imogen 2) and VideoFX where Veo will be available for testing, are only available in a limited number of countries outside the US. Additionally, Imogen 3 will also be rolled out to VertexAI as well.