Google is employing Gemini AI to enhance its robots’ navigation and task-completion abilities. In a study report, the DeepMind robotics team revealed that utilizing the extended context window of Gemini 1.5 Pro facilitates easier user interaction with its RT-2 robots through natural language commands.
In this method, a video tour of a location, such as a house or workplace, is recorded by researchers, who then use Gemini 1.5 Pro to enable the robot to “watch” the video and learn about its surroundings. Once it has seen, the robot can obey commands because of its visual perception. For instance, if you ask the robot where you can charge something by showing it your phone, it can direct you to a power source. DeepMind claims that in a sizable operational area, the robot driven by Gemini accomplished over 90% of the more than 50 human instructions with success.
It was discovered by researchers that Gemini 1.5 Pro aids in the planning of robot actions beyond simple navigation. When a user asks the robot, for example, if Coke—their favorite beverage—is available, Gemini knows that the robot should search the refrigerator for Cokes and get back to the customer. DeepMind intends to investigate these results more.
Though there are cuts after the robot acknowledges each request, concealing the 10–30 seconds it takes to process instructions, Google’s video demos are nonetheless amazing. These robots could at least assist us in finding our misplaced wallets or keys, even though it might be some time before we have sophisticated robots in our houses.