Eye, Robot: Researchers Give Machines a New Perspective with Active Vision
When Iman Soltani worked in the automotive industry, he studied assembly floors and noticed that each automated task had its specific robotic design.
He also noticed that while a robot would need multiple cameras affixed to several locations for the best possible sightline, an operator on the assembly floor would move their head and neck to get the best view as they manipulated components.
When he became an assistant professor of mechanical and aerospace engineering at the University of California, Davis, in 2020, these inefficiencies got Soltani thinking. What if, in addition to being trained on how to manipulate objects to achieve a task, the robot could also be trained on how to adaptively adjust its line of sight while performing the task?
Active Vision and Imitation Learning
In a new paper, Soltani and his team of researchers explore using active vision, or AV, in which robots can adapt and control their own cameras or sensors to gather the most relevant information to complete a task successfully. In this case, the robot learns from a human (i.e., imitation learning) how to change its line of sight to perform tasks like inserting a peg into a socket or pouring a sphere from one test tube into a second test tube.
These tasks are simple for humans, who adjust their viewpoints automatically and instinctively. This skill, however, can be notoriously difficult to program into a robot.
"Some of the things that we take for granted as humans are so simple that we don't even pay attention to how we achieve them," Soltani said. "We have evolved over millions of years to achieve certain capabilities. When you start to think in terms of robotics, you realize that they're quite complex."
The researchers set up a robot system comprising two arms controlled by the user's hands and a third robot arm with an AV camera controlled through a virtual reality, or VR, headset. The user, or trainer, receives a live feed from the AV camera through the VR headset, and the movements of the headset control the AV arm.
In this study, the team ran five simulation tasks and one real-world task with varying levels of difficulty. Some scenarios were specifically designed to require the robot to actively seek the best possible perspectives to be able to complete the task.
To generate data and train the robot, the researchers ran through each scenario adjusting perspective through the AV camera and controlling the manipulation robotic arms to complete the tasks, which included peg insertion, slot insertion, hook package, pour test tube, thread needle and occluded, or obstructed, insertion. The robots then would attempt to imitate how the user changed their perspective to complete the tasks independently.
Robot See, Robot Do
The researchers found that for some of the tasks, which were designed to not need varying perspectives, although AV did not improve things, it did not hurt the performance either, indicating that conventionally used extra camera feeds were unnecessary. In other words, it is possible to achieve the same performance with a single camera when it is adaptively positioned by the robot.
For the more difficult scenarios (thread needle and pour test tube) and the real-world task (occluded insertion), AV performed better than the setup that included multiple fixed cameras. For the thread needle test, for instance, the hole is not clearly visible from the static cameras and the AV camera is able to find a good perspective of the hole to insert the needle through.
The findings led to new questions about whether the change of perspective happens simultaneously with the manipulation or if the perspective is determined first and then the manipulation happens. This question, says Soltani, will inform the next step in this research: making the robot choose between moving the hands or head or moving both in tandem.
"We think that in more complex scenarios, most likely, we don't want to keep moving our heads while trying to complete a sensitive task with our hands. In the next phase of this research, we plan to penalize the robot and say, 'Okay, if you're moving your head and neck too much, you're not supposed to manipulate things,' because, intuitively speaking, when the head and neck are constantly moving, it's difficult to have fine control over what's happening in the environment."
Soltani and his team are also exploring the granular movements of the operator's gaze, attempting to understand where the operator is focusing at each instant. This information, Soltani posits, could teach robots to move to achieve the best perspectives, as well as which part of the camera images it should focus on to successfully complete the task.
Pushing Robotics Forward
Soltani will continue to explore imitation learning, as well as other machine learning methods to help the robot reason why it might need to change perspective, including large language models, reinforcement learning and more classic information theoretic-based methods in which the objective of camera positioning is to maximize information gain.
Soltani believes that active vision is an aspect of robotics that can help push the field forward and closer to a future in which robots are helping on the assembly floor or at home — household robots could be learning to load a dishwasher and take out the trash not too long from now.
"We're super excited about this research because there's really not much that has been done in this area," said Soltani. "We have so many questions that we can address through all the different tools that are at our disposal these days, and very soon, robotics is going to unlock its full potential."