Google’s Robotics Research team has developed a new robotics and language model, PaLM-E, that combines the power of large language models with data from robot sensors. PaLM-E is a generalist model, capable of performing both vision and language tasks, as well as controlling robots and learning efficiently.
How PaLM-E works
PaLM-E works by injecting observations into a pre-trained language model, converting sensor data, such as images, into a representation comparable to how natural language words are processed. Language models rely on a mechanism to represent text mathematically in a way that neural networks can process. This is achieved by dividing the text into “tokens” that encode the words and associating them with high-dimensional vectors. The language model can apply mathematical operations on the resulting sequence of vectors to predict the next most likely word token. By feeding the newly predicted word into the input, the language model can iteratively generate longer text.
The inputs of PaLM-E are texts and other modalities, such as images, robot states, and scene representations, in an arbitrary order, what we call “multimodal sentences”. The output is text generated auto-regressively by PaLM-E, which could be an answer to a question or a sequence of decisions in the form of text.
An embodied and multimodal model
PaLM-E is based on the PaLM language model and the ViT-22B architecture for vision. The idea behind PaLM-E is to train encoders that convert a variety of inputs into the same space as natural word embeddings. These continuous inputs are transformed into something that looks like “words”, although they do not necessarily form discrete sets. Since word and image embeddings have the same dimensionality, they can be fed into the language model.
PaLM-E is a generalist model, designed for robotics but also capable of vision and language tasks. PaLM-E can describe images, detect objects, classify scenes, quote poetry, solve mathematical equations, and generate code . It combines the power of language-based learning with the ability to control robots and learn efficiently.
knowledge transfer
PaLM-E offers a new way of training generalist models, which is achieved by combining robotics and vision and language tasks through a common representation: taking images and text as input and producing text as output. A key result is that PaLM-E achieves positive knowledge transfer from both the vision and language domains, improving the robot’s learning efficiency.
The positive knowledge transfer from vision and language tasks to robotics allows PaLM-E to address a large set of robotics, vision and language tasks simultaneously, without performance degradation compared to training individual models on single tasks. Vision and language data significantly improve robot task performance. This transfer allows PaLM-E to learn robotics tasks efficiently in terms of the number of examples needed to solve a task.
The results show that PaLM-E can address a diverse set of robotic and vision-language tasks effectively. When PaLM-E is tasked with making decisions about a robot, it is combined with a low-level language policy to translate the text into low-level robot actions.
Some examples
In one example, PaLM-E controls a mobile robot in a kitchen to pick up a bag of chips. In another example, the robot is tasked with grabbing a green block. Although the robot has not seen that block before, PaLM-E generates a step-by-step plan that generalizes beyond the robot’s training data.
In another environment, the PaLM-E model solves high-level, long-duration tasks, such as “sort the blocks by colors in corners”, directly from the images and producing a textually represented sequence of actions. PaLM-E also demonstrates the ability to generalize to new tasks not seen during the training time, such as pushing red blocks towards a cup of coffee.
Undoubtedly, PaLM-E represents a significant advance in the field of robotics, combining the capacity of language models with the transfer of knowledge from vision and language tasks to address a wide range of robotic tasks. This multimodal and embodied approach also has the potential to unify tasks that previously seemed separate. The ability of PaLM-E to perform language, vision, and robotics tasks efficiently, and to generalize to new unseen tasks, has important implications for the future of robotics and multimodal learning.