Physical Intelligence Unveils Breakthrough AI Model That Surprised Its Own Researchers
By admin | Apr 16, 2026 | 7 min read
A San Francisco robotics startup called Physical Intelligence, which has operated quietly for two years while garnering significant attention in the Bay Area's AI scene, released new research on Thursday. The findings demonstrate that its latest model can guide robots to complete tasks they were never specifically trained for—a result that even the company's own researchers admit was unexpected.
The new model, named π0.7, is described by the company as an early but significant move toward the long-term objective of creating a general-purpose robot brain. Such a system could be directed toward an unfamiliar job, receive instructions in plain language, and successfully execute it. If these results withstand scrutiny, they could indicate that robotic AI is nearing a turning point similar to the evolution of large language models, where abilities begin to improve in ways that exceed what the training data alone would suggest.
The central claim of the paper is compositional generalization: the capacity to blend skills learned in different situations to tackle entirely new problems. Traditionally, robot training has relied on a method akin to rote memorization—gathering data for a specific task, training a specialized model on that data, and repeating the process for each new task. Physical Intelligence asserts that π0.7 breaks this pattern.
“Once it crosses that threshold where it goes from only doing exactly the stuff that you collect the data for to actually remixing things in new ways,” explains Sergey Levine, a co-founder of Physical Intelligence and a UC Berkeley professor specializing in AI for robotics, “the capabilities are going up more than linearly with the amount of data. That much more favorable scaling property is something we’ve seen in other domains, like language and vision.”
One of the paper's most compelling demonstrations involves an air fryer that the model had virtually no exposure to during training. Upon investigation, the research team discovered only two relevant instances in the entire training dataset: one where a different robot simply pushed the air fryer closed, and another from an open-source dataset where a separate robot placed a plastic bottle inside one on command. Somehow, the model integrated these fragments with broader web-based pretraining data to develop a functional understanding of how the appliance operates.
“It’s very hard to track down where the knowledge is coming from, or where it will succeed or fail,” notes Ashwin Balakrishna, a research scientist at Physical Intelligence and a Stanford computer science PhD student. Nevertheless, without any coaching, the model made a credible attempt at using the appliance to cook a sweet potato. When provided with step-by-step verbal instructions—essentially, a human guiding the robot through the task as one might train a new employee—it succeeded.
This coaching capability is significant because it implies robots could be deployed in new settings and enhanced in real time without needing additional data collection or model retraining.
The researchers are upfront about the model's limitations and cautious not to overstate their progress. In at least one instance, they attribute a problem to their own team. “Sometimes the failure mode is not on the robot or on the model,” Balakrishna says. “It’s on us. Not being good at prompt engineering.” He recounts an early air fryer experiment that had a 5% success rate. After spending about half an hour refining how the task was explained to the model, the success rate soared to 95%.

The model also cannot yet autonomously execute complex, multi-step tasks from a single high-level command. “You can’t tell it, ‘Hey, go make me some toast’,” Levine states. “But if you walk it through—‘for the toaster, open this part, push that button, do this’—then it actually tends to work pretty well.”
The team also acknowledged the lack of standardized benchmarks for robotics, which complicates external validation of their claims. Instead, the company compared π0.7 against its own previous specialist models—purpose-built systems trained on individual tasks—and found that the generalist model matched their performance across a range of complex activities, including making coffee, folding laundry, and assembling boxes.
Perhaps the most notable aspect of the research—if the researchers are to be believed—is not any single demonstration but how much the results astonished the team members themselves, whose job involves knowing precisely what is in the training data and therefore what the model should and shouldn't be capable of.
“My experience has always been that when I deeply know what’s in the data, I can kind of just guess what the model will be able to do,” Balakrishna remarks. “I’m rarely surprised. But the last few months have been the first time where I’m genuinely surprised. I just bought a gear set randomly and asked the robot, ‘Hey, can you rotate this gear.’ And it just worked.”
Levine recalled the moment researchers first encountered GPT-2 generating a story about unicorns in the Andes. “Where the heck did it learn about unicorns in Peru?” he says. “That’s such a weird combination. And I think that seeing that in robotics is really special.”
Critics will likely point out an uncomfortable asymmetry: language models had the entire internet to learn from, while robots do not, and no amount of clever prompting can fully bridge that gap. However, when asked where he anticipates skepticism, Levine points elsewhere entirely.
“The criticism that can always be leveled at any robotic generalization demo is that the tasks are kind of boring,” he observes. “The robot is not doing a backflip.” He challenges that perspective, arguing that the difference between an impressive robot demo and a robotic system that genuinely generalizes is precisely the point. Generalization, he suggests, will always appear less dramatic than a carefully choreographed stunt—but it is far more practical.
The paper itself uses cautious language throughout, describing π0.7 as showing “early signs” of generalization and “initial demonstrations” of new capabilities. These are research findings, not a deployed product, and Physical Intelligence has consistently been reserved about commercial timelines. When asked directly when a system based on these findings might be ready for real-world deployment, Levine declined to speculate.
“I think there’s good reason to be optimistic, and certainly it’s progressing faster than I expected a couple of years ago,” he says. “But it’s very hard for me to answer that question.”
Physical Intelligence has raised over $1 billion to date and was most recently valued at $5.6 billion. A significant part of the investor enthusiasm surrounding the company is linked to Lachy Groom, a co-founder who spent years as one of Silicon Valley's most respected angel investors—backing companies like Figma, Notion, and Ramp—before deciding that Physical Intelligence was the venture he had been seeking. This pedigree has helped the startup attract substantial institutional funding, even as it has declined to provide investors with a commercialization timeline. The company is now reportedly in discussions for a new funding round that would nearly double its valuation to $11 billion. The team declined to comment.
Comments
Please log in to leave a comment.
No comments yet. Be the first to comment!