Article | New Mou, Author | Lu Yao
Recently, a term has been buzzing in certain circles: "Physical AI".
This term was actually mentioned over ten times by Jensen Huang in his speech at the Las Vegas CES early last year, but it wasn't until this year that "Physical AI" truly exploded in significance.
So, what exactly is "Physical AI"?
A couple of days ago, I saw a video of a robot watering flowers. The robot first walked to the faucet, turned on the valve, filled the watering can, then turned around, walked to the flower pot, adjusted its angle, and poured the water in evenly. The spout didn't hit the edge of the pot, and no water spilled out.
For a machine to understand "carrying a cup of water," it needs to know the cup is cylindrical, calculate the precise force needed to grip it without slipping or crushing it, understand that water is a liquid and will spill if shaken, and constantly adjust its arm angle while walking to compensate for body movement.
These things, a human three-year-old can do intuitively. But for AI, this is a huge leap. Over the past decade, AI learned to see, hear, speak, and draw, but it remained trapped within screens. What Physical AI aims to do is put this smart brain into a body that can run, jump, grasp, and manipulate objects in the real world.
Simply put, Physical AI is about making AI understand and act upon the physical world. It's no longer just processing text and images; it's about performing correct actions in an environment governed by gravity, friction, and inertia.
A fact seldom discussed domestically is that the term "Physical AI" didn't originate from some chip giant's PR department. This concept first appeared in a 2020 paper published in *Nature Machine Intelligence*. The paper systematically defined Physical AI for the first time:
A class of embodied systems capable of performing tasks typically associated with intelligent organisms. The core lies in deeply integrating physical laws into the AI system, so machines are no longer "physically blind" and can complete the perception-to-action loop.
From the academic world's opening shot in 2020 to the industry's full embrace in 2026, there was a gap of six whole years. In these six years, sensor costs dropped by several orders of magnitude, edge AI computing power moved from theory to engineering, and the reliability and mass production capability of robot bodies quietly reached an inflection point — these were the hidden forces pushing Physical AI from papers to production lines.
From Demonstration to Working
If the large language models of 2023 taught AI to chat, then the keyword for Physical AI in 2026 is just one thing: work.
The change is visible to the naked eye.
This time last year, the way robot companies showed off their muscles was still by filming demo videos, setting up scenes, rehearsing repeatedly, and shooting in one take. Impressive to watch, but you never knew how many takes they did.
This year, the playbook is completely different. This year, Zhi Yuan Robotics did something on a 3C production line in Nanchang: they threw a robot into a real factory and had it work continuously for several hours, live-streaming the entire process. No preset script, no limited scene — just the same production line workers face daily. Hundreds of thousands of people watched online.
A month later, Zhi Yuan announced in Hong Kong the mass production of 10,000 humanoid robots. The leap from one prototype in the lab to 10,000 on a production line is a milestone that changes the game.
Zhi Yuan's approach is interesting. Most robotics startups focus on a specific segment — some only on the body, some only on the large model, some only on dexterous hands. Zhi Yuan chose another path: doing the full stack, simultaneously developing the body manufacturing, AI model, dexterous manipulation, and data collection, while also investing in over 60 upstream and downstream companies in the industry chain.
The cost of this approach is clear: the parent company has over a thousand employees, expected to grow further by the end of this year, with an annual salary expenditure alone reaching billions. This path burns cash, but once proven, its moat is also the deepest.
Zhi Yuan's founder Deng Taihua proposed an analytical framework called the "XYZ Curve." He said embodied intelligence development has three stages: X is the development and experimentation phase, where people are still playing with demos; Y is the deployment and growth phase, where robots actually start working on production lines; Z is the ultimate intelligent emergence phase.
He characterized 2026 as: "the first year of deployment phase, officially moving from 'can move' to 'can work'." The difference between "can move" and "can work" is just one word, but it marks the entire industry's coming of age.
The pace overseas is equally intense, not slowing down across the Pacific.
American humanoid robot company Figure AI is an unavoidable name on this track. In September last year, they completed a funding round of over $1 billion, raising their valuation to $39 billion, making them the world's highest-valued humanoid robot company at the time.
A month later, they released a new generation product, Figure 03, standing 1.68 meters tall and weighing about 60 kilograms, demonstrating household chores like watering plants, serving dishes, and folding clothes. Founder Brett Adcock specifically added on social media: all actions were autonomously completed by the robot, with no human remote control.
Technologically, it's noteworthy that Figure made a major strategic pivot, terminating its cooperation with OpenAI and fully transitioning to its self-developed neural network system, Helix.
This system mimics human cognition with a three-layer structure: the bottom layer handles balance and instinctive reactions, the middle layer translates brain commands into motor control commands 200 times per second, and the top layer is the logical brain, responsible for understanding scenes and making decisions. This "instinct-reflex-thought" three-tier architecture is quite clever, essentially giving the robot a non-crashing nervous system.
Another thing worth mentioning. At this year's GTC conference, NVIDIA announced a move: deep cooperation with the world's four industrial robotics giants — ABB, KUKA, Yaskawa, and Fanuc. Over 2 million industrial robots already installed on production lines worldwide can now use NVIDIA's simulation platform for virtual commissioning and AI training.
These four companies combined account for over half of the global industrial robot market share. In the next decade, these robots will undergo an upgrade from "traditional programming" to "AI-driven." Whichever software platform can embed itself into this process will essentially secure the "operating system" layer for the next generation of industrial automation. NVIDIA clearly doesn't want to miss this boat ticket.
Cross-Border Sprint from the Supply Chain
Another interesting phenomenon: automotive supply chain companies are entering the Physical AI track en masse.
At this year's Beijing Auto Show, traditional automotive suppliers like Aptiv, Valeo, Horizon Robotics, and Qianxun SI showcased robotics-related solutions in clusters. Many industry insiders realized then that embodied intelligent perception is the same as automotive intelligent driving perception; automotive solutions can be directly applied to humanoid robots.
Thinking about it carefully, it makes sense. The automotive intelligent driving system is essentially a perception-decision-execution loop for a "mobile robot." Its three core modules — visual perception, path planning, and real-time control — are highly homologous in technical architecture with traditional industrial robots and humanoid robots.
Automotive suppliers' cameras, radars, steer-by-wire chassis, and real-time operating systems can be migrated to the robotics field with slight adaptation. In this sense, the hundreds of billions in R&D spending the automotive industry burned over the past decade on intelligence are now flowing into the Physical AI track as "technology spillover."
This might explain why Chinese robotics companies can so quickly enter the mass production stage. Manufacturing capabilities and supply chain management aren't built from scratch; many are readily available. Those component suppliers already honed on automotive production lines for over a decade are now applying their skills on a new battlefield.
There are ready-made cases abroad. Take Tesla, for example. Its first-generation humanoid robot Optimus is also accelerating its entry. Previously, Tesla clearly announced in its Q1 2026 earnings call that the company would transition to "a future centered on AI, autonomous taxis, and humanoid robots," with the first-generation robot production line having a capacity of 1 million units, replacing the current Model S and Model X production lines.
The number 1 million might seem exaggerated in today's context, but Tesla's logic is clear: it wants to directly replicate the large-scale production capabilities and supply chain management experience accumulated in automobile manufacturing into the humanoid robotics field.
What Musk wants is not a "robot that can move," but a "mass-produced tool" that can work alongside humans in factories. Once this path is proven, its impact on the manufacturing automation landscape will be no less than that of the Model 3 on the fuel vehicle market.
World Model: Why It Become Usable This Year
Having covered the major players' moves at the industry level, let's zoom in one layer deeper: what's the technological foundation of this Physical AI race?
To sum it up in one sentence: the engineering breakthrough of world models. I think this is also the most critical point for understanding this wave.
The concept of "world model" isn't new; it was proposed back in 2018. The core idea is simple: let AI develop an internal understanding of how the physical world operates, so it can predict "what will happen if I push this cup." But previously, this mostly existed only in papers — too computationally expensive, unstable generation quality, unsuitable for real-time interaction.
The turning point happened in the last year. NVIDIA launched a series of models called Cosmos, whose core capability is generating action data conforming to physical laws from text or images.
For example: if you want to train a robot to move boxes in various weather conditions, you don't need to actually film videos in factories during rain, snow, or at night. Set the parameters in a simulation environment, and Cosmos can directly generate massive amounts of highly realistic training data covering various extreme scenarios.
Early this year, the Ant Lingbo team open-sourced a framework called LingBot-World, specifically for interactive world models. It can achieve nearly 10 minutes of continuous, stable video generation, with end-to-end interaction latency controlled within seconds. Users can control virtual characters in real-time with a keyboard and mouse like playing a game, with the model providing instant feedback on scene changes. The significance is that world models moved from "offline rendering" to "online interaction," boosting training efficiency by an order of magnitude.
Another startup, Jijia Vision, released the GigaWorld-1 platform, positioned as a "digital sandbox" for the physical world. A month later, Alibaba's ABot-PhysWorld surpassed it on a benchmark called WorldArena, topping the comprehensive rankings. Competition is advancing month by month.
The importance of these open-source projects lies not in how high their parameters are, but in turning a game "only giants could play" into a tool "small teams can also use." When enough people are building the wheels, more cars will truly start running.
The reason world models have become a core component in the Physical AI era is that they answer that long-unresolved question: how to enable robots to learn the complex laws of the physical world in a low-cost, high-efficiency way?
Training data from the real world is extremely costly to obtain and inherently carries distribution bias. It's hard to gather all edge scenarios in reality, like factory night shifts during a blizzard, emergency situations during a logistics warehouse blackout, or sudden human intervention on a production line. But synthetic data can. By manipulating scene parameters with prompts in a simulation environment, researchers can generate large-scale training videos covering extreme conditions within hours, which would take months or even years under the traditional real-data collection route.
The leverage effect of this breakthrough might exceed any single algorithm improvement.
The Paradigm Has Changed
The breakthrough in world models is actually just one part of the evolution of the Physical AI tech stack. Changes in underlying technology are driving a fundamental architectural rebuild of the entire robotics industry.
Traditional robots use a "sense, plan, act" three-stage approach. First, sensors perceive the environment, then engineers write rules telling the machine how to plan its path, and finally, it executes the action. This works fine in structured environments like factory assembly lines, but once the scenario gets complex, its shortcomings are exposed. The machine only follows the preset script and gets stuck when encountering unseen situations.
Physical AI takes a different path: "perception, reasoning, execution." After perception, it doesn't go through human-written rules but uses a trained neural network to reason what to do and then execute. The essential difference is that the former is "the engineer thinks for the machine," while the latter is "the machine understands the physical world itself."
The International Federation of Robotics released a technology roadmap this year, predicting that within the next three years, 80% of new robot models will adopt this new architecture, with the traditional three-stage approach gradually exiting the mainstream. This isn't a minor tweak; it's a full paradigm shift.
As an industry expert aptly summarized: Physical AI is the ultimate mode of AI development because it needs to understand not only human instructions but also all the laws of the physical world.
Jensen Huang said the "ChatGPT moment" for robotics development has arrived. In my view, the nature of Physical AI's "moment" is completely different from that of language models. The "that moment" for language models was when ordinary people worldwide first got their hands on AI. The "that moment" for Physical AI is when AI truly starts working for the first time.
Currently, this track is at a very special stage: the direction is locked in, the concept is validated, but the landscape isn't settled.
On one hand, making demos and achieving mass production are two completely different capability systems. Getting one prototype to work is one thing; having ten thousand products perform consistently in real-world scenarios tests manufacturing consistency, supply chain resilience, scenario generalization ability, and operational systems. These have little to do with AI algorithms, but each is enough to halt a batch of players. On the other hand, real-world data collection is expensive, time-consuming, and has limited coverage, which almost predestines that large-scale training for Physical AI will heavily rely on synthetic data.
At the same time, from automotive supply chains and traditional industrial automation to consumer electronics manufacturing, industries that seem unrelated to "AI" are accelerating their entry into Physical AI through technology spillover. Their manufacturing capabilities, supply chain management experience, and scenario resources might be the key variables determining the speed of Physical AI's practical application.
An intuitive judgment is this: look back at the AI wave ignited by ChatGPT in early 2023. The ones who captured the most value weren't the model makers, but the infrastructure providers. Will this wave of Physical AI replay the same script?
NVIDIA's moves suggest it's betting on this direction, but the story isn't finished. 2026 is the first year of the deployment phase; industrial competition has just begun. Looking back three years from now, which names are still at the table and which have been eliminated might surprise most people.








