Android in the AR world in China, Rokid detonates space computing frenzy

Click to follow

 Text丨Liu Yuqi

It may be hard for you to imagine that in a space without a display or a mouse, you can complete a 5,000-word article with just a pair of AR glasses and a pocket console.

Yes, on August 26, at the 2023 Rokid Jungle new product launch conference, such a scene was actually happening. At the conference, Rokid released Rokid AR Studio, a consumer-grade OST (optical see-through) personal spatial computing platform, including Rokid Max Pro (priced at 4,999 yuan) and Rokid Station Pro (priced at 3,999 yuan).

Rokid founder and CEO Zhu Mingming said at the press conference: "Let spatial computing be more naturally integrated into daily life and work, and let Rokid AR Studio become your first spatial computer."

This is very different from people’s previous understanding of AR glasses. Before this, AR glasses have been "locked" in the entertainment scene, relying on the two pillar industries of film and television and games to survive. Rokid AR Studio has truly become a personal productivity tool, IM software, writing articles, writing code, searching for information, etc. Work scenarios such as this can all be completed with the latest hardware.

The expansion of usage scenarios allows AR devices to shift from marginalized scenarios to more practical use values. Only when consumers are willing to pay, the entire AR industry chain will enter a positive cycle in the consumer market.

Zhu Mingming, the boss who says he is a "social phobia", is a complete control of products and technology. He once killed two drafts of product designs internally and almost drove the product department crazy. But when the product department secretly took out the designed product, Zhu Mingming immediately ordered all resources to be allocated to this product. "I only care about one data, which is the user usage time. At present, our real user usage time is close to one and a half hours, and the weekly retention rate exceeds 20%. If we achieve this, users will grow naturally."

The accumulation of users reaching one million levels also means that the AR industry has entered the second stage of software system and ecological construction. In recent years, more and more system manufacturers, application software manufacturers, and content manufacturers have joined the AR ecological construction.

"A bunch of lunatics, a dream, ten years."

As Zhu Mingming said, it took Rokid 10 years to go from entertainment scenes to productivity tools. Behind this is not only a leap in thinking, but also a big step forward from hardware technology to software technology, and even the entire industry chain. Apple and Rokid have started the second phase of the AR competition, and industry competition is also accelerating.

01 Monocular SLAM, how to redefine interaction?

In the entire press conference, the most surprising thing was not the body of Rokid Max Pro 76g, but the fact that it only has one camera, which can actually complete SLAM (spatial positioning technology), micro-gesture interaction, first-person perspective sharing, and visual positioning. VPS capabilities and other integrated interaction methods.

After experiencing physical interaction (handle), voice interaction, and gesture interaction, AR/VR devices are developing towards eye tracking and current multi-sensory fusion interaction solutions.

However, multi-sensory fusion interaction has higher requirements for hardware. In addition to meeting basic needs, it is also necessary to capture user actions and gestures in all directions and from multiple angles in order to accurately complete the interaction.

How difficult is it to complete SLAM interaction with a single camera?

The visual SLAM method consists of two modules, one is Tracking, which uses known 3D point positions and basic positioning; the other is Mapping, which updates the position of 3D points. No matter which link or method is used, monocular means that only one camera can be selected, as well as a fixed position and a fixed angle, which poses great challenges to the range of recognition, tracking speed and accuracy.

"The industry thinks that monocular SLAM is unbelievable and difficult to achieve," Zhu Mingming joked, "This may also be an affirmation of Rokid."

Currently, the few AR glasses with spatial interaction on the market are equipped with at least three cameras to undertake algorithm functions. The different visual routes have also formed two camps: VST (video see-through) represented by Apple and OST (optical see-through) represented by Rokid.

Still taking Apple Vision Pro as an example, it uses 12 cameras to "stack" fast positioning capture, high-precision panoramic perception and precision tracking, and uses VST to display the external world on the terminal screen through the cameras. The camera shoots in real time to see the outside world.

However, the method of stacking hardware for interaction has increased the cost and doubled the price. At the same time, it has led to two major implementation problems: the weight of the machine and the difficulty in mass production. This is the fundamental reason why Apple’s Vision Pro is priced at US$3,499 and will not be mass-produced until 2024.

The OST solution that Rokid insists on has certain technical barriers. Due to the complex pipeline design, the limited viewing angle of the display screen, and the high cost of optical components, without a significant price increase, Rokid can only pass Technological breakthroughs are used to reduce superposition costs.

And how is monocular SLAM achieved that the industry considers "incredible"? After the meeting, Light Cone Intelligence had an in-depth exchange with Zhu Mingming and discovered that Rokid’s “trick” lies in using AI algorithms to break through hardware barriers.

Zhu Mingming introduced that although monocular SLAM technology has existed for a long time, it has never been used in AR glasses. The front camera of mobile phones also uses this technology. The only difference is: the algorithm.

From AI to AR, this is a road that seems to span but is actually essentially integrated, but it is precisely because of Rokid’s accumulation in the field of AI in the past few years, through multi-dimensional visual algorithm models, including visual positioning and enhancement, and digital human technology. , 2D/3D gesture recognition, OCR recognition and other technologies allow AI to be implemented in specific scenarios.

For example, the AR visual positioning and enhancement function is to solve and break through the single-purpose limitation. By constructing a centimeter-level visual map, virtual information can be accurately superimposed and integrated in the real object world to achieve high-precision three-dimensional reconstruction of objects and scenes.

Wang Junjie, vice president of Rokid and head of the XR center, said: "Spatial positioning is based on SLAM technology, and then stable natural interaction can be carried out in the space. It takes 1 to 2 seconds to quickly initialize through the algorithm and establish the mapping space."

On the market, most devices still use binocular solutions, but binocular fusion also has many problems. In addition to the cost of adding an extra camera, algorithms also need to be continuously used to fit the data of the two cameras in real time. This leads to more complex issues.

From this point of view, if the monocular solution can proceed smoothly, Rokid will be the first to step on a technical trend. Previously, Rokid was also the manufacturer of the industry's first Station host. The solution of separating glasses and host has been proven to be the optimal solution for industry experience.

In addition, in gesture recognition, Rokid adopts the micro-gesture interaction method. You can click and select with a pinch of your finger; you can also switch the interface or content you are browsing by flicking the gesture left or right. Logical definitions such as simple pinching and sliding gestures are more natural and easy to get started with.

Judging from our field test results, Rokid can currently realize bare-hand spatial interaction with both hands. Currently, Rokid's gesture recognition algorithm supports complex scene recognition such as horizontal/spatial axis rotation, bright/dark light, etc., and it also has a rich range of recognizable gesture types. , The algorithm is accurate, the overall recognition rate is about 90% or more, and it has millisecond-level recognition response capabilities and 99% reliability guarantee.

Rokid said that based on deep learning algorithms and a large amount of experimental data, the monocular 3D gesture algorithm can reconstruct hand posture parameters in real time on the mobile terminal, including hand 6DoF, hand joint point 6DoF, and Hand Mesh information, providing AR gesture interaction. Good algorithmic foundation.

At present, Rokid's gesture recognition can realize a variety of operations in 3D space, including point, pinch, grab, hold, drag, pull, etc., which can fully meet the needs of AR interactive applications. For example, put on the Rokid Max Pro, reach out and open your palm in front of your eyes to bring up the menu.

After all, in order to support such a complex algorithm structure, the hero behind it is not only the camera, but also the "brain", which is the computing power and performance of Rokid Station Pro.

02 Space computer in your pocket

There has always been an impossible triangle of "computing power, comfort, and price" in the entire VR/AR industry. Equipment with higher computing power is often heavier and more expensive, and lightweight equipment with high comfort cannot meet the needs of use.

Judging from the reality, there is currently no "perfect" solution. Mainstream manufacturers are trying to find a balance between the two. There are two types of mainstream solutions on the market: one is represented by Apple. The display and computing are integrated and the battery is externally connected; the other is the display and computing split design represented by Rokid.

Apple's integrated design integrates two micro-OLED screens, multiple cameras, sensors, speakers and other components, which is more efficient in display effects, calculations, etc., but it also increases the weight of the body itself, resulting in only Connect the battery externally.

Rokid insists on a split design that maximizes wearability. Compared with Vision Pro's weight of 454g, the weight of 76g glasses is almost the same as ordinary glasses; at the same time, the computing power of the host can also be less limited by space resources, and to a certain extent avoid Discomfort caused by heat dissipation.

In general, the split route can achieve the ultimate two-way development of the lightness of the glasses and the computing power of the host. It is also more flexible, and the iteration of the computing power and the technical route of the glasses can be carried out asynchronously.

Rokid Station Pro is based on the split design and has been upgraded with higher computing power to create an All in One terminal that integrates computing, imaging, communication and other functions. It can truly be called a "productivity tool" HyperTerminal.

According to Guangcone Intelligence, Rokid Station Pro is equipped with Qualcomm Snapdragon XR2+, 12G RAM + 128G ROM, and supports WIFI6/6E and BT5.1. The battery life of Station Pro will be more than twice that of the mobile phone solution, and it also has better Heat dissipation and higher performance can achieve centimeter-level 6DoF tracking accuracy and extremely low MTP (Motion to Photon) rendering latency.

Public information shows that Snapdragon XR2+ is the latest flagship XR platform launched by Qualcomm. It can achieve 50% battery life and 30% improvement in heat dissipation performance, thereby supporting smaller and thinner devices to enable a richer and immersive experience. . At the same time, the Snapdragon XR2+ platform introduces a new image processing pipeline, which can achieve a latency of less than 10 milliseconds and enable a full-color video see-through MR experience.

Judging from the on-site experience of Lightcone Intelligence, whether it is watching movies, playing games, or using the keyboard to perform work and production processes, especially under the high-frequency interaction and fighting of games, the smoothness and reaction speed of the picture are very smooth. slip.

It is worth mentioning that the core algorithm currently on the market is still 3DoF (three degrees of freedom tracking), which means that the device can detect rotation in the three directions of upward, forward, and downward, but cannot detect the spatial displacement of the head forward, backward, left, and right. .

The upgraded Station Pro adopts the 6DoF algorithm, which in addition to detecting changes in the field of view angle caused by the rotation of the head, can also detect changes in six displacements of "up, down, front, left, and right" caused by body movement.

The upgrade of this algorithm is more important in the player's degree of freedom. For example, when fighting zombies under the 3DoF algorithm, the range is at a certain angle in front of you, but after the upgrade, zombies appear from 360 degrees, and the physical sensation of zombies hitting your face when you turn around is beyond the reach of the former.

In other words, not only is the computing power higher, the experience is smoother, but the expansion of computing power space also brings about a huge difference in physical experience.

Said Bakadir, Senior Director, Its own unique AR application ecosystem."

03 Doing iOS in the AR industry

Of course, the reason why Apple mobile phones have been able to dominate the mobile phone market all year round is not only due to its hardware, but also to its system and ecosystem. The barriers built through software systems to cultivate user habits are often stronger than the hardware itself.

This is part of the reason why Rokid developed its own AR space operating system - YodaOS-Master, but it is not the whole reason.

At Rokid Open Day in March this year, Rokid officially launched YodaOS-Master and released the "AR space creation platform Lingjing", which allows everyone to create AR content in 3D space and everyone can participate, completely breaking the boundaries of AR creation. threshold, allowing ecological potential to explode.

If monocular SLAM, 3D gesture recognition, Snapdragon XR+, and the Lingjing platform are all sharp swords, then YodaOS-Master can unleash these unique skills through a self-developed system.

To put it simply, Rokid is taking a road that no one has traveled before, and Rokid's philosophy is "software defines everything". All software needs to be carried and provided by the system in order to exert its value.

Focusing on the five aspects of perception, understanding, interaction, display, collaboration and digital creation, YodaOS-Master has made huge upgrades in many aspects such as chip optimization, hardware design, software architecture, AR algorithms and creation tools. It is currently the most complete A set of space operating systems for the AR era.

At the press conference, Rokid also demonstrated the openness and convenience brought by its self-developed system. To give a few obvious examples, based on its self-developed system and Snapdragon XR+ platform, Rokid has developed a multi-tasking parallel mode, which breaks the previous constraints of being able to do only one task, enabling the scenario of talking about DingTalk, writing code, and reading documents at the same time. , can simultaneously realize and give full play to the advantages of large space screens, maximizing production efficiency.

Another extremely innovative case is that Rokid redefined spatial search based on its self-developed system. Zhu Mingming said that this breaks the previous display method of search information. The presentation of search results is no longer a two-dimensional plane, but exists in a three-dimensional space. "The results that are most relevant to the question will be closest to you, and the results that are somewhat relevant will be on the secondary page. The further away, the less relevant. Of course, you can also cross out the previous results and dynamically select the results you want."

In this way, the sense of the future is instantly full, and it also shows the essential difference from the first stage of AR equipment.

It can be seen that the open ecosystem of the AR industry has begun to enter the second stage. Apple and Rokid are not only moving to the left and the right in the direction of hardware, but also in terms of industry system software and ecological development. Through the co-creation of hardware, algorithms, software ecology, developers, users and platforms, AR will move more quickly towards the second stage of rapid development in a completely open ecosystem.

Shi Wenfeng, chief engineer of Rokid system R&D, said, "The YodaOS-Master operating system integrates Rokid's core technologies including speech recognition, gesture recognition, SLAM, etc. into system services through a service-oriented approach, and provides a variety of client SDKs for development. Efficient development by developers, such as SDK for Unity, allows Unity developers (developer application channel: open platform website (ar.rokid.com)) to quickly use Rokid core technology for development."

From hardware to software, from system to ecology, Rokid's development path is a bit like Apple's during the Steve Jobs era.

“The AR industry is just before dawn,” Zhu Mingming said.

#rokid##AR#

Welcome to follow Light Cone Intelligence and get more cutting-edge knowledge of science and technology!

Guess you like

Origin blog.csdn.net/GZZN2019/article/details/132526675