Photorealistic volumetric video communications  

By Sergi Fernandez Langa – i2CAT

One of the main objectives of PRESENCE is to provide intuitive and realistic user experiences bringing real humans into interactive virtual worlds, enabling real-time and hyper-realistic volumetric multiuser communications between remotely connected people. Nowadays there are multiple volumetric capture systems in the market, but most of them do not work in real-time, while those that do, offer a quality that is still far from what users expect. This is an issue also addressed by big corporations. In addition, current systems are still struggling to offer scalability in terms of number of users per session. The PRESENCE project will take one step forward in order to generate hyperrealistic volumetric video at scale. To do so a combination of technologies and methods will be integrated. We can divide the strategy towards this goal into different segments that conform to the volumetric video reconstruction and transmission pipelines.

Capture sensors.

Most real-time volumetric capture systems use RGBD cameras with time of flight (ToF) technology to extract the distance to which a certain point is from the sensor. With this data and some mathematical operations, we can generate a point cloud of the image being captured by the sensor, and by applying calibration and reconstruction methods we can obtain an approximate body representation of the user.  These methods concentrate their effort on eliminating the noise introduced by the depth data since ToF introduces a lot of noise (i.e. the depth values tend to “vibrate” in the 3D space). The PRESENCE approach is to use lighfield technology and depth estimation, as incorporated in Raytrix cameras, to reduce the level of noise when generating a depthmap of a camera view.  This noise reduction is expected to drastically minimize the need for depth correction methods, providing some extra milliseconds to volumetric video pipelines to execute other also very relevant tasks (e.g., colour correction, surface completion, canonical body estimation, compression, etc.).

Fig. 1: A Raytrix camera installed at i2CAT labs (left) and detail of the quality of a volumetric frame captured with Raytrix camera.

Digital human reconstruction from multi-camera systems.

As indicated before, some of the efforts (and associated computing time) of real-time volumetric capture systems are oriented to correct the data noise generated by the sensors and perform different reconstruction operations that will improve the visual quality of the obtained representation. This is a time-consuming task that precisely prevents actual systems from working at acceptable frame rates. Eventually, a decision needs to be made between quality and frame rate. In PRESENCE our initial approach is to use traditional computer vision techniques which deliver relatively good and realistic 3D reconstructed outcomes. We use signal processing algorithms (e.g. FFT) and filtering techniques which are rather fast, to achieve the real-time element of the reconstruction. Although these are proven algorithms in the computer vision domain, they are restricted by the voxel grid we use i.e. the 3D resolution, so to speak, of our scene. The more we increase the voxel grid, the more the frame rate decreases. To tackle this restriction, non-conventional methods have to be researched and employed which are both, independent of the voxel grid and also deliver real-time processing. One such method we currently research is to use Gaussian Splatting rasterization techniques in order to render photorealistic 3D scenes in high frame rates.

Fig. 2: i2CAT basic reconstruction from multiple K4A sensors (up) and CERTH reconstruction from multiple Kinect sensors (down)

Volumetric video compression.

Once a volumetric reconstruction is complete and ready to be sent through the network, data compression algorithms need to be applied. The amount of data to be transferred is otherwise unmanageable by existing commercial communication networks, similar to what happens with 2D video even in a more pronounced manner. PRESENCE will explore and compare different compression methods focusing mainly on two approaches. One aims at compressing the depth data itself and doing the necessary reconstruction operations in the destination (where the volume is expected to be rendered), and the other aims at compressing the full geometry of the output volume, reducing the computational effort on the rendering side. Different projection algorithms will be explored in order to transfer 3D data to the 2D plane in the most optimal way, enabling us to use existing 2D video codecs while preserving the geometry of the targeted object.

Fig. 3: Matrix with original and resulting point cloud after the application of different compression pipelines at various compression levels by i2CAT.

Scalability in holo-conferencing.

The scalability of multi-user sessions with an increasing number of users represented as volumetric videos. Scalability in the volumetric domain is not that different to the 2D video domain. The more users participating in one session, the greater the challenges that will arise. Sessions with 2 users are simple, a point-to-point connection should be enough. As the number of users in one session increases, other architectures may be used. Forwarding units, for instance, may serve to reach a higher number of users per session, although up to a certain number of users both client and server-side components may reach a threshold in terms of bandwidth or available computational resources. Mixing units may introduce smarter processing of the data being transferred at server-side, in a way that clients receive only the portion of content that will be visualized. Alternatively, transcoding units may change the data transferred in a way that is adapted to the potential receivers. PRESENCE will introduce and explore different potential solutions to enable scalability considering future network features (i.e., deploying computation across the edge-cloud continuum) to increase the number of users in a session and the devices that may participate in this session.

Fig. 4: Screen capture of a scalability test by i2CAT. Thirteen prerecorded volumetric user representations of 50K point per frame at 15fps, and using CWI codec are sent, received, decompressed and rendered in one single unity client.