Testing User Representation and Avatar Realism in VR Navigation Tasks

Authors:
(1) Rafael Kuffner dos Anjos;
(2) Joao Madeiras Pereira.
Table of Links
Abstract and 1 Introduction
2 Related Work and 2.1 Virtual avatars
2.2 Point cloud visualization
3 Test Design and 3.1 Setup
3.2 User Representations
3.3 Methodology
3.4 Virtual Environment and 3.5 Tasks Description
3.6 Questionnaires and 3.7 Participants
4 Results and Discussion, and 4.1 User preferences
4.2 Task performance
4.3 Discussion
5 Conclusions and References
3 TEST DESIGN
We further study the effects of realism and perspective in natural tasks. In this section we describe the main aspects of designing the test experience regarding user representation and the design of the task. The following subsections present the task concept, the avatar representations used and the setup used on the test task.
3.1 Setup
A wide-baseline setup was used due to two main reasons; firstly the fact that the kinect sensor has a limit on its effective range (0.4m to 4.5m, with skeletons losing reliability starting on 2.5m), and in order to properly evaluate a navigation task, a bigger space was needed. When the user is at the limits of the sensors operating range, the quality of the experience would be compromised, so a wide-baseline setup guarantees the whole body of the user is always visible by at least one camera. Secondly, since a third person perspective is presented as one interaction paradigm, the whole body of the participant must be visible at all times in order to avoid holes in the representation. A narrow baseline or single sensor setup would capture just half of the participant’s body, greatly compromising the experience.
The five sensors are fixated on the walls of the laboratory where the study was being held, covering an area of approximately 4 x 4 meters. Since the proposed navigation tasks were mainly performed on a line between the two goals, we mapped the environment in such a way so during the execution of the tasks, the participant was always facing a sensor so his hands were always visible on the first person perspective, and back on the third person perspective. The physical setup chosen for our study can be seen on figure 1.
3.2 User Representations
Regarding user representation, we chose three different user representations, following the known Uncanny Valley effect. Camera positioning in third-person perspective is based on previous work by Kosch et al. [17], where the camera is positioned above user’s head for improved spatial awareness.
In all the used representations, the Kinect’s joints position and rotation are mapped directly into the avatars using direct Kinematics.
3.2.1 Abstract
The first avatar is a simplified avatar representation which is composed by abstract components. Spheres were used for each joint and
the head, and cylinders for each bone connecting joints. Only the joints that are tracked by the Microsoft Kinect were represented. Figures 2a and 2b how this representation on the First and Third Person Perspectives (1PP and 3PP), respectively.
3.2.2 Mesh
The second representation is a realistic mesh avatar resembling a human being. This representation did not include animation for individual fingers, since they are not tracked by the Kinect sensor. Figures 2c and 2d show this representation on the First and Third Person Perspectives (1PP and 3PP), respectively.
3.2.3 Point Cloud
This body representation is based on a combination of separate streams of point clouds from Microsoft Kinect sensors. Each individual sensor first captures the skeletal information for each human in its field of view. Following, a point cloud is created from the combination of depth and color values seen by the camera, and points relevant to the users are segmented from the background.
At several points of the interaction, different cameras will be sending very similar information, and due to the timely constrained nature of our problem, integration of different streams or redundancy resolution is not performed. We chose to implement a simple decimation technique that takes into account the what body parts are more relevant to the task at hand. By using the captured skeleton information, we attribute different priorities to each joint according to user defined parameters. For the Virtual reality scenario, information about hands was found to be more valuable, and head information was discarded.
Since each sensor can be associated with a single computer, we transmit both the skeleton and point cloud data through the network to the host PC where the application is running. For each point, we transmit (x, y,z,r,g,b,q) which are the points 3D coordinates, color, and one bit indicating high or low quality. This last bit is necessary in order to adjust the rendering parameters in the host application. While higher quality points are more tightly grouped, requiring smaller splat sizes, under-sampled regions must use bigger splats to create closed surfaces.
Each sensor position is previously configured on the host computer after a calibration step. Data read through the network is then parsed and rendered in the environment using surface-aligned splats. For interaction purposes, the transmitted skeleton information is used. We are able to successfully parse and render the transmitted avatars at 30 frames per second, allowing for a clean interaction on the user side. Figures 2e and 2f show this representation on the first and third person points of view.