UX Testing Agents

User experience (UX) is a broad, complex, and multi-faceted concept. It encompasses many different components of human experience along with characteristics of the system itself and the surrounding environment. Examples are the emotional state of users interacting with a system, their cognitive load, the difficulty of using the system (e.g., how difficult is a game level), etc.

Testing UX is of paramount importance for the development of systems that users want to use and enjoy using. It is, however, an expensive and time-consuming endeavour when using traditional user testing-based approaches. To mitigate this issue, the iv4XR framework offers Socio-Emotional Testing Agents (SETA) capable of testing different components of UX.

The Modular Socio-Emotional Agent

Given the complex and multi-faceted nature of UX, it is unfeasible to have a single, uni-dimensional measure of UX. What can be done is the measurement or prediction of different components of UX according to what is most relevant to testers and designers. With this in mind, we have implemented the SETAs to be modular by design. This means that they can be run with different modules, each endowing them with the ability to predict different components of UX. They can also be easily expanded with new modules to allow them to predict novel components of UX as new predictors are developed or trained.


For example, to the designer of an airplane piloting simulation, the most relevant components of UX to be predicted might be cognitive load and the level of arousal. It might be of little relevance to the level of happiness or enjoyment of the user, whereas those might be the most relevant component for the designer of a video game. It is thus important to allow the SETAs to be flexible to the system under test and the goals of the designers.

The same modular concept applies to the behaviour of the SETAs. Different systems under test will require agents that behave differently to test their UX. Even the same system might require agents that behave in different ways for different scenarios and UX components being tested. For some systems, it might be feasible and logical to test UX with an agent that tries to cover all possible actions and behaviours. For other systems, like a real-life simulation or a complex video game, that might be unfeasible given the action space, as well as irrelevant for UX testing.

How to Use It

PAD Model of Emotion (Machine Learning Approach)

One of the components of UX is the emotional state of the user, which is altered by the interaction of the user with the system under test. One of the modules developed for the iv4XR framework is an emotional predictive module based on the PAD model of emotion. This model describes human emotions based on three dimensions: Pleasure; Arousal; and Dominance. For this module, we used machine learning to train a predictive model for the dimensions of the PAD model based on data collected from the system under test.

For the sake of training the model, a game named “Flower Hunter” was developed, inspired by old-school top-down 2D games like Legend of Zelda and Pacman. It was designed to be easily modifiable, fast-running, compatible with Python machine-learning libraries, and ultimately entertaining enough to motivate users to play it.

The game and all of the required code and data for training and using the predictive PAD model can be found in the following GitHub repository: https://github.com/iv4xr-project/PAD_emotion_game

Training the Model

To train the predictive model using the traces already collected from a user study with 91 participants, all one has to do is to download the code from the previously linked GitHub repository and run the following on the terminal when on the parent folder of the repository:

python3 predictor.py train_model

Once this runs successfully, three new files will appear. They will be named “trained_forest_Pleasure.pkl”, “trained_forest_Arousal.pkl” and “trained_forest_Dominance.pkl”. Each of them represents a trained predictive model for each of the emotional dimensions of the PAD model.

Predicting Using the Model

To predict one of the emotional dimensions using one of the trained models, one can write the following on the terminal:

python3 predictor.py predict_using_trained_model DIMENSION

where DIMENSION can be either “pleasure”, “arousal” or “dominance”, depending on which emotional dimension one wishes to predict. This will read the traces present in the “/Traces” folder and print a prediction of the emotional classes (Increasing or Decreasing/Steady) for each 4-second slice of the trace. To predict the classes for the arousal emotional dimension, one would therefore write:

python3 predictor.py predict_using_trained_model arousal

Persona Agents

How does a user behave? For most games and simulations, there is not a single answer. In the exact same scenario, two different users can behave in completely distinct ways. This makes the task of simulating user behaviour a complex one. In most games, it is unfeasible to simulate every possible sequence of actions a player might do. Even when this is possible, many of those sequences of actions might be highly unlikely to be chosen by a player, whereas others might be extremely common. Knowing which sequences of actions better reflect the behaviour of real users can help developers and testers make better decisions.

With this in mind, the iv4XR implements Persona Agents, that is, agents that behave like a specific user or subset of users. To do so, we use clustering of user traces based on a given metric for the distance of behaviour and then evolve the parameters of genetic agents to behave like the representatives of the found clusters.


Clustering the Traces

To cluster the traces from the user study present on the previously mentioned GitHub repository, one can write the following on the terminal:

python3 trace_processor.py

This will cluster the traces of a specific level into folders inside the “/Clusters” folder, which can then be used to evolve the Persona Agents.

Evolving the agents

To evolve a set of agents, one can write the following on the terminal:

python3 evolution.py

This will use the clusters saved in the “/Saved_Clusters” folder and evolve agents to behave like each of the exemplars of each cluster. The parameter evolution will be saved in the “/Saved_Personas” folder. The parameters of the agents of the last generation with the greatest fitness can then be used when deploying the agents.

Deploying the agents

To deploy an agent, one needs only to write the following on the terminal:

python3 flower_hunter.py

This will run 8 different agents that represent 8 clusters found when clustering the user traces collected over the 3 levels that were used for the data collection.

Example Persona-Agent Behaviour


OCC Model of Emotion (Model-Based Approach)

A different approach to predict emotions is to develop a formal model of appraisal for event-based emotion. In the iv4XR framework, we implement an event-based transition system to formalize relevant emotions using the Ortony, Clore, & Collins (OCC) theory of emotions. The model is integrated on top of iv4xr’s tactical agent programming library, to create intelligent UX test agents, capable of appraising emotions in our first game case study called Lab Recruits. The results can be graphically shown as heat maps, as can be seen in the following figure.


Visualization of the test agent’s emotions ultimately helps game designers to produce contents that evoke a certain experience in players. The code for running this module can be found in the following:

GitHub repository: https://github.com/iv4xr-project/eplaytesting-pipeline.

How to Run

To run the code and replicate the results, please follow the following video tutorial:



Defining Desired Experience Outcomes

At the basis of UX testing, is the experience (or set of experiences) that the designer wants users to have. Defining these experiences is not trivial. A desired (or undesired) experience can be a complex and spatiotemporal phenomenon which cannot be easily verified by the common method of using “asserts” employed by traditional functional testing. A designer might have a very specific emotional experience that she wants users to have and requires a method of expressing that experience in a way that can be automatically verifiable.

Within the iv4XR framework, we have developed a language for defining Desired Experience Outcomes based on temporal logic, which allows designers to directly encode the emotional experience they wish users to have throughout time and the different locations of a simulation.


Once a Desired Experience Outcome (DEO) has been defined, it can be used as the basis for testing. An example scenario is shown in the figure bellow. Having defined the DEO, an XR game designer can then train a set of Persona Agents to interact with the system and use one of the emotional predictive models to have an experience traversal for each agent and level pair. These traversals can then be automatically compared to the DEO to give the designer information about which level-agent pairs satisfy the experience the designer had in mind. The designer can then use this information to know if the levels require alterations. If so, the designer can alter the levels and then quickly repeat the test routine to know how the alterations to the level impacted the experience of the different types of users.


UX Modules Being Developed and Researched

Besides our two main UX modules based on emotional models, we have undergone research on other components of UX which are currently being developed and further integrated with the framework.

Difficulty Estimation

The difficulty of the interaction with a system, for example, a game, can have a very direct impact on UX. We are thus developing a method for the estimation of difficulty through the use of machine learning agents with added noise. We have agents learn how to solve a level of a game through reinforcement learning and then we add different types of noise to the learnt solution. The more the noise impacts the ability of the agent to solve the game, the harder the difficulty of the level.

GitHub Repository: https://github.com/iv4xr-project/difficultysch

Motion Sickness on VR

Motion sickness when using VR systems, also called cybersickness, can negatively impact UX. We are therefore exploring methods of automatically detecting motion sickness in VR simulations in order to prevent users from experiencing it. We have conducted a user study and trained a predictive model to detect motion sickness based only on pixel information.

GitHub Repository: https://github.com/iv4xr-project/CSPredictionInVR

Cognitive Load

Cognitive load is a measure of how overwhelmed or not a user is with the information that is being presented. We have conducted a study to test wether a model of cognitive load could automatically and accurately predict the cognitive load of players interacting with a puzzle game.

GitHub Repository: https://github.com/albertoramos1997/WayOut

Testing of Interactive Story Telling

In simulations where there is a narrative which users can influence with their decisions, it becomes relevant to know how decisions can influence the state of the system and have information regarding the reachability of states. We thus developed a tool for developers and testers to explore how user’s decisions influence the outcomes and final states of the system, helping also to find decision loops and unreachable or hard to reach states.

GitHub Repository: https://github.com/iv4xr-project/in-story-validator