Analyzing Online Video Games As Distributed Rendering Systems

TL;DR Online video games like CSGO are distributed renders: many clients and a single server communicate over a network, each rendering their own perspective of the game state. Analyzing these renders' different perspectives can be critical for applications like player behavior analysis. In this blog post, I'll explain the requirements of a logging system that supports analyzing the different perspectives.

Understanding Player Behavior Requires Analyzing Different Renderings Of the Same Game

Analytics for online games like CSGO often boil down to visibility: when did a player see an in-game event, and how did they respond? Computing visibility requires analyzing the output of the GPU renderer. See my prior blog post on CSGO demo files for the problems with approximate, CPU-based visibility computations for analytics, like CSGO's spottedBy field.

So we need to analyze the GPU renderer's output, but which output should we look at? If there was one true rendering of the game, then we'd be done. We could use that one true rendering to answer our analytics questions and call it a day. But, a CSGO match has 10 clients (one per player) and one server. Each of these 11 computers has a different version of the game state, so rendering based on those different game states produces different perspectives. The below images demonstrate the differences between renderings. The image on the left shows a client's rendering. The player's crosshairs are on an enemy. The image on the right shows the server's rendering. The enemy is behind a wall and to the right of the crosshairs. Both screenshots occur at the exact same moment when the enemy is killed.

The left image shows the client's rendering. The player's crosshair is on the enemy. The right image shows the server's rendering. The enemy is behind a wall and to the right of the crosshair. Note: it may be helpful to open the images in new tabs and zoom in on the crosshair.

The discrepancy between renderings results from CSGO's distributed nature. The server is the single source of truth. It decides each players location on each tick. Clients receive the server's state updates 5-20 ticks later due to the network latency. This latency means that clients only know the true state of the world long after it occurred. Different games have different approaches for handling the latency. For FPS games like CSGO, each client predicts its local player's position in the present and renders other players positions in the past. This per-client combination of time shifts (local player in present, other players in past) means that the server and each client render a different version of the world. For more details on these shifts, please see Valve's documentation. I will write a more in-depth blog post on this topic in the future.

We Can Analyze The Different Renderings With Standard Distributed Tracing Techniques

As we saw above, online video games are a distributed system where each computer in the system renders a different perspective of the game state. If we want to analyze this distributed rendering system, we'll need a tracing system that can record all the different perspectives. Let's take a step back from the game-specific world and think about the general properties of an abstract distributed tracing system. Then, we can apply these properties to our specific, video game use case. The key properties to of a distributed tracing system are:

Temporal State - This property tracks the state for each computer in the distributed system at each time step. This is standard for distributed tracer's like Dapper. In addition to this basic temporal state, we'll also want to track some key events. Like Git [1], we always have a single source of truth. So, we'll also want to track key events where clients' states adjust relative to the single source of truth. You can think of the following as branch and merge events in Git.

Desync Events - Events when a client's state no longer matches the server's state.
Merge Events - Events when a client's state recovers from a desync event.

Multiple Temporal Granularities - It may be too expensive to record every event for every computer in a distributed system. A tracing system should be able to record basic events for many computers while also recording more detailed events on a subset of computers. This is standard for distributed tracer's like DataDog's.
Spatial State - This property tracks the locations of the different computers. Some computers may sit near each other, such as servers on the same datacenter rack, while others may be moving with an unreliable connection, like cellphones. Storing the locations of the different computers will enable developers reviewing the traces to identify shared properties of buggy computers and suggest reasons for issues. This is standard for tracing events in distributed systems like Google's datacenters. Google extended traceroute to be aware of their datacenters' spatial layouts.

Now that we've established these properties, let's see how they are applied video games.

Temporal State For Player Behavior Analysis - A client's rendering of the game is their temporal state. We must track this state in order to analyze their behavior. For a CSGO game with 10 players, we need the trace to record 10 different temporal states in order to study each client's behavior.
Desync And Merge Events For Network Debugging - Not all network lag is problematic. Lag is only an issue if it impacts gameplay. For example, the death shown in the images above is an impossible event from the server's perspective: a player died behind a wall. The client desynced from the server, showing the player in a different location, and then merged by creating an impossible event, a death behind a wall. This is a situation where lag impacts gameplay. Tracking when lag causes desync events between the temporal states will enable us to focus on network issues that impact gameplay.
Multiple Temporal Granularities For Weapon Design - Weapons are designed for specific types of players, such as CSGO's shotguns with low accuracy and random damage. The shotguns are easy to use because the player's aim has little impact on the result of a shot. We would like to analyze if these weapons are correctly designed by correlating low skilled players' crosshair placement and weapons success. This analysis requires granular rendering data for low skilled players and little data for high skilled players. Multiple temporal granularities allow us to adjust the rendering data recorded per player.
Spatial State For Organizing Infrastructure - Servers need to run at a constant tick rate to provide consistent gameplay. Running too many servers on the same physical machine can increase variance and degrade gameplay. But, running too few servers on the same machine will increase costs. Tracking spatial state enables developers to precisely balance server layout by diagnosing when desync events result from servers that are too tightly packed.

If you have questions, comments, or ways to improve this proposal, please email me at durst@stanford.edu.

Footnotes

I know Git doesn't require a single source of truth. However, the standard way to use Git is with GitHub as a central server and single source of truth.

David Durst's Blog