Why I Made CSKnow
I want to explore spatio-temporal databases. There are already application-specialized spatial databases such as PostGIS for queries like "find all the mountains in the US" and time series databases such as KDB and TimescaleDB for queries like "find me all the times when CPU utilization was greater than 80%." These spatial or temporal databases enable high throughput, low latency OLAP by specializing (1) data storage, (2) indexes, (3) and query languages to take advantage of the spatial and temporal locality present in their target data sets. I want to design a spatio-temporal database specialized for queries like "find me all the times when a player's crosshairs tracked another player through the wall for at least 3 seconds" (indicating they're cheating).
The CSKnow data set will serve as the motivating example for this spatio-temporal database. CSKnow contains the locations of players every tick (every 1/128 of a second) along with key events like weapons firing, frags, and players seeing other players. CSGO data is a good example because:
-
Iteration Speed - The data set enables fast iteration on spatio-temporal database design since:
- Data Is Easy to Access - The data can be simulated using CSGO's bots in large quantities using open-source libraries. Some similar spatio-temporal data sets like NBA's player position data sets from SportVU and SecondSpectrum, only small subsets with limited data dictionaries are publicly available.
-
Good Queries Can Be Generated Easily - Since I've played a lot of CSGO, I'm
knowledge both in the system design and application.
As both the developer and the, I (and CS researchers like me)
can quickly perform the following iteration loop:
- Come up with a new query that both stresses the system and answers an interesting question about CSGO.
- Fix problems in system identified by example query.
- Ground Truth Is Readily Available - We can determine the recall and accuracy of queries by watching the demo files, which provide an exact replay of all games in the data set using the CSGO engine. Similar data sets, like SportVU, aren't linked to complete replays of every game from every camera angle.
- CSGO Queries Generalize to Other Domains - The queries on CSGO will generalize to sports and other esports as all have similar setups: two teams of a fixed number of players in a well-defined court, field, or map with labels identifying key interactions between players and regions of the map. Additionally, the trajectory queries on the CSGO data set are of interest to the moving objects database community that is already building spatio-temporal databases.
- Large Scale of Real Data - CSGO data logged from human players can be orders of magnitude larger than other non-simulated data sets. A whole NBA season is roughly 50 GB, a couple years of ship tracking data is 1.8 TB, and a single day of CSGO is roughly 100 TB (assuming data is collected every 128 ticks and there are 500,000 concurrent players).
How I Made CSKnow
There are two components to the data generation process:
- CSGO Demo Generation - CSGO servers can emit demo files logging the actions on every tick of the game. I created a demo generator Docker image by configuring the existing CSGO server docker image to run 5v5 games of de_dust2 with only bots. The key technical trick here was figuring out the right settings in server.cfg, gamemode_competitive.cfg, and gamemode_competitive_server.cfg as CSGO servers have multiple pre-match stages that need be disabled for bot-only play including hibernation and warmup. This docker image automatically uploads the demo files to S3.
- CSGO Demo Parsing - CSGO demo files are just collections of Protobuf messages. I created a CSGO demo parser docker image to download the demos from S3, parse them using an existing Go library, and upload CSVs to S3 with player location data every tick as well as data from other events like frags.
How You Can Use CSKnow
The data set is available in the S3 bucket "csknow". The folder "csknow/demos/processed" contains the demo files. The folder "csknow/demos/csv2" contains the csv files. There are multiple csv files for each demo tracking different events. An example parser for the data can be found here If you have comments or questions about the data set, please email me at durst@stanford.edu.