Wednesday, April 21, 2021

Guide - Replay Analysis

In this post, I'll go over the scripts I use to analyze replays, for anyone who also wants to try to do some analysis. Naturally, you'll need to know some amount of programming. We'll also look at a silly analysis I did which looks at the average composition of discards based on the hand's yaku.

First, you'll need the actual replays. You can use the phoenix-logs repo to do this. However, I had trouble getting it to work, so ApplySci sent me the database files themselves. I've put five years of replays in this Google Drive folder you can download. It should be a good enough sample size for most things.

The database has a table named "logs" which has seven columns. "log_id" is the id of the log (the one you'll see in replay links). "is_tonpusen" marks if the game is East-only. "is_hirosima" marks if the game is sanma. "log_content" is the XML replay data. The others aren't important.

That repo, and that folder, have a mergedbs.py script. You'll have to have python to use it, which you can download here. I use 3.7. Make sure to check the option to add to your PATH while installing it.

This script will extract the games from all the various year databases you have and put them into a single database sorted by game type. The script I included in the folder will separate it into es4p (East-South 4 Player), e4p (East-Only 4 Player), and es3p (East-South 3 Player) databases. The other game modes aren't played enough to matter. If you have a different number of year databases, you'll have to change the line that reads "for year in range(2016, 2021):" to start or end at the years you have. The second number should be increased by one (so if 2020 is the last year you have, it says 2021). It will take a while.

The databases output by this script will have a table named "logs" with the columns "log_id", "year", and "log_content". Year is the year the replay was played in, while the other two are the same as the base databases.

Now, to actually analyze the replays, you'll have to read from the database. The items in the database are also zipped to reduce their size, so you'll need to unzip them. Then, after unzipping them, you'll have to parse the XML in some way. For the replay XML format, you can use this as a reference.

The script I use for this part can be found here. It uses lxml for parsing the xml, so you will need to run "pip install lxml" in the console to get that. It also uses tqdm to provide a progress bar for the analysis, so run "pip install tqdm" to get that. Every 100,000 replays it will output the results. 

You can change the SQL statements to pick replays from a certain year, or a certain number of replays. I'll often add "LIMIT 10000" when first running an analysis to make sure the data looks correct and there are no errors.

My analysis scripts all extend from this simple "LogAnalyzer" class. Those are the two functions that the batch_analysis script will call. For a more fleshed out class, look at the "LogHandAnalyzer" class. This takes care of all the annoying work of gathering hands, discards, and calls. For an example analysis script, the "PondTraits" analysis uses the LogHandAnalyzer. There are also like a hundred other scripts in the "analysis" folder of that repo you can look at, though some are broken in some way or really old.

My typical process for writing an analysis starts with initializing two dictionaries of counters. "self.counts = defaultdict(Counter)" is frequently seen in these scripts. The name of the other depends on the analysis. When you write something to a defaultdict (such as "self.counts["Turn 3"]["Riichi"] += 1") it will create a Counter with that key. Counters are good at storing numbers, and their keys will be read as 0 by default. In this way you can write numbers to them arbitrarily.

In the PrintResults function, I'll write the data to files in a CSV format. You can then import these to Google Sheets to create tables or charts. For example, here's a chart for the PondTraits analysis script I linked earlier. It shows the distribution of tiles in player's discards before they riichi, based on their hand's yaku.

Amazing.

No comments:

Post a Comment