|
| 1 | +--- |
| 2 | +title: Chess Data Analysis |
| 3 | +parent: Examples |
| 4 | +layout: default |
| 5 | +nav_order: 1 |
| 6 | +permalink: /docs/Examples/ChessDataAnalysis |
| 7 | +--- |
| 8 | + |
| 9 | +# Chess Data Analysis |
| 10 | +{: .no_toc } |
| 11 | + |
| 12 | +* TOC |
| 13 | +{:toc} |
| 14 | + |
| 15 | +## Problem Statement |
| 16 | + |
| 17 | +Let's look at a very simple example of collecting some data and doing something with it. We will: |
| 18 | + |
| 19 | +* Build a pipeline to download a player's game data for the past few months from the [chess.com API](https://www.chess.com/news/view/published-data-api) |
| 20 | +* Use the `python-chess` package to parse the PGN game data |
| 21 | +* Use `pandas` to do some basic opening win-rate analysis |
| 22 | + |
| 23 | +## Setup |
| 24 | + |
| 25 | +This is a standalone script. Python package requirements are specified in `requirements.txt` |
| 26 | + |
| 27 | +**See the [source code](https://github.com/pyper-dev/pyper/tree/main/examples/ChessDataAnalysis) for this example** _(always review code before running it on your own machine)_ |
| 28 | + |
| 29 | +## Implementation |
| 30 | + |
| 31 | +To collect the data we need, we will use the chess.com API's monthly multigame PGN download endpoint, which has the url format: |
| 32 | + |
| 33 | +``` |
| 34 | +https://api.chess.com/pub/player/player-name/games/YYYY/MM/pgn |
| 35 | +``` |
| 36 | + |
| 37 | +Firstly, we define a helper function to generate these urls for the most recent months: |
| 38 | + |
| 39 | +```python |
| 40 | +def generate_urls_by_month(player: str, num_months: int): |
| 41 | + """Define a series of pgn game resource urls for a player, for num_months recent months.""" |
| 42 | + today = datetime.date.today() |
| 43 | + for i in range(num_months): |
| 44 | + d = today - relativedelta(months=i) |
| 45 | + yield f"https://api.chess.com/pub/player/{player}/games/{d.year}/{d.month:02}/pgn" |
| 46 | +``` |
| 47 | + |
| 48 | +We also need a function to fetch the raw data from each url. |
| 49 | + |
| 50 | +```python |
| 51 | +def fetch_text_data(url: str, session: requests.Session): |
| 52 | + """Fetch text data from a url.""" |
| 53 | + r = session.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}) |
| 54 | + return r.text |
| 55 | +``` |
| 56 | + |
| 57 | +Each PGN dataset consists of data for multiple games. We'll create a function called `read_game_data` to extract individual game details as dictionaries. |
| 58 | + |
| 59 | +```python |
| 60 | +def _clean_opening_name(eco_url: str): |
| 61 | + """Get a rough opening name from the chess.com ECO url.""" |
| 62 | + name = eco_url.removeprefix("https://www.chess.com/openings/") |
| 63 | + return " ".join(name.split("-")[:2]) |
| 64 | + |
| 65 | + |
| 66 | +def read_game_data(pgn_text: str, player: str): |
| 67 | + """Read PGN data and generate game details (each PGN contains details for multiple games).""" |
| 68 | + pgn = io.StringIO(pgn_text) |
| 69 | + while (headers := chess.pgn.read_headers(pgn)) is not None: |
| 70 | + color = 'W' if headers["White"].lower() == player else 'B' |
| 71 | + |
| 72 | + if headers["Result"] == "1/2-1/2": |
| 73 | + score = 0.5 |
| 74 | + elif (color == 'W' and headers["Result"] == "1-0") or (color == 'B' and headers["Result"] == "0-1"): |
| 75 | + score = 1 |
| 76 | + else: |
| 77 | + score = 0 |
| 78 | + |
| 79 | + yield { |
| 80 | + "color": color, |
| 81 | + "score": score, |
| 82 | + "opening": _clean_opening_name(headers["ECOUrl"]) |
| 83 | + } |
| 84 | +``` |
| 85 | + |
| 86 | +Finally, we need some logic to handle the data analysis (which we're keeping very barebones). |
| 87 | +Let's dump the data into a pandas dataframe and print a table showing: |
| 88 | + |
| 89 | +* average score grouped by chess opening |
| 90 | +* where the player plays the white pieces |
| 91 | +* ordered by total games |
| 92 | + |
| 93 | +```python |
| 94 | +def build_df(data: typing.Iterable[dict]) -> pd.DataFrame: |
| 95 | + df = pd.DataFrame(data) |
| 96 | + df = df[df["color"] == 'W'] |
| 97 | + df = df.groupby("opening").agg(total_games=("score", "count"), average_score=("score", "mean")) |
| 98 | + df = df.sort_values(by="total_games", ascending=False) |
| 99 | + return df |
| 100 | +``` |
| 101 | + |
| 102 | +All that's left is to piece everything together. |
| 103 | + |
| 104 | +Note that the Pyper framework hasn't placed any particular restrictions on the way our 'business logic' is implemented. We can use Pyper to simply compose together these logical functions into a concurrent pipeline, with minimal code coupling. |
| 105 | + |
| 106 | +In the pipeline, we will: |
| 107 | + |
| 108 | +1. Set `branch=True` for `generate_urls_by_month`, to allow this task to generate multiple outputs |
| 109 | +2. Create 3 workers for `fetch_text_data`, so that we can wait on requests concurrently |
| 110 | +3. Set `branch=True` for `read_game_data` also, as this generates multiple dictionaries |
| 111 | +4. Let the `build_df` function consume all output generated by this pipeline |
| 112 | + |
| 113 | +```python |
| 114 | +def main(): |
| 115 | + player = "hikaru" |
| 116 | + num_months = 6 # Keep this number low, or add sleeps for etiquette |
| 117 | + |
| 118 | + with requests.Session() as session: |
| 119 | + run = ( |
| 120 | + task(generate_urls_by_month, branch=True) |
| 121 | + | task( |
| 122 | + fetch_text_data, |
| 123 | + workers=3, |
| 124 | + bind=task.bind(session=session)) |
| 125 | + | task( |
| 126 | + read_game_data, |
| 127 | + branch=True, |
| 128 | + bind=task.bind(player=player)) |
| 129 | + > build_df |
| 130 | + ) |
| 131 | + df = run(player, num_months) |
| 132 | + print(df.head(10)) |
| 133 | +``` |
| 134 | + |
| 135 | +With no more lines of code than it would have taken to define a series of sequential for-loops, we've defined a concurrently executable data flow! |
| 136 | + |
| 137 | +We can now run everything to see the result of our analysis: |
| 138 | + |
| 139 | +``` |
| 140 | +opening total_games average_score |
| 141 | +
|
| 142 | +Nimzowitsch Larsen 244 0.879098 |
| 143 | +Closed Sicilian 205 0.924390 |
| 144 | +Caro Kann 157 0.882166 |
| 145 | +Bishops Opening 156 0.900641 |
| 146 | +French Defense 140 0.846429 |
| 147 | +Sicilian Defense 127 0.877953 |
| 148 | +Reti Opening 97 0.819588 |
| 149 | +Vienna Game 71 0.929577 |
| 150 | +English Opening 61 0.868852 |
| 151 | +Scandinavian Defense 51 0.862745 |
| 152 | +``` |
0 commit comments