pyper-dev
diff --git a/‎docs/src/docs/ApiReference/task.md‎
Lines changed: 20 additions & 19 deletions b/‎docs/src/docs/ApiReference/task.md‎
Lines changed: 20 additions & 19 deletions
diff --git a/‎docs/src/docs/Examples/ChessDataAnalysis.md‎
Lines changed: 152 additions & 0 deletions b/‎docs/src/docs/Examples/ChessDataAnalysis.md‎
Lines changed: 152 additions & 0 deletions
diff --git a/‎docs/src/docs/UserGuide/AdvancedConcepts.md‎
Lines changed: 18 additions & 16 deletions b/‎docs/src/docs/UserGuide/AdvancedConcepts.md‎
Lines changed: 18 additions & 16 deletions
diff --git a/‎docs/src/docs/UserGuide/ComposingPipelines.md‎
Lines changed: 1 addition & 2 deletions b/‎docs/src/docs/UserGuide/ComposingPipelines.md‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎docs/src/docs/UserGuide/CreatingPipelines.md‎
Lines changed: 5 additions & 8 deletions b/‎docs/src/docs/UserGuide/CreatingPipelines.md‎
Lines changed: 5 additions & 8 deletions
@@ -47,14 +47,20 @@ Pipelines created this way can be [composed](../UserGuide/ComposingPipelines) in
 * **type:** `Optional[Callable]`
 * **default:** `None`
 
-The function or callable object defining the logic of the task. Does not need to be passed explicitly if using `@task` as a decorator.
+The function or callable object defining the logic of the task. This is a positional-only parameter.
 
 ```python
 from pyper import task
 
 @task
 def add_one(x: int):
     return x + 1
+
+# OR
+def add_one(x: int):
+    return x + 1
+
+pipeline = task(add_one)
 ```
 
 {: .text-beta}
@@ -76,16 +82,14 @@ if __name__ == "__main__":
     pipeline1 = task(create_data)
     for output in pipeline1(0):
         print(output)
-        # Prints:
-        # [1, 2, 3]
+        #> [1, 2, 3]
 
     pipeline2 = task(create_data, branch=True)
     for output in pipeline2(0):
         print(output)
-        # Prints:
-        # 1
-        # 2
-        # 3
+        #> 1
+        #> 2
+        #> 3
 ```
 
 This can be applied to generator functions (or async generator functions) to submit outputs lazily:
@@ -102,10 +106,9 @@ if __name__ == "__main__":
     pipeline = task(create_data, branch=True)
     for output in pipeline(0):
         print(output)
-        # Prints:
-        # 1
-        # 2
-        # 3
+        #> 1
+        #> 2
+        #> 3
 ```
 
 {: .text-beta}
@@ -135,10 +138,9 @@ if __name__ == "__main__":
     pipeline = create_data | running_total
     for output in pipeline(0):
         print(output)
-        # Prints:
-        # 1
-        # 3
-        # 6
+        #> 1
+        #> 3
+        #> 6
 ```
 
 {: .warning}
@@ -259,10 +261,9 @@ if __name__ == "__main__":
     )
     for output in pipeline(1, 4):
         print(output)
-        # Prints:
-        # 10
-        # 20
-        # 30
+        #> 10
+        #> 20
+        #> 30
 ```
 
 Given that each producer-consumer expects to be given one input argument, the purpose of the `bind` parameter is to allow functions to be defined flexibly in terms of the inputs they wish to take, as well as allowing tasks to access external states, like contexts.
 
@@ -0,0 +1,152 @@
+---
+title: Chess Data Analysis
+parent: Examples
+layout: default
+nav_order: 1
+permalink: /docs/Examples/ChessDataAnalysis
+---
+
+# Chess Data Analysis
+{: .no_toc }
+
+* TOC
+{:toc}
+
+## Problem Statement
+
+Let's look at a very simple example of collecting some data and doing something with it. We will:
+
+* Build a pipeline to download a player's game data for the past few months from the [chess.com API](https://www.chess.com/news/view/published-data-api)
+* Use the `python-chess` package to parse the PGN game data
+* Use `pandas` to do some basic opening win-rate analysis
+
+## Setup
+
+This is a standalone script. Python package requirements are specified in `requirements.txt`
+
+**See the [source code](https://github.com/pyper-dev/pyper/tree/main/examples/ChessDataAnalysis) for this example** _(always review code before running it on your own machine)_
+
+## Implementation
+
+To collect the data we need, we will use the chess.com API's monthly multigame PGN download endpoint, which has the url format:
+
+```
+https://api.chess.com/pub/player/player-name/games/YYYY/MM/pgn
+```
+
+Firstly, we define a helper function to generate these urls for the most recent months:
+
+```python
+def generate_urls_by_month(player: str, num_months: int):
+    """Define a series of pgn game resource urls for a player, for num_months recent months."""
+    today = datetime.date.today()
+    for i in range(num_months):
+        d = today - relativedelta(months=i)
+        yield f"https://api.chess.com/pub/player/{player}/games/{d.year}/{d.month:02}/pgn"
+```
+
+We also need a function to fetch the raw data from each url.
+
+```python
+def fetch_text_data(url: str, session: requests.Session):
+    """Fetch text data from a url."""
+    r = session.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})
+    return r.text
+```
+
+Each PGN dataset consists of data for multiple games. We'll create a function called `read_game_data` to extract individual game details as dictionaries.
+
+```python
+def _clean_opening_name(eco_url: str):
+    """Get a rough opening name from the chess.com ECO url."""
+    name = eco_url.removeprefix("https://www.chess.com/openings/")
+    return " ".join(name.split("-")[:2])
+
+
+def read_game_data(pgn_text: str, player: str):
+    """Read PGN data and generate game details (each PGN contains details for multiple games)."""
+    pgn = io.StringIO(pgn_text)
+    while (headers := chess.pgn.read_headers(pgn)) is not None:
+        color = 'W' if headers["White"].lower() == player else 'B'
+        
+        if headers["Result"] == "1/2-1/2":
+            score = 0.5
+        elif (color == 'W' and headers["Result"] == "1-0") or (color == 'B' and headers["Result"] == "0-1"):
+            score = 1
+        else:
+            score = 0
+        
+        yield {
+            "color": color,
+            "score": score,
+            "opening": _clean_opening_name(headers["ECOUrl"]) 
+        }
+```
+
+Finally, we need some logic to handle the data analysis (which we're keeping very barebones).
+Let's dump the data into a pandas dataframe and print a table showing:
+
+* average score grouped by chess opening
+* where the player plays the white pieces
+* ordered by total games
+
+```python
+def build_df(data: typing.Iterable[dict]) -> pd.DataFrame:
+    df = pd.DataFrame(data)
+    df = df[df["color"] == 'W']
+    df = df.groupby("opening").agg(total_games=("score", "count"), average_score=("score", "mean"))
+    df = df.sort_values(by="total_games", ascending=False)
+    return df
+```
+
+All that's left is to piece everything together.
+
+Note that the Pyper framework hasn't placed any particular restrictions on the way our 'business logic' is implemented. We can use Pyper to simply compose together these logical functions into a concurrent pipeline, with minimal code coupling.
+
+In the pipeline, we will:
+
+1. Set `branch=True` for `generate_urls_by_month`, to allow this task to generate multiple outputs
+2. Create 3 workers for `fetch_text_data`, so that we can wait on requests concurrently
+3. Set `branch=True` for `read_game_data` also, as this generates multiple dictionaries
+4. Let the `build_df` function consume all output generated by this pipeline
+
+```python
+def main():
+    player = "hikaru"
+    num_months = 6  # Keep this number low, or add sleeps for etiquette
+
+    with requests.Session() as session:
+        run = (
+            task(generate_urls_by_month, branch=True)
+            | task(
+                fetch_text_data,
+                workers=3,
+                bind=task.bind(session=session))
+            | task(
+                read_game_data,
+                branch=True,
+                bind=task.bind(player=player))
+            > build_df
+        )
+        df = run(player, num_months)
+        print(df.head(10))
+```
+
+With no more lines of code than it would have taken to define a series of sequential for-loops, we've defined a concurrently executable data flow!
+
+We can now run everything to see the result of our analysis:
+
+```
+opening               total_games  average_score
+
+Nimzowitsch Larsen            244       0.879098
+Closed Sicilian               205       0.924390
+Caro Kann                     157       0.882166
+Bishops Opening               156       0.900641
+French Defense                140       0.846429
+Sicilian Defense              127       0.877953
+Reti Opening                   97       0.819588
+Vienna Game                    71       0.929577
+English Opening                61       0.868852
+Scandinavian Defense           51       0.862745
+```
@@ -39,7 +39,7 @@ IO-bound tasks benefit from both concurrent and parallel execution.
 However, to avoid the overhead costs of creating processes, it is generally preferable to use either threading or async code.
 
 {: .info}
-Threads incur a higher overhead cost compared to async coroutines, but are suitable if the function / application prefers or requires a synchronous implementation
+Threads incur a higher overhead cost compared to async coroutines, but are suitable if the task prefers or requires a synchronous implementation
 
 Note that asynchronous functions need to `await` or `yield` something in order to benefit from concurrency.
 Any long-running call in an async task which does not yield execution will prevent other tasks from making progress:
@@ -115,30 +115,32 @@ In Pyper, it is especially important to separate out different types of work int
 
 ```python
 # Bad -- functions not separated
-@task(workers=20)
+@task(branch=True, workers=20)
 def get_data(endpoint: str):
     # IO-bound work
     r = requests.get(endpoint)
     data = r.json()
 
     # CPU-bound work
-    return process_data(data)
+    for item in data["results"]:
+        yield process_data(item)
 ```
 
 Whilst it makes sense to handle the network request concurrently, the call to `process_data` within the same task is blocking and will harm concurrency.
-Instead, `process_data` can be implemented as a separate task:
+Instead, `process_data` should be implemented as a separate function:
 
 ```python
-@task(workers=20)
+@task(branch=True, workers=20)
 def get_data(endpoint: str):
     # IO-bound work
     r = requests.get(endpoint)
-    return r.json()
+    data = r.json()
+    return data["results"]
 
 @task(workers=10, multiprocess=True)
 def process_data(data):
     # CPU-bound work
-    ...
+    return ...
 ```
 
 ### Resource Management
@@ -254,16 +256,14 @@ if __name__ == "__main__":
     branched_pipeline = task(get_data, branch=True)
     for output in branched_pipeline():
         print(output)
-        # Prints:
-        # 1
-        # 2
-        # 3
+        #> 1
+        #> 2
+        #> 3
 
     non_branched_pipeline = task(get_data)
     for output in non_branched_pipeline():
         print(output)
-        # Prints:
-        # <generator object get_data at ...>
+        #> <generator object get_data at ...>
 ```
 
 ### Limitations
@@ -281,11 +281,13 @@ Generator functions, which return _immediately_, do most of their work outside o
 
 The alternatives are to:
 
-1. Use a synchronous generator anyway (if its performance is unlikely to be a bottleneck)
+1. Refactor your functions. If you find that one function is repeating a computation multiple times, it may be possible to [separate out responsibilities](#logical-separation) into separate functions
+
+2. Use a synchronous generator anyway (if its performance is unlikely to be a bottleneck)
 
-2. Use a normal synchronous function, and return an iterable data structure (if memory is unlikely to be a bottleneck)
+3. Use a normal synchronous function, and return an iterable data structure (if memory is unlikely to be a bottleneck)
 
-3. Use an async generator (if an async implementation of the function is appropriate)
+4. Use an async generator (if an async implementation of the function is appropriate)
 
 {: .text-green-200}
 **Multiprocessing and Pickling**
 
@@ -40,8 +40,7 @@ This represents defining a new function that:
 if __name__ == "__main__":
     for output in new_pipeline(4):
         print(output)
-        # Prints:
-        # 9
+        #> 9
 ```
 
 ## Consumer Functions and the `>` Operator
 
@@ -70,8 +70,7 @@ if __name__ == "__main__":
     pipeline = task(func)
     for output in pipeline(x=0):
         print(output)
-        # Prints:
-        # 1
+        #> 1
 ```
 
 {: .info}
@@ -91,10 +90,9 @@ if __name__ == "__main__":
     pipeline = task(func, branch=True)
     for output in pipeline(x=0):
         print(output)
-        # Prints:
-        # 1
-        # 2
-        # 3
+        #> 1
+        #> 2
+        #> 3
 ```
 
 ## Asynchronous Code
@@ -112,8 +110,7 @@ async def main():
     pipeline = task(func)
     async for output in pipeline(x=0):
         print(output)
-        # Prints:
-        # 1
+        #> 1
 
 if __name__ == "__main__":
     asyncio.run(main())