Skip to content

Commit

Permalink
vault backup: 2025-01-11 17:00:37
Browse files Browse the repository at this point in the history
  • Loading branch information
abhiaagarwal committed Jan 11, 2025
1 parent cc990ea commit 3ebe74a
Show file tree
Hide file tree
Showing 8 changed files with 308 additions and 2 deletions.
Empty file added content/().md
Empty file.
12 changes: 12 additions & 0 deletions content/education/omscs/iyse-6420.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
tags:
- thoughts
aliases:
- bayes
---

I took IYSE-6420, Bayesian Statistics in Fall of 2024.

I liked it! I don't really have much to say about it. Having a degree in economics, I've always held appreciation of Bayesian methods, and it was nice to see it used in an actual programming language and _not_ Stata.

Breezy, fun, and insanely informative — being able to model problems by simply describing what you have, and letting it sample a probability space — is awesome, and I foresee it being able to be applied to my career.
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ public:
std::cout << "I have been destroyed :(" << std::endl;
}
};

int main() {
auto verycoolobject = MyVeryCoolObject();
{
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
tags:
- observations
title: Running a bunch of python futures in a safely
---
Hey, did you know that unless you know exactly what you're doing, you should _never_ use `asyncio.gather` after manually creating async tasks via `asyncio.create_task()`?

This code,

```python
async def my_function(val: int) -> int:
return val * 2

async def main() -> None:
task_1 = asyncio.create_task(my_function(1))
task_2 = asyncio.create_task(my_function(2))
task_3 = asyncio.create_task(my_function(3))

results = await asyncio.gather(*tasks)
print(results) # expect [2, 4, 6]


if __name__ == "__main__":
asyncio.run(main())
```

is actually _deeply_ unsafe. Let me cite the (in)famous article that alerted me to this problem, [the heisenberg lurking in your async code](https://textual.textualize.io/blog/2023/02/11/the-heisenbug-lurking-in-your-async-code/). Here's also an excellent [stack overflow](https://stackoverflow.com/a/76823668/21551208) answer that goes a bit more in depth. In short, due to python's garbage collector, those `task_*` objects we created are weak references. Python's garbage collector doesn't understand that those `task_*` objects have a life after the `asyncio.gather`, and they may just be arbitrarily garbage collected by python, and never run.

Why does python do this? ¯\\\_(ツ)\_/¯. I would like to have a cordial conversation to whoever designed it this way.

In fact, the [asyncio docs for `create_task`](https://docs.python.org/3/library/asyncio-task.html#asyncio.create_task) have a warning for this:

![[running-a-bunch-of-futures-asyncio-docs.png]]Alright, fair, but I, like hundreds of millions developers, will skip text that's in a grey box. It needs to have red scary text, maybe outlined in red, and it should also have a popup in the browser. Guido Van Rossum should mail each IP address that has ever downloaded python a hand-written letter warning them of this. That's how serious this problem is.

The alternative solution, as the docs mention, is to use `asyncio.TaskGroup`. I actually _love_ the `asyncio.TaskGroup()` abstraction, and it serves its purpose well.

```python
async def my_function(val: int) -> int:
return val * 2

async def main() -> None:
async with asyncio.TaskGroup() as tg:
task_1 = tg.create_task(my_function(1))
task_2 = tg.create_task(my_function(2))
task_3 = tg.create_task(my_function(3))

results = [task_1.result(), task_2.result(), task_3.result()]
print(results) # expect [2, 4, 6]


if __name__ == "__main__":
asyncio.run(main())
```

Pretty good! After the `tg` scope context manager has ended, we are guaranteed that each task has finished (or errored). Even though I have strong feelings about python not really having true scoping, we are able to consume the results of those tasks. But it still is a bit un-ergonomic. What if we want to create 1 million tasks, all with the same function, and get the results of all of them simultaneously?

```python
async def my_function(val: int) -> int:
return val * 2

async def run_a_bunch_of_tasks(n: int) -> list[int]:
async with asyncio.TaskGroup() as tg:
tasks = [tg.create_task(my_function(i)) for i in range(n)]
return [task.result() for task in tasks]

async def main() -> None:
results = await run_a_bunch_of_tasks(1000000)
print(results) # expect [2, 4, 6, ..., 1999998, 2000000]


if __name__ == "__main__":
asyncio.run(main())
```

Nicer, but now let's _really_ abstract it.

```python
import asyncio
from collections.abc import Awaitable, Iterable
from typing import TypeVar

T = TypeVar("T")


async def run_futures(coros: Iterable[Awaitable[T]]) -> list[T]:
async with asyncio.TaskGroup() as tg:
tasks: list[asyncio.Task] = [tg.create_task(coro) for coro in coros]
return [t.result() for t in tasks]
```

Now, we can use it.
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
---
title: The only python interview question you will ever need
tags:
- thoughts
---
I've been thinking recently about Python API design (as one does, in their mid 20s). I'm someone who cares deeply writing performant code, so I often turn to [[threading|`threading, multiprocessing`]] or `asyncio` when dealing with IO-bound work (which is the majority of python applications).

I was looking at the API design for `concurrent.futures.ProcessPoolExecutor/ThreadPoolExecutor.submit()`, which looks something like:

```python
def cpu_bound_function(val, *, keyword_arg):
...

# or ThreadPoolExecutor()
with ProcessPoolExecutor() as executor:
task = executor.submit(cpu_bound_function, 5, keyword_arg=10)
```

Versus `asyncio.TaskGroup.create_task()`, which looks something like:

```python
async def io_bound_function(val, *, keyword_arg):
...

async with TaskGroup() as tg:
task = tg.create_task(io_bound_function(5, keyword_arg=10))
```

The question is: **Why are these APIs different?** Or, **why can't `ProcessPoolExecutor/ThreadPoolExecutor` use the nicer, ergonomic API of `asyncio.TaskGroup.create_task()`**?

I like the question because if you understand how async code works (not even necessarily in python, this same logic applies to Javascript promises or Rust Futures) then it's immediately obvious. Additionally, if you have a minimum understanding of threading/multiprocessing as a concept, it's obvious.

If it isn't obvious to you: time to study :)

# Appendix: What LLMs say

This isn't a particularly esoteric question or anything, but I wanted to see what LLMs think about it. Here's what a few of them say.

> [!example]- Claude 3.5 Sonnet
>
> This is a great question about API design! Let me break down why these APIs are different and why the ProcessPoolExecutor/ThreadPoolExecutor can't use the same approach as TaskGroup.
> The key difference lies in how these executors handle function execution:
>
> 1. **TaskGroup (asyncio)** operates within the same process and thread. When you write:
> ```python
> tg.create_task(io_bound_function(5, keyword_arg=10))
> ```
> The function call `io_bound_function(5, keyword_arg=10)` is evaluated in the current process, creating a coroutine object. This coroutine object is then scheduled for execution by the event loop.
>
> 2. **ProcessPoolExecutor/ThreadPoolExecutor** needs to execute the function in a different process/thread. If they used the same API:
> ```python
> # If this was possible (but it's not):
> executor.submit(cpu_bound_function(5, keyword_arg=10))
> ```
> The function call would be evaluated in the current process before being submitted to the executor. This would mean:
> - For ProcessPoolExecutor: The function would actually run in the main process, defeating the purpose of process pooling
> - Arguments would be evaluated in the wrong process/thread
> - The function and arguments would need to be pickled/serialized after execution rather than before
>
> Instead, by using:
> ```python
> executor.submit(cpu_bound_function, 5, keyword_arg=10)
> ```
> The executor can:
> 1. Pickle/serialize the function reference and arguments separately
> 2. Send them to the worker process/thread
> 3. Only then evaluate the function call in the correct process/thread
>
> So while the TaskGroup API is more ergonomic, the ProcessPoolExecutor/ThreadPoolExecutor API is designed this way out of necessity - it needs to delay the function call evaluation until it reaches the worker process/thread. The distinction becomes even clearer with something like:
>
> ```python
> # With TaskGroup - evaluates get_expensive_value() immediately
> tg.create_task(cpu_bound_function(get_expensive_value(), keyword_arg=10))
>
> # With ProcessPoolExecutor - only passes the function reference
> executor.submit(cpu_bound_function, get_expensive_value(), keyword_arg=10)
> ```
>
> The second case evaluates `get_expensive_value()` in the main process before submission, which might not be what you want if it's a CPU-intensive operation.
>
> > [!info] Verdict
> > Like most things, Claude gives a great, concise answer. I especially like the example.
> [!example]- ChatGPT o1
> A convenient way to think about it is that **the `asyncio` world and the `concurrent.futures` world solve “concurrency” in very different ways**, with very different constraints. They _look_ somewhat similar (both submit “tasks” that run concurrently), but under the hood they’re operating in different paradigms:
>
> 1. **`asyncio` uses an event loop and coroutines**
>
> - `create_task(coro)` receives a _coroutine object_, which is a first-class object in Python.
> - That coroutine object is then scheduled to run on the event loop.
> - When you write `tg.create_task(myfunc(5, x=10))`, you are _already calling_ `myfunc(...)` in order to _build the coroutine object_, but you are _not_ actually running the function’s body yet (because coroutines don’t run until the event loop schedules them).
> 2. **`concurrent.futures` (ProcessPoolExecutor/ThreadPoolExecutor) uses callable + arguments**
>
> - `executor.submit(func, *args, **kwargs)` sends the callable _plus_ the arguments into a worker thread or process.
> - In the case of a `ProcessPoolExecutor`, `func` and its arguments must be _pickleable_.
> - The actual function call happens in a completely separate thread or process, so it needs your function object and arguments in a form that can be passed across processes.
> - If you wrote something like `executor.submit(myfunc(5, x=10))`, that would _immediately call_ `myfunc(5, x=10)` in the current process—returning its result (or a coroutine object, if `myfunc` was async)—and then try to submit that _value_ (rather than the function) to the executor. That’s not what we want.
>
> ---
>
> ## Why they can’t just share the same API
>
> 1. **Different concurrency models**
> `asyncio` is _single-threaded_ concurrency driven by an event loop (cooperative multitasking), whereas `concurrent.futures` is about _multi_-threading or multi-_processing_ concurrency (preemptive multitasking). They each require different underlying mechanisms.
>
> 2. **Pickle vs. coroutine objects**
>
> - With `ProcessPoolExecutor`, any function you submit must be importable at the top level and must be pickleable.
> - A _coroutine object_ (as in `create_task(coro)`) _isn’t_ pickleable in a meaningful way for shipping off to another process.
> - Even in a thread pool, the executor expects a function pointer plus arguments; it doesn’t integrate with an event loop or coroutines directly.
> 3. **Historical reasons / different eras**
>
> - `concurrent.futures` predates `asyncio` and was heavily influenced by Java’s executor pattern (where you submit a callable and arguments).
> - `asyncio` came later with a different design: coroutines, an event loop, `await`, etc.
> - By the time `asyncio` matured, `concurrent.futures` already had its established `submit(func, *args, **kwargs)` pattern, so there wasn’t a simple or backward-compatible way to unify them.
> 4. **They solve different primary use cases**
>
> - `ThreadPoolExecutor` / `ProcessPoolExecutor`: CPU-bound or blocking tasks, run in separate threads/processes.
> - `asyncio`: IO-bound tasks that _cooperatively_ yield.
> - It’s not just syntax sugar—`asyncio` is built around the concept that a function returns immediately as a “cold” coroutine object. In `concurrent.futures`, you must supply a “plain old Python function” that hasn’t been invoked yet (because it will be invoked _in another thread or process_).
>
> ---
>
> ### Putting it all together
>
> When you do:
>
> ```python
> # asyncio
> async with TaskGroup() as tg:
> task = tg.create_task(myasyncfunc(5, keyword_arg=10))
> ```
>
> - You are constructing a coroutine object `myasyncfunc(5, keyword_arg=10)` in the current thread, and telling the event loop: “Whenever you get a chance, run this coroutine.”
>
> But when you do:
>
> ```python
> # concurrent.futures
> with ProcessPoolExecutor() as executor:
> fut = executor.submit(myfunc, 5, keyword_arg=10)
> ```
>
> - You are handing _callable plus arguments_ to the executor, which will ship them to a pool worker process. The worker process will then call `myfunc(5, keyword_arg=10)` over there.
>
> If the `concurrent.futures` executors tried to accept the same style—e.g., `executor.submit(myfunc(5, x=10))`—they’d be forced to evaluate `myfunc(...)` _right now_ in your current process. That _defeats the purpose_, because the result of `myfunc(...)` would be computed before even hitting the executor, and there’d be nothing left to run on another thread/process.
>
> Hence the short answer:
>
> > **They have different concurrency models and different technical constraints, so they necessarily end up with different-looking APIs.**
>
> > [!info] Verdict
> > A bit wordy, but overall, the right answer. o1 in general is a bit wordy (not sure if this is intentional behavior), but it could be cut in half and still be right.
> [!example]- Gemini 1206 Exp
> You've hit upon a subtle but important distinction between synchronous and asynchronous programming models, and how that impacts API design. Here's a breakdown of why `concurrent.futures` and `asyncio` differ in their task submission approaches:
>
> **1. Synchronous vs. Asynchronous Execution**
>
> * **`concurrent.futures` (ProcessPoolExecutor/ThreadPoolExecutor):** Designed for **synchronous** code. When you `submit()` a task, the calling thread *blocks* (waits) or continues execution (depending on further interactions with the future object), but conceptually, it expects the task to run independently, either in a separate thread or process.
> * **`asyncio` (TaskGroup):** Designed for **asynchronous** code using coroutines (`async` and `await`). `create_task()` schedules a coroutine to run concurrently within the event loop. The execution flow doesn't inherently block; it relies on the event loop to manage the switching between coroutines.
>
> **2. Function Call vs. Coroutine Object**
>
> * **`submit()`:** Takes a **callable** (function or method) and its arguments. The executor is responsible for invoking this callable in a separate thread/process. The function is executed directly.
> * **`create_task()`:** Takes a **coroutine object**. The `io_bound_function(5, keyword_arg=10)` in your example *doesn't actually execute the function*. Instead, it returns a coroutine object that represents the suspended execution of that function. The event loop is then responsible for resuming this coroutine at the appropriate times.
>
> **3. The Fundamental Difference**
>
> The core reason for the API difference comes down to this:
>
> * **Synchronous executors need to *call* your function to make it run.** Therefore, they need the function and arguments separately.
> * **Asynchronous frameworks need to *schedule* your coroutine within an event loop.** They work with the coroutine object itself, not by directly calling the underlying function. The function has already been called to produce the coroutine object.
>
> **Why Can't `concurrent.futures` Adopt `asyncio`'s Style?**
>
> If `ProcessPoolExecutor/ThreadPoolExecutor` were to accept a coroutine object, they would have no built-in mechanism to execute it. They don't have an event loop. They expect a regular, callable function that they can invoke directly.
>
> **Analogy**
>
> Think of it like this:
>
> * **`concurrent.futures`:** You hand a chef (executor) a recipe (function) and ingredients (arguments) and say, "Make this dish." The chef follows the recipe step-by-step in their own kitchen (thread/process).
> * **`asyncio`:** You hand a head chef (event loop) a set of instructions for a dish (coroutine object). The head chef doesn't cook it immediately but adds it to their list of tasks. They then delegate different parts of preparing the dish (resuming the coroutine) to sous chefs (other parts of the program) as they become available, all while managing other dishes concurrently.
>
> **In Summary**
>
> The difference in API design reflects the underlying models:
>
> * `concurrent.futures`: Call a function to execute it.
> * `asyncio`: Schedule a coroutine object for execution within an event loop.
>
> The API for synchronous execution needs the function and arguments separately because it will invoke the function directly. The asynchronous API uses the coroutine object because it's not responsible for direct execution but rather scheduling within a larger system (the event loop).
> > [!info] Verdict
> > I really like this Gemini's tone, and the example is pretty good too.
Verdict: the LLMs know, and they all give what I would classify as _great_ answers.
7 changes: 6 additions & 1 deletion content/programming/tools/git/multiple-git-accounts.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
---
title: Handling multiple git accounts
tags:
- thoughts
---

Imagine the scenario: you have multiple github accounts, one for personal stuff, one for school, one for work stuff. Managing this stuff becomes a nightmare, and I'm sure y'all can relate to pushing with the wrong email, frantically having to amend, then `git config user.email`.

If you keep your separate "identities" separated via subfolder, there's a nice solution.
Expand All @@ -20,4 +26,3 @@ Now, in `/my/path`, create a `.gitconfig`:
Everything nested under `/my/path` will use those git configs.

Note this doesn't solve the issue of "how do I authenticate to github with multiple accounts", which... is its own can of worms. Maybe someday!

1 change: 1 addition & 0 deletions content/the-art-of-writing-fast.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ title: The art of writing fast
tags:
- thoughts
---

I really want to become a better writer. My whole life, I've been paralyzed by my own poor prose, jealous of people who can communicate, and desperate to be one of them. I started this mini-blog as a way to develop those writing skills.

Yet I don't write much. Well, I do. I write something, then I leave it in a half-finished state, then I scrap it. I get halfway through an idea, then discover something new about it, then leave it trapped in the prison of my mind, only to be forgotten and rediscovered a few months later.
Expand Down

0 comments on commit 3ebe74a

Please sign in to comment.