Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,36 @@
# agent-app
Minimalist example of exposing an LLM agent using fastAPI

# Considerations

## The langchain project
- This repository was built on an early release of Langchain. The library has since expanded into a much more versatile and complete tool.
- [LangGraph](https://langchain-ai.github.io/langgraph/tutorials/introduction/) now allows us to orchestrate agents and natively implements features for "human-in-the-loop", long-term memory among other features that had to be implemented manually in this repository.
- [LangSmith](https://docs.smith.langchain.com) now allows us to observe and evaluate our agents in a production environment. It can be used together with either LangGraph or LangChain. It also integrates with LangGraph and serves as a deployment platform, though not relevant for this architecture.

## Changes and general guidelines to devs
- LangChain itself has changed significantly since this repository was created. The syntax now differs signficantly. Tools were streamlined, although a lot of the old Chain schemas are still available.
- Some of the custom logic here has now been abstracted away into new features built into v0.3 of LangChain. Here is a [tutorial](https://python.langchain.com/docs/versions/v0_3/#how-to-update-your-code) on how to update your code.
- Any updates should start with the consideration that [Pydantic has to be updated to v2](https://docs.pydantic.dev/latest/migration/).
- More importantly, LangChain provides a comprehensive tutorial on how to [migrate](https://python.langchain.com/docs/versions/migrating_chains/) from the early implementation to the current one. The syntax used now depends on LCEL (LangChain Expression Language). Every LCEL object implements a runnable interface that makes streaming intermediate steps, batching and other features easier, making some of the logic here obsolete after a refactor.
- Cohere ReRank coupled with vector store semantic search is still considered among the most reliable ways to retrieve relevant information from a vector store, so while some of the syntax can be refactored and streamlined, the core idea remains the same as of Q2 2025.
- Qdrant is still a widely supported vector store. Its use in memory retrieval can be maintained, but given the ubiquity of PGVector, and even greater support across applications and pipelines, further development COULD benefit from a full migration to PGVector.
- The embedding model can be cheaply upgraded to the latest and cheapest OpenAI model available, though benchmarking is recommended, as increased dimensionality can lead to unpredictable results. Generally, more is better, but bear in mind that it would require a new, higher dimension table to be created, which could be problematic for user data persistence if the migration is not done well. 1536 dimensions is still a good balance between performance and cost, but that could change in the future.
- With the recent explosion of performant open source models, it is quickly becoming feasible to build this application without having to rely on the big 3 LLM providers, or at least a partial phase out (say, keeping the embeddings or low token calls on OpenAI).
- On multi-language support: Almost every underlying model that will be used has support for an infinity of languages, however, custom prompts are widely used under the hood (in our custom logic and in LangChain itself), which could confuse the model and lead to unpredictable results. Either the "head" model has to be prompted to translate all its downstream calls into english, or the prompts themselves have to be translated or autogenerated by an intermediate step. This is a complex problem that is not yet solved, and is not yet a priority for this project. It is recommended that reliability in different languages be benchmarked and compared to english with an internal script for evaluation, especially when websearch is involved, as mistranslations tend to propagate very quickly throughout the chain (a 'google translate' between different languages quickly leads to hilarious results due to synonyms and contextual errors, the same applies here, just to a lesser extent).
- The aforementioned evaluation script can be used by importing the chains and models into a new py file, with results dumped in a dataframe that can be evaluated by a smart model or manually, with scores assigned to different permutations of models and languages.
- The local cache is fine to keep, though other solutions are now available at the framework, and possibly even vectorstore level.
- Memory schema works well but many other options are available now that might lead to better recall accuracy and personalization.
- Webscraping features are custom built for our backend, but again, with the ubiquity of websearch being now integrated natively on a lot of models, as well as a vast array of third party solutions, an incoming dev can look at other options that work better than an HTML/BeautifulSoup text dump.

## Cost optimization
- The BLAZINGLY FAST nature of development in this space has lowered the threshold as to what models are smart enough to handle the complexity involved in this application, meaning that bigger might not always be better if 95% can be done at 1/10th or 1/100th the cost. The functions manager is the module that requires the most deteministic and reliable output, meaning that a low temperature, high context model will be of great use here. For parsing websearches and conversation recall, any modern model will do. The "head" model that talks to the user will somewhat bottleneck the percieved quality of the responses, but it is still of lower priority than the functions manager. Avoid going for anything too simple, as loss of information throughout the chain of thought can lead to unpredictable results. "good enough" when possible, and "best in class" when the result has to be deterministic. An output parser and evaluation model can come in handy here too.
- PGVector is more than enough, so is qdrant, so unless there is an explicit benefit to using a paid vectorstore, it is recommended to use a free one.
- LangSmith, [MCP](https://modelcontextprotocol.io/docs/concepts/prompts) and other solutions within langchain can be used to evaluate output before a lot of tokens are spent. Coupled with the Runnable and natively async nature of LCEL, tokens can be saved by rejecting bad responses early.
- Batched requests are often cheaper on many APIs, so it is recommended to use them when possible (also now more seamless with LCEL).
- Output length can be trunkated to save tokens. Make sure this is only used when appropriate, as it can lead to loss of information.

## Final Considerations
- An incoming dev should prioritize updating the syntax to the latest version of LangChain, and then proceed to refactor the code to take advantage of the new features. Only then should they start implementing new features.
- Directly building on top of this repository "as is", without updating the syntax, is likely to lead to more complexity and problems that have long been abstracted away and solved by the community.
- FastAPI remains a viable and performant API solution in python, unlikely to be replaced by anything else in the near future.