Chat with PDF using Chainlit #178

neelasha23 · 2024-04-05T13:02:58Z

Closes #167

📚 Documentation preview 📚: https://ploomber-doc--178.org.readthedocs.build/en/178/

edublancas · 2024-04-05T15:26:36Z

does langchain provide any significant benefits? you could also do it with the openAI package but if you feel like using langchain is better, go ahead

neelasha23 · 2024-04-05T15:41:59Z

does langchain provide any significant benefits? you could also do it with the openAI package but if you feel like using langchain is better, go ahead

The embeddings are from OpenAI. langchain has features for making the conversation a contextual one. So the chat memory will append the current user query to the previous response and if there's any additional info in previous msg it takes that into account. I referred to this tutorial: https://docs.chainlit.io/examples/qa. Saw the same in Panel example

Is it preferrable to try without langchain completely ? I'll update both methods

@edublancas

edublancas · 2024-04-05T18:16:10Z

that's ok, let's keep langchain

chainlit readme

cli

fix readme

examples/chainlit/chat-with-pdf/README.md

bryannho

Code looks good. Will wait for deploy to test functionality

neelasha23 · 2024-04-08T17:48:33Z

App deployed here: https://frosty-flower-5089.ploomberapp.io/ @bryannho @edublancas

bryannho · 2024-04-08T19:40:06Z

@neelasha23 I got this error when I uploaded a PDF (didn't send any messages yet):

neelasha23 · 2024-04-08T19:46:17Z

@neelasha23 I got this error when I uploaded a PDF (didn't send any messages yet):

Please share the PDF @bryannho

bryannho · 2024-04-08T19:55:37Z

@edublancas It was failing because I used a scanned PDF.

@neelasha23 will add a catch and use pdf_scanned_to_text for this case

bryannho · 2024-04-08T20:21:25Z

@neelasha23 two more small notes:

I tried again with this native placeholder.pdf and it returned the same error list index out of range, so unsure if it's just a scanned pdf issue.
The "Processing placeholder.pdf..." message isn't sent until the document has fully finished processing which is a bit confusing. I see the progress bar for the file upload, but could you try to send that message after the file upload has completed but before the app starts parsing the document? See the video below:

Screen.Recording.2024-04-08.at.1.14.01.PM.mov

neelasha23 · 2024-04-09T12:17:02Z

@edublancas It was failing because I used a scanned PDF.

@neelasha23 will add a catch and use [pdf_scanned_to_text]

Added call to pdf_scanned_to_text

@bryannho

neelasha23 · 2024-04-09T12:20:24Z

I tried again with this native placeholder.pdf and it returned the same error list index out of range

It happened because there was just one line in the PDF. Fixed

could you try to send that message after the file upload has completed but before the app starts parsing the document?

It was already after the file upload and before doc parsing. But I think because of the asynchronous nature of calling pdf_to_text the message was getting stuck. Have convrted the pdf conversion call to async. There might be a lag in the message display depending on size of PDF.

Deployed updated app here: https://polished-night-8566.ploomberapp.io/

@bryannho @edublancas

edublancas · 2024-04-09T13:19:57Z

the answers look off (I tried this paper: https://arxiv.org/abs/2402.00838):

neelasha23 · 2024-04-09T16:43:46Z

the answers look off (I tried this paper: https://arxiv.org/abs/2402.00838):

I think it's confusing the title with the ones in the References section.

It is answering specific questions about the content though:

@edublancas

edublancas · 2024-04-09T16:49:50Z

I think we should fix it, those basic questions should be answered correctly.

check how it's done here: https://github.com/ploomber/doc/tree/main/examples/panel/chat-with-pdf I remember testing that example and it worked fine when asking about the abstract and title

two things might happen:

the issue might be in how we pass the info to langchain, in such case, check if you want to change the settings
maybe the problem is the content returned by our parsing functions. in such case, let us know so we can see how we can improve it

neelasha23 · 2024-04-09T16:57:21Z

The Panel one is also mistaking the title:

But the 2nd response is better:

I'll still check what can be improved

@edublancas

edublancas · 2024-04-09T17:03:43Z

cool. yeah, compare the answers from both examples, and try to see if you can improve it a bit.

spend ~2 hours on this; no need to spend more. let me know your conclusions so we know if we should keep digging or publish as is

neelasha23 · 2024-04-10T12:37:31Z

I ended up spending a bit more time on this as even after adding the same settings as the Panel app the results were not comparing with the outputs I saw in the Panel app yesterday. But I tested the Panel app again and I realized it's not performing any better. The results are inconsistent. It might give the correct answer once in a while by chance. Also, this issue is mostly with the OLMO paper. I found better results with other papers. Below are the detailed observations and next steps:

Changes made to Chainlit App (as per the Panel one)
Here's the commit

Changed RecursiveCharacterTextSplitter to CharacterTextSplitter
Converted the PDF extracted using Ploomber API to a list of Document objects. Each object looks like this:
Document(page_content="text", metadata={"source": page_path, "page": page number})
As we now have Document objects converted LanceDB.from_texts to LanceDB_from_documents.
Converted ConversationalRetrievalChain to RetrievalQA
Added search_type="similarity"

Panel app observations (inconsistent results on the Olmo paper)

It has picked up from the references:

Another attempt:

Chainlit output

Paper: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Conclusions

It looks like in both cases the vector search is biased towards the References section, which might be because it's towards the end of the document.
We can research more on how to utilize the metadata in the Document object to improve the results.
We can explore a RAG-based approach and index different parts of the document separately so the retriever doesn't get confused.
The pdf_to_text and scanned_pdf_to_text functions can have a parameter for returning results in the Document format mentioned above. That would make it compatible with langchain (in case it's relevant for users)
I think the vector DB should not make too much of a difference but we can check differences between LanceDB and Chroma (that's the only difference now in the Chainlit app vs Panel app). Although I did try with Chroma also but it didn't make any difference.

Deployed updated app: https://www.platform.ploomber.io/applications/steep-salad-8357/0f7566de

@edublancas

edublancas · 2024-04-11T03:04:09Z

@neelasha23 please prepare social media posts and include the link to the app, let's keep the example running for a week

neelasha23 added 2 commits April 5, 2024 00:24

pdf chat chainlit

786bd86

Added chain

c7c54ba

readme

21e13d7

neelasha23 added 3 commits April 6, 2024 21:24

lock file

c648d27

chainlit readme

Dockerfile

2b6ebef

cli

update lock

6bbac73

fix readme

neelasha23 force-pushed the issue167 branch from 5628d8c to 6bbac73 Compare April 6, 2024 15:55

fix secrets link;

1ce86c3

neelasha23 commented Apr 6, 2024

View reviewed changes

examples/chainlit/chat-with-pdf/README.md Show resolved Hide resolved

neelasha23 marked this pull request as ready for review April 6, 2024 16:17

neelasha23 requested review from edublancas and bryannho as code owners April 6, 2024 16:17

bryannho reviewed Apr 8, 2024

View reviewed changes

neelasha23 added 2 commits April 9, 2024 16:41

async function

c3c7552

scanned pdf

5d94150

langhcian updates

6eadda4

neelasha23 requested a review from bryannho April 10, 2024 12:56

edublancas approved these changes Apr 11, 2024

View reviewed changes

edublancas merged commit 5d53e9d into main Apr 11, 2024
1 check passed

edublancas deleted the issue167 branch April 11, 2024 03:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chat with PDF using Chainlit #178

Chat with PDF using Chainlit #178

neelasha23 commented Apr 5, 2024 •

edited by github-actions bot

Loading

edublancas commented Apr 5, 2024

neelasha23 commented Apr 5, 2024 •

edited

Loading

edublancas commented Apr 5, 2024

bryannho left a comment

neelasha23 commented Apr 8, 2024 •

edited

Loading

bryannho commented Apr 8, 2024

neelasha23 commented Apr 8, 2024

bryannho commented Apr 8, 2024 •

edited

Loading

bryannho commented Apr 8, 2024

neelasha23 commented Apr 9, 2024

neelasha23 commented Apr 9, 2024 •

edited

Loading

edublancas commented Apr 9, 2024

neelasha23 commented Apr 9, 2024 •

edited

Loading

edublancas commented Apr 9, 2024

neelasha23 commented Apr 9, 2024

edublancas commented Apr 9, 2024

neelasha23 commented Apr 10, 2024 •

edited

Loading

edublancas commented Apr 11, 2024

Chat with PDF using Chainlit #178

Chat with PDF using Chainlit #178

Conversation

neelasha23 commented Apr 5, 2024 • edited by github-actions bot Loading

edublancas commented Apr 5, 2024

neelasha23 commented Apr 5, 2024 • edited Loading

edublancas commented Apr 5, 2024

bryannho left a comment

Choose a reason for hiding this comment

neelasha23 commented Apr 8, 2024 • edited Loading

bryannho commented Apr 8, 2024

neelasha23 commented Apr 8, 2024

bryannho commented Apr 8, 2024 • edited Loading

bryannho commented Apr 8, 2024

neelasha23 commented Apr 9, 2024

neelasha23 commented Apr 9, 2024 • edited Loading

edublancas commented Apr 9, 2024

neelasha23 commented Apr 9, 2024 • edited Loading

edublancas commented Apr 9, 2024

neelasha23 commented Apr 9, 2024

edublancas commented Apr 9, 2024

neelasha23 commented Apr 10, 2024 • edited Loading

edublancas commented Apr 11, 2024

neelasha23 commented Apr 5, 2024 •

edited by github-actions bot

Loading

neelasha23 commented Apr 5, 2024 •

edited

Loading

neelasha23 commented Apr 8, 2024 •

edited

Loading

bryannho commented Apr 8, 2024 •

edited

Loading

neelasha23 commented Apr 9, 2024 •

edited

Loading

neelasha23 commented Apr 9, 2024 •

edited

Loading

neelasha23 commented Apr 10, 2024 •

edited

Loading