Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chat with PDF using Chainlit #178

Merged
merged 10 commits into from
Apr 11, 2024
Merged

Chat with PDF using Chainlit #178

merged 10 commits into from
Apr 11, 2024

Conversation

neelasha23
Copy link
Contributor

@neelasha23 neelasha23 commented Apr 5, 2024

Closes #167


📚 Documentation preview 📚: https://ploomber-doc--178.org.readthedocs.build/en/178/

@edublancas
Copy link
Contributor

does langchain provide any significant benefits? you could also do it with the openAI package but if you feel like using langchain is better, go ahead

@neelasha23
Copy link
Contributor Author

neelasha23 commented Apr 5, 2024

does langchain provide any significant benefits? you could also do it with the openAI package but if you feel like using langchain is better, go ahead

The embeddings are from OpenAI. langchain has features for making the conversation a contextual one. So the chat memory will append the current user query to the previous response and if there's any additional info in previous msg it takes that into account. I referred to this tutorial: https://docs.chainlit.io/examples/qa. Saw the same in Panel example

Is it preferrable to try without langchain completely ? I'll update both methods

@edublancas

@edublancas
Copy link
Contributor

that's ok, let's keep langchain

chainlit readme
fix readme
@neelasha23 neelasha23 marked this pull request as ready for review April 6, 2024 16:17
Copy link
Contributor

@bryannho bryannho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good. Will wait for deploy to test functionality

@neelasha23
Copy link
Contributor Author

neelasha23 commented Apr 8, 2024

@bryannho
Copy link
Contributor

bryannho commented Apr 8, 2024

@neelasha23 I got this error when I uploaded a PDF (didn't send any messages yet):

image

@neelasha23
Copy link
Contributor Author

@neelasha23 I got this error when I uploaded a PDF (didn't send any messages yet):

Please share the PDF @bryannho

@bryannho
Copy link
Contributor

bryannho commented Apr 8, 2024

@edublancas It was failing because I used a scanned PDF.

@neelasha23 will add a catch and use pdf_scanned_to_text for this case

@bryannho
Copy link
Contributor

bryannho commented Apr 8, 2024

@neelasha23 two more small notes:

  1. I tried again with this native placeholder.pdf and it returned the same error list index out of range, so unsure if it's just a scanned pdf issue.
  2. The "Processing placeholder.pdf..." message isn't sent until the document has fully finished processing which is a bit confusing. I see the progress bar for the file upload, but could you try to send that message after the file upload has completed but before the app starts parsing the document? See the video below:
Screen.Recording.2024-04-08.at.1.14.01.PM.mov

@neelasha23
Copy link
Contributor Author

@edublancas It was failing because I used a scanned PDF.

@neelasha23 will add a catch and use [pdf_scanned_to_text]

Added call to pdf_scanned_to_text

@bryannho

@neelasha23
Copy link
Contributor Author

neelasha23 commented Apr 9, 2024

  1. I tried again with this native placeholder.pdf and it returned the same error list index out of range

It happened because there was just one line in the PDF. Fixed

  1. could you try to send that message after the file upload has completed but before the app starts parsing the document?

It was already after the file upload and before doc parsing. But I think because of the asynchronous nature of calling pdf_to_text the message was getting stuck. Have convrted the pdf conversion call to async. There might be a lag in the message display depending on size of PDF.

Deployed updated app here: https://polished-night-8566.ploomberapp.io/

@bryannho @edublancas

@edublancas
Copy link
Contributor

the answers look off (I tried this paper: https://arxiv.org/abs/2402.00838):

image

@neelasha23
Copy link
Contributor Author

neelasha23 commented Apr 9, 2024

the answers look off (I tried this paper: https://arxiv.org/abs/2402.00838):

I think it's confusing the title with the ones in the References section.

Screenshot 2024-04-09 at 10 11 37 PM Screenshot 2024-04-09 at 10 16 54 PM

It is answering specific questions about the content though:

Screenshot 2024-04-09 at 10 13 01 PM

@edublancas

@edublancas
Copy link
Contributor

I think we should fix it, those basic questions should be answered correctly.

check how it's done here: https://github.com/ploomber/doc/tree/main/examples/panel/chat-with-pdf I remember testing that example and it worked fine when asking about the abstract and title

two things might happen:

  • the issue might be in how we pass the info to langchain, in such case, check if you want to change the settings
  • maybe the problem is the content returned by our parsing functions. in such case, let us know so we can see how we can improve it

@neelasha23
Copy link
Contributor Author

The Panel one is also mistaking the title:

Screenshot 2024-04-09 at 10 24 10 PM

But the 2nd response is better:

Screenshot 2024-04-09 at 10 26 28 PM

I'll still check what can be improved

@edublancas

@edublancas
Copy link
Contributor

cool. yeah, compare the answers from both examples, and try to see if you can improve it a bit.

spend ~2 hours on this; no need to spend more. let me know your conclusions so we know if we should keep digging or publish as is

@neelasha23
Copy link
Contributor Author

neelasha23 commented Apr 10, 2024

I ended up spending a bit more time on this as even after adding the same settings as the Panel app the results were not comparing with the outputs I saw in the Panel app yesterday. But I tested the Panel app again and I realized it's not performing any better. The results are inconsistent. It might give the correct answer once in a while by chance. Also, this issue is mostly with the OLMO paper. I found better results with other papers. Below are the detailed observations and next steps:

Changes made to Chainlit App (as per the Panel one)
Here's the commit

  • Changed RecursiveCharacterTextSplitter to CharacterTextSplitter
  • Converted the PDF extracted using Ploomber API to a list of Document objects. Each object looks like this:
    Document(page_content="text", metadata={"source": page_path, "page": page number})
  • As we now have Document objects converted LanceDB.from_texts to LanceDB_from_documents.
  • Converted ConversationalRetrievalChain to RetrievalQA
  • Added search_type="similarity"

Panel app observations (inconsistent results on the Olmo paper)
Screenshot 2024-04-10 at 4 53 09 PM

It has picked up from the references:
Screenshot 2024-04-10 at 4 54 14 PM

Another attempt:

Screenshot 2024-04-10 at 4 59 24 PM

Chainlit output

Paper: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Screenshot 2024-04-10 at 5 20 35 PM

Conclusions

  1. It looks like in both cases the vector search is biased towards the References section, which might be because it's towards the end of the document.
  2. We can research more on how to utilize the metadata in the Document object to improve the results.
  3. We can explore a RAG-based approach and index different parts of the document separately so the retriever doesn't get confused.
  4. The pdf_to_text and scanned_pdf_to_text functions can have a parameter for returning results in the Document format mentioned above. That would make it compatible with langchain (in case it's relevant for users)
  5. I think the vector DB should not make too much of a difference but we can check differences between LanceDB and Chroma (that's the only difference now in the Chainlit app vs Panel app). Although I did try with Chroma also but it didn't make any difference.

Deployed updated app: https://www.platform.ploomber.io/applications/steep-salad-8357/0f7566de

@edublancas

@neelasha23 neelasha23 requested a review from bryannho April 10, 2024 12:56
@edublancas edublancas merged commit 5d53e9d into main Apr 11, 2024
1 check passed
@edublancas edublancas deleted the issue167 branch April 11, 2024 03:03
@edublancas
Copy link
Contributor

@neelasha23 please prepare social media posts and include the link to the app, let's keep the example running for a week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chat with pdf example
3 participants