Replies: 6 comments
-
Why do you need to inspect the docstore? We don't do any special encoding when saving to disk. |
Beta Was this translation helpful? Give feedback.
-
I have the same question because in my docstore Norwegian diacritic characters are also Unicode escaped. I didn't specify any parameter that could cause this. Docstore is UTF-8 encoded so I don't know why they are escaped? |
Beta Was this translation helpful? Give feedback.
-
There is a way to avoid Unicode escaping, by giving the argument ensure_ascii=False to json.dump(). However, I believe that there would be no negative impacts as it is. |
Beta Was this translation helpful? Give feedback.
-
Hi, @tmforai60! I'm Dosu, and I'm helping the LlamaIndex team manage their backlog. I wanted to let you know that we are marking this issue as stale. From what I understand, the issue you raised is about Japanese characters being saved as escaped Unicode instead of UTF-8. There have been some discussions in the comments about inspecting the docstore and the impact of Unicode escapes on data retrieval. One user suggested using the argument Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LlamaIndex repository. If it is, please let the LlamaIndex team know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you for your contribution to LlamaIndex! |
Beta Was this translation helpful? Give feedback.
-
🤖 Hello, Thank you for reaching out with your question. It seems like you're experiencing an issue with Japanese characters being saved as escaped Unicode in your document store. Based on the information you've provided and the context from similar issues, it appears that the issue might be related to how the text encoding is handled during the data loading process. In LlamaIndex, the Here is the relevant code from the with open(file, "r", encoding="utf-8") as html_file:
soup = BeautifulSoup(html_file, "html.parser") And from the def load_data(
self, file: Path, extra_info: Optional[Dict] = None
) -> List[Document]:
"""Parse file into string."""
with open(file, encoding="utf-8") as f:
content = f.read()
metadata = {"filename": file.name, "extension": file.suffix}
if extra_info:
metadata = {**metadata, **extra_info}
return [Document(text=content, metadata=metadata)] In both cases, the encoding is set to "utf-8" when opening a file for reading. If the file's content is not compatible with the provided encoding, a However, it's important to note that these classes are used for reading data from files, not for saving data to the document store. The issue you're experiencing might be related to how the data is saved to the document store, not how it's read from the files. Unfortunately, without more information about how you're saving the data to the document store, it's difficult to provide a more specific solution. Could you please provide more details about this part of your code? Specifically, it would be helpful to know how you're creating the documents that you're saving to the document store, and how you're saving them. Thank you for your cooperation. Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
Hello, I am facing the same issue for the vector_store utils (llama_index/core/vector_stores/utils.py) method Besides, I would also suggest to add the metadata key depending on the vector store object created. Now the following 3 keys are created regardless the vector store used, although they are clearly not shared by all the vector stores as commented. This is causing confusion for all the similar but different naming for Modified method:
|
Beta Was this translation helpful? Give feedback.
-
Question Validation
Question
Hello,
I create saved a docstore in the way below.
It has texts written in unicode with escape, not in Japanese character(or UTF-8) which I want.
How can I solve the problem?
Thank you for your help.
docstore
{"docstore/metadata":...},` "docstore/data": {"...": {"data": {"text": "\u6c11\u4e8b\u88c1\u5224\u306b\u304a\u3044\u3066...", "doc_id": ...
Beta Was this translation helpful? Give feedback.
All reactions