You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Aug 1, 2024. It is now read-only.
Copy file name to clipboardexpand all lines: pages/rag-guide/data-loaders.mdx
+49-3
Original file line number
Diff line number
Diff line change
@@ -51,10 +51,56 @@ You can specify data loading options through the retriever configurations object
51
51
52
52
## Loading Custom Data
53
53
54
-
When you want to load data from a file that is not supported by QvikChat out of the box, you can use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property when configuring the retriever.
54
+
When you want to load data from a file that is not supported by QvikChat out of the box, you can use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property when configuring the retriever.
55
55
56
56
### Example - Loading Data from Webpage
57
57
58
-
Here is an example of how you can load data from a webpage using the `htmlLoader` from LangChain:
58
+
Below is an example of how you can load data from a webpage.
59
59
60
-
```typescript
60
+
QvikChat by default doesn't provide a data loader for web pages. So, in this example, we are going to a custom web loader from LangChain to load data from a webpage.
61
+
62
+
In this example, we're going to use the [Cheerio](https://js.langchain.com/v0.2/docs/integrations/document_loaders/web_loaders/web_cheerio) web loader. Cheerio is a fast and lightweight library that can help you extract data from web pages, without the need for a full browser environment.
// Method to define all endpoints of the project and run the server
73
+
const defineEndpointsRunServer =async () => {
74
+
// configure and instantiate data loader
75
+
const loader =newCheerioWebBaseLoader(
76
+
"https://qvikchat.pkural.ca/rag-guide"
77
+
);
78
+
79
+
// load data and get docs
80
+
const docs =awaitloader.load();
81
+
82
+
// define RAG chat endpoint with docs
83
+
defineChatEndpoint({
84
+
endpoint: "chat",
85
+
enableRAG: true,
86
+
topic: "QvikChat - RAG chat endpoint",
87
+
retrieverConfig: {
88
+
docs: docs,
89
+
dataType: "text",
90
+
generateEmbeddings: true,
91
+
},
92
+
});
93
+
94
+
// Run server
95
+
runServer();
96
+
};
97
+
98
+
// execute method to define endpoints and run server
99
+
defineEndpointsRunServer();
100
+
```
101
+
102
+
For the above example, the full source code and the instructions to run it can be found [here](https://github.com/oconva/qvikchat-examples/tree/main/examples/rag-chat-webpage).
103
+
104
+
**GIGO (Garbage In, Garbage Out):**
105
+
106
+
For best performance, it is highly recommended that you spend some extra time preparing a strategy for data collection for Retrieval Augmented Generation (RAG). Without proper data cleaning and preprocessing, the data may contain a lot of noise, which can also lead to poor response quality. For example, when crawling data from a webpage, if done without proper planning, the data may contain a lot of irrelevant information such as ads, navigation links, etc. This can lead to poor performance of the chat endpoints. Moreover, data comes in all shapes and sizes. A single webpage or a word document can contain information of various types, like code, text, tables, etc. It is important to handle and process these different types of information using an appropriate strategy.
Copy file name to clipboardexpand all lines: pages/rag-guide/data-retrieval.mdx
+3-3
Original file line number
Diff line number
Diff line change
@@ -27,9 +27,9 @@ The `getDataRetriever` method and the `retrieverConfig` property in the chat end
27
27
28
28
**Optional properties**
29
29
30
-
-`dataType`: The type of data to load. This helps ascertain the best splitting strategy. If not specified, the data type is inferred from the file extension.
31
-
-`docs`: An array of `Document` objects containing the data to load. This is useful when you want to load data from a source not supported by QvikChat by default. You can use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property.
32
-
-`splitDocs`: An array containing documents that have been processed through a data splitter. This is useful when you want to load data from a source not supported by QvikChat by default. You can use any [LangChain-supported text splitter](https://js.langchain.com/v0.2/docs/how_to/#text-splitters) to split the data and provide the split documents as the `splitDocs` property. If providing `splitDocs`, you do not need to provide `docs`.
30
+
-`dataType`: The type of data to load. This helps ascertain the best splitting strategy. When providing `filePath`, specifying `dataType` is optional. If not specified, the data type is inferred from the file extension. If not providing `filePath`, i.e., you are specifying `docs` or `splitDocs`, you must provide the `dataType`, since it cannot be inferred from the file extension.
31
+
-`docs`: An array of `Document` objects containing the data to load. This is useful when you want to load data from a source not supported by QvikChat by default. You can use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property. If providing `docs`, you do not need to provide `filePath` but you must provide the `dataType`.
32
+
-`splitDocs`: An array containing documents that have been processed through a data splitter. This is useful when you want to load data from a source not supported by QvikChat by default. You can use any [LangChain-supported text splitter](https://js.langchain.com/v0.2/docs/how_to/#text-splitters) to split the data and provide the split documents as the `splitDocs` property. If providing `splitDocs`, you do not need to provide `docs`. You must provide the `dataType`.
33
33
-`jsonLoaderKeysToInclude`: An object containing the keys to include when loading JSON data. This is useful when you want to load only specific keys from the JSON data.
34
34
-`csvLoaderOptions`: An object containing options to specify when loading CSV data. This is useful when you want to specify the delimiter and other options when loading CSV data.
35
35
-`pdfLoaderOptions`: An object containing options to specify when loading PDF data. This is useful when you want to specify additional options when loading PDF data.
0 commit comments