Skip to content
This repository was archived by the owner on Aug 1, 2024. It is now read-only.

Commit 90c276b

Browse files
committed
Added info on custom data loader
1 parent 14dd1a6 commit 90c276b

File tree

3 files changed

+53
-6
lines changed

3 files changed

+53
-6
lines changed

pages/rag-guide/data-loaders.mdx

+49-3
Original file line numberDiff line numberDiff line change
@@ -51,10 +51,56 @@ You can specify data loading options through the retriever configurations object
5151

5252
## Loading Custom Data
5353

54-
When you want to load data from a file that is not supported by QvikChat out of the box, you can use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property when configuring the retriever.
54+
When you want to load data from a file that is not supported by QvikChat out of the box, you can use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property when configuring the retriever.
5555

5656
### Example - Loading Data from Webpage
5757

58-
Here is an example of how you can load data from a webpage using the `htmlLoader` from LangChain:
58+
Below is an example of how you can load data from a webpage.
5959

60-
```typescript
60+
QvikChat by default doesn't provide a data loader for web pages. So, in this example, we are going to a custom web loader from LangChain to load data from a webpage.
61+
62+
In this example, we're going to use the [Cheerio](https://js.langchain.com/v0.2/docs/integrations/document_loaders/web_loaders/web_cheerio) web loader. Cheerio is a fast and lightweight library that can help you extract data from web pages, without the need for a full browser environment.
63+
64+
```typescript filename="src/index.ts"
65+
import { setupGenkit, runServer } from "@oconva/qvikchat/genkit";
66+
import { defineChatEndpoint } from "@oconva/qvikchat/endpoints";
67+
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
68+
69+
// Setup Genkit
70+
setupGenkit();
71+
72+
// Method to define all endpoints of the project and run the server
73+
const defineEndpointsRunServer = async () => {
74+
// configure and instantiate data loader
75+
const loader = new CheerioWebBaseLoader(
76+
"https://qvikchat.pkural.ca/rag-guide"
77+
);
78+
79+
// load data and get docs
80+
const docs = await loader.load();
81+
82+
// define RAG chat endpoint with docs
83+
defineChatEndpoint({
84+
endpoint: "chat",
85+
enableRAG: true,
86+
topic: "QvikChat - RAG chat endpoint",
87+
retrieverConfig: {
88+
docs: docs,
89+
dataType: "text",
90+
generateEmbeddings: true,
91+
},
92+
});
93+
94+
// Run server
95+
runServer();
96+
};
97+
98+
// execute method to define endpoints and run server
99+
defineEndpointsRunServer();
100+
```
101+
102+
For the above example, the full source code and the instructions to run it can be found [here](https://github.com/oconva/qvikchat-examples/tree/main/examples/rag-chat-webpage).
103+
104+
**GIGO (Garbage In, Garbage Out):**
105+
106+
For best performance, it is highly recommended that you spend some extra time preparing a strategy for data collection for Retrieval Augmented Generation (RAG). Without proper data cleaning and preprocessing, the data may contain a lot of noise, which can also lead to poor response quality. For example, when crawling data from a webpage, if done without proper planning, the data may contain a lot of irrelevant information such as ads, navigation links, etc. This can lead to poor performance of the chat endpoints. Moreover, data comes in all shapes and sizes. A single webpage or a word document can contain information of various types, like code, text, tables, etc. It is important to handle and process these different types of information using an appropriate strategy.

pages/rag-guide/data-retrieval.mdx

+3-3
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,9 @@ The `getDataRetriever` method and the `retrieverConfig` property in the chat end
2727

2828
**Optional properties**
2929

30-
- `dataType`: The type of data to load. This helps ascertain the best splitting strategy. If not specified, the data type is inferred from the file extension.
31-
- `docs`: An array of `Document` objects containing the data to load. This is useful when you want to load data from a source not supported by QvikChat by default. You can use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property.
32-
- `splitDocs`: An array containing documents that have been processed through a data splitter. This is useful when you want to load data from a source not supported by QvikChat by default. You can use any [LangChain-supported text splitter](https://js.langchain.com/v0.2/docs/how_to/#text-splitters) to split the data and provide the split documents as the `splitDocs` property. If providing `splitDocs`, you do not need to provide `docs`.
30+
- `dataType`: The type of data to load. This helps ascertain the best splitting strategy. When providing `filePath`, specifying `dataType` is optional. If not specified, the data type is inferred from the file extension. If not providing `filePath`, i.e., you are specifying `docs` or `splitDocs`, you must provide the `dataType`, since it cannot be inferred from the file extension.
31+
- `docs`: An array of `Document` objects containing the data to load. This is useful when you want to load data from a source not supported by QvikChat by default. You can use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property. If providing `docs`, you do not need to provide `filePath` but you must provide the `dataType`.
32+
- `splitDocs`: An array containing documents that have been processed through a data splitter. This is useful when you want to load data from a source not supported by QvikChat by default. You can use any [LangChain-supported text splitter](https://js.langchain.com/v0.2/docs/how_to/#text-splitters) to split the data and provide the split documents as the `splitDocs` property. If providing `splitDocs`, you do not need to provide `docs`. You must provide the `dataType`.
3333
- `jsonLoaderKeysToInclude`: An object containing the keys to include when loading JSON data. This is useful when you want to load only specific keys from the JSON data.
3434
- `csvLoaderOptions`: An object containing options to specify when loading CSV data. This is useful when you want to specify the delimiter and other options when loading CSV data.
3535
- `pdfLoaderOptions`: An object containing options to specify when loading PDF data. This is useful when you want to specify additional options when loading PDF data.

public/robots.txt

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Sitemap: https://qvikchat.pkural.ca/sitemap.xml

0 commit comments

Comments
 (0)