Added info on custom data loader

pranav-kural · pranav-kural · commit 90c276b06597 · 2024-07-15T13:18:47.000-04:00
diff --git a/pages/rag-guide/data-loaders.mdx b/pages/rag-guide/data-loaders.mdx
@@ -51,10 +51,56 @@ You can specify data loading options through the retriever configurations object
 
 ## Loading Custom Data
 
-When you want to load data from a file that is not supported by QvikChat out of the box, you can use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property when configuring the retriever. 
+When you want to load data from a file that is not supported by QvikChat out of the box, you can use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property when configuring the retriever.
 
 ### Example - Loading Data from Webpage
 
-Here is an example of how you can load data from a webpage using the `htmlLoader` from LangChain:
+Below is an example of how you can load data from a webpage.
 
-```typescript
+QvikChat by default doesn't provide a data loader for web pages. So, in this example, we are going to a custom web loader from LangChain to load data from a webpage.
+
+In this example, we're going to use the [Cheerio](https://js.langchain.com/v0.2/docs/integrations/document_loaders/web_loaders/web_cheerio) web loader. Cheerio is a fast and lightweight library that can help you extract data from web pages, without the need for a full browser environment.
+
+```typescript filename="src/index.ts"
+import { setupGenkit, runServer } from "@oconva/qvikchat/genkit";
+import { defineChatEndpoint } from "@oconva/qvikchat/endpoints";
+import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
+
+// Setup Genkit
+setupGenkit();
+
+// Method to define all endpoints of the project and run the server
+const defineEndpointsRunServer = async () => {
+  // configure and instantiate data loader
+  const loader = new CheerioWebBaseLoader(
+    "https://qvikchat.pkural.ca/rag-guide"
+  );
+
+  // load data and get docs
+  const docs = await loader.load();
+
+  // define RAG chat endpoint with docs
+  defineChatEndpoint({
+    endpoint: "chat",
+    enableRAG: true,
+    topic: "QvikChat - RAG chat endpoint",
+    retrieverConfig: {
+      docs: docs,
+      dataType: "text",
+      generateEmbeddings: true,
+    },
+  });
+
+  // Run server
+  runServer();
+};
+
+// execute method to define endpoints and run server
+defineEndpointsRunServer();
+```
+
+For the above example, the full source code and the instructions to run it can be found [here](https://github.com/oconva/qvikchat-examples/tree/main/examples/rag-chat-webpage).
+
+**GIGO (Garbage In, Garbage Out):**
+
+For best performance, it is highly recommended that you spend some extra time preparing a strategy for data collection for Retrieval Augmented Generation (RAG). Without proper data cleaning and preprocessing, the data may contain a lot of noise, which can also lead to poor response quality. For example, when crawling data from a webpage, if done without proper planning, the data may contain a lot of irrelevant information such as ads, navigation links, etc. This can lead to poor performance of the chat endpoints. Moreover, data comes in all shapes and sizes. A single webpage or a word document can contain information of various types, like code, text, tables, etc. It is important to handle and process these different types of information using an appropriate strategy.
diff --git a/pages/rag-guide/data-retrieval.mdx b/pages/rag-guide/data-retrieval.mdx
@@ -27,9 +27,9 @@ The `getDataRetriever` method and the `retrieverConfig` property in the chat end
 
 **Optional properties**
 
-- `dataType`: The type of data to load. This helps ascertain the best splitting strategy. If not specified, the data type is inferred from the file extension.
-- `docs`: An array of `Document` objects containing the data to load. This is useful when you want to load data from a source not supported by QvikChat by default. You can use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property.
-- `splitDocs`: An array containing documents that have been processed through a data splitter. This is useful when you want to load data from a source not supported by QvikChat by default. You can use any [LangChain-supported text splitter](https://js.langchain.com/v0.2/docs/how_to/#text-splitters) to split the data and provide the split documents as the `splitDocs` property. If providing `splitDocs`, you do not need to provide `docs`.
+- `dataType`: The type of data to load. This helps ascertain the best splitting strategy. When providing `filePath`, specifying `dataType` is optional. If not specified, the data type is inferred from the file extension. If not providing `filePath`, i.e., you are specifying `docs` or `splitDocs`, you must provide the `dataType`, since it cannot be inferred from the file extension.
+- `docs`: An array of `Document` objects containing the data to load. This is useful when you want to load data from a source not supported by QvikChat by default. You can use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property. If providing `docs`, you do not need to provide `filePath` but you must provide the `dataType`.
+- `splitDocs`: An array containing documents that have been processed through a data splitter. This is useful when you want to load data from a source not supported by QvikChat by default. You can use any [LangChain-supported text splitter](https://js.langchain.com/v0.2/docs/how_to/#text-splitters) to split the data and provide the split documents as the `splitDocs` property. If providing `splitDocs`, you do not need to provide `docs`. You must provide the `dataType`.
 - `jsonLoaderKeysToInclude`: An object containing the keys to include when loading JSON data. This is useful when you want to load only specific keys from the JSON data.
 - `csvLoaderOptions`: An object containing options to specify when loading CSV data. This is useful when you want to specify the delimiter and other options when loading CSV data.
 - `pdfLoaderOptions`: An object containing options to specify when loading PDF data. This is useful when you want to specify additional options when loading PDF data.
diff --git a/public/robots.txt b/public/robots.txt
@@ -0,0 +1 @@
+Sitemap: https://qvikchat.pkural.ca/sitemap.xml

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Sitemap: https://qvikchat.pkural.ca/sitemap.xml`