This project does webscraping for a given website and leverages vectorDB like weaviate to support client queries for closest word present in dataSet. This project is built like a Proof Of Concept style instead it should be hosted on a server and APIs need to be exposed. (Future Enhancements)
- Installation
- Configurations
-
Dependencies:
- Node version: v18.10.0
- Docker Desktop
- puppeteer - Project Dependency
- weaviate-ts-client - Project Dependency
-
Installation steps: 0. Clone the project into your system.
- Install Node into system:
- brew install [email protected]
- Install all the dependencies of the project:
- npm install
- Make sure you Have Docker Installed:
- Go to the project folder and run "docker compose up -d" to create a container which will be running weaviate server.
- Set your 2 inputs in the index.ts file, 1. websiteName(String), 2. keyword(String)
- Run the project using "npm start"
- Install Node into system:
- Here two 3rd Party APIs are being used:
- huggingface(for inference model which helps in deciding vector weights for a given text/query)
- weaviateClinet(for querying vectorDB weaviate and indexing data with vector weights we got from hugging face)
- Since we are using the free model huge data vectorization from huggingface will caluse rate limit error.
- Apart from this we can leverage Cohere LLM to create generative data for the user query on our crawled dataSet as well.