Project Title

This project does webscraping for a given website and leverages vectorDB like weaviate to support client queries for closest word present in dataSet. This project is built like a Proof Of Concept style instead it should be hosted on a server and APIs need to be exposed. (Future Enhancements)

Installation

Dependencies:
1. Node version: v18.10.0
2. Docker Desktop
3. puppeteer - Project Dependency
4. weaviate-ts-client - Project Dependency
Installation steps: 0. Clone the project into your system.
1. Install Node into system:
  - brew install [email protected]
2. Install all the dependencies of the project:
  - npm install
3. Make sure you Have Docker Installed:
  - https://www.docker.com/products/docker-desktop/
4. Go to the project folder and run "docker compose up -d" to create a container which will be running weaviate server.
5. Set your 2 inputs in the index.ts file, 1. websiteName(String), 2. keyword(String)
6. Run the project using "npm start"

High Level Diagram

Configurations

Here two 3rd Party APIs are being used:
- huggingface(for inference model which helps in deciding vector weights for a given text/query)
- weaviateClinet(for querying vectorDB weaviate and indexing data with vector weights we got from hugging face)
- Since we are using the free model huge data vectorization from huggingface will caluse rate limit error.
- Apart from this we can leverage Cohere LLM to create generative data for the user query on our crawled dataSet as well.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.vscode		.vscode
weaviateClient		weaviateClient
.gitignore		.gitignore
HighLevelDesign.png		HighLevelDesign.png
docker-compose.yml		docker-compose.yml
index.ts		index.ts
package-lock.json		package-lock.json
package.json		package.json
readme.md		readme.md
tsconfig.json		tsconfig.json
webCrawler.ts		webCrawler.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Title

Table of Contents

Installation

High Level Diagram

Configurations

About

Releases

Packages

Languages

arshtech97/WebCrawlingSimilarity

Folders and files

Latest commit

History

Repository files navigation

Project Title

Table of Contents

Installation

High Level Diagram

Configurations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages