Webscrapper

Definition

Summary

This is a simple application, where a user can add a link to a web page and the application will scrape all the information of that page and get a list of all of the links in that page.

Features

As a user, I should be able to see a list of all the pages that I have scrapped with the number of links that the scraper found.
As a user, I should be able to see the details of all the links of a particular page, that means the url of a link and the “name” of a link.
As a user I should be able to add a url and the system should be able to check for all the links and add it to the database. A link will have the following format <a href="https://www.w3schools.com"> Visit W3Schools.com! </a> the href will be the link and the body will be the name of the link.

Keep in mind that the body of a link sometimes is not only text and could be other html elements, in those cases you could save only a portion of the html. The title of the web page will be the page name. Keep in mind that some pages will take more time than others to scrape.

Tools used

Backend

Python
Django
Django Rest Framework
Celery
Rabbitmq
Beautiful soup
SQLite

Frontend

React
Nextjs
MUI Library

Instructions

git clone <repository>

Short way

docker-compose build
docker-compose up

That's it!

Long way

Dependenciies (Rabbitmq)

cd webscrapper
docker-compose up

Backend setup

Create a Python virtual environment (https://docs.python.org/3/library/venv.html#creating-virtual-environments) and enable it
pip install -r requirements.txt
cd backend
python3 manage.py migrate
gunicorn --bind 0.0.0.0:5000 --timeout=300 -k gevent webscrapper.wsgi

Celery setup: in another terminal execute:

python3 -m celery -A webscrapper worker -E -l info

Frontend setup: in another terminal execute:

cd ..
cd frontend
npm install
npm run dev

Demo

Go to http://localhost:3000 and you will be able to see how it works.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
backend		backend
docker		docker
frontend		frontend
README.md		README.md
docker-compose.yaml		docker-compose.yaml
how-it-works.gif		how-it-works.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webscrapper

Definition

Summary

Features

Tools used

Backend

Frontend

Instructions

Short way

Long way

Demo

About

Releases

Packages

Languages

georgeos/webscrapper

Folders and files

Latest commit

History

Repository files navigation

Webscrapper

Definition

Summary

Features

Tools used

Backend

Frontend

Instructions

Short way

Long way

Demo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages