This is a simple application, where a user can add a link to a web page and the application will scrape all the information of that page and get a list of all of the links in that page.
- As a user, I should be able to see a list of all the pages that I have scrapped with the number of links that the scraper found.
- As a user, I should be able to see the details of all the links of a particular page, that means the url of a link and the “name” of a link.
- As a user I should be able to add a url and the system should be able to check for all the links and add it to the database. A link will have the following format
<a href="https://www.w3schools.com"> Visit W3Schools.com! </a>
the href will be the link and the body will be the name of the link.
Keep in mind that the body of a link sometimes is not only text and could be other html elements, in those cases you could save only a portion of the html. The title of the web page will be the page name. Keep in mind that some pages will take more time than others to scrape.
- Python
- Django
- Django Rest Framework
- Celery
- Rabbitmq
- Beautiful soup
- SQLite
- React
- Nextjs
- MUI Library
git clone <repository>
docker-compose build
docker-compose up
That's it!
- Dependenciies (Rabbitmq)
cd webscrapper
docker-compose up
- Backend setup
- Create a Python virtual environment (https://docs.python.org/3/library/venv.html#creating-virtual-environments) and enable it
pip install -r requirements.txt
cd backend
python3 manage.py migrate
gunicorn --bind 0.0.0.0:5000 --timeout=300 -k gevent webscrapper.wsgi
- Celery setup: in another terminal execute:
python3 -m celery -A webscrapper worker -E -l info
- Frontend setup: in another terminal execute:
cd ..
cd frontend
npm install
npm run dev
Go to http://localhost:3000
and you will be able to see how it works.