InteliLink (only ~ 95% ready, still testing)

new release version 2.0 (2024)

InteliLink is a web scraper designed to check publicly accessible websites from a list of domains, extract imprint and contact information, and match this information with an existing CSV database. If the contact information is not in the database, it will be added.

Project Goal

The goal of this project is to create a web scraper named InteliLink that:

Checks publicly accessible websites from a list of domains.
Extracts imprint and contact information using regex and pattern matching techniques.
Matches the extracted data with an existing CSV database.
Adds new contact information to the database if not already present.

Data Sources

Domain List: A text file containing the domains to be checked.
Proxy List: A text file containing proxies in the format ip:port.
CSV Database: A CSV file with existing contact data (columns: Name, Address, Phone, Fax, Mobile, Email, Website, Social Network Accounts).

Features

Load Proxies: Load proxies from a file.
Load Domains: Load domains from a file.
Load and Save CSV Database: Read and update the CSV file.
Fetch Website: Retrieve a website using a randomly selected proxy.
Extract Data: Extract imprint and contact information from the HTML content using regex and pattern matching techniques.
Match Data: Compare new data with existing data in the CSV database.
Save New Data: Insert new data into the CSV database.

Workflow

Initialization:
- Load proxies and domains from their respective files.
- Load existing contact data from the CSV database.
Website Checking:
- For each domain:
  - Select a random proxy.
  - Fetch the website.
  - Extract imprint and contact information using regex and pattern matching techniques.
Data Matching:
- Compare the extracted data with existing data in the CSV database.
- Add new data if it is not already present.
Saving:
- Save the updated CSV database.

Setup

Prerequisites

Python 3.x
pip (Python package installer)

Installation

Clone the repository:

git clone https://github.com/yourusername/InteliLink.git
cd InteliLink

Create a virtual environment:

python -m venv venv
source venv/bin/activate   # On Windows, use `venv\Scripts\activate`

Install the required dependencies:
```
pip install -r requirements.txt
```
Run the setup script to create the necessary directory structure and files:
```
python setup_project.py
```

Usage

Place your domain list in data/domains.txt.
Place your proxy list in data/proxies.txt.
Ensure your CSV database is available in data/contacts.csv.
Run the main script:
```
python src/main.py
```

Logging

Logging information is stored in logs/scraping.log. The log file contains detailed information about the scraping process, including errors and successful operations.

Testing

To run the tests, use the following command:

python -m unittest discover -s tests

Contributing

Contributions are welcome! Please feel free to submit a pull request.

Your Support

If you find this project useful and want to support it, there are several ways to do so:

If you find the white paper helpful, please ⭐ it on GitHub. This helps make the project more visible and reach more people.
Become a Follower: If you're interested in updates and future improvements, please follow my GitHub account. This way you'll always stay up-to-date.
Learn more about my work: I invite you to check out all of my work on GitHub and visit my developer site https://volkansah.github.io. Here you will find detailed information about me and my projects.
Share the project: If you know someone who could benefit from this project, please share it. The more people who can use it, the better. If you appreciate my work and would like to support it, please visit my GitHub Sponsor page. Any type of support is warmly welcomed and helps me to further improve and expand my work.

Thank you for your support! ❤️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InteliLink (only ~ 95% ready, still testing)

new release version 2.0 (2024)

Table of Contents

Project Goal

Data Sources

Features

Workflow

Setup

Prerequisites

Installation

Usage

Logging

Testing

Contributing

Your Support

Copyright S. Volkan Kücükbudak

License Privat, till yet! Only for privat use!

About

Releases

Sponsor this project

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
data		data
logs		logs
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup_project.py		setup_project.py

VolkanSah/InteliLink

Folders and files

Latest commit

History

Repository files navigation

InteliLink (only ~ 95% ready, still testing)

new release version 2.0 (2024)

Table of Contents

Project Goal

Data Sources

Features

Workflow

Setup

Prerequisites

Installation

Usage

Logging

Testing

Contributing

Your Support

Copyright S. Volkan Kücükbudak

License Privat, till yet! Only for privat use!

About

Topics

Resources

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages