- Programmers can use our API to access websites. Then the website cannot know about the IP of programmer, the website consider that it is our IP accessing the website. So the website cannot forbid the IP of programmer.
- Expand IP pool – By Free IP collector.
About 1000 high-anonymity IP at a time. - Maintain IP dynamically – By IP Validator.
- Different IP using strategy – By API parameters.
- “Session” function optional.
- HTTP and HTTPS.
- Load balancing and failover – By Nginx.
- IP Pool’s condition monitoring.
- Python 3.6
- cd Proxy_Platform
- python -m pip install -r requirements.txt
- cd src
- python Listener.py
- http://xxx/api/?[arguments] (For request website)
- http://xxx/status (For condition monitoring)
- url : The website you want to request.
- headers : The headers you want to add into datagram.
- cookies : The cookies you want to add into datagram.
- method : ‘get’ or ‘post’.
- data : The data you want to add into datagram.
- time_max : The max request times for one IP.
- time_delay : The delay time between two requests.
- request_con : Task concurrency.
- timeout : The timeout of one request.
- Session : ‘Session’ function.
- http://localhost/api/?url=http://www.baidu.com&&time_max=2&&session=xoisuf02
- http://localhost/status
Introduction
- Some websites have strategy that cookies bind with IP.
So if user wants to request the website continuously,
we provide a session function that user can use the same IP.
How to use?
The first time you want to use the session function, you should let ‘session=True’.
Then, you will get a session string character such as ‘UvQk7OIG’.
The next time you want to use the same IP just let ‘session’ equals to the character.
- No session. (e.g. http://localhost/api?url=http://www.baidu.com)
- {“response”:[source code]}
- {“response”:[source code],’session’:[session character],’cookies’:[cookies]}
- Condition monitoring. (e.g. http://localhost/status)
当前 IP 总数:xxx
IP 分数分布: (越高越好)
1 ~ 19: xx
20 ~ 30: xx
31 ~ 40: xx
41 ~ 50: xx
51 ~ 60: xx
61 ~ 70: xx
71 ~ 80: xx
81 ~ 90: xx
90 ~ 100: xx
只有分数大于 xx 的 IP 地址会被使用 - Argument error. (e.g. http://localhost/api/?url=”w.baidu.com”)
- {“error”:[error message]}
- Path: /src/Schdule
- Function:
Acquire IP from database.
Schedule IP to visit website according to API arguments.
Use IP which score higher first.
This component is based on an open source project on Github[1].
The file structure for this component is shown in the Fig.1.
For the proxy spider, it uses xpath to drag IP proxy information from free IP proxy website.
The content of xpath, the url,
and some other information are all stored in config.py which is the profile for this part.
While running, it creates one thread to crawl target website and fifty threads to test and verify the usability for each IP proxy.
All the usable IP proxy will be added to database.
The architecture of this part is shown in Fig.2.
Because of network fluctuations and the limitation of visiting on some IP proxy website,
this spider need to resend HTTP request to the target website once the HTTP response doesn’t be received after TIMEOUT which makes this spider a little bit slow.
Some proxy websites use anti-crawl mechanism which means running proxy spider frequently will get IP proxies less and less.
Since they are all free IP proxies, it still can’t get that much usable IP proxies.
The running output on console is shown in Fig.3.
Proxies check is designed to ensure that there are always several accessible IP that are in high-speed.
We set a score for each IP. When an IP is added to database, we set it 20. And we check the IP by visiting some website and change scores according to whether they can successfully visit the website and their response time and current scores. Once get response, we divided the response time into different levels.
The picture below shows our strategy. If there is a response, we change the score according to response time. Else, we change it according to current score.
Besides, we set other two timeout to control the speed. One is for connecting to servers’ socket, another is for response time after successfully connecting to servers. Both timeout are considered fail visiting.
If the score of IP comes to 0, which shows it can not be used during current period and that usually means the socket is already closed. So it will be moved to another database and there is a thread to check the socket state. Once open they will be moved back.
Waiting for a response is a waste of time, thus we use gevent library to make better use of this time. Gevent is an asynchronous, unblock framework based on libev event loop and greenlet. When a function turn to IO operation, it will be added a callback function and then free CPU to another function. The IO controller will take charge of it, and when finishing the function will be pushed into wait queue. After all functions in stack finish, the function in queue will run one after another from head.
We use Nginx to finish load balancing and failover.
Nginx can make a proxy to dispense requests to our servers.
For example, these pictures are our setting for Nginx.
The first picture means that we make 4 servers for Nginx. The first three servers are running and the forth server is backup. So if the first three servers all fail, then the backup server will receive the request. After servers, here is a ‘fair’ word means that it will dispense requests depend on server response time.
The second picture is that we let Nginx proxy on 0.0.0.0:80. So if someone request 0.0.0.0:80, then it will dispense this request to the server.
- Redis
To deal with large amounts of IP information, the proxy platform chooses redis as its IP proxy database. Each IP has a score to show its quality and it’s has the format as address:port:type (e.g. 11.22.33.44:80:HTTPS). All the proxy IPs are stored in a zset called ProxyPlatform sort by its score in order to facilitate the use and maintenance of IP. The scheduler can randomly get IP proxy from the database and the tester can check the IP score sequentially.
- MySQL
The proxy platform uses MySQL to store other information about the operations made by users.
- Table Saves stores the request method, response status code, response header and response time of each request operation. Users can query and get the information of the request they had made before.
- Table Session stores the information of Session function. It generates a unique 8 characters length string and map it to the specific IP. Users can get the same IP as they used before by the unique string.
To simulate a environment for high loading, we used 4 computers create 500 threads to request our program. And we used 4 servers to receive requests.
Because it wasn’t let our server load fully, it only has over 100 requests per second.
If you are interested, you can test our program and tell us.