From 54f7105517976fc5af72da40d0a3cc5b1642727d Mon Sep 17 00:00:00 2001 From: gunjianpan Date: Tue, 9 Apr 2019 10:50:00 +0800 Subject: [PATCH] =?UTF-8?q?=F0=9F=A6=83add=20some=20proxy=20explain?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 43 ++++++++++++++++++++++++------------------- 1 file changed, 24 insertions(+), 19 deletions(-) diff --git a/README.md b/README.md index 873877a..c19aa45 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ Spider logo

Spider Man

-[![GitHub](https://img.shields.io/github/license/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider/master/LICENSE) +[![GitHub](https://img.shields.io/github/license/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider/blob/master/LICENSE) [![GitHub tag](https://img.shields.io/github/tag/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider/releases) [![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider) @@ -46,7 +46,7 @@ - Big data store - High concurrency requests -- Support Websocket +- Support WebSocket - method for font cheat - method for js compile - Some Application @@ -57,19 +57,24 @@ - `Highly Available Proxy IP Pool` - By obtaining data from `Gatherproxy`, `Goubanjia`, `xici` etc. Free Proxy WebSite - - Analysis the Goubanjia port data + - Analysis of the Goubanjia port data - Quickly verify IP availability - Cooperate with Requests to automatically assign proxy Ip, with Retry mechanism, fail to write DB mechanism - - two model for proxy shell - - model 1: load gather proxy list && update proxy list file - - model 2: update proxy pool db && test available + - two models for proxy shell + - model 1: load gather proxy list && update proxy list file(need over the GFW, your personality passwd in http://gatherproxy.com to `proxy/data/passage` one line by username, one line by passwd) + - model 0: update proxy pool db && test available - one common proxy api - `from proxy.getproxy import GetFreeProxy` - `get_request_proxy = GetFreeProxy().get_request_proxy` - `get_request_proxy(url: str, types: int, data=None, test_func=None, header=None)` - - also one comon basic req api - - `from util import baisc_req` + - also one common basic req api + - `from util import basic_req` - `basic_req(url: str, types: int, proxies=None, data=None, header=None)` + - if you want spider by using proxy + - because access proxy web need over the GFW, so maybe you can't use `model 1` to download proxy file. + - 1. download proxy txt from http://gatherproxy.com + - 2. cp download_file proxy/data/gatherproxy + - 3. python proxy/getproxy.py --model==0 ## Application @@ -124,9 +129,9 @@ - when you want to download lots of show like Season 22, Season 21. - If click one by one, It is very boring, so zimuzu.py is all you need. -- The thing you only need do is to wait the program run. -- And you copy the Thunder url for one to download the movies. -- Now The Winter will coming, I think you need it to review ``. +- The thing you only need do is to wait for the program run. +- And you copy the Thunder URL for one to download the movies. +- Now The Winter will come, I think you need it to review ``. ### `Bilibili` @@ -137,9 +142,9 @@ 9. `Get av data by websocket` - `bilibili/bsocket.py` -- base on websocket +- base on WebSocket - byte analysis -- heart beat +- heartbeat 10. `Get comment data by http` - `bilibili/bilibili.py` @@ -163,9 +168,9 @@ ## Development -**All model base on `proxy.getproxy`, so it is very !import.** +**All model base on `proxy.getproxy`, so it is a very !import.** -`docker` is in the road. +`docker` is on the road. ```bash $ git clone https://github.com/iofu728/spider.git @@ -174,7 +179,7 @@ $ pip3 install -r requirement.txt # using proxy pool $ python proxy/getproxy.py --model=1 # model = 1: load gather proxy (now need have qualification to request google) -$ python proxy/getproxy.py --model=1 # model = 0: test proxy +$ python proxy/getproxy.py --model=0 # model = 0: test proxy $ from proxy.getproxy import GetFreeProxy $ get_request_proxy = GetFreeProxy().get_request_proxy @@ -281,9 +286,9 @@ parent_tree.find_all(re.compile(''')) #### Idea -1. get data from html -> json +1. get data from HTML -> json 2. get font map -> transform num -3. or load font analysis font(contrast withe base) +3. or load font analysis font(contrast with base) #### Trouble Shooting @@ -313,4 +318,4 @@ parent_tree.find_all(re.compile(''')) ##### `bilibili` some url return 404 like `http://api.bilibili.com/x/relation/stat?jsonp=jsonp&callback=__jp11&vmid=` -basic_req auto add `host` to headers, but this url can't request in ‘Host’ +basic_req auto add `host` to headers, but this URL can't request in ‘Host’