From 54f7105517976fc5af72da40d0a3cc5b1642727d Mon Sep 17 00:00:00 2001
From: gunjianpan
Date: Tue, 9 Apr 2019 10:50:00 +0800
Subject: [PATCH] =?UTF-8?q?=F0=9F=A6=83add=20some=20proxy=20explain?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
---
README.md | 43 ++++++++++++++++++++++++-------------------
1 file changed, 24 insertions(+), 19 deletions(-)
diff --git a/README.md b/README.md
index 873877a..c19aa45 100644
--- a/README.md
+++ b/README.md
@@ -3,7 +3,7 @@
Spider Man
-[![GitHub](https://img.shields.io/github/license/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider/master/LICENSE)
+[![GitHub](https://img.shields.io/github/license/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider/blob/master/LICENSE)
[![GitHub tag](https://img.shields.io/github/tag/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider/releases)
[![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider)
@@ -46,7 +46,7 @@
- Big data store
- High concurrency requests
-- Support Websocket
+- Support WebSocket
- method for font cheat
- method for js compile
- Some Application
@@ -57,19 +57,24 @@
- `Highly Available Proxy IP Pool`
- By obtaining data from `Gatherproxy`, `Goubanjia`, `xici` etc. Free Proxy WebSite
- - Analysis the Goubanjia port data
+ - Analysis of the Goubanjia port data
- Quickly verify IP availability
- Cooperate with Requests to automatically assign proxy Ip, with Retry mechanism, fail to write DB mechanism
- - two model for proxy shell
- - model 1: load gather proxy list && update proxy list file
- - model 2: update proxy pool db && test available
+ - two models for proxy shell
+ - model 1: load gather proxy list && update proxy list file(need over the GFW, your personality passwd in http://gatherproxy.com to `proxy/data/passage` one line by username, one line by passwd)
+ - model 0: update proxy pool db && test available
- one common proxy api
- `from proxy.getproxy import GetFreeProxy`
- `get_request_proxy = GetFreeProxy().get_request_proxy`
- `get_request_proxy(url: str, types: int, data=None, test_func=None, header=None)`
- - also one comon basic req api
- - `from util import baisc_req`
+ - also one common basic req api
+ - `from util import basic_req`
- `basic_req(url: str, types: int, proxies=None, data=None, header=None)`
+ - if you want spider by using proxy
+ - because access proxy web need over the GFW, so maybe you can't use `model 1` to download proxy file.
+ - 1. download proxy txt from http://gatherproxy.com
+ - 2. cp download_file proxy/data/gatherproxy
+ - 3. python proxy/getproxy.py --model==0
## Application
@@ -124,9 +129,9 @@
- when you want to download lots of show like Season 22, Season 21.
- If click one by one, It is very boring, so zimuzu.py is all you need.
-- The thing you only need do is to wait the program run.
-- And you copy the Thunder url for one to download the movies.
-- Now The Winter will coming, I think you need it to review ``.
+- The thing you only need do is to wait for the program run.
+- And you copy the Thunder URL for one to download the movies.
+- Now The Winter will come, I think you need it to review ``.
### `Bilibili`
@@ -137,9 +142,9 @@
9. `Get av data by websocket` - `bilibili/bsocket.py`
-- base on websocket
+- base on WebSocket
- byte analysis
-- heart beat
+- heartbeat
10. `Get comment data by http` - `bilibili/bilibili.py`
@@ -163,9 +168,9 @@
## Development
-**All model base on `proxy.getproxy`, so it is very !import.**
+**All model base on `proxy.getproxy`, so it is a very !import.**
-`docker` is in the road.
+`docker` is on the road.
```bash
$ git clone https://github.com/iofu728/spider.git
@@ -174,7 +179,7 @@ $ pip3 install -r requirement.txt
# using proxy pool
$ python proxy/getproxy.py --model=1 # model = 1: load gather proxy (now need have qualification to request google)
-$ python proxy/getproxy.py --model=1 # model = 0: test proxy
+$ python proxy/getproxy.py --model=0 # model = 0: test proxy
$ from proxy.getproxy import GetFreeProxy
$ get_request_proxy = GetFreeProxy().get_request_proxy
@@ -281,9 +286,9 @@ parent_tree.find_all(re.compile('''))
#### Idea
-1. get data from html -> json
+1. get data from HTML -> json
2. get font map -> transform num
-3. or load font analysis font(contrast withe base)
+3. or load font analysis font(contrast with base)
#### Trouble Shooting
@@ -313,4 +318,4 @@ parent_tree.find_all(re.compile('''))
##### `bilibili` some url return 404 like `http://api.bilibili.com/x/relation/stat?jsonp=jsonp&callback=__jp11&vmid=`
-basic_req auto add `host` to headers, but this url can't request in ‘Host’
+basic_req auto add `host` to headers, but this URL can't request in ‘Host’