Skip to content

Commit

Permalink
🦃add some proxy explain
Browse files Browse the repository at this point in the history
  • Loading branch information
iofu728 committed Apr 9, 2019
1 parent 253676c commit 54f7105
Showing 1 changed file with 24 additions and 19 deletions.
43 changes: 24 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<img width="100" src="https://cdn.nlark.com/yuque/0/2018/jpeg/104214/1540358574166-46cbbfd2-69fa-4406-aba9-784bf65efdf9.jpeg" alt="Spider logo"></a></p>
<h1 align="center">Spider Man</h1>

[![GitHub](https://img.shields.io/github/license/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider/master/LICENSE)
[![GitHub](https://img.shields.io/github/license/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider/blob/master/LICENSE)
[![GitHub tag](https://img.shields.io/github/tag/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider/releases)
[![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider)

Expand Down Expand Up @@ -46,7 +46,7 @@

- Big data store
- High concurrency requests
- Support Websocket
- Support WebSocket
- method for font cheat
- method for js compile
- Some Application
Expand All @@ -57,19 +57,24 @@
- <u>`Highly Available Proxy IP Pool`</u>
- By obtaining data from `Gatherproxy`, `Goubanjia`, `xici` etc. Free Proxy WebSite
- Analysis the Goubanjia port data
- Analysis of the Goubanjia port data
- Quickly verify IP availability
- Cooperate with Requests to automatically assign proxy Ip, with Retry mechanism, fail to write DB mechanism
- two model for proxy shell
- model 1: load gather proxy list && update proxy list file
- model 2: update proxy pool db && test available
- two models for proxy shell
- model 1: load gather proxy list && update proxy list file(need over the GFW, your personality passwd in http://gatherproxy.com to `proxy/data/passage` one line by username, one line by passwd)
- model 0: update proxy pool db && test available
- one common proxy api
- `from proxy.getproxy import GetFreeProxy`
- `get_request_proxy = GetFreeProxy().get_request_proxy`
- `get_request_proxy(url: str, types: int, data=None, test_func=None, header=None)`
- also one comon basic req api
- `from util import baisc_req`
- also one common basic req api
- `from util import basic_req`
- `basic_req(url: str, types: int, proxies=None, data=None, header=None)`
- if you want spider by using proxy
- because access proxy web need over the GFW, so maybe you can't use `model 1` to download proxy file.
- 1. download proxy txt from http://gatherproxy.com
- 2. cp download_file proxy/data/gatherproxy
- 3. python proxy/getproxy.py --model==0

## Application

Expand Down Expand Up @@ -124,9 +129,9 @@

- when you want to download lots of show like Season 22, Season 21.
- If click one by one, It is very boring, so zimuzu.py is all you need.
- The thing you only need do is to wait the program run.
- And you copy the Thunder url for one to download the movies.
- Now The Winter will coming, I think you need it to review `<Game of Thrones>`.
- The thing you only need do is to wait for the program run.
- And you copy the Thunder URL for one to download the movies.
- Now The Winter will come, I think you need it to review `<Game of Thrones>`.

### `Bilibili`

Expand All @@ -137,9 +142,9 @@

9. <u>`Get av data by websocket`</u> - <u>`bilibili/bsocket.py`</u>

- base on websocket
- base on WebSocket
- byte analysis
- heart beat
- heartbeat

10. <u>`Get comment data by http`</u> - <u>`bilibili/bilibili.py`</u>

Expand All @@ -163,9 +168,9 @@

## Development

**All model base on `proxy.getproxy`, so it is very !import.**
**All model base on `proxy.getproxy`, so it is a very !import.**

`docker` is in the road.
`docker` is on the road.

```bash
$ git clone https://github.com/iofu728/spider.git
Expand All @@ -174,7 +179,7 @@ $ pip3 install -r requirement.txt

# using proxy pool
$ python proxy/getproxy.py --model=1 # model = 1: load gather proxy (now need have qualification to request google)
$ python proxy/getproxy.py --model=1 # model = 0: test proxy
$ python proxy/getproxy.py --model=0 # model = 0: test proxy

$ from proxy.getproxy import GetFreeProxy
$ get_request_proxy = GetFreeProxy().get_request_proxy
Expand Down Expand Up @@ -281,9 +286,9 @@ parent_tree.find_all(re.compile('''))
#### Idea
1. get data from html -> json
1. get data from HTML -> json
2. get font map -> transform num
3. or load font analysis font(contrast withe base)
3. or load font analysis font(contrast with base)
#### Trouble Shooting
Expand Down Expand Up @@ -313,4 +318,4 @@ parent_tree.find_all(re.compile('''))
##### `bilibili` some url return 404 like `http://api.bilibili.com/x/relation/stat?jsonp=jsonp&callback=__jp11&vmid=`
basic_req auto add `host` to headers, but this url can't request in ‘Host’
basic_req auto add `host` to headers, but this URL can't request in ‘Host’

0 comments on commit 54f7105

Please sign in to comment.