Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Winniekun / spider Public

Notifications You must be signed in to change notification settings
Fork 67
Star 149

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Breadcrumbs

spider

/

README.md

Latest commit

History

77 lines (41 loc) · 2.33 KB

Breadcrumbs

spider

/

README.md

File metadata and controls

77 lines (41 loc) · 2.33 KB

python3.x 爬虫小项目

自己平时做数据分析时爬的数据就当做练习爬虫了 😸

爬取豆瓣国漫----2017/10
爬取QQ好友所有说说----2017/11
爬取赛氪网信息（未完成----2017/11
爬取知乎用户信息(基于轮子哥 scrapy)----2017/11
爬取WeChat(用itchat)----2017/12
机器验证破解(未完成)----2017/12
爬取星巴克信息----2018/1
爬取网易云音乐评论 (持续更新)---- 2018/1
爬取京东特定的商品评论---- 2018/1
爬取豆瓣神秘巨星短评---- 2018/2
爬取github--- 2018/2
vip视频解析助手--- 2018/2

抖音APP视频爬取下载(Fiddler)---2018/2
scrapy学习(依赖官方文档) ---2018/3
xpath学习 ---2018/3
文件下载(浏览器下载的太慢了,ubuntu上还未发现好的下载软件,就自己简单实现了一个) ---/2018/3
爬取ted的视频的文本内容，为后续的分析准备
WIFI 暴力破解

添加百度文库的爬取（最近在用百度文库，经常提示粘贴超过用量，就弄了该脚本）
并发爬取IMDB的数据

环境搭建与讲解

1. qq空间说说爬取

步骤:

通过模拟登录获取,因为说说中的请求链接需要的参数是在cookie中获取的,当然也可以通过其他的方式获取对应的cookies. 其中g_qzonetoken的获取是在网页的源码中获取的,
分析说说的链接, 构造参数, 传入即可

环境:

selenuim
request

注意事项

若是使用的是chrome, 注意chromedriver的版本和自己chrome的版本对应
使用模拟登录, 注意设置合适的睡眠时间, 避免还未执行登录操作, 后续的程序就直接执行了(可添加判断, 未做)

TODO

并发爬取
支持断点爬取

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.