-
Notifications
You must be signed in to change notification settings - Fork 1
/
README
62 lines (34 loc) · 1.18 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
WHAT SKILL LEVELS DO WE HAVE
SPLIT INTO PAIRS
INSTALL STUFF
Why write a screen scraper?
To get data that is available, but not in structured format.
What can I scrape?
With patience, almost anything. But the more tabular the data the more straightforward it will be.
When doesn't this work?
When you can't be certain you've found all the data (search only, no predictable urls)
What is PANDA?
http://pandaproject.net/
Why put data in PANDA?
To share with your colleagues. To search it.
Tools and technologies:
Python, Node, Ruby, Scraperwiki, Mechanize
What are we going to produce today?
A script you can run to extract structured data from an unstructured website.
What we aren't going to cover:
Sessions/cookies, regular expressions, POST urls/search params, broken HTML,
Question:
Does the percentage of runners who finish the race vary with wind speed?
Step 1:
Explain boilerplate
How to fetch a webpage
Scraping the year
Step 2:
Scraping the registered and finished runners
Step 3:
Scraping the wind speed
Step 4:
Scraping all the urls
Writing to a csv
Step 5:
Finished script that scrapes everything