-
Notifications
You must be signed in to change notification settings - Fork 0
/
progress_log.txt
144 lines (113 loc) · 5.75 KB
/
progress_log.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
* Start with mkdir crawl; git init; mkvirtualenv crawl
* Go to scrapy homepage -> pip install scrapy, copy example BlogSpider to
local file
* Run scrapy spider my_file.py and try understand the output
What is the difference between crawled and scraped?
Crawling means walking through the pages, scrape means extracting data from
a page. Ok.
At the end there are some stats.
* The scrapy example uses a css selector as argument for a css method on a
response object (class 'scrapy.http.response.html.HtmlResponse', scrapy
stuff).
The css method returns a object of class
'scrapy.selector.unified.SelectorList', which inherits from list and has
some methods like "re" and "extract"
* Lets look a quoka.
There are the filters on the left. If I click one, the url does not show any
query parameters, like "...?city=Berlin", scrapy must help me.
* First do the crawling, scraping the data is a separate task.
* The filter links looks like:
<a rel="nofollow" href="javascript:qsn.set('comm','1',1);">nur Gewerbliche</a>
<a title="'Büros, Gewerbeflächen' Kleinanzeigen aus Berlin" onclick="qsn.changeCity('111756',25, this);
return false;" href="/immobilien/bueros-gewerbeflaechen/berlin/cat_27_2710_ct_111756.html">Berlin</a>
qsn seems to be a "QSN Javascript Websocket Client for Node.js and HTML5 Browsers."
Not important.
Pagination links look like:
<a class="t-pgntn-blue" data-qng-page="5" href="/kleinanzeigen/cat_27_2710_ct_0_page_5.html">5</a>
ct means obviously city, 111756 is the id of berlin
http://www.quoka.de/immobilien/bueros-gewerbeflaechen/berlin/cat_27_2710_ct_111756.html
* Cities and pages seem to be straightforward to find.
Find the city ids, get the number of pages and go through the pages with the links like this:
"<a class="t-pgntn-blue" data-qng-page="2" href="/kleinanzeigen/cat_27_2710_ct_111756_page_2.html">2</a>"
* The filter "nur Angebote" seems to be more tricky still.
When clicking on it, firebug show me a POST-request to
http://www.quoka.de/immobilien/bueros-gewerbeflaechen/
or when there is city selected
http://www.quoka.de/immobilien/bueros-gewerbeflaechen/berlin/cat_27_2710_ct_111756.html
with a lot of interesting POST-parameters. Maybe I can use them.
* In order to get started lets concentrate on crawling the cities/pages first and deal with the "nur
Angebote" filter later.
* Even before that, get the total number of properties.
The TotalSpider does the job. Still have to filter out the "Gesuche"
* Back to the crawler. Set css filter and regex for cities, create callback that loops over the pages
The first page in pagination has different url pattern than the following ones.
Use css selector to distinguish between "Partner-Anzeigen" and regular ads.
And then another callback
* Now create a MySQL table with all the fields given in the pdf. Do it with
MySQL-python -> storage.py
* Use Items and ItemLoader for the data. First do the regular ads.
ItemLoader returns lists, the docs say something about
default_output_processors
* Need to create a MysqlStoragePipeline
And yes, there are some rows in the database :)
* Make use of scrapy.Field(input_processor=...) in order to clean the data
* Back to the filters. Make a POST-request to http://www.quoka.de/immobilien/bueros-gewerbeflaechen/
with parameter "classtype" = "of" sets the "nur Angebote" filter.
Additionally "comm" = 0 or 1 sets "nur Private" or "nur Gewerbe".
* Now I have to deal with the duplicate filter, because the urls for comm =
0 or 1 are the same. -> set dont_filter=True for all parse functions, except
parse_property (these have to be unique)
Send comm value in meta dict down the chain
* Running into connection timeouts, maybe I crawl too much. Perhaps I need to
introduce a timeout somewhere
* The big stuff is complete, now polish the result
- The css selector for "Partner-Anzeigen" doesn't seem ok
- I'm missing some results, have only 2829
> select Stadt, count(*) from quoka group by Stadt;
gives 875 cities
> select count(*) from quoka where Stadt like "Berlin%";
227
Maybe the timeouts, or there is a bug, or 2829 is the real number..
Next run 3298
* How can there be duplicates like this?!
DEBUG: Filtered duplicate request: <GET http://www.quoka.de/qpi/immobilien/bueros-gewerbeflaechen/c2710a157428953/klein-fein-mein.html>
* Todolist:
- Find "Partner-Anzeigen" correctly; then add visible data like header
- Get number of objects right
- Phone number and property type
* added property type "Büros, Gewerbeflächen"
* try POST parameter showallresults='true', still a third of the 5k properties
are missing. Where should they be?! (stats created by StatsPipeline)
MariaDB [crawl]> select city_category, count(*) from stats group by city_category;
+------------------+----------+
| city_category | count(*) |
+------------------+----------+
| berlin | 383 |
| bonn | 132 |
| bremen | 67 |
| dortmund | 191 |
| dresden | 77 |
| duesseldorf | 126 |
| frankfurt | 247 |
| hamburg | 136 |
| hannover | 121 |
| karlsruhe | 161 |
| koeln | 111 |
| leipzig | 71 |
| ludwigshafen | 260 |
| mannheim | 94 |
| moenchengladbach | 184 |
| muenchen | 217 |
| nuernberg | 125 |
| stuttgart | 209 |
| wiesbaden | 180 |
| wuppertal | 217 |
+------------------+----------+
And too much "Gewerbliche" ads
MariaDB [crawl]> select commercial, count(*) from stats group by commercial;
+------------+----------+
| commercial | count(*) |
+------------+----------+
| 0 | 2020 |
| 1 | 1785 |
+------------+----------+