-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathwebScraping
387 lines (238 loc) · 11.9 KB
/
webScraping
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
WebScraping
***********
WebScraping is the term for using program codes to download and process content/data from web.
modules in Webscraping:
-->webbrowser: fetches webpages.
-->Requests: downloads files and webpages.
-->Beautiful Soup:Parses html
-->Selenium:it launches and controls webbrowser.it can fill in forms and simulate mouse clicks in the browser.
WEBBROWSER MODULE
-----------------
webbrowser's open() method can take a url as arguement and opens that url in the browser.
eg:
import webbrowser
webbrowser.open('https//www.google.com')
result: google is opened on the web browser.
Mapit program
*-----------*
mapit.py---the program to get the address from commandline and open google maps with the retrieved address.
Todo:
--> get the address from command line.
--> or read the clipboard content if there is no command line arguement for mapit
--> use the webbrowser module's open() method to open google maps with the retrieved address from the user.
REQUESTS MODULE
---------------
requests module can be used to download files and webpages.
Downloading a webpage with requests module's get() method
Syntax: requests.get('url')
Note: the return value of the above get() method will be a response object.
eg:
import requests
result=requests.get('www.abc.com/abc.txt')
the result variable in the above code stores the response object.The response object result consists of many details about the website downloaded and the download status of the file/webpage.
-> To check whether the download is successful, we use the response_code variable of requests module
Syntax: responseobject.status_code=requests.codes.ok
Note: the status code for 'ok' is 200 and for file not found is 404
the status code will be 200 if download successful and 404 error if no file is found at that url.
eg:
import requests
result=requests.get('www.abc.com/abc.txt')
if result.status_code==requests.codes.ok:
print(len(result.txt))
print(result.txt)
Controlling what amount of text is printed
------------------------------------------
print(result.text[:N])
where N is the index number in the list to which the characers in the text file must be printed.
eg:
import requests
result=requests.get('www.abc.com/abc.txt')
if result.status_code==requests.codes.ok:
print(len(result.txt))
print(result.txt[:250])
#prints 250 characters of the text file including whitespaces.
RAISE FOR STATUS METHOD
-----------------------
The raise_for_status() method of the request module helps us to raise a status request which returns 'ok' or 200 if download successful and returns 404 when download unsuccessful.
using raise request method inside a try except block which makes the program run smoothly (avoid unnecessary execution breaks) even when there is some problem with downloads.
eg:
try:
result.raise_for_status()
except Exception as exp:
print('exception is %s'%(exp))
it catches the exception and displays it so that the programmer can correct the code.
SAVING DOWNLOADED FILES TO HARDDRIVE
------------------------------------
The downloaded content can be written to a file using the write method of the file in binary mode --'wb'--
eg:
file=open('result.txt','wb')
for chunks in result.iter_content(100000):
file.write(chunks)
file.close()
where iter_content(100000) method of the request module iterates on the content in the result response object that is retrieved using get() method of request module.
iter_content()---returns data present on the content file result as chunks.
chunk: each chunk has the same size in bytes.
NOTE:
---> Chunk is byte datatype.
---> we can specify the size of the chunk that the iter_content returns.this can be done by specifying the size in bytes as the iter_content() method parameter.
Whole program:
--------------
import requests
result=requests.get('https://drive.google.com/file/d/1OTio2-hSqdMM-TjwnVbdJ0dPgUVCv1Pu/view?usp=sharing') #a text file in my google drive
try:
result.raise_for_status()
except Exception as exp:
print('exception is %s'%(exp))
file=open('result.txt','wb')
for chunks in result.iter_content(100000):
file.write(chunks)
file.close()
Result: A file named result.text is created on the same directory where this code is saved before running it.
the result.txt will contain the contents of the file in my google drive.
HTML-Hyper Text Markup Language
-------------------------------
--> HTML is used as a formatting language for webpages in the internet.
--> HTML consists of tags that are words enclosed between angle brackets < and > such tags ie the opening and closing tags enclose text which forms an element also called inner HTML. eg: <h>,<p>,<span>,<a> etc..are tags
<p><strong>Hello</strong>World!!</p>
--> Anchor tags <a>
Anchor tags have attributes such as href which helps to add a hyperlink.
eg:
<a href="url">Python</a>
Result: here the Python acts as a hyperlink and points to the url.
--> We can find html of any part of a webpage using inspect element feature of the webbrowser.
NOTE: It is advised not to parse html using regexes because it atracts more errors and the process is more complex.
Module like Beautiful Soup can be used instead.
Beautiful Soup Module
---------------------
The Beautiful Soup module is used to parse HTML data which are retrieved in the form of strings using the request module.
--> Beautiful Soup module is imported as 'bs4'--Beautiful Soup version 4.
Creating BS4 object
-------------------
--> import bs4 and request modules,
--> create a response object and get the website content from the url using request.get(url)
--> check response_object.raise_for_status
--> create bs4object=bs4.BeautifulSoup(responseobject.txt)------passing the contents retrieved / the text attribute from request.
--> here the bs4object is the variable in which the beautiful soup object is stored and is used to retrieve data by parsing the HTML
--> we can also use a file instead of a webpage by using file operations.
Syntax:
file=open('myfile.html')
bs4object=bs4.BeautifulSoup(file)
Finding an element with SELECT() method
---------------------------------------
The select() method of beautiful soup module takes parameter strings of CSS selectors inorder to retrieve desired elements from the html content.
Some of the CSS selectors are:
--> soup.select('div')------------------------returns the list of div elements
--> soup.select('div span')-------------------returns list of spam element inside div element
--> soup.select('#author')--------------------returns the id attribute of author (#=id)
--> soup.select('.name')----------------------returns the name class (.=class)
--> soup.select('input[name]')----------------returns input element having name attribute with any value.
--> soup.select('div>span')-------------------returns all spam elements inside div elements with no other element in between them.
--> soup.select('input[type=button]')---------returns all input elements having attribute named type and its value- 'button'.
NOTE: The select() method of bs4 module returns a list of tags which then fed into str() method produces strings containing the elements and inner html.
the list generated for the select() method can return attributes in the retrieved data by using 'attrs' which actually returns a dictionary of id's and their values.
getText() method
----------------
The getText() method returns the string to which it is called upon.
NOTE: here T in Text is upper case character
BS4 code:
import bs4,requests
file=open('myfile.html')
bs4object=bs4.BeautifulSoup(file)
elem=bs4object.select('#author')
print(len(elem))
print(str(elem))
print(elem[0].getText())
print(elem[0].attrs)
file.close()
To find the paragraph element <p>
---------------------------------
Code:
import bs4,requests
#requests module is only needed if we are downloading the data from internet
file=open('myfile.html')
bs4obj=bs4.BeautifulSoup(file)
pelement=bs4obj.select('p')
c=len(pelement)
print(c)
i=0
while i<c:
print(str(pelement[i]))
print(pelement[i].getText())
print(pelement[i].attrs)
i=i+1
file.close()
To get data from the element's attributes
-----------------------------------------
Data can be retrieved from element's attributes using the get() method. Passing thr required attribute name to the get() method returns the value assosiated with the attribute name.
code:
import bs4,requests
file=open('myfile.html')
bs4object=bs4.BeautifulSoup(file)
elem=bs4object.select('#author')
print(len(elem))
print(str(elem))
print(elem[0].getText())
print(elem[0].get('id'))
print(elem[0].attrs)
file.close()
PROJECTS on BS4:
================
--------------------------------------------------------------fill later-----------------------------------------------------------------------
CONTROLLING THE BROWSER WITH SELENIUM MODULE (AUTOMATION)
---------------------------------------------------------
Selenium Module is not imported directly as we do for other modules instead we do the following:
Syntax:
from selenium import webdriver #from selenium module, webdriver sub module is loaded
browser=webdriver.Firefox(executable_path=r'/home/killbox/Downloads/geckodriver.exe') #opens up firefox browser
browser.get('https://www.google.com')
FINDING ELEMENTS ON PAGE USING SELENIUM
---------------------------------------
2 types of methods
-> find_element_*
-> find_elements_*
Some of the selenium webdriver methods:
browser.find_element_by_class_name(name)
browser.find_elements_by_class_name(name) # Elements that use the CSS class name
browser.find_element_by_css_selector(selector)
browser.find_elements_by_css_selector(selector) # Elements that match the CSS selector
browser.find_element_by_id(id)
browser.find_elements_by_id(id) # Elements with a matching id attribute value
browser.find_element_by_link_text(text)
browser.find_elements_by_link_text(text) # <a> elements that completely match the text provided
browser.find_element_by_partial_link_text(text)
browser.find_elements_by_partial_link_text(text) # <a> elements that contain the text provided
browser.find_element_by_name(name)
browser.find_elements_by_name(name) # Elements with a matching name attribute value
browser.find_element_by_tag_name(name)
browser.find_elements_by_tag_name(name) # Elements with a matching tag name (case insensitive; an <a> element is matched by 'a' and 'A')
once elements have been found out using web driver's methods, called web element object,
we can get more details on the web element by reading its attributes and calling its methods.
Some of web element's attributes and methods are:
tag_name : The tag name, such as 'a' for an <a> element
get_attribute(name) : The value for the element’s name attribute
text : The text within the element, such as 'hello' in <span>hello</span>
clear() : For text field or text area elements, clears the text typed into it
is_displayed() : Returns True if the element is visible; otherwise returns False
is_enabled() : For input elements, returns True if the element is enabled; otherwise returns False
is_selected() : For checkbox or radio button elements, returns True if the element is selected; otherwise returns False
location : A dictionary with keys 'x' and 'y' for the position of the element in the page
Selenium find tag name program code:
------------------------------------
from selenium import webdriver
browser=webdriver.Firefox(executable_path=r'/home/killbox/Downloads/geckodriver')
browser.get('http://inventwithpython.com')
try:
elemobj=browser.find_element_by_class_name('card')
print('tag found is <%s>' % (elemobj.tag_name))
except:
print('no such line')
CLICKING LINKS ON PAGE USING SELENIUM
-------------------------------------
from selenium import webdriver
browser=webdriver.firefox(executable_path=r'/home/killbox/Downloads/geckodriver')
browser.get('http://inventwithpython.com')
try:
elemobj=browser.find_element_by_link_text('Read It Online')
elemobj.click()
except:
print('no such line')