webScraping

WebScraping
***********

WebScraping is the term for using program codes to download and process content/data from web.
modules in Webscraping:

-->webbrowser: fetches webpages.

-->Requests: downloads files and webpages.

-->Beautiful Soup:Parses html 

-->Selenium:it launches and controls webbrowser.it can fill in forms and simulate mouse clicks in the browser.


WEBBROWSER MODULE
-----------------

webbrowser's open() method can take a url as arguement and opens that url in the browser.

eg:
import webbrowser
webbrowser.open('https//www.google.com')

result: google is opened on the web browser.


Mapit program
*-----------*

mapit.py---the program to get the address from commandline and open google maps with the retrieved address.
Todo:

--> get the address from command line.

--> or read the clipboard content if there is no command line arguement for mapit

--> use the webbrowser module's open() method to open google maps with the retrieved address from the user.


REQUESTS MODULE
---------------

requests module can be used to download files and webpages.

Downloading a webpage with requests module's get() method

Syntax: requests.get('url')

Note: the return value of the above get() method will be a response object.

eg:
import requests
result=requests.get('www.abc.com/abc.txt')

the result variable in the above code stores the response object.The response object result consists of many details about the website downloaded and the download status of the file/webpage.

-> To check whether the download is successful, we use the response_code variable of requests module

Syntax: responseobject.status_code=requests.codes.ok

Note: the status code for 'ok' is 200 and for file not found is 404 
      the status code will be 200 if download successful and 404 error if no file is found at that url.

eg:
import requests
result=requests.get('www.abc.com/abc.txt')
if result.status_code==requests.codes.ok:
   print(len(result.txt))
   print(result.txt)


Controlling what amount of text is printed
------------------------------------------

print(result.text[:N])
where N is the index number in the list to which the characers in the text file must be printed.

eg:
import requests
result=requests.get('www.abc.com/abc.txt')
if result.status_code==requests.codes.ok:
   print(len(result.txt))
   print(result.txt[:250])
#prints 250 characters of the text file including whitespaces.


RAISE FOR STATUS METHOD
-----------------------

The raise_for_status() method of the request module helps us to raise a status request which returns 'ok' or 200 if download successful and returns 404 when download unsuccessful.

using raise request method inside a try except block which makes the program run smoothly (avoid unnecessary execution breaks) even when there is some problem with downloads.
eg:
try:
   result.raise_for_status()
except Exception as exp:
       print('exception is %s'%(exp))
it catches the exception and displays it so that the programmer can correct the code.


SAVING DOWNLOADED FILES TO HARDDRIVE
------------------------------------

The downloaded content can be written to a file using the write method of the file in binary mode --'wb'--

eg:
file=open('result.txt','wb')
for chunks in result.iter_content(100000):
    file.write(chunks)
file.close()

where iter_content(100000) method of the request module iterates on the content in the result response object that is retrieved using get() method of request module.

iter_content()---returns data present on the content file result as chunks.
chunk: each chunk has the same size in bytes.

NOTE:
---> Chunk is byte datatype.
---> we can specify the size of the chunk that the iter_content returns.this can be done by specifying the size in bytes as the iter_content() method parameter.


Whole program:
--------------

import requests
result=requests.get('https://drive.google.com/file/d/1OTio2-hSqdMM-TjwnVbdJ0dPgUVCv1Pu/view?usp=sharing') #a text file in my google drive
try:
   result.raise_for_status()
except Exception as exp:
       print('exception is %s'%(exp))
file=open('result.txt','wb')
for chunks in result.iter_content(100000):
    file.write(chunks)
file.close()

Result: A file named result.text is created on the same directory where this code is saved before running it.
        the result.txt will contain the contents of the file in my google drive.


HTML-Hyper Text Markup Language
-------------------------------

--> HTML is used as a formatting language for webpages in the internet.


--> HTML consists of tags that are words enclosed between angle brackets < and > such tags ie the opening and closing tags enclose text which forms an element also called inner HTML. eg: <h>,<p>,<span>,<a> etc..are tags
<p><strong>Hello</strong>World!!</p>


--> Anchor tags <a>
    Anchor tags have attributes such as href which helps to add a hyperlink.
    eg:
    <a href="url">Python</a>
Result: here the Python acts as a hyperlink and points to the url.


--> We can find html of any part of a webpage using inspect element feature of the webbrowser.


NOTE: It is advised not to parse html using regexes because it atracts more errors and the process is more complex.
      Module like Beautiful Soup can be used instead. 


Beautiful Soup Module
---------------------

The Beautiful Soup module is used to parse HTML data which are retrieved in the form of strings using the request module.

--> Beautiful Soup module is imported as 'bs4'--Beautiful Soup version 4.


Creating BS4 object
-------------------

--> import bs4 and request modules, 


--> create a response object and get the website content from the url using request.get(url)


--> check response_object.raise_for_status


--> create bs4object=bs4.BeautifulSoup(responseobject.txt)------passing the contents retrieved / the text attribute from request.


--> here the bs4object is the variable in which the beautiful soup object is stored and is used to retrieve data by parsing the HTML


--> we can also use a file instead of a webpage by using file operations.

Syntax:
    file=open('myfile.html')
    bs4object=bs4.BeautifulSoup(file)


Finding an element with SELECT() method
---------------------------------------

The select() method of beautiful soup module takes parameter strings of CSS selectors inorder to retrieve desired elements from the html content.

Some of the CSS selectors are:

--> soup.select('div')------------------------returns the list of div elements

--> soup.select('div span')-------------------returns list of spam element inside div element

--> soup.select('#author')--------------------returns the id attribute of author (#=id)

--> soup.select('.name')----------------------returns the name class (.=class)

--> soup.select('input[name]')----------------returns input element having name attribute with any value.

--> soup.select('div>span')-------------------returns all spam elements inside div elements with no other element in between them.

--> soup.select('input[type=button]')---------returns all input elements having attribute named type and its value- 'button'.


NOTE: The select() method of bs4 module returns a list of tags which then fed into str() method produces strings containing the elements and inner html.

the list generated for the select() method can return attributes in the retrieved data by using 'attrs' which actually returns a dictionary of id's and their values.

getText() method
----------------

The getText() method returns the string to which it is called upon.
NOTE: here T in Text is upper case character

BS4 code:

import bs4,requests
file=open('myfile.html')
bs4object=bs4.BeautifulSoup(file)
elem=bs4object.select('#author')
print(len(elem))
print(str(elem))
print(elem[0].getText())
print(elem[0].attrs)
file.close()

To find the paragraph element <p>
---------------------------------

Code:

import bs4,requests
#requests module is only needed if we are downloading the data from internet
file=open('myfile.html')
bs4obj=bs4.BeautifulSoup(file)
pelement=bs4obj.select('p')
c=len(pelement)
print(c)
i=0
while i<c:
   print(str(pelement[i]))
   print(pelement[i].getText())
   print(pelement[i].attrs)
   i=i+1
file.close()


To get data from the element's attributes
-----------------------------------------

Data can be retrieved from element's attributes using the get() method. Passing thr required attribute name to the get() method returns the value assosiated with the attribute name.


code:
import bs4,requests
file=open('myfile.html')
bs4object=bs4.BeautifulSoup(file)
elem=bs4object.select('#author')
print(len(elem))
print(str(elem))
print(elem[0].getText())
print(elem[0].get('id'))
print(elem[0].attrs)
file.close()

PROJECTS on BS4:
================

--------------------------------------------------------------fill later-----------------------------------------------------------------------


CONTROLLING THE BROWSER WITH SELENIUM MODULE (AUTOMATION)
---------------------------------------------------------

Selenium Module is not imported directly as we do for other modules instead we do the following:

Syntax:
from selenium import webdriver   #from selenium module, webdriver sub module is loaded
browser=webdriver.Firefox(executable_path=r'/home/killbox/Downloads/geckodriver.exe')   #opens up firefox browser 
browser.get('https://www.google.com')


FINDING ELEMENTS ON PAGE USING SELENIUM
---------------------------------------

2 types of methods 

-> find_element_*

-> find_elements_*


Some of the selenium webdriver methods:


browser.find_element_by_class_name(name)
browser.find_elements_by_class_name(name)     # Elements that use the CSS class name

browser.find_element_by_css_selector(selector)
browser.find_elements_by_css_selector(selector)  # Elements that match the CSS selector

browser.find_element_by_id(id)
browser.find_elements_by_id(id)     # Elements with a matching id attribute value

browser.find_element_by_link_text(text)
browser.find_elements_by_link_text(text)    # <a> elements that completely match the text provided

browser.find_element_by_partial_link_text(text)
browser.find_elements_by_partial_link_text(text)    # <a> elements that contain the text provided

browser.find_element_by_name(name)
browser.find_elements_by_name(name)    # Elements with a matching name attribute value

browser.find_element_by_tag_name(name)
browser.find_elements_by_tag_name(name)     # Elements with a matching tag name (case insensitive; an <a> element is matched by 'a' and 'A')


once elements have been found out using web driver's methods, called web element object, 

we can get more details on the web element by reading its attributes and calling its methods. 


Some of web element's attributes and methods are:


tag_name  :  The tag name, such as 'a' for an <a> element

get_attribute(name)  :  The value for the element’s name attribute

text  :  The text within the element, such as 'hello' in <span>hello</span>

clear()  :  For text field or text area elements, clears the text typed into it

is_displayed()  :  Returns True if the element is visible; otherwise returns False

is_enabled()  :  For input elements, returns True if the element is enabled; otherwise returns False

is_selected()  :  For checkbox or radio button elements, returns True if the element is selected; otherwise returns False

location  :  A dictionary with keys 'x' and 'y' for the position of the element in the page


Selenium find tag name program code:
------------------------------------

from selenium import webdriver
browser=webdriver.Firefox(executable_path=r'/home/killbox/Downloads/geckodriver')
browser.get('http://inventwithpython.com')
try:
   elemobj=browser.find_element_by_class_name('card')
   print('tag found is <%s>' % (elemobj.tag_name))
except:
   print('no such line')


CLICKING LINKS ON PAGE USING SELENIUM
-------------------------------------

from selenium import webdriver
browser=webdriver.firefox(executable_path=r'/home/killbox/Downloads/geckodriver')
browser.get('http://inventwithpython.com')
try:
   elemobj=browser.find_element_by_link_text('Read It Online')
   elemobj.click()
except:
   print('no such line')