- Identify situations where data scraping would be beneficial
- Understand the methods and legality of data scraping
- Use modules such as Cheerio to scrape data from the web
Scraping (Screen Scraping, Web Data Extraction, Web Harvesting, etc) refers to the process of requesting an HTML page and picking out relevant data from the document string. In other words, you can scrape content off of web pages by parsing the html.
- no API available
- API is unreliable/unkept/etc.
- no fee or call limit (unless a rate-limit is set up)
- more anonymous than getting data through dev resources
- need to log into site in order to access desired data
- organization of data makes it hard/laborious to access
- web page structure frequently changes (program that uses scraped data would need constant updates)
- copywrite and other legal issues
NOTE: The legality of data scraping and using a site's data may depend on a site's terms of use. Scraping a site and using the data for profit may violate a site's terms of use, so be careful before scraping a site. This is not legal advice, and we are not lawyers, but we recommend that you contact a lawyer if you want to scrape data for a for-profit application.
Let's try creating a program that will scrape neighborhood data from this site:
https://visitseattle.org/partners/?frm=partners&ptype=visitors-guide&s=&neighborhood=Capitol+Hill
To get started, create a new folder, and initialize npm. We'll also want to install two modules:
axios
- for accessing external resources via HTTPcheerio
- essentially, this is server side jQuery. We will be using this to traverse the data we get back from our request.
There are multiple ways to get an HTML document, but we'll use the axios
module in this example. To scrape data from the site, we need to request the webpage. In a getbusinesses.js
file, import axios
, then make a request to the Seattle Neigbhborhoods website.
const axios = require('axios');
const URL = 'https://visitseattle.org/partners/?frm=partners&ptype=visitors-guide&s=&neighborhood=Capitol+Hill';
axios.get(URL).then(res => {
console.log(res.data)
})
Run the program and take a look at your output. What did the request return?
Look over the Cheerio Documentation - for more info about our next steps.
The request to the seattle neighborhoods url gave us the entire HTML document string - now we need to parse it in order to pick out the specific data we're looking for. This is where Cheerio comes in! Import Cheerio to your getbusinesses.js
:
const cheerio = require('cheerio');
Inside the callback function of request, we'll pass the html we got back into the cheerio.load()
function. We store the result, which is a cheerio object, in the dollar sign variable because cheerio is designed to mimic jQuery selectors (though technically, we could store it in any variable we'd like).
axios.get(URL).then(res => {
let $ = cheerio.load(res.data);
console.log($);
});
Run the program and take a look at the cheerio
object. How might we find the html again? Does the cheerio
object contain a method for this?
- First you have to identify what content you're looking to scrape and how to access it. Let's aim to scrape the names of all the busineses listed on this page of Capitol Hill businesses. Open the dev tools and inspect the page to see if you can pinpoint the elements that have the relvant information for each result.
Upon some inspection, we can see that the results live inside of a search-results
div
, which contains a search-results-partner
section
element. Inside that there is a search-result-container
div
that has a search-result
div
for each result.
Let's try to target the name of the first business. Inspect the first search-result
div
and identify exactly where the business name lives.
The business name can be found inside the search-result-preview
div
as both the title
of the nested a
tag, as well as the h3
child of that anchor.
First let's grab the first search-result-preview
element. Cheerio uses jQuery selectors to identify elements.
axios.get(URL).then(res => {
let $ = cheerio.load(res.data);
let result = $('.search-result-preview').html();
console.log(result);
});
Now let's target the title
attribute:
axios.get(URL).then(res => {
let $ = cheerio.load(res.data);
let result = $('.search-result-preview').find('a').attr('title');
console.log(result);
});
Great! Now we know how to find the title of one business, but how do we get all of them?
Cheerio actually gives us the option of selecting the first or all of the elements that match the selector. Let's take a closer look at ('.search-result-preview')
by getting it's length
:
axios.get(URL).then(res => {
let $ = cheerio.load(res.data);
let result = $('.search-result-preview');
console.log(result.length);
});
It looks like that result object actually contains all of the results on the page! Cheerio has iterators for traversing cheerio objects like this. Let's use the each
iterator, which functions similarly to the javascript Array.forEach()
:
axios.get(URL).then(res => {
let $ = cheerio.load(res.data);
let results = $('.search-result-preview');
results.each((index, element) => {
console.log($(element).find('a').attr('title'));
});
});
Logging to the console is great, but in practice, we'll likely want to store all of these titles in an an array-like cheerio object. We can use the built in .map()
iterator to pull out just the titles and store them in their own cheerio object:
axios.get(URL).then(res => {
let $ = cheerio.load(res.data);
let results = $('.search-result-preview');
let resultTitles = results.map((index, element) => {
return $(element).find('a').attr('title');
});
console.log(resultTitles);
});
This still gives us a lot of gobbledy-gook we didn't ask for. Use the .get()
function after .map()
to see an array of exactly what we asked for (see docs -> traversing -> get):
axios.get(URL).then(res => {
let $ = cheerio.load(res.data);
let results = $('.search-result-preview');
let resultTitles = results.map((index, element) => {
return $(element).find('a').attr('title');
});
console.log(resultTitles.get());
});
What if we want more than just the name of the businesses? Let's modify our code to also store the URL for the image associated with the result:
axios.get(URL).then(res => {
let $ = cheerio.load(res.data);
let results = $('.search-result-preview');
let filteredResults = results.map((index, element) => {
return {
title: $(element).find('a').attr('title'),
img: [ INSERT YOUR CODE HERE ]
}
});
console.log(filteredResults.get());
});
$(element).find('.image-container').attr('style');
Now you need to modify the string to isolate the url!
axios.get(URL).then(res => {
let $ = cheerio.load(res.data);
let results = $('.search-result-preview');
let filteredResults = results.map((index, element) => {
let imgurl = $(element).find('.image-container').attr('style');
imgurl = imgurl.substring(22, imgurl.length-15);
return {
title: $(element).find('a').attr('title'),
img: imgurl
}
});
console.log(filteredResults.get());
});
Now you need to modify the string to isolate the url!