Skip to content

Latest commit

 

History

History
424 lines (327 loc) · 19 KB

README.md

File metadata and controls

424 lines (327 loc) · 19 KB

How to Crawl a Website Using Web Crawler?

Oxylabs promo code

Web Crawler is a built-in feature of our Scraper APIs. It’s a tool used to discover target URLs, select the relevant content, and have it delivered in bulk. It crawls websites in real-time and at scale to quickly deliver all content or only the data you need based on your chosen criteria.

What can Web Crawler do?

There are three main tasks Web Crawler can do:

  • Perform URL discovery;

  • Crawl all pages on a site;

  • Index all URLs on a domain.

Use it when you need to crawl through the site and receive parsed data in bulk, as well as to collect a list of URLs in a specific category or from an entire website.

There are three data output types you can receive when using Web Crawler: a list of URLs, parsed results, and HTML files. If needed, you can set Web Crawler to upload the results to your cloud storage.

Web Crawler settings overview

You can easily control the crawling scope by adjusting its width and depth with filters. Web Crawler can also use various scraping parameters, such as geo-location and user agent, to increase the success rate of crawling jobs. Most of these scraping parameters depend on the Scraper API you use.

Endpoints

To control your crawling job, you need to use different endpoints. You can initiate, stop and resume your job, get job info, get the list of result chunks, and get the results. Below are the endpoints we’ll use in this crawling tutorial. For more information and output examples, visit our documentation.

Create a new job

  • Endpoint: https://ect.oxylabs.io/v1/jobs

  • Method: POST

  • Authentication: Basic

  • Request headers: Content-Type: application/json

Get sitemap

This endpoint will deliver the list of URLs found while processing the job.

  • Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/sitemap

  • Method: GET

  • Authentication: Basic

Get the list of aggregate result chunks

  • Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/aggregate

  • Method: GET

  • Authentication: Basic

The aggregate results can consist of a lot of data, so we split them into multiple chunks based on the chunk size you specify. Use this endpoint to get a list of chunk files available.

Get a chunk of the aggregate result

  • Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk}

  • Method: GET

  • Authentication: Basic

With this endpoint, you can download a particular chunk of the aggregate result. The contents of the response body depend on the output type you choose.

The result can be one of the following:

  • An index (a list of URLs)

  • An aggregate JSON file with all parsed results

  • An aggregate JSON file with all HTML results

Query parameters

For your convenience, we’ve put all the available parameters you can use in the table below. It can also be found in our documentation.

Parameter Description Default Value
url The URL of the starting point -
filters These parameters are used to configure the breadth and depth of the crawling job, as well as determine which URLs should be included in the end result. See this section for more information. -
filters:crawl Specifies which URLs Web Crawler will include in the end result. See this section for more information. -
filters:process Specifies which URLs Web Crawler will scrape. See this section for more information. -
filters:max_depth Determines the max length of URL chains Web Crawler will follow. See this section for more information. 1
scrape_params These parameters are used to fine-tune the way we perform the scraping jobs. For instance, you may want us to execute Javascript while crawling a site, or you may prefer us to use proxies from a particular location. -
scrape_params:source See this section for more information. -
scrape_params:geo_location The geographical location that the result should be adapted for. See this section for more information. -
scrape_params:user_agent_type Device type and browser. See this section for more information. desktop
scrape_params:render Enables JavaScript rendering. Use when the target requires JavaScript to load content. If you want to use this feature, set the parameter value to html. See this section for more information. -
output:type\_ The output type. We can return a sitemap (list of URLs found while crawling) or an aggregate file containing HTML results or parsed data. See this section for more information. -
upload These parameters are used to describe the cloud storage location where you would like us to put the result once we're done. See this section for more information. -
upload:storage_type Define the cloud storage type. The only valid value is s3 (for AWS S3). gcs (for Google Cloud Storage) is coming soon. -
upload:storage_url The storage bucket URL. -

Using these parameters is straightforward, as you can pass them with the request payload. Below you can find code examples in Python.

Using Web Crawler in Postman

For simplicity, you can use Postman to make crawling requests. Download this Postman collection to try out all the endpoints of Web Crawler. Here’s a step-by-step video tutorial you can follow:

How to Crawl a Website: Step-by-step Guide

Using Web Crawler in Python

To make HTTP requests in Python, we’ll use the Requests library. Install it by entering the following in your terminal:

pip install requests

To deal with HTML results, we’ll use the BeautifulSoup4 library to parse the results and make them more readable. This step is optional, but you can install this library with:

pip install BeautifulSoup4

Getting a list of URLs

In the following example, we use the sitemap parameter to create a job that crawls the Amazon homepage and gets a list of URLs found within the starting page. With the crawl and process parameters being set to “.\*”, Web Crawler will follow and return any Amazon URL. These two parameters use regular expressions (regex) to determine what URLs should be crawled and processed. Be sure to visit our documentation for more details and useful resources.

We don’t need to include the source parameter because we aren’t scraping content from the URLs yet. Using the json module, we write the data into a .json file, and then, with the pprint module, we print the structured content. Let’s see the example:

import requests, json
from pprint import pprint

# Set the content type to JSON.
headers = {"Content-Type": "application/json"}

# Crawl all URLs inside the target URL.
payload = {
    "url": "https://www.amazon.com/",
    "filters": {
        "crawl": [".*"],
        "process": [".*"],
        "max_depth": 1
    },
    "scrape_params": {
        "user_agent_type": "desktop",
    },
    "output": {
        "type_": "sitemap"
    }
}

# Create a job and store the JSON response.
response = requests.request(
    'POST',
    'https://ect.oxylabs.io/v1/jobs',
    auth=('USERNAME', 'PASSWORD'),  # Your credentials go here.
    headers=headers,
    json=payload,
)

# Write the decoded JSON response to a .json file.
with open('job_sitemap.json', 'w') as f:
    json.dump(response.json(), f)

# Print the decoded JSON response.
pprint(response.json())

Depending on the request size, the process might take a bit of time. You can make sure the job is finished by checking the job information. When it’s done, send another request to the sitemap endpoint https://ect.oxylabs.io/v1/jobs/{id}/sitemap to return a list of URLs. For example:

import requests, json
from pprint import pprint

# Store the JSON response containing URLs (sitemap).
sitemap = requests.request(
    'GET',
    'https://ect.oxylabs.io/v1/jobs/{id}/sitemap', # Replace {id] with the job ID.
    auth=('USERNAME', 'PASSWORD'),  # Your credentials go here.
)

# Write the decoded JSON response to a .json file.
with open('sitemap.json', 'w') as f:
    json.dump(sitemap.json(), f)

# Print the decoded JSON response.
pprint(sitemap.json())

Getting parsed results

To get parsed content, use the parsed parameter. Using the example below, we can crawl all URLs found on this Amazon page and then parse the content of each URL. This time, we’re using the amazon source as we’re scraping content from the specified Amazon page. So, let’s see all of this put together in Python:

import requests, json
from pprint import pprint

# Set the content type to JSON.
headers = {"Content-Type": "application/json"}

# Parse content from the URLs found in the target URL.
payload = {
    "url": "https://www.amazon.com/s?i=electronics-intl-ship&bbn=16225009011&rh=n%3A502394%2Cn%3A281052&dc&qid"
           "=1679564333&rnid=502394&ref=sr_pg_1",
    "filters": {
        "crawl": [".*"],
        "process": [".*"],
        "max_depth": 1
    },
    "scrape_params": {
        "source": "amazon",
        "user_agent_type": "desktop"
    },
    "output": {
        "type_": "parsed"
    }
}

# Create a job and store the JSON response.
response = requests.request(
    'POST',
    'https://ect.oxylabs.io/v1/jobs',
    auth=('USERNAME', 'PASSWORD'),  # Your credentials go here.
    headers=headers,
    json=payload,
)

# Write the decoded JSON response to a .json file.
with open('job_parsed.json', 'w') as f:
    json.dump(response.json(), f)

# Print the decoded JSON response.
pprint(response.json())

Note that if you want to use the geo_location parameter when scraping Amazon pages, you must set its value to the preferred location’s zip/postal code. For more information, visit this page in our documentation.

Once the job is complete, you can check how many chunks your request has generated and then download the content from each chunk with this endpoint: https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk}. For instance, with the following code snippet, we’re printing the first chunk:

import requests, json
from pprint import pprint

# Store the JSON response containing parsed results.
parsed_results = requests.request(
    'GET',
    'https://ect.oxylabs.io/v1/jobs/{id}/aggregate/1',  # Replace {id] with the job ID.
    auth=('USERNAME', 'PASSWORD'),  # Your credentials go here.
)

# Write the decoded JSON response to a .json file.
with open('parsed_results_1.json', 'w') as f:
    json.dump(parsed_results.json(), f)

# Print the decoded JSON response.
pprint(parsed_results.json())

Getting HTML results

The code to get HTML results doesn’t differ much from the code in the previous section. The only difference is that we’ve set the type_ parameter to html. Let’s see the code sample:

import requests, json
from pprint import pprint

# Set the content type to JSON.
headers = {"Content-Type": "application/json"}

# Index HTML results of URLs found in the target URL. 
payload = {
    "url": "https://www.amazon.com/s?i=electronics-intl-ship&bbn=16225009011&rh=n%3A502394%2Cn%3A281052&dc&qid"
           "=1679564333&rnid=502394&ref=sr_pg_1",
    "filters": {
        "crawl": [".*"],
        "process": [".*"],
        "max_depth": 1
    },
    "scrape_params": {
        "source": "universal",
        "user_agent_type": "desktop"
    },
    "output": {
        "type_": "html"
    }
}

# Create a job and store the JSON response.
response = requests.request(
    'POST',
    'https://ect.oxylabs.io/v1/jobs',
    auth=('USERNAME', 'PASSWORD'),  # Your credentials go here
    headers=headers,
    json=payload,
)

# Write the decoded JSON response to a .json file.
with open('job_html.json', 'w') as f:
    json.dump(response.json(), f)

# Print the decoded JSON response.
pprint(response.json())

Again, you’ll need to make a request to retrieve each chunk of the result. We’ll use the BeautifulSoup4 library to parse HTML, but this step is optional. We then write the parsed content to an .html file. The code example below downloads content from the first chunk:

import requests
from bs4 import BeautifulSoup

# Store the JSON response containing HTML results.
html_response = requests.request(
    'GET',
    'https://ect.oxylabs.io/v1/jobs/{id}/aggregate/1',  # Replace {id] with the job ID.
    auth=('USERNAME', 'PASSWORD'),  # Your credentials go here.
)

# Parse the HTML content.
soup = BeautifulSoup(html_response.content, 'html.parser')
html_results = soup.prettify()

# Write the HTML results to an .html file.
with open('html_results.html', 'w') as f:
    f.write(html_results)

# Print the HTML results.
print(html_results)

You can modify the code files as needed per your requirements.

This tutorial covered the fundamental aspects of using Web Crawler. We recommend looking at our documentation for more information on using the endpoints and query parameters. In case you have any questions, you can always contact us at [email protected] or via live chat on our website.