Skip to content

Commit

Permalink
Merge branch 'master' into new-api-docs
Browse files Browse the repository at this point in the history
  • Loading branch information
B4nan authored Nov 29, 2024
2 parents 6b965e2 + 868b576 commit e9088e6
Show file tree
Hide file tree
Showing 12 changed files with 793 additions and 28 deletions.
2 changes: 1 addition & 1 deletion .github/styles/Apify/Apify.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,6 @@ level: warning
swap:
Apify Dashboard: Apify Console
apify freelancers: Apify freelancers
Apify Platfrom: Apify platform
Apify Platform: Apify platform
'(?:[Tt]he\s)?[Aa]pify\sproxy': Apify Proxy
circa: approx.
Original file line number Diff line number Diff line change
Expand Up @@ -140,12 +140,12 @@ Letting our program visibly crash on error is enough for our purposes. Now, let'

<Exercises />

### Scrape Amazon
### Scrape AliExpress

Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with Amazon search results:
Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with AliExpress search results:

```text
https://www.amazon.com/s?k=darth+vader
https://www.aliexpress.com/w/wholesale-darth-vader.html
```

<details>
Expand All @@ -154,13 +154,12 @@ https://www.amazon.com/s?k=darth+vader
```py
import httpx

url = "https://www.amazon.com/s?k=darth+vader"
url = "https://www.aliexpress.com/w/wholesale-darth-vader.html"
response = httpx.get(url)
response.raise_for_status()
print(response.text)
```

If you get `Server error '503 Service Unavailable'`, that's just Amazon's anti-scraping protections. You can learn about how to overcome those in our [Anti-scraping protections](../anti_scraping/index.md) course.
</details>

### Save downloaded HTML as a file
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,14 @@ for product in soup.select(".product-item"):

This program does the same as the one we already had, but its code is more concise.

:::note Fragile code

We assume that the selectors we pass to the `select()` or `select_one()` methods return at least one element. If they don't, calling `[0]` on an empty list or `.text` on `None` would crash the program. If you perform type checking on your Python program, the code examples above may even trigger warnings about this.

Not handling these cases allows us to keep the code examples more succinct. Additionally, if we expect the selectors to return elements but they suddenly don't, it usually means the website has changed since we wrote our scraper. Letting the program crash in such cases is a valid way to notify ourselves that we need to fix it.

:::

## Precisely locating price

In the output we can see that the price isn't located precisely. For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -199,8 +199,12 @@ def export_json(file, data):
json.dump(data, file, default=serialize, indent=2)

listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
soup = download(listing_url)
data = [parse_product(product) for product in soup.select(".product-item")]
listing_soup = download(listing_url)

data = []
for product in listing_soup.select(".product-item"):
item = parse_product(product)
data.append(item)

with open("products.csv", "w") as file:
export_csv(file, data)
Expand All @@ -209,7 +213,7 @@ with open("products.json", "w") as file:
export_json(file, data)
```

The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).
The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with one that only takes up four lines of code.

:::tip Refactoring

Expand Down Expand Up @@ -300,9 +304,13 @@ Now we'll pass the base URL to the function in the main body of our program:

```py
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
soup = download(listing_url)
# highlight-next-line
data = [parse_product(product, listing_url) for product in soup.select(".product-item")]
listing_soup = download(listing_url)

data = []
for product in listing_soup.select(".product-item"):
# highlight-next-line
item = parse_product(product, listing_url)
data.append(item)
```

When we run the scraper now, we should see full URLs in our exports:
Expand Down
Loading

0 comments on commit e9088e6

Please sign in to comment.