Merge branch 'master' into new-api-docs

apify · Nov 29, 2024 · e9088e6 · e9088e6
2 parents 6b965e2 + 868b576
commit e9088e6
Show file tree

Hide file tree

Showing 12 changed files with 793 additions and 28 deletions.
diff --git a/.github/styles/Apify/Apify.yml b/.github/styles/Apify/Apify.yml
@@ -5,6 +5,6 @@ level: warning
 swap:
     Apify Dashboard: Apify Console
     apify freelancers: Apify freelancers
-    Apify Platfrom: Apify platform
+    Apify Platform: Apify platform
     '(?:[Tt]he\s)?[Aa]pify\sproxy': Apify Proxy
     circa: approx.
diff --git a/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md b/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
@@ -140,12 +140,12 @@ Letting our program visibly crash on error is enough for our purposes. Now, let'
 
 <Exercises />
 
-### Scrape Amazon
+### Scrape AliExpress
 
-Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with Amazon search results:
+Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with AliExpress search results:
 
 ```text
-https://www.amazon.com/s?k=darth+vader
+https://www.aliexpress.com/w/wholesale-darth-vader.html
 ```
 
 <details>
@@ -154,13 +154,12 @@ https://www.amazon.com/s?k=darth+vader
   ```py
   import httpx
 
-  url = "https://www.amazon.com/s?k=darth+vader"
+  url = "https://www.aliexpress.com/w/wholesale-darth-vader.html"
   response = httpx.get(url)
   response.raise_for_status()
   print(response.text)
   ```
 
-  If you get `Server error '503 Service Unavailable'`, that's just Amazon's anti-scraping protections. You can learn about how to overcome those in our [Anti-scraping protections](../anti_scraping/index.md) course.
 </details>
 
 ### Save downloaded HTML as a file

diff --git a/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md b/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
@@ -122,6 +122,14 @@ for product in soup.select(".product-item"):
 
 This program does the same as the one we already had, but its code is more concise.
 
+:::note Fragile code
+
+We assume that the selectors we pass to the `select()` or `select_one()` methods return at least one element. If they don't, calling `[0]` on an empty list or `.text` on `None` would crash the program. If you perform type checking on your Python program, the code examples above may even trigger warnings about this.
+
+Not handling these cases allows us to keep the code examples more succinct. Additionally, if we expect the selectors to return elements but they suddenly don't, it usually means the website has changed since we wrote our scraper. Letting the program crash in such cases is a valid way to notify ourselves that we need to fix it.
+
+:::
+
 ## Precisely locating price
 
 In the output we can see that the price isn't located precisely. For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this:

diff --git a/sources/academy/webscraping/scraping_basics_python/09_getting_links.md b/sources/academy/webscraping/scraping_basics_python/09_getting_links.md
@@ -199,8 +199,12 @@ def export_json(file, data):
     json.dump(data, file, default=serialize, indent=2)
 
 listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-soup = download(listing_url)
-data = [parse_product(product) for product in soup.select(".product-item")]
+listing_soup = download(listing_url)
+
+data = []
+for product in listing_soup.select(".product-item"):
+    item = parse_product(product)
+    data.append(item)
 
 with open("products.csv", "w") as file:
     export_csv(file, data)
@@ -209,7 +213,7 @@ with open("products.json", "w") as file:
     export_json(file, data)
 ```
 
-The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).
+The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with one that only takes up four lines of code.
 
 :::tip Refactoring
 
@@ -300,9 +304,13 @@ Now we'll pass the base URL to the function in the main body of our program:
 
 ```py
 listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-soup = download(listing_url)
-# highlight-next-line
-data = [parse_product(product, listing_url) for product in soup.select(".product-item")]
+listing_soup = download(listing_url)
+
+data = []
+for product in listing_soup.select(".product-item"):
+    # highlight-next-line
+    item = parse_product(product, listing_url)
+    data.append(item)
 ```
 
 When we run the scraper now, we should see full URLs in our exports: