| 
 | 1 | +.. _`advanced-requests`:  | 
 | 2 | + | 
 | 3 | +===================  | 
 | 4 | +Additional Requests  | 
 | 5 | +===================  | 
 | 6 | + | 
 | 7 | +Websites nowadays needs a lot of page interactions to display or load some key  | 
 | 8 | +information. In most cases, these are done via AJAX requests. Some examples of these are:  | 
 | 9 | + | 
 | 10 | +    * Clicking a button on a page to reveal other similar products.  | 
 | 11 | +    * Clicking the `"Load More"` button to retrieve more images of a given item.  | 
 | 12 | +    * Scrolling to the bottom of the page to load more items `(i.e. infinite scrolling)`.  | 
 | 13 | +    * Hovering that reveals a tool-tip containing additional page info.  | 
 | 14 | + | 
 | 15 | +As such, performing additional requests inside Page Objects are inevitable to  | 
 | 16 | +properly extract data for some websites.  | 
 | 17 | + | 
 | 18 | +.. warning::  | 
 | 19 | + | 
 | 20 | +    Additional requests made inside a Page Object aren't meant to represent  | 
 | 21 | +    the **Crawling Logic** at all. They are simply a low-level way to interact  | 
 | 22 | +    with today's websites which relies on a lot of page interactions to display  | 
 | 23 | +    its contents.  | 
 | 24 | + | 
 | 25 | + | 
 | 26 | +HttpClient  | 
 | 27 | +==========  | 
 | 28 | + | 
 | 29 | +The main interface for executing additional requests would be :class:`~.HttpClient`.  | 
 | 30 | +It also has full support for :mod:`asyncio` enabling developers to perform  | 
 | 31 | +the additional requests asynchronously.  | 
 | 32 | + | 
 | 33 | +Let's see a few quick examples to see how it's being used in action.  | 
 | 34 | + | 
 | 35 | +A simple ``GET`` request  | 
 | 36 | +------------------------  | 
 | 37 | + | 
 | 38 | +.. code-block:: python  | 
 | 39 | +
  | 
 | 40 | +    import attr  | 
 | 41 | +    import web_poet  | 
 | 42 | +
  | 
 | 43 | +
  | 
 | 44 | +    @attr.define  | 
 | 45 | +    class ProductPage(web_poet.ItemWebPage):  | 
 | 46 | +        http_client: web_poet.HttpClient  | 
 | 47 | +
  | 
 | 48 | +        async def to_item(self):  | 
 | 49 | +            item = {  | 
 | 50 | +                "url": self.url,  | 
 | 51 | +                "name": self.css("#main h3.name ::text").get(),  | 
 | 52 | +                "product_id": self.css("#product ::attr(product-id)").get(),  | 
 | 53 | +            }  | 
 | 54 | +
  | 
 | 55 | +            # Simulates clicking on a button that says "View All Images"  | 
 | 56 | +            response: web_poet.ResponseData = await self.http_client.get(  | 
 | 57 | +                f"https://api.example.com/v2/images?id={item['product_id']}"  | 
 | 58 | +            )  | 
 | 59 | +            page = web_poet.WebPage(response)  | 
 | 60 | +
  | 
 | 61 | +            item["images"] = page.css(".product-images img::attr(src)").getall()  | 
 | 62 | +            return item  | 
 | 63 | +
  | 
 | 64 | +There are a few things to take note in this example:  | 
 | 65 | + | 
 | 66 | +    * A ``GET`` request can be done via :class:`~.HttpClient`'s  | 
 | 67 | +      :meth:`~.HttpClient.get` method.  | 
 | 68 | +    * We're now using the ``async/await`` syntax.  | 
 | 69 | +    * The response is of type :class:`~.ResponseData`.  | 
 | 70 | + | 
 | 71 | +        * Though in order to use :meth:`~.ResponseShortcutsMixin.css`  | 
 | 72 | +          `(and other shortcut methods)` we'll need to feed it into  | 
 | 73 | +          :class:`~.WebPage`.  | 
 | 74 | + | 
 | 75 | +As the example suggests, we're performing an additional request that allows us  | 
 | 76 | +to extract more images in a product page that might not otherwise be possible.  | 
 | 77 | +This is because in order to do so, an additional button needs to be clicked  | 
 | 78 | +which fetches the complete set of product images via AJAX.  | 
 | 79 | + | 
 | 80 | +A ``POST`` request with `header` and `body`  | 
 | 81 | +-------------------------------------------  | 
 | 82 | + | 
 | 83 | +Let's see another example which needs ``headers`` and ``body`` data to process  | 
 | 84 | +additional requests.  | 
 | 85 | + | 
 | 86 | +In this example, we'll paginate related items in a carousel. These are  | 
 | 87 | +usually lazily loaded by the website to reduce the amount of information  | 
 | 88 | +rendered in the DOM that might not otherwise be viewed by all users anyway.  | 
 | 89 | + | 
 | 90 | +Thus, additional requests inside the Page Object is typically needed for it:  | 
 | 91 | + | 
 | 92 | +.. code-block:: python  | 
 | 93 | +
  | 
 | 94 | +    import attr  | 
 | 95 | +    import web_poet  | 
 | 96 | +
  | 
 | 97 | +
  | 
 | 98 | +    @attr.define  | 
 | 99 | +    class ProductPage(web_poet.ItemWebPage):  | 
 | 100 | +        http_client: web_poet.HttpClient  | 
 | 101 | +
  | 
 | 102 | +        async def to_item(self):  | 
 | 103 | +            item = {  | 
 | 104 | +                "url": self.url,  | 
 | 105 | +                "name": self.css("#main h3.name ::text").get(),  | 
 | 106 | +                "product_id": self.css("#product ::attr(product-id)").get(),  | 
 | 107 | +                "related_product_ids": self.parse_related_product_ids(self),  | 
 | 108 | +            }  | 
 | 109 | +
  | 
 | 110 | +            # Simulates "scrolling" through a carousel that loads related product items  | 
 | 111 | +            response: web_poet.responseData = await self.http_client.post(  | 
 | 112 | +                url="https://www.api.example.com/related-products/",  | 
 | 113 | +                headers={  | 
 | 114 | +                    'Host': 'www.example.com',  | 
 | 115 | +                    'Content-Type': 'application/json; charset=UTF-8',  | 
 | 116 | +                },  | 
 | 117 | +                body=json.dumps(  | 
 | 118 | +                    {  | 
 | 119 | +                        "Page": 2,  | 
 | 120 | +                        "ProductID": item["product_id"],  | 
 | 121 | +                    }  | 
 | 122 | +                ),  | 
 | 123 | +            )  | 
 | 124 | +            second_page = web_poet.WebPage(response)  | 
 | 125 | +
  | 
 | 126 | +            related_product_ids = self.parse_related_product_ids(second_page)  | 
 | 127 | +            item["related_product_ids"] = related_product_ids  | 
 | 128 | +            return item  | 
 | 129 | +
  | 
 | 130 | +        @staticmethod  | 
 | 131 | +        def parse_related_product_ids(page: web_poet.WebPage) -> List[str]:  | 
 | 132 | +            return page.css("#main .related-products ::attr(product-id)").getall()  | 
 | 133 | +
  | 
 | 134 | +Here's the key takeaway in this example:  | 
 | 135 | + | 
 | 136 | +    * Similar to :class:`~.HttpClient`'s :meth:`~.HttpClient.get` method,  | 
 | 137 | +      a :meth:`~.HttpClient.post` method is also available that's  | 
 | 138 | +      typically used to submit forms.  | 
 | 139 | + | 
 | 140 | +Batch requests  | 
 | 141 | +--------------  | 
 | 142 | + | 
 | 143 | +We can also choose to process requests by **batch** instead of sequentially.  | 
 | 144 | +Let's modify the example in the previous section to see how it can be done:  | 
 | 145 | + | 
 | 146 | +.. code-block:: python  | 
 | 147 | +
  | 
 | 148 | +    from typing import List  | 
 | 149 | +
  | 
 | 150 | +    import attr  | 
 | 151 | +    import web_poet  | 
 | 152 | +
  | 
 | 153 | +
  | 
 | 154 | +    @attr.define  | 
 | 155 | +    class ProductPage(web_poet.ItemWebPage):  | 
 | 156 | +        http_client: web_poet.HttpClient  | 
 | 157 | +
  | 
 | 158 | +        default_pagination_limit = 10  | 
 | 159 | +
  | 
 | 160 | +        async def to_item(self):  | 
 | 161 | +            item = {  | 
 | 162 | +                "url": self.url,  | 
 | 163 | +                "name": self.css("#main h3.name ::text").get(),  | 
 | 164 | +                "product_id": self.css("#product ::attr(product-id)").get(),  | 
 | 165 | +                "related_product_ids": self.parse_related_product_ids(self),  | 
 | 166 | +            }  | 
 | 167 | +
  | 
 | 168 | +            requests: List[web_poet.Request] = [  | 
 | 169 | +                self.create_request(page_num=page_num)  | 
 | 170 | +                for page_num in range(2, default_pagination_limit)  | 
 | 171 | +            ]  | 
 | 172 | +            responses: List[web_poet.ResponseData] = await self.http_client.batch_requests(*requests)  | 
 | 173 | +            pages = map(web_poet.WebPage, responses)  | 
 | 174 | +            related_product_ids = [  | 
 | 175 | +                product_id  | 
 | 176 | +                for page in pages  | 
 | 177 | +                for product_id in self.parse_related_product_ids(page)  | 
 | 178 | +            ]  | 
 | 179 | +
  | 
 | 180 | +            item["related_product_ids"].extend(related_product_ids)  | 
 | 181 | +            return item  | 
 | 182 | +
  | 
 | 183 | +        def create_request(self, page_num=2):  | 
 | 184 | +            # Simulates "scrolling" through a carousel that loads related product items  | 
 | 185 | +            return web_poet.Request(  | 
 | 186 | +                url="https://www.api.example.com/product-pagination/",  | 
 | 187 | +                method="POST",  | 
 | 188 | +                headers={  | 
 | 189 | +                    'Host': 'www.example.com',  | 
 | 190 | +                    'Content-Type': 'application/json; charset=UTF-8',  | 
 | 191 | +                },  | 
 | 192 | +                body=json.dumps(  | 
 | 193 | +                    {  | 
 | 194 | +                        "Page": page_num,  | 
 | 195 | +                        "ProductID": item["product_id"],  | 
 | 196 | +                    }  | 
 | 197 | +                ),  | 
 | 198 | +            )  | 
 | 199 | +
  | 
 | 200 | +        @staticmethod  | 
 | 201 | +        def parse_related_product_ids(page: web_poet.WebPage) -> List[str]:  | 
 | 202 | +            return page.css("#main .related-products ::attr(product-id)").getall()  | 
 | 203 | +
  | 
 | 204 | +The key takeaways for this example are:  | 
 | 205 | + | 
 | 206 | +    * A :class:`~.Request` can be instantiated to represent a Generic HTTP Request.  | 
 | 207 | +      It only contains the HTTP Request information for now and isn't executed yet.  | 
 | 208 | +      This is useful for creating factory methods to help create them without any  | 
 | 209 | +      download execution at all.  | 
 | 210 | +    * :class:`~.HttpClient`' has a :meth:`~.HttpClient.batch_requests` method that  | 
 | 211 | +      can process a series of :class:`~.Request` instances.  | 
 | 212 | + | 
 | 213 | +        * Note that it can accept different types of :class:`~.Request` that might  | 
 | 214 | +          not be related *(e.g. a mixture of* ``GET`` *and* ``POST`` *requests)*.  | 
 | 215 | +          This is useful to process them in batch to take advantage of async  | 
 | 216 | +          execution.  | 
 | 217 | + | 
 | 218 | +.. _advanced-downloader-impl:  | 
 | 219 | + | 
 | 220 | +Downloader Implementation  | 
 | 221 | +=========================  | 
 | 222 | + | 
 | 223 | +Please note that on its own, :class:`~.HttpClient` doesn't do anything. It doesn't  | 
 | 224 | +know how to execute the request on its own. Thus, for frameworks or projects  | 
 | 225 | +wanting to use additional requests in Page Objects, they need to set the  | 
 | 226 | +implementation of how to download :class:`~.Request`.  | 
 | 227 | + | 
 | 228 | +For more info on this, kindly read the API Specifications for :class:`~.HttpClient`.  | 
 | 229 | + | 
 | 230 | +In any case, frameworks that wish to support **web-poet** could provide the  | 
 | 231 | +HTTP downloader implementation in two ways:  | 
 | 232 | + | 
 | 233 | +.. _setup-contextvars:  | 
 | 234 | + | 
 | 235 | +1. Context Variable  | 
 | 236 | +-------------------  | 
 | 237 | + | 
 | 238 | +:mod:`contextvars` is natively supported in :mod:`asyncio` in order to set and  | 
 | 239 | +access context-aware values. This means that the framework using **web-poet**  | 
 | 240 | +can easily assign the implementation using the readily available :mod:`contextvars`  | 
 | 241 | +instance named ``web_poet.request_backend_var``.  | 
 | 242 | + | 
 | 243 | +This can be set using:  | 
 | 244 | + | 
 | 245 | +.. code-block:: python  | 
 | 246 | +
  | 
 | 247 | +    def request_implementation(r: web_poet.Request) -> web_poet.ResponseData:  | 
 | 248 | +        ...  | 
 | 249 | +
  | 
 | 250 | +    from web_poet import request_backend_var  | 
 | 251 | +    request_backend_var.set(request_implementation)  | 
 | 252 | +
  | 
 | 253 | +Setting this up would allow access to the request implementation in a  | 
 | 254 | +:class:`~.HttpClient` instance which uses it by default.  | 
 | 255 | + | 
 | 256 | +.. warning::  | 
 | 257 | + | 
 | 258 | +    If no value for ``web_poet.request_backend_var`` was set, then a  | 
 | 259 | +    :class:`~.RequestBackendError` is raised. However, no exception would  | 
 | 260 | +    be raised if **option 2** below is used.  | 
 | 261 | + | 
 | 262 | + | 
 | 263 | +2. Dependency Injection  | 
 | 264 | +-----------------------  | 
 | 265 | + | 
 | 266 | +The framework using **web-poet** might be using other libraries which doesn't  | 
 | 267 | +have a full support to :mod:`contextvars` `(e.g. Twisted)`. With that, an  | 
 | 268 | +alternative approach would be to supply the request implementation when creating  | 
 | 269 | +an :class:`~.HttpClient` instance:  | 
 | 270 | + | 
 | 271 | + | 
 | 272 | +.. code-block:: python  | 
 | 273 | +
  | 
 | 274 | +    def request_implementation(r: web_poet.Request) -> web_poet.ResponseData:  | 
 | 275 | +        ...  | 
 | 276 | +
  | 
 | 277 | +    from web_poet import HttpClient  | 
 | 278 | +    http_client = HttpClient(request_downloader=request_implementation)  | 
0 commit comments