Implement concurrent requests when crawling #762

leonstafford · 2020-12-05T11:45:16Z

leonstafford
Dec 5, 2020

Can revisit this after previously failed attempts with curl_multi on old codebase. Should be much cleaner now with Guzzle and async promises.

Still curious as to concurrent request limits on certain hosts, but if there's a user option to scale it down, at least, should be fine

leonstafford · 2020-12-05T11:47:30Z

leonstafford
Dec 5, 2020
Author

thought: performance-wise, would fetching static files first speed things up on a 10,000+ post site? ie, not taxing the server as much/prone to creeping memory usage vs dynamic pages....

0 replies

palmiak · 2021-10-27T08:50:48Z

palmiak
Oct 27, 2021

I decided to play a bit with this. As you mentioned Guzzle can help us here thanks to the Pool object. You can read a bit about how to implement it in the quickstart https://docs.guzzlephp.org/en/stable/quickstart.html

This is a bit of proof of concept with some things not working:

auth headers
counters

I did some tests on my very poor hosting, with concurrency set to:

1, it built WordPressowka in 2:00
2, it was faster - 1:41
3, this gave the best result - ~1:36
4, from this point it began to get slower ~1:40, but no errors

Still, the difference between 2:00 and 1:30 is quite a lot. So finding the sweet spot for every hosting can bring quite a boost.

<?php
/*
    Crawler

    Crawls URLs in WordPressSite, saving them to StaticSite

*/

namespace WP2Static;

use WP2StaticGuzzleHttp\Client;
use WP2StaticGuzzleHttp\Psr7\Request;
use WP2StaticGuzzleHttp\Psr7\Response;
use Psr\Http\Message\ResponseInterface;
use WP2StaticGuzzleHttp\Exception\RequestException;
use WP2StaticGuzzleHttp\Pool;

define( 'WP2STATIC_REDIRECT_CODES', [ 301, 302, 303, 307, 308 ] );

class Crawler {

    /**
     * @var Client
     */
    private $client;
    /**
     * @var string
     */
    private $site_path;

    /**
     * Crawler constructor
     */
    public function __construct() {
        $this->site_path = rtrim( SiteInfo::getURL( 'site' ), '/' );

        $port_override = apply_filters(
            'wp2static_curl_port',
            null
        );

        $base_uri = $this->site_path;

        if ( $port_override ) {
            $base_uri = "{$base_uri}:{$port_override}";
        }

        $this->client = new Client(
            [
                'base_uri' => $base_uri,
                'verify' => false,
                'http_errors' => false,
                'allow_redirects' => [
                    'max' => 1,
                    // required to get effective_url
                    'track_redirects' => true,
                ],
                'connect_timeout'  => 0,
                'timeout' => 600,
                'headers' => [
                    'User-Agent' => apply_filters(
                        'wp2static_curl_user_agent',
                        'WP2Static.com',
                    ),
                ],
            ]
        );
    }

    public static function wp2staticCrawl( string $static_site_path, string $crawler_slug ) : void {
        if ( 'wp2static' === $crawler_slug ) {
            $crawler = new Crawler();
            $crawler->crawlSite( $static_site_path );
        }
    }

    /**
     * Crawls URLs in WordPressSite, saving them to StaticSite
     */
    public function crawlSite( string $static_site_path ) : void {
        $crawled = 0;
        $cache_hits = 0;

        WsLog::l( 'Starting to crawl detected URLs.' );

        $site_host = parse_url( $this->site_path, PHP_URL_HOST );
        $site_port = parse_url( $this->site_path, PHP_URL_PORT );
        $site_host = $site_port ? $site_host . ":$site_port" : $site_host;
        $site_urls = [ "http://$site_host", "https://$site_host" ];

        $use_crawl_cache = apply_filters(
            'wp2static_use_crawl_cache',
            CoreOptions::getValue( 'useCrawlCaching' )
        );

        WsLog::l( ( $use_crawl_cache ? 'Using' : 'Not using' ) . ' CrawlCache.' );

        // TODO: use some Iterable or other performance optimisation here
        // to help reduce resources for large URL sites

        /**
         * When you call method that executes database query in for loop
         * you are calling method and querying database for every loop iteration.
         * To avoid that you need to assing the result to a variable.
         */

        $crawlable_paths = CrawlQueue::getCrawlablePaths();

        $client = new Client();

        foreach ( $crawlable_paths as $root_relative_path ) {
            $absolute_uri = new URL( $this->site_path . $root_relative_path );
            $urls[] = [
                    'url' => $absolute_uri->get(),
                    'path' => $root_relative_path,
            ];
        }

        $requests = function ( $urls ) {
            foreach ( $urls as $url ) {
                yield new Request('GET', $url['url']);
            }
        };

        $pool = new Pool($client, $requests( $urls ), [
            'concurrency' => 3,
            'fulfilled' => function (Response $response, $index) use ( $urls, $use_crawl_cache ) {
                if ( ! $response ) {
                    return;
                }

                $root_relative_path = $urls[$index]['path'];
                $crawled_contents = (string) $response->getBody();
                $status_code = $response->getStatusCode();

                if ( $status_code === 404 ) {
                    WsLog::l( '404 for URL ' . $root_relative_path );
                    CrawlCache::rmUrl( $root_relative_path );
                    $crawled_contents = null;
                } elseif ( in_array( $status_code, WP2STATIC_REDIRECT_CODES ) ) {
                    $crawled_contents = null;
                }

                $redirect_to = null;

                if ( in_array( $status_code, WP2STATIC_REDIRECT_CODES ) ) {
                    $effective_url = $url;

                    // returns as string
                    $redirect_history =
                        $response->getHeaderLine( 'X-Guzzle-Redirect-History' );

                    if ( $redirect_history ) {
                        $redirects = explode( ', ', $redirect_history );
                        $effective_url = end( $redirects );
                    }

                    $redirect_to =
                        (string) str_replace( $site_urls, '', $effective_url );
                    $page_hash = md5( $status_code . $redirect_to );
                } elseif ( ! is_null( $crawled_contents ) ) {
                    $page_hash = md5( $crawled_contents );
                } else {
                    $page_hash = md5( (string) $status_code );
                }

                // TODO: as John mentioned, we're only skipping the saving,
                // not crawling here. Let's look at improving that... or speeding
                // up with async requests, at least
                if ( $use_crawl_cache ) {
                    // if not already cached
                    if ( CrawlCache::getUrl( $root_relative_path, $page_hash ) ) {
                        $cache_hits++;
                    }
                }

                $crawled++;

                if ( $crawled_contents ) {
                    // do some magic here - naive: if URL ends in /, save to /index.html
                    // TODO: will need love for example, XML files
                    // check content type, serve .xml/rss, etc instead
                    if ( mb_substr( $root_relative_path, -1 ) === '/' ) {
                        StaticSite::add( $root_relative_path . 'index.html', $crawled_contents );
                    } else {
                        StaticSite::add( $root_relative_path, $crawled_contents );
                    }
                }

                CrawlCache::addUrl(
                    $root_relative_path,
                    $page_hash,
                    $status_code,
                    $redirect_to
                );

                // incrementally log crawl progress
                if ( $crawled % 300 === 0 ) {
                    $notice = "Crawling progress: $crawled crawled, $cache_hits skipped (cached).";
                    WsLog::l( $notice );
                }
            },
            'rejected' => function (RequestException $reason, $index) use ( $urls, $use_crawl_cache ) {
                $root_relative_path = $urls[$index]['path'];
                WsLog::l( 'Failed ' . $root_relative_path );
            },
        ]);

        // Initiate the transfers and create a promise
        $promise = $pool->promise();

        // Force the pool of requests to complete.
        $promise->wait();

        WsLog::l(
            "Crawling complete. $crawled crawled, $cache_hits skipped (cached)."
        );

        $args = [
            'staticSitePath' => $static_site_path,
            'crawled' => $crawled,
            'cache_hits' => $cache_hits,
        ];

        do_action( 'wp2static_crawling_complete', $args );
    }

    /**
     * Crawls a string of full URL within WordPressSite
     *
     * @return ResponseInterface|null response object
     */
    public function crawlURL( string $url ) : ?ResponseInterface {
        $headers = [];

        $auth_user = CoreOptions::getValue( 'basicAuthUser' );

        if ( $auth_user ) {
            $auth_password = CoreOptions::getValue( 'basicAuthPassword' );

            if ( $auth_password ) {
                $headers['auth'] = [ $auth_user, $auth_password ];
            }
        }

        $request = new Request( 'GET', $url, $headers );

        $response = $this->client->send( $request );

        return $response;
    }
}

0 replies

leonstafford · 2021-10-27T12:01:14Z

leonstafford
Oct 27, 2021
Author

Well, @palmiak, this is super exciting/humbling, when someone comes and PoC's something that's been on my todo list for like 4 years :D

I guess it's hard to know how many paths can do concurrently, as some pages may tax the server more, while static assets may respond instantly. For that, I'd lean towards a user-configurable concurrency number, unless y'all got ideas on how to do it based on PHP memory usage detection or such, but I don't think we could reliably detect load on CPU/DB (especially if remote)/disk... but could do it on avg response times and adjust dynamically?

0 replies

palmiak · 2021-10-27T12:45:38Z

palmiak
Oct 27, 2021

I was thinking that at first, we should not over-engineer this and just set 1 as default and make it filtrable or add the possibility to set in the UI.

Of course, first I will try to get add all the missing things I mentioned before.

It would be great if you test it too. So we could see if you will also experienced some speed gains.

0 replies

leonstafford · 2021-10-27T12:47:56Z

leonstafford
Oct 27, 2021
Author

Sounds good!

I'm bit behind on my testing - not sure when I will, but would be Lokl (Docker) on my M1 MBPro, so will be fun to see how hard/fast can push it!

Do you need some test servers in DO/Vultr to play with? I can spin some up...

0 replies

palmiak · 2021-10-27T19:28:34Z

palmiak
Oct 27, 2021

I did some testing on my local machine with i7.

The results are quite obvious that Guzzle's concurrency works quite nicely.

Having 51 posts, with concurrency set to:
1 - 38 seconds
3 - 22 seconds
10 - 16 seconds
40 - 7 seconds

Having 551 posts
1 - 320 seconds
3 - 97 seconds
10 - 67 seconds
40 - 69 seconds (this one is a bit weird, with fewer posts it gave better results)

1 reply

leonstafford Oct 27, 2021
Author

awesome! Fewer posts being faster makes sense: heavier queries, slower avg response times the more posts?

palmiak · 2021-11-05T13:01:25Z

palmiak
Nov 5, 2021

In case you missed it - there is a PR - #834

I would need some help, especially with line lengths warnings. But overall it looks solid-ish ;)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement concurrent requests when crawling #762

{{title}}

Replies: 7 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Implement concurrent requests when crawling #762

leonstafford Dec 5, 2020

Replies: 7 comments · 1 reply

leonstafford Dec 5, 2020 Author

palmiak Oct 27, 2021

leonstafford Oct 27, 2021 Author

palmiak Oct 27, 2021

leonstafford Oct 27, 2021 Author

palmiak Oct 27, 2021

leonstafford Oct 27, 2021 Author

palmiak Nov 5, 2021

leonstafford
Dec 5, 2020

Replies: 7 comments 1 reply

leonstafford
Dec 5, 2020
Author

palmiak
Oct 27, 2021

leonstafford
Oct 27, 2021
Author

palmiak
Oct 27, 2021

leonstafford
Oct 27, 2021
Author

palmiak
Oct 27, 2021

leonstafford Oct 27, 2021
Author

palmiak
Nov 5, 2021