Implement concurrent requests when crawling #762
Replies: 7 comments 1 reply
-
thought: performance-wise, would fetching static files first speed things up on a 10,000+ post site? ie, not taxing the server as much/prone to creeping memory usage vs dynamic pages.... |
Beta Was this translation helpful? Give feedback.
-
I decided to play a bit with this. As you mentioned Guzzle can help us here thanks to the This is a bit of proof of concept with some things not working:
I did some tests on my very poor hosting, with concurrency set to:
Still, the difference between 2:00 and 1:30 is quite a lot. So finding the sweet spot for every hosting can bring quite a boost. <?php
/*
Crawler
Crawls URLs in WordPressSite, saving them to StaticSite
*/
namespace WP2Static;
use WP2StaticGuzzleHttp\Client;
use WP2StaticGuzzleHttp\Psr7\Request;
use WP2StaticGuzzleHttp\Psr7\Response;
use Psr\Http\Message\ResponseInterface;
use WP2StaticGuzzleHttp\Exception\RequestException;
use WP2StaticGuzzleHttp\Pool;
define( 'WP2STATIC_REDIRECT_CODES', [ 301, 302, 303, 307, 308 ] );
class Crawler {
/**
* @var Client
*/
private $client;
/**
* @var string
*/
private $site_path;
/**
* Crawler constructor
*/
public function __construct() {
$this->site_path = rtrim( SiteInfo::getURL( 'site' ), '/' );
$port_override = apply_filters(
'wp2static_curl_port',
null
);
$base_uri = $this->site_path;
if ( $port_override ) {
$base_uri = "{$base_uri}:{$port_override}";
}
$this->client = new Client(
[
'base_uri' => $base_uri,
'verify' => false,
'http_errors' => false,
'allow_redirects' => [
'max' => 1,
// required to get effective_url
'track_redirects' => true,
],
'connect_timeout' => 0,
'timeout' => 600,
'headers' => [
'User-Agent' => apply_filters(
'wp2static_curl_user_agent',
'WP2Static.com',
),
],
]
);
}
public static function wp2staticCrawl( string $static_site_path, string $crawler_slug ) : void {
if ( 'wp2static' === $crawler_slug ) {
$crawler = new Crawler();
$crawler->crawlSite( $static_site_path );
}
}
/**
* Crawls URLs in WordPressSite, saving them to StaticSite
*/
public function crawlSite( string $static_site_path ) : void {
$crawled = 0;
$cache_hits = 0;
WsLog::l( 'Starting to crawl detected URLs.' );
$site_host = parse_url( $this->site_path, PHP_URL_HOST );
$site_port = parse_url( $this->site_path, PHP_URL_PORT );
$site_host = $site_port ? $site_host . ":$site_port" : $site_host;
$site_urls = [ "http://$site_host", "https://$site_host" ];
$use_crawl_cache = apply_filters(
'wp2static_use_crawl_cache',
CoreOptions::getValue( 'useCrawlCaching' )
);
WsLog::l( ( $use_crawl_cache ? 'Using' : 'Not using' ) . ' CrawlCache.' );
// TODO: use some Iterable or other performance optimisation here
// to help reduce resources for large URL sites
/**
* When you call method that executes database query in for loop
* you are calling method and querying database for every loop iteration.
* To avoid that you need to assing the result to a variable.
*/
$crawlable_paths = CrawlQueue::getCrawlablePaths();
$client = new Client();
foreach ( $crawlable_paths as $root_relative_path ) {
$absolute_uri = new URL( $this->site_path . $root_relative_path );
$urls[] = [
'url' => $absolute_uri->get(),
'path' => $root_relative_path,
];
}
$requests = function ( $urls ) {
foreach ( $urls as $url ) {
yield new Request('GET', $url['url']);
}
};
$pool = new Pool($client, $requests( $urls ), [
'concurrency' => 3,
'fulfilled' => function (Response $response, $index) use ( $urls, $use_crawl_cache ) {
if ( ! $response ) {
return;
}
$root_relative_path = $urls[$index]['path'];
$crawled_contents = (string) $response->getBody();
$status_code = $response->getStatusCode();
if ( $status_code === 404 ) {
WsLog::l( '404 for URL ' . $root_relative_path );
CrawlCache::rmUrl( $root_relative_path );
$crawled_contents = null;
} elseif ( in_array( $status_code, WP2STATIC_REDIRECT_CODES ) ) {
$crawled_contents = null;
}
$redirect_to = null;
if ( in_array( $status_code, WP2STATIC_REDIRECT_CODES ) ) {
$effective_url = $url;
// returns as string
$redirect_history =
$response->getHeaderLine( 'X-Guzzle-Redirect-History' );
if ( $redirect_history ) {
$redirects = explode( ', ', $redirect_history );
$effective_url = end( $redirects );
}
$redirect_to =
(string) str_replace( $site_urls, '', $effective_url );
$page_hash = md5( $status_code . $redirect_to );
} elseif ( ! is_null( $crawled_contents ) ) {
$page_hash = md5( $crawled_contents );
} else {
$page_hash = md5( (string) $status_code );
}
// TODO: as John mentioned, we're only skipping the saving,
// not crawling here. Let's look at improving that... or speeding
// up with async requests, at least
if ( $use_crawl_cache ) {
// if not already cached
if ( CrawlCache::getUrl( $root_relative_path, $page_hash ) ) {
$cache_hits++;
}
}
$crawled++;
if ( $crawled_contents ) {
// do some magic here - naive: if URL ends in /, save to /index.html
// TODO: will need love for example, XML files
// check content type, serve .xml/rss, etc instead
if ( mb_substr( $root_relative_path, -1 ) === '/' ) {
StaticSite::add( $root_relative_path . 'index.html', $crawled_contents );
} else {
StaticSite::add( $root_relative_path, $crawled_contents );
}
}
CrawlCache::addUrl(
$root_relative_path,
$page_hash,
$status_code,
$redirect_to
);
// incrementally log crawl progress
if ( $crawled % 300 === 0 ) {
$notice = "Crawling progress: $crawled crawled, $cache_hits skipped (cached).";
WsLog::l( $notice );
}
},
'rejected' => function (RequestException $reason, $index) use ( $urls, $use_crawl_cache ) {
$root_relative_path = $urls[$index]['path'];
WsLog::l( 'Failed ' . $root_relative_path );
},
]);
// Initiate the transfers and create a promise
$promise = $pool->promise();
// Force the pool of requests to complete.
$promise->wait();
WsLog::l(
"Crawling complete. $crawled crawled, $cache_hits skipped (cached)."
);
$args = [
'staticSitePath' => $static_site_path,
'crawled' => $crawled,
'cache_hits' => $cache_hits,
];
do_action( 'wp2static_crawling_complete', $args );
}
/**
* Crawls a string of full URL within WordPressSite
*
* @return ResponseInterface|null response object
*/
public function crawlURL( string $url ) : ?ResponseInterface {
$headers = [];
$auth_user = CoreOptions::getValue( 'basicAuthUser' );
if ( $auth_user ) {
$auth_password = CoreOptions::getValue( 'basicAuthPassword' );
if ( $auth_password ) {
$headers['auth'] = [ $auth_user, $auth_password ];
}
}
$request = new Request( 'GET', $url, $headers );
$response = $this->client->send( $request );
return $response;
}
} |
Beta Was this translation helpful? Give feedback.
-
Well, @palmiak, this is super exciting/humbling, when someone comes and PoC's something that's been on my todo list for like 4 years :D I guess it's hard to know how many paths can do concurrently, as some pages may tax the server more, while static assets may respond instantly. For that, I'd lean towards a user-configurable concurrency number, unless y'all got ideas on how to do it based on PHP memory usage detection or such, but I don't think we could reliably detect load on CPU/DB (especially if remote)/disk... but could do it on avg response times and adjust dynamically? |
Beta Was this translation helpful? Give feedback.
-
I was thinking that at first, we should not over-engineer this and just set 1 as default and make it filtrable or add the possibility to set in the UI. Of course, first I will try to get add all the missing things I mentioned before. It would be great if you test it too. So we could see if you will also experienced some speed gains. |
Beta Was this translation helpful? Give feedback.
-
Sounds good! I'm bit behind on my testing - not sure when I will, but would be Lokl (Docker) on my M1 MBPro, so will be fun to see how hard/fast can push it! Do you need some test servers in DO/Vultr to play with? I can spin some up... |
Beta Was this translation helpful? Give feedback.
-
I did some testing on my local machine with i7. The results are quite obvious that Guzzle's concurrency works quite nicely. Having 51 posts, with concurrency set to: Having 551 posts |
Beta Was this translation helpful? Give feedback.
-
In case you missed it - there is a PR - #834 I would need some help, especially with line lengths warnings. But overall it looks solid-ish ;) |
Beta Was this translation helpful? Give feedback.
-
Can revisit this after previously failed attempts with curl_multi on old codebase. Should be much cleaner now with Guzzle and async promises.
Still curious as to concurrent request limits on certain hosts, but if there's a user option to scale it down, at least, should be fine
Beta Was this translation helpful? Give feedback.
All reactions