Crabler - Web crawler for Crabs

Asynchronous web scraper engine written in rust.

Features:

fully based on async-std
derive macro based api
struct based api
stateful scraper (structs can hold state)
ability to download files
ability to schedule navigation jobs in an async manner

Example

extern crate crabler;

use std::path::Path;

use crabler::*;
use surf::Url;

#[derive(WebScraper)]
#[on_response(response_handler)]
#[on_html("a[href]", walk_handler)]
struct Scraper {}

impl Scraper {
    async fn response_handler(&self, response: Response) -> Result<()> {
        if response.url.ends_with(".png") && response.status == 200 {
            println!("Finished downloading {} -> {:?}", response.url, response.download_destination);
        }
        Ok(())
    }

    async fn walk_handler(&self, mut response: Response, a: Element) -> Result<()> {
        if let Some(href) = a.attr("href") {
            // Create absolute URL
            let url = Url::parse(&href)
                .unwrap_or_else(|_| Url::parse(&response.url).unwrap().join(&href).unwrap());

            // Attempt to download an image
            if href.ends_with(".png") {
                let image_name = url.path_segments().unwrap().last().unwrap();
                let p = Path::new("/tmp").join(image_name);
                let destination = p.to_string_lossy().to_string();

                if !p.exists() {
                    println!("Downloading {}", destination);
                    // Schedule crawler to download file to some destination
                    // downloading will happen in the background, await here is just to wait for job queue
                    response.download_file(url.to_string(), destination).await?;
                } else {
                    println!("Skipping existing file {}", destination);
                }
            } else {
              // Or schedule crawler to navigate to a given url
              response.navigate(url.to_string()).await?;
            };
        }

        Ok(())
    }
}

#[async_std::main]
async fn main() -> Result<()> {
    let scraper = Scraper {};

    // Run scraper starting from given url and using 20 worker threads
    scraper.run(Opts::new().with_urls(vec!["https://www.rust-lang.org/"]).with_threads(20)).await
}

Sample project

Gonzih/apod-nasa-scraper-rs

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
.github		.github
crabler_derive		crabler_derive
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.rs		build.rs
shell.nix		shell.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crabler - Web crawler for Crabs

Example

Sample project

About

Releases 2

Packages

Contributors 4

Languages

License

Gonzih/crabler

Folders and files

Latest commit

History

Repository files navigation

Crabler - Web crawler for Crabs

Example

Sample project

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 4

Languages

Packages