Skip to content

Latest commit

 

History

History
297 lines (237 loc) · 10.9 KB

README.md

File metadata and controls

297 lines (237 loc) · 10.9 KB

CocoCrawler 🥥

NuGet Build and Publish

CocoCrawler is an easy to use web crawler, scraper and parser in C#. By combing PuppeteerSharp and AngleSharp it brings the best of both sides, and merges them into an easy to use API.

It provides an simple API to get started

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("https://old.reddit.com/r/csharp", pageOptions => pageOptions
        .ExtractList(containersSelector: "div.thing.link.self", [
            new("Title","a.title"),
            new("Upvotes", "div.score.unvoted"),
            new("Datetime", "time", "datetime"),
            new("Total Comments","a.comments"),
            new("Url","a.title", "href")
        ])
        .AddPagination("span.next-button > a")
        .ConfigurePageActions(options => // Only for showing the possibilities, not needed for running sample
        {
            options.ScrollToEnd();
            options.Wait(2000);
            // options.Click("span.next-button > a");
        })
        .AddOutputToConsole()
        .AddOutputToCsvFile("results.csv")
    )
    .ConfigureEngine(options =>
    {
        options.UseHeadlessMode(false);
        options.PersistVisitedUrls();
        options.WithLoggerFactory(loggerFactory);
        options.WithCookies([
            new("auth-cookie", "l;alqpekcoizmdfugnvkjgvsaaprufc", "thedomain.com")
        ]);
    })
    .BuildAsync(cancellationToken);

await crawlerEngine.RunAsync(cancellationToken);

This examples starts at page https://old.reddit.com/r/csharp scrapes all the posts, then continues to the next page and scrapes everything again, and on and on. And outputs everything scraped to the console and a csv file.

With this library it's easy to

  • Scrape Single Page Apps
  • Scrape Listings
  • Add pagination
  • Alternative to list is open each post and scrape the page and continue with pagination
  • Scrape multiple pages in parallel
  • Add custom outputs
  • Customize Everything

Scraping pages

With each Page (a page a is a single URL job) added it's possible to add a Task. For each Page it's possible to:

.ExtractObject(...)

   var crawlerEngine = await new CrawlerEngineBuilder()
       .AddPage("https://github.com/", pageOptions => pageOptions
           .ExtractObject([
                new(Name: "Title", Selector: "div.title > a > span"),
                new(Name: "Description", Selector: "div.title > a > span"),
            ])
        .BuildAsync(cancellationToken);

Which scrapes the title and description of the page and outputs it.

.ExtractList(...)

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("https://github.com/", pageOptions => pageOptions
        .ExtractList(containersSelector: "div > div.repos", [
            new(Name: "Title", Selector: "div.title > a > span"),
            new(Name: "Description", Selector: "div.title > a > span"),
        ]))
    .BuildAsync(cancellationToken);

ExtractList scrapes a list of objects. The containersSelector is the selector for the container that holds the objects. And all selectors after that are relative to the container. Each object in the list is inidividually send to the output.

.OpenLinks(...)

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("https://github.com/", pageOptions => pageOptions
        .OpenLinks(linksSelector: "div.example-link-to-repose", subPage => subPage
            .ExtractObject([
                new("Title","div.sitetable.linklisting a.title"),
            ])))
    .BuildAsync(cancellationToken);

OpenLinks opens each link in the linksSelector and scrapes that page. It's usually combined with .ExtractObject(...) and .AddPagination(...). linksSelector expects a list of a tags. It's also possible to chain multiple .OpenLinks(...).

.AddPagination(...)

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("https://github.com/", pageOptions => pageOptions
        .ExtractList(containersSelector: "div > div.repos", [
            new(Name: "Title", Selector: "div.title > a > span"),
            new(Name: "Description", Selector: "div.title > a > span"),
        ]))
        .AddPagination("span.next-button > a")
    .BuildAsync(cancellationToken);

AddPagination adds pagination to the page. It expects a selector to the next page. It's usually the Next button.

Multiple Pages

It's possible to add multiple pages to scrape with the same Tasks.

   var crawlerEngine = await new CrawlerEngineBuilder()
       .AddPages(["https://old.reddit.com/r/csharp", "https://old.reddit.com/r/dotnet"], pageOptions => pageOptions
           .OpenLinks("div.thing.link.self a.bylink.comments", subPageOptions =>
           {
                subPageOptions.ExtractObject([
                       new("Title","div.sitetable.linklisting a.title"),
                       new("Url","div.sitetable.linklisting a.title", "href"),
                       new("Upvotes", "div.sitetable.linklisting div.score.unvoted"),
                       new("Top comment", "div.commentarea div.entry.unvoted div.md"),
               ]);
               subPageOptions.ConfigurePageActions(ops =>
                {
                    ops.ScrollToEnd();
                    ops.Wait(4000);
                });
           })
           .AddPagination("span.next-button > a")
        .BuildAsync(cancellationToken);

   await crawlerEngine.RunAsync(cancellationToken);

This example starts at https://old.reddit.com/r/csharp and https://old.reddit.com/r/dotnet and opens each post and scrapes the title, url, upvotes and top comment. It also scrolls to the end of the page and waits 4 seconds before scraping the page. And then it continues with the next pagination page.

PageActions - A way to interact with the browser

Page Actions are a way to interact with the browser. It's possible to add page actions to each page. It's possible to click away popups, or scroll to bottom. The following actions are available:

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("https://github.com/", pageOptions => pageOptions
        .ExtractList(containersSelector: "div > div.repos", [
            new(Name: "Title", Selector: "div.title > a > span"),
            new(Name: "Description", Selector: "div.title > a > span"),
        ]))
        .ConfigurePageActions(ops =>
        {
            ops.ScrollToEnd();
            ops.Click("button#load-more");
            ops.Wait(4000);
        });
    .BuildAsync(cancellationToken);

Outputs

It's possible to add multiple outputs to the engine. The following outputs are available:

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("https://github.com/", pageOptions => pageOptions
        .OpenLinks(linksSelector: "div.example-link-to-repose", subPage => subPage
            .ExtractObject([
                new("Title","div.sitetable.linklisting a.title"),
            ])))
        .AddOutputToConsole()
        .AddOutputToCsvFile("results.csv")    
    .BuildAsync(cancellationToken);

You can add your own output by implementing the ICrawlOutput interface.

public interface ICrawlOutput
{
    Task Initiaize(CancellationToken cancellationToken);
    Task WriteAsync(JObject jObject, CancellationToken cancellationToken);
}

Initialize is called once before the engine starts. WriteAsync is called for each object that is scraped.

On Page level it's possible to add custom outputs

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("", p => p.AddOutput(new MyCustomOutput()))
    .BuildAsync(cancellationToken);

Configuring the Engine

Cookies

It's possible to add cookies to all request

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage(...)
    .ConfigureEngine(options =>
    {
        options.WithCookies([
            new("auth-cookie", "l;alqpekcoizmdfugnvkjgvsaaprufc", "thedomain.com"),
            new("Cookie2", "def", "localhost")
        ]);
    })
    .BuildAsync(cancellationToken);

Setting the User Agent

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage(...)
    .ConfigureEngine(options =>
    {
        options.WithUserAgent("linux browser - example user agent");
    })
    .BuildAsync(cancellationToken);

Default User Agent is from Chrome browser.

Ignoring URLS

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage(...)
    .ConfigureEngine(options =>
    {
        options.WithIgnoreUrls(["https://example.com", "https://example2.com"]);
    })    
    .BuildAsync(cancellationToken);

Stopping the engine

The engine stops when the

  • The total number of pages to crawl is reached.
  • 2 minutes have passed since the last job was added

Persisting visited pages

It's possible to persist visited pages to a file. Once persisted the engine will skip the pages next time.

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage(...)
    .ConfigureEngine(options =>
    {
        options.PersistVisitedUrls();
    })
    .BuildAsync(cancellationToken);

Other notable options

The engine can be configured with the following options:

  • UseHeadlessMode(bool headless): If the browser should be headless or not
  • WithLoggerFactory(ILoggerFactory loggerFactory): The logger factory to use, to enable logging.
  • TotalPagesToCrawl(int total): The total number of pages to crawl
  • WithParallelismDegree(int parallelismDegree) : The number of browser tabs it can open in parallel

Extensibility

The library is designed to be extensible. It's possible to add custom IParser, IScheduler, IVisitedUrlTracker and ICrawler implementations.

using the engine builder it's possible to add custom implementations

.ConfigureEngine(options =>
{
    options.WithCrawler(new MyCustomCrawler());
    options.WithScheduler(new MyCustomScheduler());
    options.WithParser(new MyCustomParser());
    options.WithVisitedUrlTracker(new MyCustomParser());
})
Interfaces Description
IParser IParser uses default AngleSharp. If you want to use something else then CSS selector, overwrite this.
IVisitedUrlTracker Default uses in memory tracker. It's possible to persist to a file. Those two options are available in the libary.
IScheduler Holds the current Jobs.