Enhancement - Ability to Parse a list #28

Marcel0024 · 2024-06-23T07:23:26Z

Hi, i've been looking at this library, it's really promissing. It really saves a lot of time writing boiler plate.
But i'm missing one feature to really be able to use it for my use-case.

Is your feature request related to a problem? Please describe.

The issue i'm running into is i don't have to open each link to scrape them.
My first page is the page with listings and has pagination.

For example:

Page 1

[Listing 1]
    [Name]
    [Amount]
    [Rating]
    [Link]

[Listing 2]
    [Name]
    [Amount]
    [Rating]
    [Link]

[Listing 3]
    [Name]
    [Amount]
    [Rating]
    [Link]

 pages [1] 2 3 4 5 6 ... 234

Page 2

[Listing 1]
    [Name]
    [Amount]
    [Rating]
    [Link]

[Listing 2]
    [Name]
    [Amount]
    [Rating]
    [Link]

[Listing 3]
    [Name]
    [Amount]
    [Rating]
    [Link]

 pages 1 [2] 3 4 5 6 ... 234

The way the library is setup is, i have to .Follow(...) each link and .Parse(..) each one opened page. But in my case i don't have to. The data i need is on this page already.

Describe the solution you'd like

Ability to parse a List, maybe use a JArray for the object returned in the entity.

Describe alternatives you've considered

I didn't find a workaround. I did try something like this:

.Parse([..Enumerable.Range(0, 10).Select(x =>
    {
        return new Schema($"Listing{x}")
        {
            new SchemaElement("Name", " div.min-w-0 > a > h2"),
            new SchemaElement("Amount", "div.min-w-0 > p.font-semibold")
        };
    })])

But all listing are the same, since the query selector just grabs the first one https://github.com/pavlovtech/WebReaper/blob/master/WebReaper/Core/Parser/Concrete/AngleSharpContentParser.cs#L85

Additional context

To keep backwards compatability, i think this needs to be implemented on SchemaElement with a new property. Maybe IsList or IsArray.

In FillOutput()
https://github.com/pavlovtech/WebReaper/blob/master/WebReaper/Core/Parser/Concrete/AngleSharpContentParser.cs#L43
in the try we can add differentiate if it's a list or not, if so, GetListData() returns a list of data to adda JArray.

I'm willing to work on a PR with some guidance/approval.

The text was updated successfully, but these errors were encountered:

Marcel0024 · 2024-06-23T10:35:55Z

Just realized you would have to change the Job implementation as well

WebReaper/WebReaper/Domain/Job.cs

Line 17 in 988ea8c

(0, _) => PageCategory.TargetPage,

Because every page would have to become a TargetPage.

Damn there's no way to override this. I thought with a custom IContentParser would do the trick, but ran into this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement - Ability to Parse a list #28

Enhancement - Ability to Parse a list #28

Marcel0024 commented Jun 23, 2024 •

edited

Loading

Marcel0024 commented Jun 23, 2024

Enhancement - Ability to Parse a list #28

Enhancement - Ability to Parse a list #28

Comments

Marcel0024 commented Jun 23, 2024 • edited Loading

Marcel0024 commented Jun 23, 2024

Marcel0024 commented Jun 23, 2024 •

edited

Loading