Hybrid Mode: ask scrapeghost to write selectors #7

jamesturk · 2023-03-19T02:43:55Z

See FAQ: https://jamesturk.github.io/scrapeghost/faq/#why-not-ask-the-scraper-to-write-css-xpath-selectors

There's an alternate version of the long-page scraper that could generate extraction selectors and then apply them client-side. Would be a huge cost savings for simple list pages. I'm exploring ideas related to this and will start posting updates on it soon.

jamesturk · 2023-06-07T01:09:34Z

Revisiting this after a few failed attempts that didn't go anywhere.

Generating the XPath/etc. just isn't nearly as robust as going straight to the data, so many variations of this introduce a whole host of problems.

In the current request to GPT, the full HTML (mod cleaning) is sent as well as a JSON schema. This is usually enough to get the data itself out of the HTML without examples because the field names are good enough (first name, last name, etc. have semantic meaning that GPT can use to make the extraction call.

For list pages, it's common to wind up with something like this:

for link in tree.xpath(DIV_OR_TR_XPATH):
    item = {
          "field": FIELD_SUB_XPATH,
          "field2": FIELD2_SUB_XPATH,
    }

To make this work on list pages, we need to send a representative sample of the page (the entire page is often too long)
and then ask GPT for DIV_OR_TR_XPATH, FIELD_XPATH, FIELD2_XPATH, etc.

We don't want each field's XPath to include the parent (e.g. we don't want //tr[3]/td[@id=xyz], we just want ./td[@id=xyz].

The prompt needs to be quite different.

Ideas:

template code like above and ask it to fill in the gaps
require some data to be sent with the request (eg. first and third item) and use those to help it understand what to grab
first pass: request "container" (how to word this?) / second pass: given just container HTML what XPath would be given

Potential Uses:

huge page where only first chunk fits within window (but need to be careful with cleaned HTML here since it'll modify xpaths)
pagination, where xpath can be determined by first page, and then repeated over and over
the same could be true of templated pages, but they run into the issue where not all data is xpath-able at least with list pages the general result is a list of URLs

jamesturk added the idea label Mar 20, 2023

jamesturk added this to the 0.5.0 milestone Mar 22, 2023

jamesturk mentioned this issue Mar 27, 2023

Functionality to JUST update existing CSS / XPath Selectors #36

Closed

jamesturk changed the title ~~hybrid mode~~ Hybrid Mode: ask scrapeghost to write selectors Mar 27, 2023

jamesturk removed this from the 0.5.0 milestone Jun 6, 2023

walking-octopus mentioned this issue Dec 14, 2023

Autoscraper memoization? #59

Open

Hardeepex mentioned this issue Dec 27, 2023

sweep:Autoscraper memoization? Hardeepex/scrapegost#1

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid Mode: ask scrapeghost to write selectors #7

Hybrid Mode: ask scrapeghost to write selectors #7

jamesturk commented Mar 19, 2023 •

edited

Loading

jamesturk commented Jun 7, 2023

Hybrid Mode: ask scrapeghost to write selectors #7

Hybrid Mode: ask scrapeghost to write selectors #7

Comments

jamesturk commented Mar 19, 2023 • edited Loading

jamesturk commented Jun 7, 2023

jamesturk commented Mar 19, 2023 •

edited

Loading