-
-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hybrid Mode: ask scrapeghost to write selectors #7
Comments
Revisiting this after a few failed attempts that didn't go anywhere. Generating the XPath/etc. just isn't nearly as robust as going straight to the data, so many variations of this introduce a whole host of problems. In the current request to GPT, the full HTML (mod cleaning) is sent as well as a JSON schema. This is usually enough to get the data itself out of the HTML without examples because the field names are good enough (first name, last name, etc. have semantic meaning that GPT can use to make the extraction call. For list pages, it's common to wind up with something like this: for link in tree.xpath(DIV_OR_TR_XPATH):
item = {
"field": FIELD_SUB_XPATH,
"field2": FIELD2_SUB_XPATH,
} To make this work on list pages, we need to send a representative sample of the page (the entire page is often too long) We don't want each field's XPath to include the parent (e.g. we don't want //tr[3]/td[@id=xyz], we just want ./td[@id=xyz]. The prompt needs to be quite different. Ideas:
Potential Uses:
|
See FAQ: https://jamesturk.github.io/scrapeghost/faq/#why-not-ask-the-scraper-to-write-css-xpath-selectors
There's an alternate version of the long-page scraper that could generate extraction selectors and then apply them client-side. Would be a huge cost savings for simple list pages. I'm exploring ideas related to this and will start posting updates on it soon.
The text was updated successfully, but these errors were encountered: