Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export in CDRv2 format #3

Merged
merged 3 commits into from
Mar 16, 2016
Merged

Export in CDRv2 format #3

merged 3 commits into from
Mar 16, 2016

Conversation

lopuhin
Copy link
Contributor

@lopuhin lopuhin commented Mar 16, 2016

Also remove export of found forms, and do not save pages from other domains (we can get there after following redirects).

I've included only required field, and left extracted_metadata empty. Also, I did not include required _timestamp field, since the docs say it should be autogenerated by elasticsearch.

Also remove export of found forms, and do not save pages
from other domains.
crawler=self.settings.get('CDR_CRAWLER'),
extracted_metadata={},
extracted_text='\n'.join(
response.xpath('//body//text()').extract()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paul suggested string() xpath function here: scrapy/parsel#34; I've tried it, and output is a bit cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is nicer, and less string joining happening, thanks!

Following @kmike suggestion. This gives cleaner output with less
extra newlines.
@kmike
Copy link
Contributor

kmike commented Mar 16, 2016

What do you think about making CDR export non-optional and putting all extra info in extracted_metadata instead of using a separate PageItem? We may output forms and form fields returned by formasaurus there; this would allow to find e.g. how common are captchas.

@lopuhin
Copy link
Contributor Author

lopuhin commented Mar 16, 2016

Yeah, I like the idea - I'll update the PR. There is also an optional crawl_data field, that should contain "Structured data included by crawler" - not sure which one of them is the best fit.

What was previously stored in PageItem and FormItem
is now stored in extracted_metadata: is_page, depth, forms.
@kmike
Copy link
Contributor

kmike commented Mar 16, 2016

Yeah, it is not clear what is crawl_data, I don't understand that. Let's just put data somewhere and then let others check, it is easy to change if we've chosen a wrong field.

url=url,
text=response.text,
if not self.link_extractor.matches(url):
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lopuhin
Copy link
Contributor Author

lopuhin commented Mar 16, 2016

This might be for some structured data from the page, like price of an item, etc. So extracted_metadata looks better for now.

kmike added a commit that referenced this pull request Mar 16, 2016
@kmike kmike merged commit a952b66 into master Mar 16, 2016
@lopuhin lopuhin deleted the cdr branch March 16, 2016 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants