Export in CDRv2 format #3

lopuhin · 2016-03-16T11:05:34Z

Also remove export of found forms, and do not save pages from other domains (we can get there after following redirects).

I've included only required field, and left extracted_metadata empty. Also, I did not include required _timestamp field, since the docs say it should be autogenerated by elasticsearch.

Also remove export of found forms, and do not save pages from other domains.

kmike · 2016-03-16T11:32:29Z

undercrawler/spiders/base_spider.py

+            crawler=self.settings.get('CDR_CRAWLER'),
+            extracted_metadata={},
+            extracted_text='\n'.join(
+                response.xpath('//body//text()').extract()),


Paul suggested string() xpath function here: scrapy/parsel#34; I've tried it, and output is a bit cleaner.

Yes, this is nicer, and less string joining happening, thanks!

@kmike

Following @kmike suggestion. This gives cleaner output with less extra newlines.

kmike · 2016-03-16T12:31:20Z

What do you think about making CDR export non-optional and putting all extra info in extracted_metadata instead of using a separate PageItem? We may output forms and form fields returned by formasaurus there; this would allow to find e.g. how common are captchas.

lopuhin · 2016-03-16T12:34:04Z

Yeah, I like the idea - I'll update the PR. There is also an optional crawl_data field, that should contain "Structured data included by crawler" - not sure which one of them is the best fit.

What was previously stored in PageItem and FormItem is now stored in extracted_metadata: is_page, depth, forms.

kmike · 2016-03-16T12:53:36Z

Yeah, it is not clear what is crawl_data, I don't understand that. Let's just put data somewhere and then let others check, it is easy to change if we've chosen a wrong field.

kmike · 2016-03-16T12:54:12Z

undercrawler/spiders/base_spider.py

-            url=url,
-            text=response.text,
+        if not self.link_extractor.matches(url):
+            return


lopuhin · 2016-03-16T12:55:25Z

This might be for some structured data from the page, like price of an item, etc. So extracted_metadata looks better for now.

Export in CDRv2 format

Export in CDRv2 format

06ce8f3

Also remove export of found forms, and do not save pages from other domains.

lopuhin assigned kmike Mar 16, 2016

kmike reviewed Mar 16, 2016
View reviewed changes

Extract text using string() xpath selector

cb37620

Following @kmike suggestion. This gives cleaner output with less extra newlines.

Always use CDR format, add extracted_metadata

5533d16

What was previously stored in PageItem and FormItem is now stored in extracted_metadata: is_page, depth, forms.

kmike reviewed Mar 16, 2016
View reviewed changes

undercrawler/spiders/base_spider.py

url=url,

text=response.text,

if not self.link_extractor.matches(url):

return

Copy link

Contributor

kmike Mar 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

kmike added a commit that referenced this pull request Mar 16, 2016

Merge pull request #3 from TeamHG-Memex/cdr

a952b66

Export in CDRv2 format

kmike merged commit a952b66 into master Mar 16, 2016

lopuhin deleted the cdr branch March 16, 2016 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export in CDRv2 format #3

Export in CDRv2 format #3

lopuhin commented Mar 16, 2016

kmike Mar 16, 2016

lopuhin Mar 16, 2016

kmike commented Mar 16, 2016

lopuhin commented Mar 16, 2016

kmike commented Mar 16, 2016

kmike Mar 16, 2016

lopuhin commented Mar 16, 2016

Export in CDRv2 format #3

Export in CDRv2 format #3

Conversation

lopuhin commented Mar 16, 2016

kmike Mar 16, 2016

Choose a reason for hiding this comment

lopuhin Mar 16, 2016

Choose a reason for hiding this comment

kmike commented Mar 16, 2016

lopuhin commented Mar 16, 2016

kmike commented Mar 16, 2016

kmike Mar 16, 2016

Choose a reason for hiding this comment

lopuhin commented Mar 16, 2016