How to store urls and html content to json format? #83

AlexPapas · 2021-07-15T12:01:26Z

Hi,
I have to say amazing tool.

I am struggling to understand on how I has store the results on a json file for each start url.
Currently i am getting binary files for each url within the domain which I have difficulties retrieving the information that I am seeking, (domain url, sub url, status code, html content or plain text )

I am running the following command:

scrapy crawl undercrawler -a url=https://www.bvrgroep.nl -s CLOSESPIDER_PAGECOUNT=5
with
FILES_STORE = "\output_data"
This creates several files without extension on that path.
So it is hard for me to get my head around the 'UndercrawlerMediaPipeline' and how I can adjust it to store files in a readable format.

Also I cannot find the
IMAGES_ENABLED in the settings file to stop downloading images

PS: I have not activated Splash as I do not have access to docker on my local laptop.

Could you please shed some light on?

The text was updated successfully, but these errors were encountered:

AlexPapas · 2021-07-15T12:10:23Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to store urls and html content to json format? #83

How to store urls and html content to json format? #83

AlexPapas commented Jul 15, 2021 •

edited

Loading

AlexPapas commented Jul 15, 2021

How to store urls and html content to json format? #83

How to store urls and html content to json format? #83

Comments

AlexPapas commented Jul 15, 2021 • edited Loading

AlexPapas commented Jul 15, 2021

AlexPapas commented Jul 15, 2021 •

edited

Loading