Skip to content

How to store urls and html content to json format? #83

@AlexPapas

Description

@AlexPapas

Hi,
I have to say amazing tool.

I am struggling to understand on how I has store the results on a json file for each start url.
Currently i am getting binary files for each url within the domain which I have difficulties retrieving the information that I am seeking, (domain url, sub url, status code, html content or plain text )

I am running the following command:

scrapy crawl undercrawler -a url=https://www.bvrgroep.nl -s CLOSESPIDER_PAGECOUNT=5
with
FILES_STORE = "\output_data"
This creates several files without extension on that path.
So it is hard for me to get my head around the 'UndercrawlerMediaPipeline' and how I can adjust it to store files in a readable format.

Also I cannot find the
IMAGES_ENABLED in the settings file to stop downloading images

PS: I have not activated Splash as I do not have access to docker on my local laptop.

Could you please shed some light on?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions