Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to store urls and html content to json format? #83

Open
AlexPapas opened this issue Jul 15, 2021 · 1 comment
Open

How to store urls and html content to json format? #83

AlexPapas opened this issue Jul 15, 2021 · 1 comment

Comments

@AlexPapas
Copy link

AlexPapas commented Jul 15, 2021

Hi,
I have to say amazing tool.

I am struggling to understand on how I has store the results on a json file for each start url.
Currently i am getting binary files for each url within the domain which I have difficulties retrieving the information that I am seeking, (domain url, sub url, status code, html content or plain text )

I am running the following command:

scrapy crawl undercrawler -a url=https://www.bvrgroep.nl -s CLOSESPIDER_PAGECOUNT=5
with
FILES_STORE = "\output_data"
This creates several files without extension on that path.
So it is hard for me to get my head around the 'UndercrawlerMediaPipeline' and how I can adjust it to store files in a readable format.

Also I cannot find the
IMAGES_ENABLED in the settings file to stop downloading images

PS: I have not activated Splash as I do not have access to docker on my local laptop.

Could you please shed some light on?

@AlexPapas
Copy link
Author

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant