Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output.txt doesn't contain anything. #1

Open
ChiragSoni95 opened this issue Jul 13, 2018 · 8 comments
Open

Output.txt doesn't contain anything. #1

ChiragSoni95 opened this issue Jul 13, 2018 · 8 comments

Comments

@ChiragSoni95
Copy link

ChiragSoni95 commented Jul 13, 2018

What exactly do you store in the output.txt apart from the pdf links, because it successfully runs but nothing is wrote in output.txt (after uncommenting the lines too).

@ChiragSoni95 ChiragSoni95 changed the title Beautiful Soup takes a lot of time DownloadSamplePaper.py takes a lot of time Jul 13, 2018
@ChiragSoni95 ChiragSoni95 changed the title DownloadSamplePaper.py takes a lot of time Output.txt doesn't contain anything. Jul 13, 2018
@laxmanverma
Copy link
Owner

which python version are you using?
Use 2.7

@ChiragSoni95
Copy link
Author

@laxmanverma , I am using python 3.6.
I tried to uncomment line number, 46-48 and 65-68, it just shows the hyperlink I entered and then it just runs and doesn't stop running.
I get the following output:
screen shot 2018-07-15 at 7 06 44 pm

@ChiragSoni95
Copy link
Author

@laxmanverma Okay I will try with python 2.7 and let you know.
Thanks

@ChiragSoni95
Copy link
Author

I tried running it on 2.7
It gives this error, and I tried printing pdfName and pdflink, it gives the following output:

Error Stack Trace:
Traceback (most recent call last):
File "/Users/chirag/PycharmProjects/LayoutLearning/scrape_pdfs.py", line 83, in
lookUp ();
File "/Users/chirag/PycharmProjects/LayoutLearning/scrape_pdfs.py", line 80, in lookUp
crawlPage ( htmlSourceCode )
File "/Users/chirag/PycharmProjects/LayoutLearning/scrape_pdfs.py", line 66, in crawlPage
urllib.urlretrieve ( pdfLink, pdfName )
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
return opener.retrieve(url, filename, reporthook, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 245, in retrieve
fp = self.open(url, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 213, in open
return getattr(self, name)(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 469, in open_file
return self.open_local_file(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 483, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory: ''

Print output:
physics-All The Best For You Board Exams-TYPE html>

<title>CBSE 12th Science Previous Year Question Papers All Subjects</title>

@laxmanverma
Copy link
Owner

laxmanverma commented Jul 16, 2018 via email

@ChiragSoni95
Copy link
Author

Yes.
so what can I do to make this generic, every site will be having a separate DOM structure, I want to retrieve as many pdfs from web as much as I can.
What can I do for that?
Can you help?

@laxmanverma
Copy link
Owner

laxmanverma commented Jul 16, 2018 via email

@ChiragSoni95
Copy link
Author

Okay thanks!!
I will try!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants