Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pop from empty list, crash in /webscrape #108

Open
KastanDay opened this issue Oct 9, 2023 · 2 comments
Open

Pop from empty list, crash in /webscrape #108

KastanDay opened this issue Oct 9, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@KastanDay
Copy link
Member

I got it by scraping this: https://ncsa-delta-doc.readthedocs-hosted.com/en/latest/index.html

Heads up a new bug:
File "/app/ai_ta_backend/web_scrape.py", line 450, in breadth_crawler
url = self.queue[depth].pop(0)
IndexError: pop from empty list

Full error:

2023-10-09 22:29:57,249:ERROR - Exception on /web-scrape [GET]

Traceback (most recent call last):

File "/opt/venv/lib/python3.8/site-packages/[flask](https://railway.app/project/214c0077-af58-4a32-a88d-64ede781eee9/logs?filter=%40service%3A14b25553-ea73-47f6-97a6-efa0fa9aa170&range=12h)/app.py", line 2190, in wsgi_app

response = self.full_dispatch_request()

File "/opt/venv/lib/python3.8/site-packages/[flask](https://railway.app/project/214c0077-af58-4a32-a88d-64ede781eee9/logs?filter=%40service%3A14b25553-ea73-47f6-97a6-efa0fa9aa170&range=12h)/app.py", line 1486, in full_dispatch_request

rv = self.handle_user_exception(e)

File "/opt/venv/lib/python3.8/site-packages/[flask](https://railway.app/project/214c0077-af58-4a32-a88d-64ede781eee9/logs?filter=%40service%3A14b25553-ea73-47f6-97a6-efa0fa9aa170&range=12h)_cors/extension.py", line 176, in wrapped_function

return cors_after_request(app.make_response(f(*args, **kwargs)))

File "/opt/venv/lib/python3.8/site-packages/[flask](https://railway.app/project/214c0077-af58-4a32-a88d-64ede781eee9/logs?filter=%40service%3A14b25553-ea73-47f6-97a6-efa0fa9aa170&range=12h)/app.py", line 1484, in full_dispatch_request

rv = self.dispatch_request()

File "/opt/venv/lib/python3.8/site-packages/[flask](https://railway.app/project/214c0077-af58-4a32-a88d-64ede781eee9/logs?filter=%40service%3A14b25553-ea73-47f6-97a6-efa0fa9aa170&range=12h)/app.py", line 1469, in dispatch_request

return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)

File "/app/ai_ta_backend/main.py", line 349, in scrape

success_fail_dict = scraper.main_crawler(url, course_name, max_urls, max_depth, timeout, stay_on_baseurl, depth_or_breadth)

File "/app/ai_ta_backend/web_scrape.py", line 532, in main_crawler

self.breadth_crawler(url=url, course_name=course_name, timeout=timeout, base_url_on=base_url_str, max_depth=max_depth)

File "/app/ai_ta_backend/web_scrape.py", line 450, in breadth_crawler

url = self.queue[depth].pop(0)

IndexError: pop from empty list

@KastanDay KastanDay added the bug Something isn't working label Oct 9, 2023
@jkmin3
Copy link
Member

jkmin3 commented Oct 11, 2023

Ahh I see, I have a catch for this error now, but should we maybe create a base url input for cases like this? For example, this site might want to input this https://ncsa as the base url.

@KastanDay
Copy link
Member Author

Interesting, I'm not sure I follow.

I thought this error occurred when the BaseURL didn't have any links on the page. So the input page has 0 additional links. Is that right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants