-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bs4 implementation #9
Conversation
@GopikrishnanSasikumar check and merge |
dataset/datasetGen.py
Outdated
yield tag_url | ||
|
||
|
||
def Link_To_JSON(urls: list): | ||
url = urls[0] | ||
def Url_To_Data(url: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be great if we can split url_to_data()
into small methods. This violate SLAP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, that sounds better than having all the operations be done under one hood. I'll look into what can be moved around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
Change Overview:
Uses
bs4
to scrape requirements mentioned in issueCurrent format:
question
,url
andsource_verified
was acquired relatively easily.answer
is the combination of commands being mentioned in the accepted answer. Due to the nature of the filter being applied (ranked by votes) only accepted answer is taken as theanswer
.manual_verified
andfinal_answer
defaults to False and None since they need to be verified manually.Testing Overview:
Creating a
dict
for eachurl
and appending to alist
is memory intensive and takes a while to finish.Can be replaced with writing to csv to remove that overhead.
Related Issues / PRs:
Further testing will have to be done remotely with larger
PAGE_MAX
.