Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bs4 implementation #9

Merged
merged 5 commits into from
Oct 3, 2020
Merged

bs4 implementation #9

merged 5 commits into from
Oct 3, 2020

Conversation

blacksmithop
Copy link
Contributor

Change Overview:

Uses bs4 to scrape requirements mentioned in issue
Current format:
image
question, url and source_verified was acquired relatively easily. answer is the combination of commands being mentioned in the accepted answer. Due to the nature of the filter being applied (ranked by votes) only accepted answer is taken as the answer. manual_verified and final_answer defaults to False and None since they need to be verified manually.

Testing Overview:

Creating a dict for each url and appending to a list is memory intensive and takes a while to finish.
Can be replaced with writing to csv to remove that overhead.

Related Issues / PRs:

Further testing will have to be done remotely with larger PAGE_MAX.

@navan0
Copy link
Collaborator

navan0 commented Oct 2, 2020

@GopikrishnanSasikumar check and merge

yield tag_url


def Link_To_JSON(urls: list):
url = urls[0]
def Url_To_Data(url: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be great if we can split url_to_data() into small methods. This violate SLAP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that sounds better than having all the operations be done under one hood. I'll look into what can be moved around.

@blacksmithop
Copy link
Contributor Author

image
Rewrote scraping logic to fit a class structure.
DataSetGen now has class variables PAGE_MAX and platform_to_tag . Use of class vars to share current data among methods.
All operations including collection of urls, and the operations under url_to_data are now class methods.
Namely get_question, is_verified, (returns a Boolean as well the url if applicable), get_answer.
This was run on a remote Linux environment with a PAGE_MAX of 1 and one tag for askubuntu. (screenshot)
Only accepted answers are considered by data_from_url (responsible for building the collection of data items)

Copy link
Collaborator

@navan0 navan0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@navan0 navan0 merged commit 31c8fcf into Nysa-clan:master Oct 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants