bs4 implementation #9

blacksmithop · 2020-10-02T13:12:33Z

Change Overview:

Uses bs4 to scrape requirements mentioned in issue
Current format:

question, url and source_verified was acquired relatively easily. answer is the combination of commands being mentioned in the accepted answer. Due to the nature of the filter being applied (ranked by votes) only accepted answer is taken as the answer. manual_verified and final_answer defaults to False and None since they need to be verified manually.

Testing Overview:

Creating a dict for each url and appending to a list is memory intensive and takes a while to finish.
Can be replaced with writing to csv to remove that overhead.

Related Issues / PRs:

Further testing will have to be done remotely with larger PAGE_MAX.

navan0 · 2020-10-02T13:17:05Z

@GopikrishnanSasikumar check and merge

gksoriginals · 2020-10-02T16:40:35Z

dataset/datasetGen.py

    yield tag_url


-def Link_To_JSON(urls: list):
-    url = urls[0]
+def Url_To_Data(url: str):


Wouldn't it be great if we can split url_to_data() into small methods. This violate SLAP

Actually, that sounds better than having all the operations be done under one hood. I'll look into what can be moved around.

blacksmithop · 2020-10-03T07:52:09Z

Rewrote scraping logic to fit a class structure.
DataSetGen now has class variables PAGE_MAX and platform_to_tag . Use of class vars to share current data among methods.
All operations including collection of urls, and the operations under url_to_data are now class methods.
Namely get_question, is_verified, (returns a Boolean as well the url if applicable), get_answer.
This was run on a remote Linux environment with a PAGE_MAX of 1 and one tag for askubuntu. (screenshot)
Only accepted answers are considered by data_from_url (responsible for building the collection of data items)

navan0

Looks good

blacksmithop added 3 commits October 2, 2020 17:02

scraping data with bs4

ec28b9d

scraping data with bs4

cfffd93

scraping data with bs4

75407d2

navan0 approved these changes Oct 2, 2020

View reviewed changes

gksoriginals requested changes Oct 2, 2020

View reviewed changes

blacksmithop added 2 commits October 3, 2020 13:12

abstraction and class

0ebf6df

scraping data with bs4

67d42aa

navan0 approved these changes Oct 3, 2020

View reviewed changes

navan0 merged commit 31c8fcf into Nysa-clan:master Oct 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bs4 implementation #9

bs4 implementation #9

blacksmithop commented Oct 2, 2020

navan0 commented Oct 2, 2020 •

edited

Loading

gksoriginals Oct 2, 2020

blacksmithop Oct 3, 2020

blacksmithop commented Oct 3, 2020

navan0 left a comment

bs4 implementation #9

bs4 implementation #9

Conversation

blacksmithop commented Oct 2, 2020

Change Overview:

Testing Overview:

Related Issues / PRs:

navan0 commented Oct 2, 2020 • edited Loading

gksoriginals Oct 2, 2020

Choose a reason for hiding this comment

blacksmithop Oct 3, 2020

Choose a reason for hiding this comment

blacksmithop commented Oct 3, 2020

navan0 left a comment

Choose a reason for hiding this comment

navan0 commented Oct 2, 2020 •

edited

Loading