-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added nlp tokenizer to python gsheet pull script #691
added nlp tokenizer to python gsheet pull script #691
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good although like I said I'm by no means a Python expert.
A couple quick comments/questions, but nothing definitely requiring a change 👍
for parentKey, pgrp in language_df.groupby(['parentKey']): | ||
childKeyDict ={} | ||
for childKey,cgrp in pgrp.groupby(['childKey']): | ||
fieldKeyDict = {} | ||
for fieldKey,fgrp in cgrp.groupby(['fieldKey']): | ||
fgrp_sub = fgrp.filter(regex='[Vv]alue') | ||
valueDict = {"array":[]} | ||
for value, translatedValue in fgrp_sub.values: | ||
valueDict[value]=translatedValue | ||
valueDict["array"].append(translatedValue) | ||
fieldKeyDict[fieldKey] = valueDict | ||
childKeyDict[childKey] = fieldKeyDict | ||
parentKeyDict[parentKey] = childKeyDict | ||
wkDict.update(parentKeyDict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no python/pandas/numpy function that would allow us to avoid this structure? If not, it's probably fine for the time being but this many nested for loops can lead to a nasty runtime complexity like O(n^4) if I'm not mistaken so it just stood out to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reference!
I’m going to look into the style for imports in a script and then do another commit.
Right now the nested loops are needed. I was using pandas ‘to_json’ but it isn’t specific enough to produce both the array and value keys. I think we could do more effective list comprehensions but right now there shouldn’t be a concern for complexity since there’s so few rows in the gsheets. But definitely in the future we need better checks for a potential runtime error.
Also is there a requirements.txt or something to require the python library dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough - I figured this may be unavoidable for now, and even so it won't make much of a difference since O(n^2)
vs O(n^4)
is likely a matter of milliseconds when it's not thousands or hundreds of thousands of rows lol
And if you're asking if we can put in a requirements.txt
I think we can just put it right in the src/python
directory right? I honestly have very limited experience with python and have never used it in a professional setting. Feel free to add a requirements.txt
to this PR, or if we want to do that in a separate PR at least add the required libs to the README
for contributors to get up and running - good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No great point - for scalability we need to reduce the complexity. But I’m not the best at doing so I’ll let a software engineer do that :)
I’m used to ‘pip install requirements.txt’ for getting libraries installed locally from modules or repos so I’ll include a requirements.txt in the PR soon. But again I do not know the most efficient way to mix js and python repos so in the future this should be amenable. Maybe we can have bc a feature enhancement issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'd say for now just adding the requirements.txt
in src/python
should be fine with no overlap 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added it to the top of the dorectory - I don't ever see it in src/ folders
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better in src/python
for organization
I think pipenv has replaced requirements.txt
, but we can handle that in a future issue/PR
A basic readme on a general overview of what this does and how to run it in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why tests are failing, but this is an issue with Travis or somewhere else - not with this PR. So I'm personally fine merging this if we another approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of my suggestions are probably unnecessary for this PR. If the script works, we can merge this. This would do well with a big ole refactor in the future.
I was never able to run this locally because I struggled figuring out the credentials. Maybe I didn't read that Medium article closely enough idk. Some more documentation on getting those credentials would be nice though.
@@ -0,0 +1,299 @@ | |||
##################### | |||
# PULLING CORONARTACKER CONTENT TRANSLATIONS FROM GOOGLE SHEETS TO GITHUB REPOSITORY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# PULLING CORONARTACKER CONTENT TRANSLATIONS FROM GOOGLE SHEETS TO GITHUB REPOSITORY | |
# PULLING CORONATRACKER CONTENT TRANSLATIONS FROM GOOGLE SHEETS TO GITHUB REPOSITORY |
#converts value to camelCase | ||
camelCase = value.split()[0].lower() + " ".join(value.split()[1:]).title().replace(" ","") | ||
#removes punctuation | ||
return camelCase.translate(str.maketrans('', '', punctuation)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, let's pick one depunctuation method, translate(str.maketrans('', '', punctuation))
or depunctuate
, and use it everywhere depunctuation is done
if len(verbs)==0 and len(nouns)==2: | ||
lst_new.append(nouns[0].lower() + nouns[1].title()) | ||
elif len(verbs)==1 and len(nouns)==1: | ||
lst_new.append(nouns[0].lower() + verbs[0].title()) | ||
else: | ||
lst_new.append(nouns[0].lower() + nouns[1].title() + verbs[0].title()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit fragile, but if it works with the current sheets it should be fine. We can fix it later
for j,_ in enumerate(arr): | ||
row_new = row.copy() | ||
row_new[v_col] = arr[j].replace('\n','') | ||
row_new[v_col] = convert_to_camelCase(row_new[v_col]).replace(" ","") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row_new[v_col] = convert_to_camelCase(row_new[v_col]).replace(" ","") | |
row_new[v_col] = convert_to_camelCase(row_new[v_col]) |
convert_to_camelCase
does the replace
I believe
df.value = education_value_cleaner(df) | ||
elif 'Survey' in wk.title: | ||
osLanguage_df = df | ||
df = survey_value_cleaner(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more consistent if education_value_cleaner
and survey_value_cleaner
returned the same type
parentKeyDict[parentKey] = childKeyDict | ||
wkDict.update(parentKeyDict) | ||
|
||
save_to_JSON(OUT_DIR,locale,wkDict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call save_to_JSON
once per language
instead of once per wk
?
you're totally right @acthelemann these are great suggestions and should be made. I can make another issue for this - as a feature enhancement. I think we need to move to the next step of working with these translations so the front end devs can try to use it. Getting feedback should be prioritized if since we'll most likely need to change functions/methods from that as well. |
All changes need to be discussed before proceeding. Failure to do so may result in the pull request being rejected.
Before submitting a pull request, please be sure to review:
Please include the issue number the pull request fixes by replacing YOUR-ISSUE-HERE in the text below.
Fixes #YOUR-ISSUE-HERE
Summary
This is in reference to #665
This PR submits a python script for the mvp which pulls the gsheet content for translations and formats them accurately.
The translation.json files are added to the client/[language]/ folders.
Details
Test Plan (required)
Final Checklist
package.json
?1.0.9
=>1.0.10
yarn
notnpm
? (important!)yarn lint
on the code?