Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added nlp tokenizer to python gsheet pull script #691

Merged
merged 9 commits into from
May 11, 2020

Conversation

ngiangre
Copy link
Contributor

@ngiangre ngiangre commented May 6, 2020

⚠️ IMPORTANT: Please do not create a Pull Request without creating an Issue first.

All changes need to be discussed before proceeding. Failure to do so may result in the pull request being rejected.

Before submitting a pull request, please be sure to review:


Please include the issue number the pull request fixes by replacing YOUR-ISSUE-HERE in the text below.

Fixes #YOUR-ISSUE-HERE

Summary

This is in reference to #665

This PR submits a python script for the mvp which pulls the gsheet content for translations and formats them accurately.

The translation.json files are added to the client/[language]/ folders.

Details

Test Plan (required)

Final Checklist

  • For CoronaTracker, did you bump the version in package.json?
    • Update the second decimal for a major change
    • Update the third decimal for a minor change
    • Numbers can go past 9, e.g. 1.0.9 => 1.0.10
    • For more info, read about Semantic Versioning
  • Did you add any new tests as necessary?
  • Is your PR rebased off the most current master?
  • Have you squashed all commits? (can be done at merge)
  • Did you use yarn not npm? (important!)
  • Did you use Material-UI wherever possible?
  • Did you run yarn lint on the code?
  • Did you run all of your most recent changes locally to make sure everything is working?

Copy link
Member

@SomeMoosery SomeMoosery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good although like I said I'm by no means a Python expert.

A couple quick comments/questions, but nothing definitely requiring a change 👍

src/python/pull_gsheets_translations_mvp.py Outdated Show resolved Hide resolved
Comment on lines 119 to 132
for parentKey, pgrp in language_df.groupby(['parentKey']):
childKeyDict ={}
for childKey,cgrp in pgrp.groupby(['childKey']):
fieldKeyDict = {}
for fieldKey,fgrp in cgrp.groupby(['fieldKey']):
fgrp_sub = fgrp.filter(regex='[Vv]alue')
valueDict = {"array":[]}
for value, translatedValue in fgrp_sub.values:
valueDict[value]=translatedValue
valueDict["array"].append(translatedValue)
fieldKeyDict[fieldKey] = valueDict
childKeyDict[childKey] = fieldKeyDict
parentKeyDict[parentKey] = childKeyDict
wkDict.update(parentKeyDict)
Copy link
Member

@SomeMoosery SomeMoosery May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no python/pandas/numpy function that would allow us to avoid this structure? If not, it's probably fine for the time being but this many nested for loops can lead to a nasty runtime complexity like O(n^4) if I'm not mistaken so it just stood out to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference!

I’m going to look into the style for imports in a script and then do another commit.

Right now the nested loops are needed. I was using pandas ‘to_json’ but it isn’t specific enough to produce both the array and value keys. I think we could do more effective list comprehensions but right now there shouldn’t be a concern for complexity since there’s so few rows in the gsheets. But definitely in the future we need better checks for a potential runtime error.

Also is there a requirements.txt or something to require the python library dependencies?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough - I figured this may be unavoidable for now, and even so it won't make much of a difference since O(n^2) vs O(n^4) is likely a matter of milliseconds when it's not thousands or hundreds of thousands of rows lol

And if you're asking if we can put in a requirements.txt I think we can just put it right in the src/python directory right? I honestly have very limited experience with python and have never used it in a professional setting. Feel free to add a requirements.txt to this PR, or if we want to do that in a separate PR at least add the required libs to the README for contributors to get up and running - good catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No great point - for scalability we need to reduce the complexity. But I’m not the best at doing so I’ll let a software engineer do that :)

I’m used to ‘pip install requirements.txt’ for getting libraries installed locally from modules or repos so I’ll include a requirements.txt in the PR soon. But again I do not know the most efficient way to mix js and python repos so in the future this should be amenable. Maybe we can have bc a feature enhancement issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'd say for now just adding the requirements.txt in src/python should be fine with no overlap 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added it to the top of the dorectory - I don't ever see it in src/ folders

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better in src/python for organization

I think pipenv has replaced requirements.txt, but we can handle that in a future issue/PR

SomeMoosery
SomeMoosery previously approved these changes May 7, 2020
pavel-ilin
pavel-ilin previously approved these changes May 7, 2020
@acthelemann
Copy link
Contributor

A basic readme on a general overview of what this does and how to run it in src/python would be helpful

@ngiangre ngiangre dismissed stale reviews from pavel-ilin and SomeMoosery via c70199d May 10, 2020 20:07
Copy link
Member

@SomeMoosery SomeMoosery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why tests are failing, but this is an issue with Travis or somewhere else - not with this PR. So I'm personally fine merging this if we another approval.

Copy link
Contributor

@acthelemann acthelemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of my suggestions are probably unnecessary for this PR. If the script works, we can merge this. This would do well with a big ole refactor in the future.

I was never able to run this locally because I struggled figuring out the credentials. Maybe I didn't read that Medium article closely enough idk. Some more documentation on getting those credentials would be nice though.

@@ -0,0 +1,299 @@
#####################
# PULLING CORONARTACKER CONTENT TRANSLATIONS FROM GOOGLE SHEETS TO GITHUB REPOSITORY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# PULLING CORONARTACKER CONTENT TRANSLATIONS FROM GOOGLE SHEETS TO GITHUB REPOSITORY
# PULLING CORONATRACKER CONTENT TRANSLATIONS FROM GOOGLE SHEETS TO GITHUB REPOSITORY

#converts value to camelCase
camelCase = value.split()[0].lower() + " ".join(value.split()[1:]).title().replace(" ","")
#removes punctuation
return camelCase.translate(str.maketrans('', '', punctuation))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, let's pick one depunctuation method, translate(str.maketrans('', '', punctuation)) or depunctuate, and use it everywhere depunctuation is done

Comment on lines +102 to +107
if len(verbs)==0 and len(nouns)==2:
lst_new.append(nouns[0].lower() + nouns[1].title())
elif len(verbs)==1 and len(nouns)==1:
lst_new.append(nouns[0].lower() + verbs[0].title())
else:
lst_new.append(nouns[0].lower() + nouns[1].title() + verbs[0].title())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit fragile, but if it works with the current sheets it should be fine. We can fix it later

for j,_ in enumerate(arr):
row_new = row.copy()
row_new[v_col] = arr[j].replace('\n','')
row_new[v_col] = convert_to_camelCase(row_new[v_col]).replace(" ","")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
row_new[v_col] = convert_to_camelCase(row_new[v_col]).replace(" ","")
row_new[v_col] = convert_to_camelCase(row_new[v_col])

convert_to_camelCase does the replace I believe

Comment on lines +167 to +170
df.value = education_value_cleaner(df)
elif 'Survey' in wk.title:
osLanguage_df = df
df = survey_value_cleaner(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more consistent if education_value_cleaner and survey_value_cleaner returned the same type

parentKeyDict[parentKey] = childKeyDict
wkDict.update(parentKeyDict)

save_to_JSON(OUT_DIR,locale,wkDict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call save_to_JSON once per language instead of once per wk?

@ngiangre
Copy link
Contributor Author

you're totally right @acthelemann these are great suggestions and should be made. I can make another issue for this - as a feature enhancement. I think we need to move to the next step of working with these translations so the front end devs can try to use it. Getting feedback should be prioritized if since we'll most likely need to change functions/methods from that as well.

@ngiangre ngiangre merged commit 60b1e2e into COVID-19-electronic-health-system:master May 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants