added nlp tokenizer to python gsheet pull script #691

ngiangre · 2020-05-06T04:22:29Z

⚠️ IMPORTANT: Please do not create a Pull Request without creating an Issue first.

All changes need to be discussed before proceeding. Failure to do so may result in the pull request being rejected.

Before submitting a pull request, please be sure to review:

Our Code of Conduct: https://github.com/COVID-19-electronic-health-system/.github/blob/master/CODE_OF_CONDUCT.md
Our Contributing Guidelines: https://github.com/COVID-19-electronic-health-system/.github/blob/master/CONTRIBUTING.md
Our Support documentation: https://github.com/COVID-19-electronic-health-system/.github/blob/master/SUPPORT.md

Please include the issue number the pull request fixes by replacing YOUR-ISSUE-HERE in the text below.

Fixes #YOUR-ISSUE-HERE

Summary

This is in reference to #665

This PR submits a python script for the mvp which pulls the gsheet content for translations and formats them accurately.

The translation.json files are added to the client/[language]/ folders.

Details

Test Plan (required)

Final Checklist

For CoronaTracker, did you bump the version in package.json?
- Update the second decimal for a major change
- Update the third decimal for a minor change
- Numbers can go past 9, e.g. 1.0.9 => 1.0.10
- For more info, read about Semantic Versioning
Did you add any new tests as necessary?
Is your PR rebased off the most current master?
Have you squashed all commits? (can be done at merge)
Did you use yarn not npm? (important!)
Did you use Material-UI wherever possible?
Did you run yarn lint on the code?
Did you run all of your most recent changes locally to make sure everything is working?

src/python/pull_gsheets_translations_mvp.py

SomeMoosery

Looks good although like I said I'm by no means a Python expert.

A couple quick comments/questions, but nothing definitely requiring a change 👍

src/python/pull_gsheets_translations_mvp.py

SomeMoosery · 2020-05-06T12:33:12Z

src/python/pull_gsheets_translations_mvp.py

+		for parentKey, pgrp in language_df.groupby(['parentKey']):
+			childKeyDict ={}
+			for childKey,cgrp in pgrp.groupby(['childKey']):
+				fieldKeyDict = {}
+				for fieldKey,fgrp in cgrp.groupby(['fieldKey']):
+					fgrp_sub = fgrp.filter(regex='[Vv]alue')
+					valueDict = {"array":[]}
+					for value, translatedValue in fgrp_sub.values:
+						valueDict[value]=translatedValue
+						valueDict["array"].append(translatedValue)
+					fieldKeyDict[fieldKey] = valueDict
+				childKeyDict[childKey] = fieldKeyDict
+			parentKeyDict[parentKey] =  childKeyDict
+		wkDict.update(parentKeyDict)


Is there no python/pandas/numpy function that would allow us to avoid this structure? If not, it's probably fine for the time being but this many nested for loops can lead to a nasty runtime complexity like O(n^4) if I'm not mistaken so it just stood out to me

this may be a good reference

Thanks for the reference!

I’m going to look into the style for imports in a script and then do another commit.

Right now the nested loops are needed. I was using pandas ‘to_json’ but it isn’t specific enough to produce both the array and value keys. I think we could do more effective list comprehensions but right now there shouldn’t be a concern for complexity since there’s so few rows in the gsheets. But definitely in the future we need better checks for a potential runtime error.

Also is there a requirements.txt or something to require the python library dependencies?

Fair enough - I figured this may be unavoidable for now, and even so it won't make much of a difference since O(n^2) vs O(n^4) is likely a matter of milliseconds when it's not thousands or hundreds of thousands of rows lol

And if you're asking if we can put in a requirements.txt I think we can just put it right in the src/python directory right? I honestly have very limited experience with python and have never used it in a professional setting. Feel free to add a requirements.txt to this PR, or if we want to do that in a separate PR at least add the required libs to the README for contributors to get up and running - good catch

No great point - for scalability we need to reduce the complexity. But I’m not the best at doing so I’ll let a software engineer do that :)

I’m used to ‘pip install requirements.txt’ for getting libraries installed locally from modules or repos so I’ll include a requirements.txt in the PR soon. But again I do not know the most efficient way to mix js and python repos so in the future this should be amenable. Maybe we can have bc a feature enhancement issue?

Yeah I'd say for now just adding the requirements.txt in src/python should be fine with no overlap 👍

added it to the top of the dorectory - I don't ever see it in src/ folders

I think it's better in src/python for organization

I think pipenv has replaced requirements.txt, but we can handle that in a future issue/PR

acthelemann · 2020-05-07T21:40:38Z

A basic readme on a general overview of what this does and how to run it in src/python would be helpful

SomeMoosery

Not sure why tests are failing, but this is an issue with Travis or somewhere else - not with this PR. So I'm personally fine merging this if we another approval.

acthelemann

All of my suggestions are probably unnecessary for this PR. If the script works, we can merge this. This would do well with a big ole refactor in the future.

I was never able to run this locally because I struggled figuring out the credentials. Maybe I didn't read that Medium article closely enough idk. Some more documentation on getting those credentials would be nice though.

acthelemann · 2020-05-11T22:13:55Z

src/python/pull_gsheets_translations_mvp.py

@@ -0,0 +1,299 @@
+#####################
+# PULLING CORONARTACKER CONTENT TRANSLATIONS FROM GOOGLE SHEETS TO GITHUB REPOSITORY


Suggested change

# PULLING CORONARTACKER CONTENT TRANSLATIONS FROM GOOGLE SHEETS TO GITHUB REPOSITORY

# PULLING CORONATRACKER CONTENT TRANSLATIONS FROM GOOGLE SHEETS TO GITHUB REPOSITORY

acthelemann · 2020-05-11T23:03:13Z

src/python/pull_gsheets_translations_mvp.py

+		#converts value to camelCase
+		camelCase = value.split()[0].lower() + " ".join(value.split()[1:]).title().replace(" ","")
+		#removes punctuation
+		return camelCase.translate(str.maketrans('', '', punctuation))


For consistency, let's pick one depunctuation method, translate(str.maketrans('', '', punctuation)) or depunctuate, and use it everywhere depunctuation is done

acthelemann · 2020-05-11T23:15:30Z

src/python/pull_gsheets_translations_mvp.py

+		if len(verbs)==0 and len(nouns)==2:
+			lst_new.append(nouns[0].lower() + nouns[1].title())
+		elif len(verbs)==1 and len(nouns)==1:
+			lst_new.append(nouns[0].lower() + verbs[0].title())
+		else:
+			lst_new.append(nouns[0].lower() + nouns[1].title() + verbs[0].title())


This is a bit fragile, but if it works with the current sheets it should be fine. We can fix it later

acthelemann · 2020-05-11T23:27:42Z

src/python/pull_gsheets_translations_mvp.py

+			for j,_ in enumerate(arr):
+				row_new = row.copy()
+				row_new[v_col] = arr[j].replace('\n','')
+				row_new[v_col] = convert_to_camelCase(row_new[v_col]).replace(" ","")


Suggested change

row_new[v_col] = convert_to_camelCase(row_new[v_col]).replace(" ","")

row_new[v_col] = convert_to_camelCase(row_new[v_col])

convert_to_camelCase does the replace I believe

acthelemann · 2020-05-11T23:30:37Z

src/python/pull_gsheets_translations_mvp.py

+		df.value = education_value_cleaner(df)
+	elif 'Survey' in wk.title:
+		osLanguage_df = df
+		df = survey_value_cleaner(df)


It would be more consistent if education_value_cleaner and survey_value_cleaner returned the same type

acthelemann · 2020-05-11T23:35:04Z

src/python/pull_gsheets_translations_mvp.py

+				parentKeyDict[parentKey] =  childKeyDict
+			wkDict.update(parentKeyDict)
+
+		save_to_JSON(OUT_DIR,locale,wkDict)


Can we call save_to_JSON once per language instead of once per wk?

ngiangre · 2020-05-11T23:47:10Z

you're totally right @acthelemann these are great suggestions and should be made. I can make another issue for this - as a feature enhancement. I think we need to move to the next step of working with these translations so the front end devs can try to use it. Getting feedback should be prioritized if since we'll most likely need to change functions/methods from that as well.

added nlp tokenizer to python gsheet pull script

ea3a369

ngiangre requested review from acthelemann, AdhamAH, pavel-ilin and SomeMoosery as code owners May 6, 2020 04:22

SomeMoosery reviewed May 6, 2020

View reviewed changes

src/python/pull_gsheets_translations_mvp.py Outdated Show resolved Hide resolved

SomeMoosery reviewed May 6, 2020

View reviewed changes

SomeMoosery and others added 5 commits May 6, 2020 08:36

Merge branch 'master' into master

1f0fb62

tokenized possible answers for survey questions

f07cdbb

Merge branch 'master' of https://github.com/ngiangre/Corona-tracker

d2ec103

can now do pip3 install -r requirements.txt

f1b55b5

Merge branch 'master' into master

54d77a2

SomeMoosery previously approved these changes May 7, 2020

View reviewed changes

pavel-ilin previously approved these changes May 7, 2020

View reviewed changes

added readme and docstring and fixed bugs in script

c70199d

ngiangre dismissed stale reviews from pavel-ilin and SomeMoosery via c70199d May 10, 2020 20:07

Merge branch 'master' into master

02b3328

SomeMoosery approved these changes May 10, 2020

View reviewed changes

AdhamAH approved these changes May 10, 2020

View reviewed changes

Merge branch 'master' into master

08ca942

acthelemann approved these changes May 11, 2020

View reviewed changes

ngiangre merged commit 60b1e2e into COVID-19-electronic-health-system:master May 11, 2020

ngiangre mentioned this pull request May 11, 2020

[FEAT] Enhance gsheet translations pull python script #702

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added nlp tokenizer to python gsheet pull script #691

added nlp tokenizer to python gsheet pull script #691

ngiangre commented May 6, 2020

SomeMoosery left a comment

SomeMoosery May 6, 2020 •

edited

Loading

SomeMoosery May 6, 2020

ngiangre May 6, 2020

SomeMoosery May 6, 2020

ngiangre May 6, 2020

SomeMoosery May 6, 2020

ngiangre May 6, 2020

acthelemann May 7, 2020

acthelemann commented May 7, 2020

SomeMoosery left a comment

acthelemann left a comment

acthelemann May 11, 2020

acthelemann May 11, 2020

acthelemann May 11, 2020

acthelemann May 11, 2020

acthelemann May 11, 2020

acthelemann May 11, 2020

ngiangre commented May 11, 2020

		@@ -0,0 +1,299 @@
		#####################
		# PULLING CORONARTACKER CONTENT TRANSLATIONS FROM GOOGLE SHEETS TO GITHUB REPOSITORY

	# PULLING CORONARTACKER CONTENT TRANSLATIONS FROM GOOGLE SHEETS TO GITHUB REPOSITORY
	# PULLING CORONATRACKER CONTENT TRANSLATIONS FROM GOOGLE SHEETS TO GITHUB REPOSITORY

	row_new[v_col] = convert_to_camelCase(row_new[v_col]).replace(" ","")
	row_new[v_col] = convert_to_camelCase(row_new[v_col])

added nlp tokenizer to python gsheet pull script #691

added nlp tokenizer to python gsheet pull script #691

Conversation

ngiangre commented May 6, 2020

⚠️ IMPORTANT: Please do not create a Pull Request without creating an Issue first.

Please include the issue number the pull request fixes by replacing YOUR-ISSUE-HERE in the text below.

Summary

Details

Test Plan (required)

Final Checklist

SomeMoosery left a comment

Choose a reason for hiding this comment

SomeMoosery May 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acthelemann commented May 7, 2020

SomeMoosery left a comment

Choose a reason for hiding this comment

acthelemann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngiangre commented May 11, 2020

SomeMoosery May 6, 2020 •

edited

Loading