feat: Audio transcript review #11

frco9 · 2020-01-15T11:12:17Z

This is the work we have done to be able to perform the review of audio files and their transcripts.
There are many things that need to be fixed and refactored before being able to push a PR to the official repo.
Let me know what you think!

Co-Authored-By: Julien Chomarat <[email protected]>

c-w

Overall this is a very competent pull request, thanks for all the changes and work! I left some review comments about the code as it was submitted in this pull request.

Now taking a step back. From an open source sustainability point of view, I'm a bit concerned about the scope of the changes: introducing a parallel model to Document (Conversation), introducing a second annotation canvas (conversation.pug), etc. This will increase the maintenance cost for doccano long-term. I wonder if your requirements can instead of achieved by building more on-top of existing abstractions, e.g. see doccano#495 as an example for a simple speech-to-text annotation task with fewer architectural changes. I wonder if your additional requirements (e.g. pretty wavesurfer player) could be integrated to that pull request so that the changes could be merged upstream with less long-term maintenance cost. If there's a desire: I'm happy to work with you on that.

c-w · 2020-01-15T16:41:02Z

.gitignore

@@ -4,6 +4,9 @@
 .DS_Store
 .AppleDouble
 .LSOverride
+env/
+*.wav
+.env


I believe the env/ and .env lines can be removed since they should already be git-ignored:

doccano/.gitignore

Lines 114 to 121 in 5942bbf

# Environments

.env

.venv

env/

venv/

ENV/

env.bak/

venv.bak/

c-w · 2020-01-15T16:41:38Z

.gitignore

@@ -200,3 +203,4 @@ bundle/
 webpack-stats.json

 .vscode
+local.env


Consider moving this to the # Environments block in the git-ignore file to maintain a nice organization of the file.

c-w · 2020-01-15T16:43:16Z

app/api/migrations/0002_auto_20191218_0134.py

@@ -0,0 +1,47 @@
+# Generated by Django 2.1.11 on 2019-12-18 01:34


Consider squashing these migrations (see manage.py squashmigrations) so that we have a single/atomic migration for the new functionality.

c-w · 2020-01-15T16:44:13Z

app/api/models.py

@@ -5,6 +5,7 @@
 from django.db.models.signals import post_save, pre_delete
 from django.urls import reverse
 from django.conf import settings
+from django.contrib.postgres.fields import JSONField


Note that Doccano also supports SQLite, MySQL and SQL Server, so I'd discourage using Postgres-specific features since this will require a work-around for the other databases.

c-w · 2020-01-15T16:48:44Z

app/api/serializers.py

+        model = ConversationsProject
+        fields = ('id', 'name', 'description', 'guideline', 'users', 'current_users_role', 'project_type', 'image',
+                  'updated_at', 'randomize_document_order')
+        read_only_fields = ('image', 'updated_at', 'users', 'current_users_role')


Consider using a pattern like this to prevent drift in the declared fields which can lead to subtle bugs (see doccano/519 as an example bug):

doccano/app/api/serializers.py

Lines 116 to 117 in 5c1a5e0

fields = ProjectSerializer.Meta.fields

read_only_fields = ProjectSerializer.Meta.read_only_fields

c-w · 2020-01-15T16:57:48Z

app/api/views.py

+        corrected = self.request.data.get('corrected')
+        document = get_object_or_404(Document, pk=self.kwargs['doc_id'])
+        document.text_validated = corrected
+        if (corrected):


Nit: remove parentheses around if statement. It's not part of PEP8 but for a single statement the parthenteses are redundant so most style guides tend to discourage them.

c-w · 2020-01-15T16:59:02Z

app/api/views.py

@@ -125,6 +143,15 @@ def perform_create(self, serializer):
        project = get_object_or_404(Project, pk=self.kwargs['project_id'])
        serializer.save(project=project)

+class ConversationList(generics.ListCreateAPIView):


Nit: add two newlines between classes as per PEP8.

c-w · 2020-01-15T16:59:17Z

app/api/views.py

    filter_backends = (DjangoFilterBackend, filters.SearchFilter, filters.OrderingFilter)
    search_fields = ('text', )
    ordering_fields = ('created_at', 'updated_at', 'doc_annotations__updated_at',
                       'seq_annotations__updated_at', 'seq2seq_annotations__updated_at')
    filter_class = DocumentFilter
    permission_classes = [IsAuthenticated & IsInProjectReadOnlyOrAdmin]

-    def get_queryset(self):
+    def get_queryset(self):        


Nit: remove trailing whitespace.

c-w · 2020-01-15T17:01:05Z

app/app/settings.py

@@ -155,6 +162,7 @@
 SOCIAL_AUTH_AZUREAD_TENANT_OAUTH2_KEY = env('OAUTH_AAD_KEY', None)
 SOCIAL_AUTH_AZUREAD_TENANT_OAUTH2_SECRET = env('OAUTH_AAD_SECRET', None)
 SOCIAL_AUTH_AZUREAD_TENANT_OAUTH2_TENANT_ID = env('OAUTH_AAD_TENANT', None)
+SOCIAL_AUTH_AZUREAD_TENANT_OAUTH2_WHITELISTED_DOMAINS = env('OAUTH_AAD_WHITELISTED_DOMAINS', '').split(',')


Use can simplify this using env.list.

c-w · 2020-01-15T17:05:00Z

app/app/settings.py

@@ -123,6 +123,13 @@
    if path.isdir(static_path)
 ]

+DEFAULT_FILE_STORAGE = 'storages.backends.azure_storage.AzureStorage'


Django Storages is really neat. I've been meaning to add this for a while but haven't gotten around to it yet, so thanks for doing this!

Currently doccano uses Apache libcloud as a cloud storage abstraction layer to enable data import from Azure Storage, AWS S3, etc. via Django Cloud Browser:

doccano/app/app/settings.py

Lines 65 to 67 in 5942bbf

CLOUD_BROWSER_APACHE_LIBCLOUD_PROVIDER = env('CLOUD_BROWSER_LIBCLOUD_PROVIDER', None)

CLOUD_BROWSER_APACHE_LIBCLOUD_ACCOUNT = env('CLOUD_BROWSER_LIBCLOUD_ACCOUNT', None)

CLOUD_BROWSER_APACHE_LIBCLOUD_SECRET_KEY = env('CLOUD_BROWSER_LIBCLOUD_KEY', None)

As such, I would recommend to have this change mirror the same capability to support many cloud storage providers and use the libcloud plugin for Django Storages instead of the direct Azure Storage one.

frco9 and others added 6 commits January 15, 2020 12:02

chore: Add volume and local secret to docker-compose

6bfe422

feat: Add azure storage to be able to push uploaded files there

ee5c4dd

feat: Update data model to handle

c1a7226

Co-Authored-By: Julien Chomarat <[email protected]>

feat: Add routes and serializer to manage conversation data

a3cca19

Co-Authored-By: Julien Chomarat <[email protected]>

feat: Conversation import logic

ab78fbc

Co-Authored-By: Julien Chomarat <[email protected]>

feat: Update front components to listen to audio and review transcripts

30b0ffc

Co-Authored-By: Julien Chomarat <[email protected]>

frco9 requested a review from c-w January 15, 2020 11:12

c-w suggested changes Jan 15, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Audio transcript review #11

feat: Audio transcript review #11

frco9 commented Jan 15, 2020

c-w left a comment

c-w Jan 15, 2020

c-w Jan 15, 2020

c-w Jan 15, 2020

c-w Jan 15, 2020

c-w Jan 15, 2020

c-w Jan 15, 2020

c-w Jan 15, 2020

c-w Jan 15, 2020

c-w Jan 15, 2020

c-w Jan 15, 2020

		@@ -0,0 +1,47 @@
		# Generated by Django 2.1.11 on 2019-12-18 01:34

	fields = ProjectSerializer.Meta.fields
	read_only_fields = ProjectSerializer.Meta.read_only_fields

	CLOUD_BROWSER_APACHE_LIBCLOUD_PROVIDER = env('CLOUD_BROWSER_LIBCLOUD_PROVIDER', None)
	CLOUD_BROWSER_APACHE_LIBCLOUD_ACCOUNT = env('CLOUD_BROWSER_LIBCLOUD_ACCOUNT', None)
	CLOUD_BROWSER_APACHE_LIBCLOUD_SECRET_KEY = env('CLOUD_BROWSER_LIBCLOUD_KEY', None)

feat: Audio transcript review #11

Are you sure you want to change the base?

feat: Audio transcript review #11

Conversation

frco9 commented Jan 15, 2020

c-w left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment