Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Audio transcript review #11

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

frco9
Copy link

@frco9 frco9 commented Jan 15, 2020

This is the work we have done to be able to perform the review of audio files and their transcripts.
There are many things that need to be fixed and refactored before being able to push a PR to the official repo.
Let me know what you think!

@frco9 frco9 requested a review from c-w January 15, 2020 11:12
Copy link

@c-w c-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a very competent pull request, thanks for all the changes and work! I left some review comments about the code as it was submitted in this pull request.

Now taking a step back. From an open source sustainability point of view, I'm a bit concerned about the scope of the changes: introducing a parallel model to Document (Conversation), introducing a second annotation canvas (conversation.pug), etc. This will increase the maintenance cost for doccano long-term. I wonder if your requirements can instead of achieved by building more on-top of existing abstractions, e.g. see doccano#495 as an example for a simple speech-to-text annotation task with fewer architectural changes. I wonder if your additional requirements (e.g. pretty wavesurfer player) could be integrated to that pull request so that the changes could be merged upstream with less long-term maintenance cost. If there's a desire: I'm happy to work with you on that.

@@ -4,6 +4,9 @@
.DS_Store
.AppleDouble
.LSOverride
env/
*.wav
.env
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the env/ and .env lines can be removed since they should already be git-ignored:

doccano/.gitignore

Lines 114 to 121 in 5942bbf

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

@@ -200,3 +203,4 @@ bundle/
webpack-stats.json

.vscode
local.env
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving this to the # Environments block in the git-ignore file to maintain a nice organization of the file.

@@ -0,0 +1,47 @@
# Generated by Django 2.1.11 on 2019-12-18 01:34
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider squashing these migrations (see manage.py squashmigrations) so that we have a single/atomic migration for the new functionality.

@@ -5,6 +5,7 @@
from django.db.models.signals import post_save, pre_delete
from django.urls import reverse
from django.conf import settings
from django.contrib.postgres.fields import JSONField
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that Doccano also supports SQLite, MySQL and SQL Server, so I'd discourage using Postgres-specific features since this will require a work-around for the other databases.

model = ConversationsProject
fields = ('id', 'name', 'description', 'guideline', 'users', 'current_users_role', 'project_type', 'image',
'updated_at', 'randomize_document_order')
read_only_fields = ('image', 'updated_at', 'users', 'current_users_role')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a pattern like this to prevent drift in the declared fields which can lead to subtle bugs (see doccano/519 as an example bug):

fields = ProjectSerializer.Meta.fields
read_only_fields = ProjectSerializer.Meta.read_only_fields

corrected = self.request.data.get('corrected')
document = get_object_or_404(Document, pk=self.kwargs['doc_id'])
document.text_validated = corrected
if (corrected):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove parentheses around if statement. It's not part of PEP8 but for a single statement the parthenteses are redundant so most style guides tend to discourage them.

@@ -125,6 +143,15 @@ def perform_create(self, serializer):
project = get_object_or_404(Project, pk=self.kwargs['project_id'])
serializer.save(project=project)

class ConversationList(generics.ListCreateAPIView):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add two newlines between classes as per PEP8.

filter_backends = (DjangoFilterBackend, filters.SearchFilter, filters.OrderingFilter)
search_fields = ('text', )
ordering_fields = ('created_at', 'updated_at', 'doc_annotations__updated_at',
'seq_annotations__updated_at', 'seq2seq_annotations__updated_at')
filter_class = DocumentFilter
permission_classes = [IsAuthenticated & IsInProjectReadOnlyOrAdmin]

def get_queryset(self):
def get_queryset(self):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove trailing whitespace.

@@ -155,6 +162,7 @@
SOCIAL_AUTH_AZUREAD_TENANT_OAUTH2_KEY = env('OAUTH_AAD_KEY', None)
SOCIAL_AUTH_AZUREAD_TENANT_OAUTH2_SECRET = env('OAUTH_AAD_SECRET', None)
SOCIAL_AUTH_AZUREAD_TENANT_OAUTH2_TENANT_ID = env('OAUTH_AAD_TENANT', None)
SOCIAL_AUTH_AZUREAD_TENANT_OAUTH2_WHITELISTED_DOMAINS = env('OAUTH_AAD_WHITELISTED_DOMAINS', '').split(',')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use can simplify this using env.list.

@@ -123,6 +123,13 @@
if path.isdir(static_path)
]

DEFAULT_FILE_STORAGE = 'storages.backends.azure_storage.AzureStorage'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Django Storages is really neat. I've been meaning to add this for a while but haven't gotten around to it yet, so thanks for doing this!

Currently doccano uses Apache libcloud as a cloud storage abstraction layer to enable data import from Azure Storage, AWS S3, etc. via Django Cloud Browser:

CLOUD_BROWSER_APACHE_LIBCLOUD_PROVIDER = env('CLOUD_BROWSER_LIBCLOUD_PROVIDER', None)
CLOUD_BROWSER_APACHE_LIBCLOUD_ACCOUNT = env('CLOUD_BROWSER_LIBCLOUD_ACCOUNT', None)
CLOUD_BROWSER_APACHE_LIBCLOUD_SECRET_KEY = env('CLOUD_BROWSER_LIBCLOUD_KEY', None)

As such, I would recommend to have this change mirror the same capability to support many cloud storage providers and use the libcloud plugin for Django Storages instead of the direct Azure Storage one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants