Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New, exciting stuff! #5

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

New, exciting stuff! #5

wants to merge 4 commits into from

Conversation

eepp
Copy link
Contributor

@eepp eepp commented Apr 3, 2019

See commit messages for more details.

As a reference, once you have an UDA directory PDF file, this is how I extracted the fields and converted them to JSON:

import sys
import fitz
import pprint
import collections
import json


def _round(val):
    return round(val, 3)


def _go(path):
    Artisse = collections.namedtuple('Artisse',
                                     ['name', 'surname', 'cat'])

    # open PDF
    doc = fitz.open(path)

    # find column positions
    extract = doc[0].getText('dict')
    lines = extract['blocks'][2]['lines']
    surname_x = _round(lines[0]['bbox'][0])
    name_x = _round(lines[1]['bbox'][0])
    cat_x = _round(lines[3]['bbox'][0])

    # find artisses
    infos = {}

    for page in doc:
        page_infos = {}
        infos[page.number] = page_infos
        extract = page.getText('dict')

        for block in extract['blocks']:
            if 'lines' not in block:
                continue

            name = None
            surname = None
            cat = None
            lines = block['lines']

            for line in lines:
                bbox = line['bbox']
                x = _round(bbox[0])
                y = _round(bbox[1])
                text = line['spans'][0]['text'].strip()

                if x == name_x:
                    name = text
                elif x == surname_x:
                    surname = text
                elif x == cat_x:
                    cat = text
                else:
                    continue

            if not name or not surname or not cat:
                continue

            page_infos[y] = Artisse(name, surname, cat)

    # keep unique artisses which are true members
    s = set()

    for page_infos in infos.values():
        for entry in page_infos.values():
            if entry.cat.lower().startswith('memb'):
                s.add(entry)

    # convert to array and sort by surname
    a = []

    for entry in s:
        a.append({
            'name': entry.name,
            'surname': entry.surname,
        })


    a.sort(key=lambda e: (e['surname'], e['name']))

    # output as JSON
    print(json.dumps(a, ensure_ascii=False, indent=2))


if __name__ == '__main__':
    _go(sys.argv[1])

eepp added 3 commits April 3, 2019 04:54
This patch adds a new command, `qng-uda`, which is the same as `qng`,
but its sources are names extracted from the UDA member directory.

You can choose the gender with --gender just like with `qng`.

You can choose the membership category with --type (actor/actress, host,
or singer).

Signed-off-by: Philippe Proulx <[email protected]>
Before this patch, `qng` with --snake-case can generate names such as
`marie-chantale_tremblay`. This is not a valid C identifier and I
believe the --snake-case option's purpose is to generate variable names
for programming languages.

Replace anything not matching `[^a-zA-Z0-9_]` with `_`.

Signed-off-by: Philippe Proulx <[email protected]>
This patch adds support for kebab case output with the --kebab-case
option.

Signed-off-by: Philippe Proulx <[email protected]>
@abusque
Copy link
Owner

abusque commented Apr 3, 2019

Nice, thanks! I'll do a proper review later and merge/release shortly.

Copy link
Owner

@abusque abusque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, thanks! I made a few suggestions/comments, nothing major though.

I'd also like to mention qng-uda in the docs if possible.

qng/cli_qng.py Outdated
import os
import random
import sys
import unicodedata
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import.

@@ -0,0 +1,75 @@
import argparse
import json
import operator
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import.

import os
import random
import sys
import unicodedata
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import.

import random
import sys
import unicodedata
from typing import Optional
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import.

qng/cli_qng.py Outdated
from typing import Optional

import qng
from qng import common
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general rule of thumb I prefer doing e.g. import qng.common over from qng import common to preserve namespacing. Not a huge deal though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you do from typing import Optional just above 😉.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I loosely follow the Google Python style guide which has an explicit exemption for the typing module. I think the rationale is that otherwise it would make function signatures much too verbose.

qng/common.py Outdated
@@ -0,0 +1,69 @@
import argparse
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused, and same with json, operator, random, typing.Optional, and qng below.

qng/common.py Outdated

def _validate_snake_kebab_args(args):
if args.kebab_case and args.snake_case:
print('Cannot specify --snake-case and --kebab-case.', file=sys.stderr)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for clarity 'Cannot specify both --snake-case and --kebab-case.'

qng/common.py Outdated
)


def _snakebabify_name(name: str, sep: str) -> str:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10/10 néologisme.

qng/common.py Outdated
def _snakebabify_name(name: str, sep: str) -> str:
name = _strip_diacritics(name)
name = name.lower()
name = re.sub(r'[^a-zA-Z0-9_]', sep, name)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the record, regarding your commit message on this, the original intent behind the --snake-case wasn't directly to produce valid C identifiers, since my intended use case was for Docker container names, where something like marie-chantale_tremblay is perfectly valid, and perhaps even more legible than marie_chantale_tremblay.

I think I'd your snake case changes as-is so there's the option of producing valid C idents, but then reintroduce the old format as --docker-case or some other name perhaps for my own use case.

qng/common.py Outdated
_DATA_DIR = os.path.join(_BASE_DIR, 'data')


def _validate_snake_kebab_args(args):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where else to put this in the file, but personally to denote a module that's meant to be used across the package but isn't part of the public API, I'd call the file _common.py and lose the underscore prefix on the function names, unless they're only used in the _common module itself.

Just a stylistic preference though, but that way I can distinguish what's internal to _common vs. what's internal to the package but accessible outside _common.

Signed-off-by: Philippe Proulx <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants