Guesswork: File naming for attributes #686

FrankBrandel · 2020-07-04T22:17:47Z

The consumer seems to have an issue detecting values from the file name when a certain pattern occurs.

Using the example from the documentation:
20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf
Works great, as expected.
Now try this:
20150314Z - Some Company Name - Invoice.pdf
In my instance it detects the date correctly, but "Some Company Name" as the title (should be correspondent) and "Invoice" as a tag (should be title).
Following pattern works though, filling the values for correspondent and title ("Invoice whenever") correctly, no tags:
20150314Z - Some Company Name - Invoice whenever.pdf
Without the date also no problem:
Some Company Name - Invoice.pdf

My setup does not use PAPERLESS_FILENAME_DATE_ORDER or PAPERLESS_FILENAME_PARSE_TRANSFORMS.

From the documentation and my logic I don't see why no. 2 should not work.

Can somebody replicate this?

Tooa · 2020-07-05T07:51:25Z

Thanks @FrankBrandel for reporting.

Analysis of the problem

The guesswork is defined as follows here:

formats = "pdf|jpe?g|png|gif|tiff?|te?xt|md|csv"
REGEXES = OrderedDict([
    ("created-correspondent-title-tags", re.compile(
        r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
        r"(?P<correspondent>.*) - "
        r"(?P<title>.*) - "
        r"(?P<tags>[a-z0-9\-,]*)"
        r"\.(?P<extension>{})$".format(formats),
        flags=re.IGNORECASE
    )),
    ("created-title-tags", re.compile(
        r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
        r"(?P<title>.*) - "
        r"(?P<tags>[a-z0-9\-,]*)"
        r"\.(?P<extension>{})$".format(formats),
        flags=re.IGNORECASE
    )),
    ("created-correspondent-title", re.compile(
        r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
        r"(?P<correspondent>.*) - "
        r"(?P<title>.*)"
        r"\.(?P<extension>{})$".format(formats),
        flags=re.IGNORECASE
    )),
    ("created-title", re.compile(
        r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
        r"(?P<title>.*)"
        r"\.(?P<extension>{})$".format(formats),
        flags=re.IGNORECASE
    )),
    ("correspondent-title-tags", re.compile(
        r"(?P<correspondent>.*) - "
        r"(?P<title>.*) - "
        r"(?P<tags>[a-z0-9\-,]*)"
        r"\.(?P<extension>{})$".format(formats),
        flags=re.IGNORECASE
    )),
    ("correspondent-title", re.compile(
        r"(?P<correspondent>.*) - "
        r"(?P<title>.*)?"
        r"\.(?P<extension>{})$".format(formats),
        flags=re.IGNORECASE
    )),
    ("title", re.compile(
        r"(?P<title>.*)"
        r"\.(?P<extension>{})$".format(formats),
        flags=re.IGNORECASE
    ))
])


# Parse filename components.
for key,regex in REGEXES.items():
    # First match for created-title-tags
    m = regex.match("20150314Z - Some Company Name - Invoice.pdf")
    # First match for created-correspondent-title
    #m = regex.match("20150314Z - Some Company Name - Invoice whenever.pdf")
    if m:
        properties = m.groupdict()
        print(key)
        print(properties)

The first case that matches 20150314Z - Some Company Name - Invoice.pdf is created-title-tags, because the case is ignored and thus Invoice matches the regex pattern r"(?P<tags>[a-z0-9\-,]*)" and is recognized as a tag then.
Empty spaces are not allowed in the tags pattern and therefore "20150314Z - Some Company Name - Invoice whenever.pdf" matches created-correspondent-title as you pointed out.
Looking at the documentation Date - Correspondent - Title.pdf should work (ref "The tags are optional, so the format Date - Correspondent - Title.pdf works as well.").

Possible Solutions

I'm not sure how the mechanism is intended to work. Can we assume that tags are always lowercase? Then we simply need to remove the re.IGNORECASE flag from the tags related to regex patterns. We cannot switch the patterns since then "20150314Z - title - tag1,tag2.pdf" would match created-correspondent-title.

What do you guys think @pitkley @bauerj @MasterofJOKers?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guesswork: File naming for attributes #686

Guesswork: File naming for attributes #686

FrankBrandel commented Jul 4, 2020

Tooa commented Jul 5, 2020 •

edited

Loading

Guesswork: File naming for attributes #686

Guesswork: File naming for attributes #686

Comments

FrankBrandel commented Jul 4, 2020

Tooa commented Jul 5, 2020 • edited Loading

Analysis of the problem

Possible Solutions

Tooa commented Jul 5, 2020 •

edited

Loading