Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YYYY-MM-DD interpreted as YYYY-DD-MM #1199

Open
lopuhin opened this issue Nov 15, 2023 · 7 comments
Open

YYYY-MM-DD interpreted as YYYY-DD-MM #1199

lopuhin opened this issue Nov 15, 2023 · 7 comments

Comments

@lopuhin
Copy link
Member

lopuhin commented Nov 15, 2023

YYYY-MM-DD interpreted as YYYY-DD-MM for arabic, but also looks like in other languages which prefer DMY order, but this looks strange -- it seems that if year is first, then we should ignore DMY / MDY and just use YMD for all locales?

Examples:

>>> dateparser.parse('2023-11-08', languages=['ar'])
datetime.datetime(2023, 8, 11, 0, 0)

>>> dateparser.parse('2023-11-08', languages=['en'], region='GB')
datetime.datetime(2023, 8, 11, 0, 0)

>>> dateparser.parse('2023-11-08', languages=['en'], region='US')
datetime.datetime(2023, 11, 8, 0, 0)

>>> dateparser.parse('2023-11-08', languages=['en'])
datetime.datetime(2023, 11, 8, 0, 0)

Side note: in reality US also has MDY date order, so if we'd interpret en as en-US and if it had MDY set, then we'd parse a lot more dates incorrectly.

@lopuhin
Copy link
Member Author

lopuhin commented Nov 15, 2023

#790 by @Gallaecio might be related, but not sure if it's enough, because the date we do get is formatted in a more weird way, as 2023 - 11 - 08 (with extra spaces).

@lopuhin
Copy link
Member Author

lopuhin commented Nov 15, 2023

According to wikipedia, YDM is used in just 4 few countries: https://en.wikipedia.org/wiki/Calendar_date#Gregorian,_year–day–month_(YDM), but it looks like we're inferring that MDY (very popular) implies YDM (very rare)

@keikoro
Copy link

keikoro commented Feb 9, 2024

Came here to confirm this affects German as well, another language which uses DMY for local date formats, but also just found issue #765, which already reported this problem back in 2020...

@keikoro
Copy link

keikoro commented Feb 9, 2024

Interestingly, when it comes to ISO 8601 dates, DMY-related settings seem to also partially disable the built-in mechanism which swaps date components on impossible combinations... when that mechanism could theoretically "save" 2/3 of dates being misinterpreted (based on numbers > 12).

Examples parsing correctly formatted ISO date "1960-12-23":

>>> dateparser.parse("1960-12-23")  # default
datetime.datetime(1960, 12, 23, 0, 0)

>>> dateparser.parse("1960-12-23", languages=["en"])  # languages set to MYD language
datetime.datetime(1960, 12, 23, 0, 0)

>>> dateparser.parse("1960-12-23", languages=["de"])  # languages set to DMY language
# None (implicit)

>>> dateparser.parse("1960-12-23", languages=["en"], settings={"DATE_ORDER": "DMY"})  # languages set to MYD language, DATE_ORDER set to DMY
# None (implicit)

>>> dateparser.parse("1960-12-23", languages=["de"], settings={"DATE_ORDER": "MDY"})  # languages set to DMY language, DATE_ORDER set to MDY
datetime.datetime(1960, 12, 23, 0, 0)

... The last two examples have the same result even with PREFER_LOCALE_DATE_ORDER set (whether True or False).

Examples parsing jumbled ISO date "1960-23-12":

>>> dateparser.parse("1960-23-12")  # default
datetime.datetime(1960, 12, 23, 0, 0)

>>> dateparser.parse("1960-23-12", languages=["en"])  # languages set to MYD language
datetime.datetime(1960, 12, 23, 0, 0)

>>> dateparser.parse("1960-23-12", languages=["de"])  # languages set to DMY language
datetime.datetime(1960, 12, 23, 0, 0)

>>> dateparser.parse("1960-23-12", settings={"DATE_ORDER": "DMY"})  # DATE_ORDER set to DMY
datetime.datetime(1960, 12, 23, 0, 0)

... The jumbled date is parsed correctly for all possible combinations of the above languages + DATE_ORDER settings, also with and without PREFER_LOCALE_DATE_ORDER set (whether True or False).

@keikoro
Copy link

keikoro commented Feb 9, 2024

@lopuhin It looks like the problem can be worked around by including the format codes for YYYY-(M)M-(D)D in date_formats in addition to setting your DMY language:

>>> dateparser.parse("2023-11-08", languages=["ar"], date_formats=["%Y-%m-%d"])
datetime.datetime(2023, 11, 8, 0, 0)

Other dates will continue to be interpreted as DMY:

>>> dateparser.parse("8.11.23", languages=["ar"], date_formats=["%Y-%m-%d"])
datetime.datetime(2023, 11, 8, 0, 0)

>>> dateparser.parse("8/11/23", languages=["ar"], date_formats=["%Y-%m-%d"])
datetime.datetime(2023, 11, 8, 0, 0)

Same for other hyphenated dates, e.g. 08-11-23, though it'd probably wise to not use hyphens at all with the explicitly set languages, just to avoid confusion. Or to always require the full year to be set everywhere and/or to not also allow %y for YY in date_formats.

@lopuhin
Copy link
Member Author

lopuhin commented Feb 19, 2024

Interesting, thanks for suggestion @keikoro . I wonder if passing date_formats=["%Y-%m-%d"] can lead to any unwanted changes in date parsing for this or other languages?

@jchillerup
Copy link

jchillerup commented Sep 27, 2024

I have a variation of this problem. I'm trying to parse invoice lines from a lot of different subcontractors. Much of it is in some variation of DMY or YMD, but never ever YDM, which I'm currently fighting. Refer to the code example below for some tests.

I was wondering if it'd be easy to amend the DATE_ORDER handling to accept a list of allowed values, or adding another one like FORBIDDEN_DATE_ORDER to specifically forbid American-style dates.

The code below requires rich, sorry about that.

import datetime
import dateparser.search
import dateparser.conf
import rich

def extract_dates(sample, debug=True):

    return dateparser.search.search_dates(
        sample,
        languages=['da', 'en'],
        settings={
            'PREFER_LOCALE_DATE_ORDER': True,
            'DATE_ORDER': 'DMY',
            'PREFER_DATES_FROM': 'past',
            'STRICT_PARSING': True,                  # There must be a day, month, year
            'PARSERS': ['absolute-time']
        },
        add_detected_language=True,
    )

tests = [
    ("sdds 07/09/2024  1. kons", datetime.datetime(2024, 9, 7, 0, 0)),
    ("sdds 30. september 2024 første kons", datetime.datetime(2024, 9, 30, 0, 0)),
    ("sdds 30. september 2024  1. kons", datetime.datetime(2024, 9, 30, 0, 0)),
    ("sdds 30. september 2024 sdw 1. kons", datetime.datetime(2024, 9, 30, 0, 0)),
    ("sdds 2024-11-02  1. kons", datetime.datetime(2024, 11, 2, 0, 0)),
    ("sdds 4. kons 3. februar 2023", datetime.datetime(2023, 2, 3, 0, 0)),
    ("sdds  4. kons 2. marts 2023", datetime.datetime(2023, 3, 2, 0, 0)),
    
]

for sample, correct in tests:
    results = extract_dates(sample)[0]

    date_part_of_string, result, language = extract_dates(sample)[0]
    if result == correct:
        rich.print(f"{date_part_of_string} -- [green]{sample}[/green] -- {result}")
    else:
        rich.print(f"{date_part_of_string} -- [red]{sample}[/red] -- {result}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants