Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Preceding Zeroes #204

Open
firebladed opened this issue Jul 12, 2021 · 5 comments
Open

Handling Preceding Zeroes #204

firebladed opened this issue Jul 12, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@firebladed
Copy link

firebladed commented Jul 12, 2021

Describe the bug
Zeroes preceding a non zero digit are ignored, either initially or following a pause

the problem is partly related to the in-predictability of pauses in readings of number sequences
as
"0 1 4 6 0 6" is correct interpreted to [0.0, 1.0, 4.0, 6.0, 0.0, 6.0]
but "01 46 06") incorrectly goes to [1.0, 46.0, 6.0]

To Reproduce
Steps to reproduce the behavior:

  • install lingua_franca
  • open python3
>>> from lingua_franca import load_languages, set_default_lang, parse
>>> from lingua_franca import extractnumbers
>>> load_languages(['en'])

>>>extract_numbers("010 101")
[10.0, 101.0]
>>> extract_numbers("01 010 101")
[1.0, 10.0, 101.0]
>>> extract_numbers("51 21 05")
[51.0, 21.0, 5.0]
>>> extract_numbers("01 46 06")
[1.0, 46.0, 6.0]

Expected behavior

zeros should be added to output as separate numbers,

I think zeros preceding a single non zero digit should be treated as a separate number, either by default or as an option

e.g
"0 1" (zero one) -> [0, 1]
"01 46 06" (zero one four six zero six) -> [0, 1, 46, 0, 6]

Additional context
this is problematic used for reading code numbers e.g totp codes
which could be zero in any digit and can be read in multiple ways

e.g 0 1 4 6 0 6 (zero one four six zero six)
34 45 65 (three four four five six seven ,thirty four forty five sixty five)
234 567 (two hundred and thirty four five hundred and sixty seven

one aspect i'm not sure of is should 46 read as "four six" be interpreted as [46] or [4, 6] when preceding a decimal (or there is no decimal) after a decimal point is different as "normal" reading is e.g 0.01475 (zero point zero one four seven five)

however "46" (fourty six) can always be converted to "4 6" however missing zeroes cannot be recovered

@firebladed firebladed added the bug Something isn't working label Jul 12, 2021
@JarbasAl
Copy link
Collaborator

JarbasAl commented Jul 12, 2021

you want to keep an eye on #150

EDIT: nvm, its the reverse problem....

@ChanceNCounter
Copy link
Contributor

Partially misplaced, I think. Apparently planned #150 format.pronounce_digits() would be a more appropriate function call for the suggested behavior.

However, I'm not sure if it retains leading zeroes at the moment, either, because it uses extract_number() along the way.

The fundamental challenge here is continuing to treating the input as a string while parsing.


Relating this back to the code side, the English number extractors "chunk" numbers as they go based on powers of 10. While parsing a base-10 number left-to-right, whenever you encounter a power of 10, you scan the remainder of the number for larger powers of 10. If you do not find any, you have identified the end of a "place."

"1,075,018" -> 1000000 | 75,000 | 18 -> sum() -> 1075018

@ChanceNCounter
Copy link
Contributor

I stand corrected. In the current version of the PR, format.pronounce_digits() does indeed preserve leading zeroes:

>>> format.pronounce_digits("014606")
'zero one four six zero six'

@ChanceNCounter
Copy link
Contributor

On reflection, the "fail" case above is OOS. If the input appears to mean something specific - "46" == 46.0 - LF can't account for whether the program calling its parsers meant to feed it "46".

I vote one of two things:

  1. Add a sugar parameter extract_numbers(..., max_digits=0) where False things retain the current behavior
    • Pros: sugar, function signature isn't very long
    • Cons: edge case, needs localization and some non-English extractors already need work
  2. wontfix

@krisgesling
Copy link
Contributor

Hey @firebladed,

If we're looking at STT output, another option might be something like an extract_digits() method that intentionally pulls out all the digits in a string as individual numbers. I think this will be more straightforward than trying to determine when people meant to have digits expressed together or not.

Can anyone think of cases other than codes or phone numbers, where this would come up?

If it won't be supported in the extract_number(s) methods we probably need to add a note to the docstring that leading zero's will be ignored.

Probably not what you're referring to, but just in case...
If it's something that you know is a number like a TOTP or PIN returned from another system, then I'd suggest that extract_numbers() is probably overkill. For example, if you typecast the string to a list you get your list of digits:

>>> totp = "012345"
>>> list(totp)
['0', '1', '2', '3', '4', '5']

If there might be spaces in the source:

>>> totp = "01 2 3 45"
>>> list(totp.replace(" ",""))
['0', '1', '2', '3', '4', '5']

or if the source may be an int you would need to do something slightly more verbose:

>>> totp = 123456   # note an int cannot have a leading zero
>>> [digit for digit in str(totp)]
['1', '2', '3', '4', '5'. '6']

This could possibly act as a workaround for the STT case:

extracted_codes = [
    list(utterance.replace(" ","")),
    extract_numbers(utterance)[0]
]
if totp in extracted_codes:
    authenticated = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants