Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerous False Negatives #12

Open
GrayEye opened this issue Jun 20, 2018 · 4 comments
Open

Numerous False Negatives #12

GrayEye opened this issue Jun 20, 2018 · 4 comments

Comments

@GrayEye
Copy link

GrayEye commented Jun 20, 2018

Hello Elyase, very glad you have created and maintained this very useful python library. I'm currently using it to help parse quite a lot of info from the USPTO. Anyway I noticed quite a few errors where the library didn't capture the city and/or country from the string. Here are some examples of strings from the source data I ran the library against where the city and/or country was not picked out. Hopefully these cases can help you improve the library.

INDIANAPOLIS INDIANA.
BARDSLEY, ENGLAND
ST. LOUIS, MO.
WHITING, INDIANA, AND CHICAGO, ILLINOIS.
PHILADELPHIA PA.
LEROY, N.Y.
LYNDONVILLE, VT.
AMENIA, N. Y.
COPPERHILL, TENN.
DETROIT AND JOSEPH CAMPAU AT THE RIVER,MICH.
IVORYTON, CONN.
ST. LOUIS, MO. CORPORATION OF MISSOURI.
OGDENSBURG, N.Y.
NEAR SHEFFIELD, ENGLAND
INDIANAPOLIS IND.
BASLE,
ST. LOUIS, MO. REPUBLISHED BY MONSANTO COMPANY,/ST. LOUIS, MO.
LABORATORY PARK DECATUR, ILL.
1006 OAZA KADOMA, KADOMA-CHO KITAKAWACHI-GUN, OSAKA,
3501 W. 48TH PLACE CHICAGO 32, ILL.
700 BROADWAY NEW YORK, N.Y.
811 WYANDOTTE KANSAS CITY, MO.
835 S. 8TH ST. ST. LOUIS 2, MO.
47/51 EXMOUTH MARKET, ROSEBERRY AVE. LONDON E.C.1, ENGLAND
1407 CUMMINGS DRIVE RICHMOND 20, VA.

@iwpnd
Copy link

iwpnd commented Jun 26, 2018

In order for it to work the input text must make use of capitalization, because the underlying regex statement and the idea behind this library is to catch city names as capitalized named entities - otherwise it would only be a lookup.

@GrayEye
Copy link
Author

GrayEye commented Jun 26, 2018

Ok, that makes sense. I can attempt to title case the data before I process it. However I also have to point out that no matter what I do certain cities like St. Louis are never recognized. Even when input as just "St. Louis" or "Saint Louis".

@iwpnd
Copy link

iwpnd commented Jun 26, 2018

This is right. You have to understand that there are two things at work here. A regular expression that tries to catch all named entities in a text, store it in a list and then look up those named entities in a table of city names. In cases like St. Louis I would guess that the regular expression does not catch the "St." in St. Louis, that is why it is not recognized.
You can however take the regular expression and craft it to your needs or you can create multiple regular expressions, concatenate those into one list and do the lookup from this.

@cusco
Copy link

cusco commented Jan 28, 2020

Hello, I've been using str.title() to capitalise strings. However, 'Malasya' is not identified as country even tho it comes up in origin: http://www.geonames.org/search.html?q=malasya

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants