Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for fuzzy Soundex #12

Open
howardjp opened this issue Dec 9, 2016 · 8 comments
Open

Add support for fuzzy Soundex #12

howardjp opened this issue Dec 9, 2016 · 8 comments

Comments

@howardjp
Copy link
Collaborator

howardjp commented Dec 9, 2016

More information on fuzzy Soundex available from http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf

@KyleHaynes
Copy link
Contributor

KyleHaynes commented Jul 15, 2019

Hi @howardjp,

I have an implementation of this fuzzy soundex I'd be happy to go into this package if you want it. I've translated it from the following: https://yomguithereal.github.io/talisman/phonetics/

Currently uses the stringi (with a slightly time benefit), could easily be written in pure base R (base R code included)

I've also read the paper and unless I'm missing something, it aligns with it.


On another note.
I've also translated some keyers as well (skeleton and omission): https://yomguithereal.github.io/talisman/keyers

Not sure if you'd see scope for them to be in phonics?

#' @aliases fuzzy_soundex
#'
#'
#' @title Implementation of the "Fuzzy Soundex" algorithm.
#'
#'
#' @description Implementation of the "Fuzzy Soundex" algorithm. Function taking a a vector of names and computing its Fuzzy Soundex code.
#'
#'
#' @param word String or vector of strings to encode.
#'
#'
#' @return Returns a vector of encoded strings.
#'
#'
#' @references Article: Holmes, David and M. Catherine McCabe. "Improving Precision and Recall for Soundex Retrieval." 
#'      \url{http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf} \cr 
#'      Code based on: \url{https://github.com/Yomguithereal/talisman/blob/master/src/phonetics/fuzzy-soundex.js}
#'      Licence: \url{https://github.com/Yomguithereal/talisman/blob/master/LICENSE.txt}.
#'
#'
#' @author Kyle Haynes, \email{kyle@@kylehaynes.com.au}.
#'
#'
#' @examples
#' fuzzy_soundex(c("Holmes", "Hollmes", "David", "Daved", "Catherine", "Kathryn"))
#' 
#' 
#' @export
fuzzy_soundex <-  function(word){
    
    # ---- Define constants ----
    # The code has been structured in such a way as to follow the following implementation of the algorithm:
        # https://github.com/Yomguithereal/talisman/blob/master/src/phonetics/fuzzy-soundex.js
 
   
    # Define 'translation'.
    translation <- c("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 
                     "0193017~07745501769301~7~9")
    
    # Define sets.
    set1 <- c("CS", "CZ", "TS", "TZ")
    set2 <- c("HR", "WR")
    set3 <- c("KN", "NG")
    set4 <- c("^H|^W|^Y")
    
    # Define 'rules'.
    rules_from <- c("CA", "CC", "CK", "CE", "CHL", "CL", "CHR", "CR", "CI", "CO", "CU", "CY", "DG", "GH", "MAC", "MC", "NST", "PF", "PH", "SCH", "TIO", "TIA", "TCH")
    rules_to <-   c("KA", "KK", "KK", "SE", "KL",  "KL", "KR",  "KR", "SI", "KO", "KU", "SY", "GG", "HH", "MK", "MK",  "NSS", "FF", "FF", "SSS", "SIO", "SIO", "CHH")


    # ---- Checks ----
    if(!is.vector(word) || is.na(word)){
        stop("Input must be a vector.")
    }
    

    # ---- Code ----
    # Coerce NA's to "".
    if(any(is.na(word))){
        na_logical <- is.na(word)
        word[is.na(word)] <- ""
    }

    
    ##   // Deburring the string & dropping any non-alphabetical character
    ##   name = deburr(name)
    ##     .toUpperCase()
    ##     .replace(/[^A-Z]/g, '');
    word <- iconv(word, to = 'ASCII//TRANSLIT')
    word <- toupper(word)
    word <- gsub("[^A-Z]", "", word, perl = TRUE)
    ## if (SET1.has(firstTwoLetters))
    ##     name = 'SS' + rest;
    ## else if (firstTwoLetters === 'GN')
    ##     name = 'NN' + rest;
    ## else if (SET2.has(firstTwoLetters))
    ##     name = 'RR' + rest;
    ## else if (firstTwoLetters === 'HW')
    ##     name = 'WW' + rest;
    ## else if (SET3.has(firstTwoLetters))
    ##     name = 'NN' + rest;
    word <- gsub(paste0("^", paste0(set1, collapse = "|^")), "SS", word, perl = TRUE)
    word <- gsub("^GN", "NN", word, perl = TRUE)
    word <- gsub(paste0("^", paste0(set2, collapse = "|^")), "RR", word, perl = TRUE)
    word <- gsub("^HW", "WW", word, perl = TRUE)
    word <- gsub(paste0("^", paste0(set3, collapse = "|^")), "NN", word, perl = TRUE)

    ##       // Applying some substitutions for endings
    ##   const lastTwoLetters = name.slice(-2),
    ##         initial = name.slice(0, -2);
    ##   if (lastTwoLetters === 'CH')
    ##     name = initial + 'KK';
    ##   else if (lastTwoLetters === 'NT')
    ##     name = initial + 'TT';
    ##   else if (lastTwoLetters === 'RT')
    ##     name = initial + 'RR';
    ##   else if (name.slice(-3) === 'RDT')
    ##     name = name.slice(0, -3) + 'RR';
    word <- gsub("(.{1,})CH$", "\\1KK", word, perl = TRUE)
    word <- gsub("(.{1,})NT$", "\\1TT", word, perl = TRUE)
    word <- gsub("(.{1,})RT$", "\\1RR", word, perl = TRUE)
    word <- gsub("(.{1,})RDT$", "\\1RR", word, perl = TRUE)
    
    ##   // Applying the rules
    ##   for (let i = 0, l = RULES.length; i < l; i++)
    ##     name = name.replace(...RULES[i]);
    
    # As gsub is already vectorised, a simple for loop will suffice.
    # for(i in 1:length(rules)){
    #     word <- gsub(rules[[i]][1], rules[[i]][2], word, perl = TRUE)
    # }
    word <- stringi::stri_replace_all_fixed(word, rules_from, rules_to, vectorize_all = FALSE)
    
    ##   // Caching the first letter
    ##   const firstLetter = name[0];
    first_character <- substring(word, 1, 1)
    ##   // Translating
    ##   let code = '';
    ##   for (let i = 0, l = name.length; i < l; i++)
    ##     code += TRANSLATION[name[i]] || name[i];
    word <- chartr(translation[1], translation[2] , word)
    ##   // Removing hyphens
    ##   code = code.replace(/-/g, '');
    word <- gsub("~", "", word, fixed = TRUE)
    ##   // Squeezing the code
    ##   code = squeeze(code);
    # Code can be found here: https://github.com/Yomguithereal/talisman/blob/master/src/helpers/index.js
        ## /**
        ##  * Function squeezing the given sequence by dropping consecutive duplicates.
        ##  *
        ##  * Note: the name was actually chosen to mimic Ruby's naming since I did not
        ##  * find any equivalent in other standard libraries.
        ##  *
        ##  * @param  {mixed} target - The sequence to squeeze.
        ##  * @return {array}        - The resulting sequence.
        ##  */
        ## export function squeeze(target) {
        ##   const isString = typeof target === 'string',
        ##         sequence = seq(target),
        ##         squeezed = [sequence[0]];
        ## 
        ##   for (let i = 1, l = sequence.length; i < l; i++) {
        ##     if (sequence[i] !== sequence[i - 1])
        ##       squeezed.push(sequence[i]);
        ##   }
        ## 
        ##   return isString ? squeezed.join('') : squeezed;
        ## }        
    word <- gsub("([0-9])\\1+", "\\1", word, perl = TRUE)
    ##   // Dealing with some initials
    ##   if (SET4.has(code[0]))
    ##     code = firstLetter + code;
    ##   else
    ##     code = firstLetter + code.slice(1);
    word[grepl(set4, word)] <- paste0(first_character[grepl(set4, word)], word[grepl(set4, word)])
    word[!grepl(set4, word)] <- paste0(first_character[!grepl(set4, word)], substring(word[!grepl(set4, word)], 2))
    ##   // Dropping vowels
    ##   code = code.replace(/0/g, '');
    word <- gsub("0", "", word, perl = TRUE)

    # Finally, if any NA's were passed, set them back to NA.
    if(exists("na_logical")){
        word[na_logical] <- NA
    }
    
    # Return hashed word(s).
    return(word)
}

@howardjp
Copy link
Collaborator Author

@KyleHaynes, this seems pretty reasonable. The big question is, do you have test cases?

@KyleHaynes
Copy link
Contributor

@howardjp, good point, let me get back to you (just drowning with work atm).

@chrislit
Copy link

chrislit commented Aug 4, 2019

If it's any help, here's a set of testcases I use for fuzzy soundex, formatted like your CSV files.
fuzzy-soundex.csv.txt

@howardjp
Copy link
Collaborator Author

howardjp commented Aug 4, 2019 via email

@KyleHaynes
Copy link
Contributor

KyleHaynes commented Aug 5, 2019

@chrislit - thanks for this

I've run against my function and from the 27 tests strings I get two mismatches (same word)

It's an issue with my function and not the test data.

I'll look into this further

require(data.table)
require(stringi)

csv <- "name,hash
Kristen,K6935
Krissy,K6900
Christen,K6935
peter,P3600
pete,P3000
pedro,P3600
stephen,S3150
steve,S3100
smith,S5300
smythe,S5300
gail,G4000
gayle,G4000
christine,K6935
christina,K6935
kristina,K6935
Wight,W3000
Hardt,H6000
Knight,N3000
Czech,S7000
Tsech,S7000
gnomic,N5900
Wright,R3000
Hrothgar,R3760
Hwaet,W3000
Grant,G6300
Hart,H6000
Hardt,H6000"

dt <- fread(csv)

dt[, "hash_2" := fuzzy_soundex(name)]
# Currently my function doesn't pad trailing 0's. Add this:
dt[, "hash_2" := stri_pad_right(hash_2, 5, 0)]

all.equal(dt$hash, dt$hash_2)

dt[hash != hash_2]
#    name  hash hash_2
# 1: Hardt H6000  R0000
# 2: Hardt H6000  R0000

@KyleHaynes
Copy link
Contributor

Fixed code (Fixed in my original comment).

Re-running previous test now returns no differences.

Re padding out strings, could make this argument driven like the phonics::soundex argument: maxCodeLen = 4L. Could set the length, otherwise, if NULL no padding at all?

I (try) to conform to snake_case, happy to adjust the name of the function to be whatever (to fit in with the rest of the phonics library)

As for the dependency of stringi let me know if you want adjust to base R.

Also, moving forward, I'll create a fork and make changes there.

@howardjp
Copy link
Collaborator Author

howardjp commented Aug 7, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants