Add support for fuzzy Soundex #12

howardjp · 2016-12-09T01:36:18Z

More information on fuzzy Soundex available from http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf

KyleHaynes · 2019-07-15T07:04:04Z

Hi @howardjp,

I have an implementation of this fuzzy soundex I'd be happy to go into this package if you want it. I've translated it from the following: https://yomguithereal.github.io/talisman/phonetics/

Currently uses the stringi (with a slightly time benefit), could easily be written in pure base R (base R code included)

I've also read the paper and unless I'm missing something, it aligns with it.

On another note.
I've also translated some keyers as well (skeleton and omission): https://yomguithereal.github.io/talisman/keyers

Not sure if you'd see scope for them to be in phonics?

#' @aliases fuzzy_soundex
#'
#'
#' @title Implementation of the "Fuzzy Soundex" algorithm.
#'
#'
#' @description Implementation of the "Fuzzy Soundex" algorithm. Function taking a a vector of names and computing its Fuzzy Soundex code.
#'
#'
#' @param word String or vector of strings to encode.
#'
#'
#' @return Returns a vector of encoded strings.
#'
#'
#' @references Article: Holmes, David and M. Catherine McCabe. "Improving Precision and Recall for Soundex Retrieval." 
#'      \url{http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf} \cr 
#'      Code based on: \url{https://github.com/Yomguithereal/talisman/blob/master/src/phonetics/fuzzy-soundex.js}
#'      Licence: \url{https://github.com/Yomguithereal/talisman/blob/master/LICENSE.txt}.
#'
#'
#' @author Kyle Haynes, \email{kyle@@kylehaynes.com.au}.
#'
#'
#' @examples
#' fuzzy_soundex(c("Holmes", "Hollmes", "David", "Daved", "Catherine", "Kathryn"))
#' 
#' 
#' @export
fuzzy_soundex <-  function(word){
    
    # ---- Define constants ----
    # The code has been structured in such a way as to follow the following implementation of the algorithm:
        # https://github.com/Yomguithereal/talisman/blob/master/src/phonetics/fuzzy-soundex.js
 
   
    # Define 'translation'.
    translation <- c("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 
                     "0193017~07745501769301~7~9")
    
    # Define sets.
    set1 <- c("CS", "CZ", "TS", "TZ")
    set2 <- c("HR", "WR")
    set3 <- c("KN", "NG")
    set4 <- c("^H|^W|^Y")
    
    # Define 'rules'.
    rules_from <- c("CA", "CC", "CK", "CE", "CHL", "CL", "CHR", "CR", "CI", "CO", "CU", "CY", "DG", "GH", "MAC", "MC", "NST", "PF", "PH", "SCH", "TIO", "TIA", "TCH")
    rules_to <-   c("KA", "KK", "KK", "SE", "KL",  "KL", "KR",  "KR", "SI", "KO", "KU", "SY", "GG", "HH", "MK", "MK",  "NSS", "FF", "FF", "SSS", "SIO", "SIO", "CHH")


    # ---- Checks ----
    if(!is.vector(word) || is.na(word)){
        stop("Input must be a vector.")
    }
    

    # ---- Code ----
    # Coerce NA's to "".
    if(any(is.na(word))){
        na_logical <- is.na(word)
        word[is.na(word)] <- ""
    }

    
    ##   // Deburring the string & dropping any non-alphabetical character
    ##   name = deburr(name)
    ##     .toUpperCase()
    ##     .replace(/[^A-Z]/g, '');
    word <- iconv(word, to = 'ASCII//TRANSLIT')
    word <- toupper(word)
    word <- gsub("[^A-Z]", "", word, perl = TRUE)
    ## if (SET1.has(firstTwoLetters))
    ##     name = 'SS' + rest;
    ## else if (firstTwoLetters === 'GN')
    ##     name = 'NN' + rest;
    ## else if (SET2.has(firstTwoLetters))
    ##     name = 'RR' + rest;
    ## else if (firstTwoLetters === 'HW')
    ##     name = 'WW' + rest;
    ## else if (SET3.has(firstTwoLetters))
    ##     name = 'NN' + rest;
    word <- gsub(paste0("^", paste0(set1, collapse = "|^")), "SS", word, perl = TRUE)
    word <- gsub("^GN", "NN", word, perl = TRUE)
    word <- gsub(paste0("^", paste0(set2, collapse = "|^")), "RR", word, perl = TRUE)
    word <- gsub("^HW", "WW", word, perl = TRUE)
    word <- gsub(paste0("^", paste0(set3, collapse = "|^")), "NN", word, perl = TRUE)

    ##       // Applying some substitutions for endings
    ##   const lastTwoLetters = name.slice(-2),
    ##         initial = name.slice(0, -2);
    ##   if (lastTwoLetters === 'CH')
    ##     name = initial + 'KK';
    ##   else if (lastTwoLetters === 'NT')
    ##     name = initial + 'TT';
    ##   else if (lastTwoLetters === 'RT')
    ##     name = initial + 'RR';
    ##   else if (name.slice(-3) === 'RDT')
    ##     name = name.slice(0, -3) + 'RR';
    word <- gsub("(.{1,})CH$", "\\1KK", word, perl = TRUE)
    word <- gsub("(.{1,})NT$", "\\1TT", word, perl = TRUE)
    word <- gsub("(.{1,})RT$", "\\1RR", word, perl = TRUE)
    word <- gsub("(.{1,})RDT$", "\\1RR", word, perl = TRUE)
    
    ##   // Applying the rules
    ##   for (let i = 0, l = RULES.length; i < l; i++)
    ##     name = name.replace(...RULES[i]);
    
    # As gsub is already vectorised, a simple for loop will suffice.
    # for(i in 1:length(rules)){
    #     word <- gsub(rules[[i]][1], rules[[i]][2], word, perl = TRUE)
    # }
    word <- stringi::stri_replace_all_fixed(word, rules_from, rules_to, vectorize_all = FALSE)
    
    ##   // Caching the first letter
    ##   const firstLetter = name[0];
    first_character <- substring(word, 1, 1)
    ##   // Translating
    ##   let code = '';
    ##   for (let i = 0, l = name.length; i < l; i++)
    ##     code += TRANSLATION[name[i]] || name[i];
    word <- chartr(translation[1], translation[2] , word)
    ##   // Removing hyphens
    ##   code = code.replace(/-/g, '');
    word <- gsub("~", "", word, fixed = TRUE)
    ##   // Squeezing the code
    ##   code = squeeze(code);
    # Code can be found here: https://github.com/Yomguithereal/talisman/blob/master/src/helpers/index.js
        ## /**
        ##  * Function squeezing the given sequence by dropping consecutive duplicates.
        ##  *
        ##  * Note: the name was actually chosen to mimic Ruby's naming since I did not
        ##  * find any equivalent in other standard libraries.
        ##  *
        ##  * @param  {mixed} target - The sequence to squeeze.
        ##  * @return {array}        - The resulting sequence.
        ##  */
        ## export function squeeze(target) {
        ##   const isString = typeof target === 'string',
        ##         sequence = seq(target),
        ##         squeezed = [sequence[0]];
        ## 
        ##   for (let i = 1, l = sequence.length; i < l; i++) {
        ##     if (sequence[i] !== sequence[i - 1])
        ##       squeezed.push(sequence[i]);
        ##   }
        ## 
        ##   return isString ? squeezed.join('') : squeezed;
        ## }        
    word <- gsub("([0-9])\\1+", "\\1", word, perl = TRUE)
    ##   // Dealing with some initials
    ##   if (SET4.has(code[0]))
    ##     code = firstLetter + code;
    ##   else
    ##     code = firstLetter + code.slice(1);
    word[grepl(set4, word)] <- paste0(first_character[grepl(set4, word)], word[grepl(set4, word)])
    word[!grepl(set4, word)] <- paste0(first_character[!grepl(set4, word)], substring(word[!grepl(set4, word)], 2))
    ##   // Dropping vowels
    ##   code = code.replace(/0/g, '');
    word <- gsub("0", "", word, perl = TRUE)

    # Finally, if any NA's were passed, set them back to NA.
    if(exists("na_logical")){
        word[na_logical] <- NA
    }
    
    # Return hashed word(s).
    return(word)
}

howardjp · 2019-07-15T12:04:18Z

@KyleHaynes, this seems pretty reasonable. The big question is, do you have test cases?

KyleHaynes · 2019-07-15T21:15:28Z

@howardjp, good point, let me get back to you (just drowning with work atm).

chrislit · 2019-08-04T21:51:21Z

If it's any help, here's a set of testcases I use for fuzzy soundex, formatted like your CSV files.
fuzzy-soundex.csv.txt

howardjp · 2019-08-04T21:55:33Z

Awesome sauce! I am out of town right now, and will look at this later this week or early next!

…

On Sun, Aug 4, 2019 at 5:51 PM Chris Little ***@***.***> wrote: If it's any help, here's a set of testcases I use for fuzzy soundex, formatted like your CSV files. fuzzy-soundex.csv.txt <https://github.com/howardjp/phonics/files/3465362/fuzzy-soundex.csv.txt> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABGG2T4LJDNDHRKDUM7LBTQC5FNTANCNFSM4CZB6N6A> .

KyleHaynes · 2019-08-05T23:07:03Z

@chrislit - thanks for this

I've run against my function and from the 27 tests strings I get two mismatches (same word)

It's an issue with my function and not the test data.

I'll look into this further

require(data.table)
require(stringi)

csv <- "name,hash
Kristen,K6935
Krissy,K6900
Christen,K6935
peter,P3600
pete,P3000
pedro,P3600
stephen,S3150
steve,S3100
smith,S5300
smythe,S5300
gail,G4000
gayle,G4000
christine,K6935
christina,K6935
kristina,K6935
Wight,W3000
Hardt,H6000
Knight,N3000
Czech,S7000
Tsech,S7000
gnomic,N5900
Wright,R3000
Hrothgar,R3760
Hwaet,W3000
Grant,G6300
Hart,H6000
Hardt,H6000"

dt <- fread(csv)

dt[, "hash_2" := fuzzy_soundex(name)]
# Currently my function doesn't pad trailing 0's. Add this:
dt[, "hash_2" := stri_pad_right(hash_2, 5, 0)]

all.equal(dt$hash, dt$hash_2)

dt[hash != hash_2]
#    name  hash hash_2
# 1: Hardt H6000  R0000
# 2: Hardt H6000  R0000

KyleHaynes · 2019-08-06T06:53:50Z

Fixed code (Fixed in my original comment).

Re-running previous test now returns no differences.

Re padding out strings, could make this argument driven like the phonics::soundex argument: maxCodeLen = 4L. Could set the length, otherwise, if NULL no padding at all?

I (try) to conform to snake_case, happy to adjust the name of the function to be whatever (to fit in with the rest of the phonics library)

As for the dependency of stringi let me know if you want adjust to base R.

Also, moving forward, I'll create a fork and make changes there.

howardjp · 2019-08-07T20:02:02Z

Your approach to maxCodeLen is fine... As for stringi, I am not sure. I mean, I don't mind the dependency. You gotta do what you gotta do, amirite? As a user, I can't imagine it is really objectionable.

…

On Tue, Aug 6, 2019 at 2:53 AM Kyle Haynes ***@***.***> wrote: Fixed code (Fixed in my original comment). Re-running previous test now returns no differences. Re padding out strings, could make this argument driven like the phonics::soundex argument: maxCodeLen = 4L. Could set the length, otherwise, if NULL no padding at all? I (try) to conform to snake_case, happy to adjust the name of the function to be whatever (to fit in with the rest of the phonics library) As for the dependency of stringi let me know if you want adjust to base R. Also, moving forward, I'll create a fork and make changes there. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABGG2WRMO4GCZN5YI64QODQDENX5ANCNFSM4CZB6N6A> .

howardjp added enhancement and removed enhancement labels Dec 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for fuzzy Soundex #12

Add support for fuzzy Soundex #12

howardjp commented Dec 9, 2016

KyleHaynes commented Jul 15, 2019 •

edited

Loading

howardjp commented Jul 15, 2019

KyleHaynes commented Jul 15, 2019

chrislit commented Aug 4, 2019

howardjp commented Aug 4, 2019 via email

KyleHaynes commented Aug 5, 2019 •

edited

Loading

KyleHaynes commented Aug 6, 2019

howardjp commented Aug 7, 2019 via email

Add support for fuzzy Soundex #12

Add support for fuzzy Soundex #12

Comments

howardjp commented Dec 9, 2016

KyleHaynes commented Jul 15, 2019 • edited Loading

howardjp commented Jul 15, 2019

KyleHaynes commented Jul 15, 2019

chrislit commented Aug 4, 2019

howardjp commented Aug 4, 2019 via email

KyleHaynes commented Aug 5, 2019 • edited Loading

KyleHaynes commented Aug 6, 2019

howardjp commented Aug 7, 2019 via email

KyleHaynes commented Jul 15, 2019 •

edited

Loading

KyleHaynes commented Aug 5, 2019 •

edited

Loading