When you validate data against a domain, the API first checks if the exact member already exists. If this member is found, it is returned as a valid value. If the member is a synonym and thus assigned to a master, the master is returned to you as a valid value. If the corresponding member is not found, the API uses the algorithm defined for the Domain to check which existing synonym or master it corresponds to. If the API finds a member here, it will return it as valid together with a threshold. The threshold defines the probability how the two members match. If the match itself is a synonym, the master is returned to you. The corresponding threshold still refers to the probability to the original synonym.
Depending on what data you use in your domain, you can significantly improve the results by choosing the right algorithm. Phonetic algorithms such as Cologne Phonetics can usually be used very well for the name of a product. If you store product numbers in your domain, the use of a distance algorithm such as Levenshtein Distanz can lead to success. With telephone numbers, on the other hand, even small differences often lead to completely wrong results, so that the best algorithm is the exact match.
HEDDA.IO benefits from our SQLPhonetics.NET engine (https://sqlphonetics.oh22.is) in the further development of the duplicate search. Future versions of HEDDA.IO will include additional algorithms to enable improved matches for a wider range of domains and languages.
The Caverphone is a phonetic matching algorithm invented to identify English names with their sounds.
The Cologne Phonetic (also known as Kölner Phonetic) is a phonetic algorithm that assigns a sequence of digits to words according to their sound, the phonetic code. The aim of this procedure is to assign the same code to words with the same sound in order to implement a similarity search for search functions.
Contrary to the original algorithm “Metaphone” whose application is limited to English only, this version takes into account spelling peculiarities of a number of other languages.
Exact Match – Case Sensitive
Checks if two strings are identical to each other with case senstitive.
Exact Match – Case Insensitive
Checks if two strings are identical to each other without case senstitive.
Jaro or Jaro-Winkler is a string comparison method between two strings or sequences. Unlike other methods such as Cologne Phonetic, it measures the distance between two strings and not the phonetic sound.
Levenshtein Distance is a string comparison method between two strings or sequences. Unlike other methods such as Cologne Phonetic, it measures the distance between two strings and not the phonetic sound.
Longest Common Substring
The Longest Common Substring algorithm is to find the longest string (or strings) that is a substring (or are substrings) of two or more strings.
Metaphone is a phonetic algorithm for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling.
The Needleman-Wunsch algorithm calculates the optimal global similarity score in a matrix for all pairs of possible prefixes of sequences.
The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System. It features an accuracy increase of 2.7% over the traditional Soundex algorithm.
The phonem algorithm mainly targets German names. The algorithm has no character limits like Soundex and uses letters instead of numbers to determine the phonetic sound.
Phonetex is a variation of the SoundEx Algorithm. It takes into account letter combinations that sound alike, particularly at the start of the word (such as ‘PN’ = ‘N’, ‘PH’ = ‘F’).
The Phonex algorithm is only defined for inputs over the standard English alphabet, i.e., “A-Z,” “Ä,” “Ö,” “Ü,” and “ß.” Non-alphabetical characters are removed from the string in a locale-dependent fashion.
Phonix is a phonetic retrieval technique developed for use with the URICA library system. It has been found to be particularly useful when applied to personal names, specifically author surnames in the context of a library system. Certain names such as Anton Chekov have been variously transliterated as TSJECHOF, TSJECHOW, TJEKHOW, CHEKHOV, CHEKHOW etc., in the multi‐lingual environment of libraries in Southern Africa.