abydos.fingerprint package
abydos.fingerprint.
The fingerprint package implements string fingerprints such as:
Basic fingerprinters originating in OpenRefine <http://openrefine.org>:
Fingerprints developed by Pollock & Zomora:
Skeleton key (
SkeletonKey)Omission key (
OmissionKey)Fingerprints developed by Cisłak & Grabowski:
Occurrence (
Occurrence)Occurrence halved (
OccurrenceHalved)Count (
Count)Position (
Position)The Synoname toolcode (
SynonameToolcode)Taft's codings:
Consonant coding (
Consonant)Extract - letter list (
Extract)Extract - position & frequency (
ExtractPositionFrequency)L.A. County Sheriff's System (
LACSS)Library of Congress Cutter table encoding (
LCCutter)Burrows-Wheeler transform (
BWTF) and run-length encoded Burrows-Wheeler transform (BWTRLEF)
Each fingerprint class has a fingerprint method that takes a string and
returns the string's fingerprint:
>>> sk = SkeletonKey()
>>> sk.fingerprint('orange')
'ORNGAE'
>>> sk.fingerprint('strange')
'STRNGAE'
- class abydos.fingerprint.BWTF(terminator: str = '\x00')[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintBurrows-Wheeler transform fingerprint.
This is a wrapper of the BWT class in abydos.compression, which provides the same interface as other descendants of _Fingerprint.
New in version 0.4.1.
Initialize BWTF instance.
- Parameters
terminator (str) -- A character added to signal the end of the string
New in version 0.4.1.
- fingerprint(word: str) str[source]
Return the Burrows-Wheeler transform of a word.
- Parameters
word (str) -- The word to fingerprint
- Returns
The Burrows-Wheeler transform of a word
- Return type
str
Examples
>>> fp = BWTF() >>> fp.fingerprint('hat') 'th\x00a' >>> fp.fingerprint('niall') 'linla\x00' >>> fp.fingerprint('colin') 'n\x00loic' >>> fp.fingerprint('atcg') 'g\x00tca' >>> fp.fingerprint('entreatment') 'term\x00teetnan'
New in version 0.4.1.
- class abydos.fingerprint.BWTRLEF(terminator: str = '\x00')[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintBurrows-Wheeler transform plus run-length encoding fingerprint.
This is a wrapper of the BWT and RLE classes in abydos.compression, which provides the same interface as other descendants of _Fingerprint.
New in version 0.4.1.
Initialize BWTRLEF instance.
- Parameters
terminator (str) -- A character added to signal the end of the string
New in version 0.4.1.
- fingerprint(word: str) str[source]
Return the run-length encoded Burrows-Wheeler transform of a word.
- Parameters
word (str) -- The word to fingerprint
- Returns
The run-length encoded Burrows-Wheeler transform of a word
- Return type
str
Examples
>>> fp = BWTRLEF() >>> fp.fingerprint('hat') 'th\x00a' >>> fp.fingerprint('niall') 'linla\x00' >>> fp.fingerprint('colin') 'n\x00loic' >>> fp.fingerprint('atcg') 'g\x00tca' >>> fp.fingerprint('entreatment') 'term\x00teetnan'
New in version 0.4.1.
- class abydos.fingerprint.Consonant(variant: int = 1, doubles: bool = True, vowels: Optional[Union[Iterable[str], str]] = None)[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintConsonant Coding Fingerprint.
Based on the consonant coding from [Taf70], variants 1, 2, 3, 1-D, 2-D, and 3-D.
New in version 0.4.1.
Initialize Consonant instance.
- Parameters
variant (int) --
Selects between Taft's 3 variants, which assign to the vowel set one of:
A, E, I, O, & U
A, E, I, O, U, W, & Y
A, E, I, O, U, W, H, & Y
doubles (bool) -- If set to False, multiple consonants in a row are conflated to a single instance.
vowels (list, set, or str) -- Setting vowels to a non-None value overrides the variant setting and defines the set of letters to be removed from the input.
New in version 0.4.1.
- fingerprint(word: str) str[source]
Return the consonant coding.
- Parameters
word (str) -- The word to fingerprint
- Returns
The consonant coding
- Return type
int
Examples
>>> cf = Consonant() >>> cf.fingerprint('hat') 'HT' >>> cf.fingerprint('niall') 'NLL' >>> cf.fingerprint('colin') 'CLN' >>> cf.fingerprint('atcg') 'ATCG' >>> cf.fingerprint('entreatment') 'ENTRTMNT'
New in version 0.4.1.
- class abydos.fingerprint.Count(n_bits: int = 16, most_common: Tuple[str, ...] = ('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintCount Fingerprint.
Based on the count fingerprint from [CislakG17].
New in version 0.3.6.
Initialize Count instance.
- Parameters
n_bits (int) -- Number of bits in the fingerprint returned
most_common (list) -- The most common tokens in the target language, ordered by frequency
New in version 0.4.0.
- fingerprint(word: str) str[source]
Return the count fingerprint.
- Parameters
word (str) -- The word to fingerprint
- Returns
The count fingerprint
- Return type
str
Examples
>>> cf = Count() >>> cf.fingerprint('hat') '0001010000000001' >>> cf.fingerprint('niall') '0000010001010000' >>> cf.fingerprint('colin') '0000000101010000' >>> cf.fingerprint('atcg') '0001010000000000' >>> cf.fingerprint('entreatment') '1111010000100000'
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.6.0: Changed to return a str and added fingerprint_int method
- fingerprint_int(word: str) int[source]
Return the count fingerprint.
- Parameters
word (str) -- The word to fingerprint
- Returns
The count fingerprint as an int
- Return type
int
Examples
>>> cf = Count() >>> cf.fingerprint_int('hat') 5121 >>> cf.fingerprint_int('niall') 1104 >>> cf.fingerprint_int('colin') 336 >>> cf.fingerprint_int('atcg') 5120 >>> cf.fingerprint_int('entreatment') 62496
New in version 0.6.0.
- class abydos.fingerprint.Extract(letter_list: Union[int, Iterable[str]] = 1)[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintExtract Letter List fingerprint.
Based on the extract letter list coding from [Taf70], for lists 1, 2, 3, & 4.
New in version 0.4.1.
Initialize Extract instance.
- Parameters
letter_list (int or iterable) -- If an integer (1-4) is supplied, Taft's specified letter lists are used. If an iterable is supplied, its values will be used as the list of letters to remove (in order).
New in version 0.4.1.
- fingerprint(word: str) str[source]
Return the extract letter list coding.
- Parameters
word (str) -- The word to fingerprint
- Returns
The extract letter list coding
- Return type
str
Examples
>>> fp = Extract() >>> fp.fingerprint('hat') 'HAT' >>> fp.fingerprint('niall') 'NILL' >>> fp.fingerprint('colin') 'CLIN' >>> fp.fingerprint('atcg') 'ATCG' >>> fp.fingerprint('entreatment') 'NRMN'
New in version 0.4.1.
- class abydos.fingerprint.ExtractPositionFrequency[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintExtract - Position & Frequency fingerprint.
Based on the extract - position & frequency coding from [Taf70].
New in version 0.4.1.
- fingerprint(word: str) str[source]
Return the extract - position & frequency coding.
- Parameters
word (str) -- The word to fingerprint
- Returns
The extract - position & frequency coding
- Return type
str
Examples
>>> fp = ExtractPositionFrequency() >>> fp.fingerprint('hat') 'HAT' >>> fp.fingerprint('niall') 'NILL' >>> fp.fingerprint('colin') 'COLN' >>> fp.fingerprint('atcg') 'ATCG' >>> fp.fingerprint('entreatment') 'NMNT'
New in version 0.4.1.
- class abydos.fingerprint.LACSS[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintL.A. County Sheriff's System fingerprint.
Based on the description from [Taf70].
New in version 0.4.1.
- fingerprint(word: str) str[source]
Return the LACSS coding.
- Parameters
word (str) -- The word to fingerprint
- Returns
The L.A. County Sheriff's System fingerprint
- Return type
str
Examples
>>> cf = LACSS() >>> cf.fingerprint('hat') '4911211' >>> cf.fingerprint('niall') '6488374' >>> cf.fingerprint('colin') '3015957' >>> cf.fingerprint('atcg') '1772371' >>> cf.fingerprint('entreatment') '3882324'
New in version 0.4.1.
Changed in version 0.6.0: Changed to return a str and added fingerprint_int method
- fingerprint_int(word: str) int[source]
Return the LACSS coding.
- Parameters
word (str) -- The word to fingerprint
- Returns
The L.A. County Sheriff's System fingerprint as an int
- Return type
int
Examples
>>> cf = LACSS() >>> cf.fingerprint_int('hat') 4911211 >>> cf.fingerprint_int('niall') 6488374 >>> cf.fingerprint_int('colin') 3015957 >>> cf.fingerprint_int('atcg') 1772371 >>> cf.fingerprint_int('entreatment') 3882324
New in version 0.6.0.
- class abydos.fingerprint.LCCutter(max_length: int = 64)[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintLibrary of Congress Cutter table encoding.
This is based on the Library of Congress Cutter table encoding scheme, as described at https://www.loc.gov/aba/pcc/053/table.html [oC13]. Handling for numerals is not included.
New in version 0.4.1.
Initialize LCCutter instance.
- Parameters
max_length (int) -- The length of the code returned (defaults to 64)
New in version 0.4.1.
- fingerprint(word: str) str[source]
Return the Library of Congress Cutter table encoding of a word.
- Parameters
word (str) -- The word to fingerprint
- Returns
The Library of Congress Cutter table encoding
- Return type
str
Examples
>>> cf = LCCutter() >>> cf.fingerprint('hat') 'H38' >>> cf.fingerprint('niall') 'N5355' >>> cf.fingerprint('colin') 'C6556' >>> cf.fingerprint('atcg') 'A834' >>> cf.fingerprint('entreatment') 'E5874386468'
New in version 0.4.1.
- class abydos.fingerprint.Occurrence(n_bits: int = 16, most_common: Tuple[str, ...] = ('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintOccurrence Fingerprint.
Based on the occurrence fingerprint from [CislakG17].
New in version 0.3.6.
Initialize Count instance.
- Parameters
n_bits (int) -- Number of bits in the fingerprint returned
most_common (list) -- The most common tokens in the target language, ordered by frequency
New in version 0.4.0.
- fingerprint(word: str) str[source]
Return the occurrence fingerprint.
- Parameters
word (str) -- The word to fingerprint
- Returns
The occurrence fingerprint
- Return type
str
Examples
>>> of = Occurrence() >>> of.fingerprint('hat') '0110000100000000' >>> of.fingerprint('niall') '0010110000100000' >>> of.fingerprint('colin') '0001110000110000' >>> of.fingerprint('atcg') '0110000000010000' >>> of.fingerprint('entreatment') '1110010010000100'
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.6.0: Changed to return a str and added fingerprint_int method
- fingerprint_int(word: str) int[source]
Return the occurrence fingerprint.
- Parameters
word (str) -- The word to fingerprint
- Returns
The occurrence fingerprint as an int
- Return type
int
Examples
>>> of = Occurrence() >>> of.fingerprint_int('hat') 24832 >>> of.fingerprint_int('niall') 11296 >>> of.fingerprint_int('colin') 7216 >>> of.fingerprint_int('atcg') 24592 >>> of.fingerprint_int('entreatment') 58500
New in version 0.6.0.
- class abydos.fingerprint.OccurrenceHalved(n_bits: int = 16, most_common: Tuple[str, ...] = ('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintOccurrence Halved Fingerprint.
Based on the occurrence halved fingerprint from [CislakG17].
New in version 0.3.6.
Initialize Count instance.
- Parameters
n_bits (int) -- Number of bits in the fingerprint returned
most_common (list) -- The most common tokens in the target language, ordered by frequency
New in version 0.4.0.
- fingerprint(word: str) str[source]
Return the occurrence halved fingerprint.
Based on the occurrence halved fingerprint from [CislakG17].
- Parameters
word (str) -- The word to fingerprint
- Returns
The occurrence halved fingerprint
- Return type
str
Examples
>>> ohf = OccurrenceHalved() >>> ohf.fingerprint('hat') '0001010000000010' >>> ohf.fingerprint('niall') '0000010010100000' >>> ohf.fingerprint('colin') '0000001001010000' >>> ohf.fingerprint('atcg') '0010100000000000' >>> ohf.fingerprint('entreatment') '1111010000110000'
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.6.0: Changed to return a str and added fingerprint_int method
- fingerprint_int(word: str) int[source]
Return the occurrence halved fingerprint.
Based on the occurrence halved fingerprint from [CislakG17].
- Parameters
word (int) -- The word to fingerprint
- Returns
The occurrence halved fingerprint as an int
- Return type
int
Examples
>>> ohf = OccurrenceHalved() >>> ohf.fingerprint_int('hat') 5122 >>> ohf.fingerprint_int('niall') 1184 >>> ohf.fingerprint_int('colin') 592 >>> ohf.fingerprint_int('atcg') 10240 >>> ohf.fingerprint_int('entreatment') 62512
New in version 0.6.0.
- class abydos.fingerprint.OmissionKey[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintOmission Key.
The omission key of a word is defined in [PZ84].
New in version 0.3.6.
- fingerprint(word: str) str[source]
Return the omission key.
- Parameters
word (str) -- The word to transform into its omission key
- Returns
The omission key
- Return type
str
Examples
>>> ok = OmissionKey() >>> ok.fingerprint('The quick brown fox jumped over the lazy dog.') 'JKQXZVWYBFMGPDHCLNTREUIOA' >>> ok.fingerprint('Christopher') 'PHCTSRIOE' >>> ok.fingerprint('Niall') 'LNIA'
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.fingerprint.Phonetic(phonetic_algorithm: Optional[Union[Callable[[str], str], abydos.phonetic._phonetic._Phonetic]] = None, joiner: str = ' ')[source]
Bases:
abydos.fingerprint._string.StringPhonetic Fingerprint.
A phonetic fingerprint is identical to a standard string fingerprint, as implemented in
String, but performs the fingerprinting function after converting the string to its phonetic form, as determined by some phonetic algorithm. This fingerprint is described at [Ope12].New in version 0.3.6.
Initialize Phonetic instance.
- phonetic_algorithmfunction
A phonetic algorithm that takes a string and returns a string (presumably a phonetic representation of the original string). By default, this function uses
double_metaphone().- joinerstr
The string that will be placed between each word
New in version 0.4.0.
- fingerprint(phrase: str) str[source]
Return the phonetic fingerprint of a phrase.
- Parameters
phrase (str) -- The string from which to calculate the phonetic fingerprint
- Returns
The phonetic fingerprint of the phrase
- Return type
str
Examples
>>> pf = Phonetic() >>> pf.fingerprint('The quick brown fox jumped over the lazy dog.') '0 afr fks jmpt kk ls prn tk'
>>> from abydos.phonetic import Soundex >>> pf = Phonetic(Soundex()) >>> pf.fingerprint('The quick brown fox jumped over the lazy dog.') 'b650 d200 f200 j513 l200 o160 q200 t000'
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.fingerprint.Position(n_bits: int = 16, most_common: Tuple[str, ...] = ('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'), bits_per_letter: int = 3)[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintPosition Fingerprint.
Based on the position fingerprint from [CislakG17].
New in version 0.3.6.
Initialize Count instance.
- Parameters
n_bits (int) -- Number of bits in the fingerprint returned
most_common (list) -- The most common tokens in the target language, ordered by frequency
New in version 0.4.0.
- fingerprint(word: str) str[source]
Return the position fingerprint.
- Parameters
word (str) -- The word to fingerprint
- Returns
The position fingerprint
- Return type
str
Examples
>>> pf = Position() >>> pf.fingerprint('hat') '1110100011111111' >>> pf.fingerprint('niall') '1111110101110010' >>> pf.fingerprint('colin') '1111111110010111' >>> pf.fingerprint('atcg') '1110010001111111' >>> pf.fingerprint('entreatment') '0000101011111111'
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.6.0: Changed to return a str and added fingerprint_int method
- fingerprint_int(word: str) int[source]
Return the position fingerprint.
- Parameters
word (str) -- The word to fingerprint
- Returns
The position fingerprint as an int
- Return type
int
Examples
>>> pf = Position() >>> pf.fingerprint_int('hat') 59647 >>> pf.fingerprint_int('niall') 64882 >>> pf.fingerprint_int('colin') 65431 >>> pf.fingerprint_int('atcg') 58495 >>> pf.fingerprint_int('entreatment') 2815
New in version 0.6.0.
- class abydos.fingerprint.QGram(qval: int = 2, start_stop: str = '', joiner: str = '', skip: int = 0)[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintQ-Gram Fingerprint.
A q-gram fingerprint is a string consisting of all of the unique q-grams in a string, alphabetized & concatenated. This fingerprint is described at [Ope12].
New in version 0.3.6.
Initialize Q-Gram fingerprinter.
- qvalint
The length of each q-gram (by default 2)
- start_stopstr
The start & stop symbol(s) to concatenate on either end of the phrase, as defined in
tokenizer.QGrams- joinerstr
The string that will be placed between each word
- skipint or Iterable
The number of characters to skip, can be an integer, range object, or list
New in version 0.4.0.
- fingerprint(phrase: str) str[source]
Return Q-Gram fingerprint.
- Parameters
phrase (str) -- The string from which to calculate the q-gram fingerprint
- Returns
The q-gram fingerprint of the phrase
- Return type
str
Examples
>>> qf = QGram() >>> qf.fingerprint('The quick brown fox jumped over the lazy dog.') 'azbrckdoedeleqerfoheicjukblampnfogovowoxpequrortthuiumvewnxjydzy' >>> qf.fingerprint('Christopher') 'cherhehrisopphristto' >>> qf.fingerprint('Niall') 'aliallni'
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.fingerprint.SkeletonKey[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintSkeleton Key.
The skeleton key of a word is defined in [PZ84].
New in version 0.3.6.
- fingerprint(word: str) str[source]
Return the skeleton key.
- Parameters
word (str) -- The word to transform into its skeleton key
- Returns
The skeleton key
- Return type
str
Examples
>>> sk = SkeletonKey() >>> sk.fingerprint('The quick brown fox jumped over the lazy dog.') 'THQCKBRWNFXJMPDVLZYGEUIOA' >>> sk.fingerprint('Christopher') 'CHRSTPIOE' >>> sk.fingerprint('Niall') 'NLIA'
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.fingerprint.String(joiner: str = ' ')[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintString Fingerprint.
The fingerprint of a string is a string consisting of all of the unique words in a string, alphabetized & concatenated with intervening joiners. This fingerprint is described at [Ope12].
New in version 0.3.6.
Initialize String instance.
- Parameters
joiner (str) -- The string that will be placed between each word
New in version 0.4.0.
- fingerprint(phrase: str) str[source]
Return string fingerprint.
- Parameters
phrase (str) -- The string from which to calculate the fingerprint
- Returns
The fingerprint of the phrase
- Return type
str
Example
>>> sf = String() >>> sf.fingerprint('The quick brown fox jumped over the lazy dog.') 'brown dog fox jumped lazy over quick the'
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.fingerprint.SynonameToolcode[source]
Bases:
abydos.fingerprint._fingerprint._FingerprintSynoname Toolcode.
Cf. [Gro91, JPGTrust91].
New in version 0.3.6.
- fingerprint(lname: str, fname: str = '', qual: str = '', normalize: int = 0) str[source]
Build the Synoname toolcode.
- Parameters
lname (str) -- Last name
fname (str) -- First name (can be blank)
qual (str) -- Qualifier
normalize (int) -- Normalization mode (0, 1, or 2)
- Returns
The transformed names and the synoname toolcode, separated by commas
- Return type
str
Examples
>>> st = SynonameToolcode() >>> st.fingerprint('hat') 'hat,,0000000003$$h' >>> st.fingerprint('niall') 'niall,,0000000005$$n' >>> st.fingerprint('colin') 'colin,,0000000005$$c' >>> st.fingerprint('atcg') 'atcg,,0000000004$$a' >>> st.fingerprint('entreatment') 'entreatment,,0000000011$$e'
>>> st.fingerprint('Ste.-Marie', 'Count John II', normalize=2) 'ste.-marie ii,count john,0200491310$015b049a127c$smcji' >>> st.fingerprint('Michelangelo IV', '', 'Workshop of') 'michelangelo iv,,3000550015$055b$mi'
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.6.0: Changed to return a comma-separated string instead of 3-tuple of strs
- fingerprint_tuple(lname: str, fname: str = '', qual: str = '', normalize: int = 0) Tuple[str, str, str][source]
Build the Synoname toolcode.
- Parameters
lname (str) -- Last name
fname (str) -- First name (can be blank)
qual (str) -- Qualifier
normalize (int) -- Normalization mode (0, 1, or 2)
- Returns
The transformed names and the synoname toolcode
- Return type
tuple
Examples
>>> st = SynonameToolcode() >>> st.fingerprint_tuple('hat') ('hat', '', '0000000003$$h') >>> st.fingerprint_tuple('niall') ('niall', '', '0000000005$$n') >>> st.fingerprint_tuple('colin') ('colin', '', '0000000005$$c') >>> st.fingerprint_tuple('atcg') ('atcg', '', '0000000004$$a') >>> st.fingerprint_tuple('entreatment') ('entreatment', '', '0000000011$$e')
>>> st.fingerprint_tuple('Ste.-Marie', 'Count John II', normalize=2) ('ste.-marie ii', 'count john', '0200491310$015b049a127c$smcji') >>> st.fingerprint_tuple('Michelangelo IV', '', 'Workshop of') ('michelangelo iv', '', '3000550015$055b$mi')
New in version 0.6.0.