fuzzychinese #
fuzzychinese._character_to_stroke #
Stroke Objects #
class Stroke(object)
A class to translate a chinese character into strokes.
Arguments:
-
dictionary_filepath
str - default=None. File path for user provided dictionary. Default dictionary will be used if not specified.A valid dictionary should be a “UTF-8” encoded text file, having two columns separated by space. First column is the character and the second column is its corresponding decomposition with each char stands for each stroke. Note, the decomposition does not have to be strokes, it can be numbers or letters, or any sequence of chars you like).
An example dictionary:
Character Strokes 上 〡一一 下 一〡㇔
get_stroke #
| get_stroke(character, placeholder='', raise_error=False)
Decompose a character into strokes based on dictionary.
When a character can not be decomposed, itself will be returned. If it’s not chinese, a placeholder is returned.
Arguments:
-
character
str - A chinese character to be decomposed. -
placeholder
str - default = ‘’. Output to be used when the character is not chinese. -
raise_error
boolean - default = False. If true, raise error if a character can not be decomposed. The default action is to show warnings.
Returns:
str
- decomposition results.
fuzzychinese._fuzzy_chinese_match #
FuzzyChineseMatch Objects #
class FuzzyChineseMatch(object)
The main class for the fuzzy match
Match a collection of chinese words with a target list of words.
Arguments:
-
ngram_range
tuple - (min_n, max_n), default=(3, 3). The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. -
analyzer
string - {‘char’, ‘radical’, ‘stroke’}, default=‘stroke’. Whether the feature should be made of character or stroke n-grams.
fit #
| fit(X)
Learn the words in X.
Arguments:
X
list, pd.Series, 1d np.array or 1d pd.DataFrame - An iterable yields chinese str in utf-8
Returns:
FuzzyChinese object
fit_transform #
| fit_transform(X, Y=None, n=3)
Learn the words in X and transform
If Y is not passed, then find similar words in the X itself . If Y is passed, for each word in Y, find the similar words in X.
Arguments:
-
X
list, pd.Series, 1d np.array or 1d pd.DataFrame - An iterable yield chinese str in utf-8 -
Y
list, pd.Series, 1d np.array or 1d pd.DataFrame - An iterable yield chinese str in utf-8 -
n
int - top n matched to be returned
Returns:
res
A numpy matrix - [n_samples, n_matches]. Each row corresponds to the top n matches to the input row. Matches are sorted by descending order in similarity.
transform #
| transform(Y, n=3)
Match the list of words to a target list(Y) of words.
Arguments:
Y
list, pd.Series, 1d np.array or 1d pd.DataFrame - an iterable yields chinese str in utf-8n
int - top n matched to be returned
Returns:
res
A numpy matrix - [n_samples, n_matches]. Each row corresponds to the top n matches to the input row. Matches are sorted by descending order in similarity.
get_similarity_score #
| get_similarity_score()
Return the similarity score for last transform call.
Returns:
res
A numpy matrix - [n_samples, n_matches]. Each row corresponds to the similarity score of top n matches.
get_index #
| get_index()
Return the original index of the matched word.
Returns:
res
A numpy matrix - [n_samples, n_matches]. Each row corresponds to the index of top n matches. Original index is return if exists.
compare_two_columns #
| compare_two_columns(X, Y)
Compare two columns and calculated similarity score for each pair on each row.
Arguments:
-
X
list, pd.Series, 1d np.array or 1d pd.DataFrame - An iterable yield chinese str in utf-8 -
Y
list, pd.Series, 1d np.array or 1d pd.DataFrame - Have same length as X. An iterable yield chinese str in utf-8 -
n
int - top n matched to be returned
Returns:
res
A numpy matrix - Return two original columns and a new column for the similarity score.
fuzzychinese._character_to_radical #
Radical Objects #
class Radical(object)
Translate a chinese character into radicals.
Arguments:
-
dictionary_filepath
str - default=None. File path for user provided dictionary. Default dictionary will be used if not specified.A valid dictionary should be a “UTF-8” encoded text file, having two columns separated by space. First column is the character and the second column is its corresponding decomposition with each char stands for each Radical. Note, the decomposition does not have to be radicals, it can be numbers or letters, or any sequence of chars you like).
An example dictionary:
Character Radicals 思 田心 疆 弓土畺
get_radical #
| get_radical(character, placeholder='', raise_error=False)
Decompose a character into radicals based on dictionary.
When a character can not be decomposed, itself will be returned. If it’s not chinese, a placeholder is returned.
Arguments:
-
character
str - A chinese character to be decomposed. -
placeholder
str - default = ‘’. Output to be used when the character is not chinese. -
raise_error
boolean - default = False. If true, raise error if a character can not be decomposed. The default action is to show warnings.
Returns:
str
- decomposition results.