fuzzychinese #

fuzzychinese._character_to_stroke #

Stroke Objects #

class Stroke(object)

[view_source]

A class to translate a chinese character into strokes.

Arguments:

dictionary_filepath str - default=None. File path for user provided dictionary. Default dictionary will be used if not specified.

A valid dictionary should be a “UTF-8” encoded text file, having two columns separated by space. First column is the character and the second column is its corresponding decomposition with each char stands for each stroke. Note, the decomposition does not have to be strokes, it can be numbers or letters, or any sequence of chars you like).

An example dictionary:

Character Strokes

上〡一一

下一〡㇔

Character	Strokes
上	〡一一
下	一〡㇔

get_stroke #

 | get_stroke(character, placeholder='', raise_error=False)

[view_source]

Decompose a character into strokes based on dictionary.

When a character can not be decomposed, itself will be returned. If it’s not chinese, a placeholder is returned.

Arguments:

character str - A chinese character to be decomposed.
placeholder str - default = ‘’. Output to be used when the character is not chinese.
raise_error boolean - default = False. If true, raise error if a character can not be decomposed. The default action is to show warnings.

Returns:

str - decomposition results.

fuzzychinese._fuzzy_chinese_match #

[view_source]

FuzzyChineseMatch Objects #

class FuzzyChineseMatch(object)

[view_source]

The main class for the fuzzy match

Match a collection of chinese words with a target list of words.

Arguments:

ngram_range tuple - (min_n, max_n), default=(3, 3). The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
analyzer string - {‘char’, ‘radical’, ‘stroke’}, default=‘stroke’. Whether the feature should be made of character or stroke n-grams.

fit #

 | fit(X)

[view_source]

Learn the words in X.

Arguments:

X list, pd.Series, 1d np.array or 1d pd.DataFrame - An iterable yields chinese str in utf-8

Returns:

FuzzyChinese object

fit_transform #

 | fit_transform(X, Y=None, n=3)

[view_source]

Learn the words in X and transform

If Y is not passed, then find similar words in the X itself . If Y is passed, for each word in Y, find the similar words in X.

Arguments:

X list, pd.Series, 1d np.array or 1d pd.DataFrame - An iterable yield chinese str in utf-8
Y list, pd.Series, 1d np.array or 1d pd.DataFrame - An iterable yield chinese str in utf-8
n int - top n matched to be returned

Returns:

res A numpy matrix - [n_samples, n_matches]. Each row corresponds to the top n matches to the input row. Matches are sorted by descending order in similarity.

transform #

 | transform(Y, n=3)

[view_source]

Match the list of words to a target list(Y) of words.

Arguments:

Y list, pd.Series, 1d np.array or 1d pd.DataFrame - an iterable yields chinese str in utf-8
n int - top n matched to be returned

Returns:

res A numpy matrix - [n_samples, n_matches]. Each row corresponds to the top n matches to the input row. Matches are sorted by descending order in similarity.

get_similarity_score #

 | get_similarity_score()

[view_source]

Return the similarity score for last transform call.

Returns:

res A numpy matrix - [n_samples, n_matches]. Each row corresponds to the similarity score of top n matches.

get_index #

 | get_index()

[view_source]

Return the original index of the matched word.

Returns:

res A numpy matrix - [n_samples, n_matches]. Each row corresponds to the index of top n matches. Original index is return if exists.

compare_two_columns #

 | compare_two_columns(X, Y)

[view_source]

Compare two columns and calculated similarity score for each pair on each row.

Arguments:

X list, pd.Series, 1d np.array or 1d pd.DataFrame - An iterable yield chinese str in utf-8
Y list, pd.Series, 1d np.array or 1d pd.DataFrame - Have same length as X. An iterable yield chinese str in utf-8
n int - top n matched to be returned

Returns:

res A numpy matrix - Return two original columns and a new column for the similarity score.

fuzzychinese._character_to_radical #

[view_source]

Radical Objects #

class Radical(object)

[view_source]

Translate a chinese character into radicals.

Arguments:

dictionary_filepath str - default=None. File path for user provided dictionary. Default dictionary will be used if not specified.

A valid dictionary should be a “UTF-8” encoded text file, having two columns separated by space. First column is the character and the second column is its corresponding decomposition with each char stands for each Radical. Note, the decomposition does not have to be radicals, it can be numbers or letters, or any sequence of chars you like).

An example dictionary:

Character Radicals

思田心

疆弓土畺

Character	Radicals
思	田心
疆	弓土畺

get_radical #

 | get_radical(character, placeholder='', raise_error=False)

[view_source]

Decompose a character into radicals based on dictionary.

When a character can not be decomposed, itself will be returned. If it’s not chinese, a placeholder is returned.

Arguments:

character str - A chinese character to be decomposed.
placeholder str - default = ‘’. Output to be used when the character is not chinese.
raise_error boolean - default = False. If true, raise error if a character can not be decomposed. The default action is to show warnings.

Returns:

str - decomposition results.

fuzzychinese._utils #

[view_source]