Unicode

unicode nomalization

Canonical and Compatibility Equivalence

http://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence

Examples of Canonical Equivalence

Subtype Examples
Combining sequence Ç ↔ C+◌̧
Ordering of combining marks q+◌̇+◌̣ ↔ q+◌̣+◌̇
Hangul & conjoining jamo 가 ↔ ᄀ +ᅡ
Singleton equivalence Ω ↔ Ω

Examples of Compatibility Equivalence

Subtype Examples
Font variants ℌ → H
Linebreaking differences [NBSP] → [SPACE]
Positional variant forms ﻉ → ‌ع‌
Circled variants ① → 1
Width variants カ → カ
Rotated variants ︷ → {
Superscripts/subscripts i⁹ → i9
Squared characters ㌀ → アパート
Fractions ¼ → 14
Other dž → dž

Normalization Forms

Form | Description – | – Normalization Form D (NFD) |Canonical Decomposition Normalization Form C (NFC) |Canonical Decomposition,followed by Canonical Composition Normalization Form KD (NFKD) |Compatibility Decomposition Normalization Form KC (NFKC) |Compatibility Decomposition,followed by Canonical Composition

unicode charater category

https://en.wikipedia.org/wiki/Unicode_character_property#General_Category http://www.fileformat.info/info/unicode/category/index.htm

Code Description [Cc] Other, Control [Cf] Other, Format [Cn] Other, Not Assigned (no characters in the file have this property) [Co] Other, Private Use [Cs] Other, Surrogate [LC] Letter, Cased [Ll] Letter, Lowercase [Lm] Letter, Modifier [Lo] Letter, Other [Lt] Letter, Titlecase [Lu] Letter, Uppercase [Mc] Mark, Spacing Combining [Me] Mark, Enclosing [Mn] Mark, Nonspacing [Nd] Number, Decimal Digit [Nl] Number, Letter [No] Number, Other [Pc] Punctuation, Connector [Pd] Punctuation, Dash [Pe] Punctuation, Close [Pf] Punctuation, Final quote (may behave like Ps or Pe depending on usage) [Pi] Punctuation, Initial quote (may behave like Ps or Pe depending on usage) [Po] Punctuation, Other [Ps] Punctuation, Open [Sc] Symbol, Currency [Sk] Symbol, Modifier [Sm] Symbol, Math [So] Symbol, Other [Zl] Separator, Line [Zp] Separator, Paragraph [Zs] Separator, Space

python code

import unicodedata
all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)
# Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn' and c in all_letters)

# show case
str = 'Abelló'
print([c for c in str])
print([c for c in unicodedata.normalize('NFD', str)])
print([c for c in unicodedata.normalize('NFD', str) if unicodedata.category(c)!= 'Mn'])
print([c for c in unicodedata.normalize('NFD', str) if unicodedata.category(c)!= 'Mn'  and c in all_letters])

# result 
['A', 'b', 'e', 'l', 'l', 'ó']
['A', 'b', 'e', 'l', 'l', 'o', '́']
['A', 'b', 'e', 'l', 'l', 'o']
['A', 'b', 'e', 'l', 'l', 'o']

Avatar
zhoumingjun

My work interests include devops, gaming and programmable matter.