unicode nomalization
Canonical and Compatibility Equivalence
http://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence
Examples of Canonical Equivalence
| Subtype | Examples | 
|---|---|
| Combining sequence | Ç ↔ C+◌̧ | 
| Ordering of combining marks | q+◌̇+◌̣ ↔ q+◌̣+◌̇ | 
| Hangul & conjoining jamo | 가 ↔ ᄀ +ᅡ | 
| Singleton equivalence | Ω ↔ Ω | 
Examples of Compatibility Equivalence
| Subtype | Examples | 
|---|---|
| Font variants | ℌ → H | 
| Linebreaking differences | [NBSP] → [SPACE] | 
| Positional variant forms | ﻉ → ع | 
| Circled variants | ① → 1 | 
| Width variants | カ → カ | 
| Rotated variants | ︷ → { | 
| Superscripts/subscripts | i⁹ → i9 | 
| Squared characters | ㌀ → アパート | 
| Fractions | ¼ → 1⁄4 | 
| Other | dž → dž | 
Normalization Forms
Form | Description – | – Normalization Form D (NFD) |Canonical Decomposition Normalization Form C (NFC) |Canonical Decomposition,followed by Canonical Composition Normalization Form KD (NFKD) |Compatibility Decomposition Normalization Form KC (NFKC) |Compatibility Decomposition,followed by Canonical Composition
unicode charater category
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category http://www.fileformat.info/info/unicode/category/index.htm
Code Description [Cc] Other, Control [Cf] Other, Format [Cn] Other, Not Assigned (no characters in the file have this property) [Co] Other, Private Use [Cs] Other, Surrogate [LC] Letter, Cased [Ll] Letter, Lowercase [Lm] Letter, Modifier [Lo] Letter, Other [Lt] Letter, Titlecase [Lu] Letter, Uppercase [Mc] Mark, Spacing Combining [Me] Mark, Enclosing [Mn] Mark, Nonspacing [Nd] Number, Decimal Digit [Nl] Number, Letter [No] Number, Other [Pc] Punctuation, Connector [Pd] Punctuation, Dash [Pe] Punctuation, Close [Pf] Punctuation, Final quote (may behave like Ps or Pe depending on usage) [Pi] Punctuation, Initial quote (may behave like Ps or Pe depending on usage) [Po] Punctuation, Other [Ps] Punctuation, Open [Sc] Symbol, Currency [Sk] Symbol, Modifier [Sm] Symbol, Math [So] Symbol, Other [Zl] Separator, Line [Zp] Separator, Paragraph [Zs] Separator, Space
python code
import unicodedata
all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)
# Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn' and c in all_letters)
# show case
str = 'Abelló'
print([c for c in str])
print([c for c in unicodedata.normalize('NFD', str)])
print([c for c in unicodedata.normalize('NFD', str) if unicodedata.category(c)!= 'Mn'])
print([c for c in unicodedata.normalize('NFD', str) if unicodedata.category(c)!= 'Mn'  and c in all_letters])
# result 
['A', 'b', 'e', 'l', 'l', 'ó']
['A', 'b', 'e', 'l', 'l', 'o', '́']
['A', 'b', 'e', 'l', 'l', 'o']
['A', 'b', 'e', 'l', 'l', 'o']