python中字符的Unicode块

python中字符的Unicode块,第1张

python中字符的Unicode块

我也找不到。奇怪!

幸运的是,Unipre块的数量非常少。

此实现接受一个字符的Unipre字符串,就像中的函数一样

unipredata
。如果您的输入大部分为ASCII,则此线性搜索甚至可能比使用
bisect
或类似方法的二进制搜索更快。如果我将其提交以包含在Python标准库中,则可能会将其编写为通过C语言中一组静态初始化
struct
s进行二进制搜索。

def block(ch):  '''  Return the Unipre block name for ch, or None if ch has no block.  >>> block(u'a')  'Basic Latin'  >>> block(unichr(0x0b80))  'Tamil'  >>> block(unichr(0xe0080))  '''  assert isinstance(ch, unipre) and len(ch) == 1, repr(ch)  cp = ord(ch)  for start, end, name in _blocks:    if start <= cp <= end:      return namedef _initBlocks(text):  global _blocks  _blocks = []  import re  pattern = re.compile(r'([0-9A-F]+)..([0-9A-F]+); (S.*S)')  for line in text.splitlines():    m = pattern.match(line)    if m:      start, end, name = m.groups()      _blocks.append((int(start, 16), int(end, 16), name))# retrieved from http://unipre.org/Public/UNIDATA/Blocks.txt_initBlocks('''# Blocks-12.0.0.txt# Date: 2018-07-30, 19:40:00 GMT [KW]# © 2018 Unipre®, Inc.# For terms of use, see http://www.unipre.org/terms_of_use.html## Unipre Character Database# For documentation, see http://www.unipre.org/reports/tr44/## Format:# Start Code..End Code; Block Name# ================================================# Note:   When comparing block names, casing, whitespace, hyphens,#         and underbars are ignored.#         For example, "Latin Extended-A" and "latin extended a" are equivalent.#         For more information on the comparison of property values,# see UAX #44: http://www.unipre.org/reports/tr44/##  All block ranges start with a value where (cp MOD 16) = 0,#  and end with a value where (cp MOD 16) = 15. In other words,#  the last hexadecimal digit of the start of range is ...0#  and the last hexadecimal digit of the end of range is ...F.#  This constraint on block ranges guarantees that allocations#  are done in terms of whole columns, and that pre chart display#  never involves splitting columns in the charts.##  All pre points not explicitly listed for Block#  have the value No_Block.# Property: Block## @missing: 0000..10FFFF; No_Block0000..007F; Basic Latin0080..00FF; Latin-1 Supplement0100..017F; Latin Extended-A0180..024F; Latin Extended-B0250..02AF; IPA Extensions02B0..02FF; Spacing Modifier Letters0300..036F; Combining Diacritical Marks0370..03FF; Greek and Coptic0400..04FF; Cyrillic0500..052F; Cyrillic Supplement0530..058F; Armenian0590..05FF; Hebrew0600..06FF; Arabic0700..074F; Syriac0750..077F; Arabic Supplement0780..07BF; Thaana07C0..07FF; NKo0800..083F; Samaritan0840..085F; Mandaic0860..086F; Syriac Supplement08A0..08FF; Arabic Extended-A0900..097F; Devanagari0980..09FF; Bengali0A00..0A7F; Gurmukhi0A80..0AFF; Gujarati0B00..0B7F; Oriya0B80..0BFF; Tamil0C00..0C7F; Telugu0C80..0CFF; Kannada0D00..0D7F; Malayalam0D80..0DFF; Sinhala0E00..0E7F; Thai0E80..0EFF; Lao0F00..0FFF; Tibetan1000..109F; Myanmar10A0..10FF; Georgian1100..11FF; Hangul Jamo1200..137F; Ethiopic1380..139F; Ethiopic Supplement13A0..13FF; Cherokee1400..167F; Unified Canadian Aboriginal Syllabics1680..169F; Ogham16A0..16FF; Runic1700..171F; Tagalog1720..173F; Hanunoo1740..175F; Buhid1760..177F; Tagbanwa1780..17FF; Khmer1800..18AF; Mongolian18B0..18FF; Unified Canadian Aboriginal Syllabics Extended1900..194F; Limbu1950..197F; Tai Le1980..19DF; New Tai Lue19E0..19FF; Khmer Symbols1A00..1A1F; Buginese1A20..1AAF; Tai Tham1AB0..1AFF; Combining Diacritical Marks Extended1B00..1B7F; Balinese1B80..1BBF; Sundanese1BC0..1BFF; Batak1C00..1C4F; Lepcha1C50..1C7F; Ol Chiki1C80..1C8F; Cyrillic Extended-C1C90..1CBF; Georgian Extended1CC0..1CCF; Sundanese Supplement1CD0..1CFF; Vedic Extensions1D00..1D7F; Phonetic Extensions1D80..1DBF; Phonetic Extensions Supplement1DC0..1DFF; Combining Diacritical Marks Supplement1E00..1EFF; Latin Extended Additional1F00..1FFF; Greek Extended2000..206F; General Punctuation2070..209F; Superscripts and Subscripts20A0..20CF; Currency Symbols20D0..20FF; Combining Diacritical Marks for Symbols2100..214F; Letterlike Symbols2150..218F; Number Forms2190..21FF; Arrows2200..22FF; Mathematical Operators2300..23FF; Miscellaneous Technical2400..243F; Control Pictures2440..245F; Optical Character Recognition2460..24FF; Enclosed Alphanumerics2500..257F; Box Drawing2580..259F; Block Elements25A0..25FF; Geometric Shapes2600..26FF; Miscellaneous Symbols2700..27BF; Dingbats27C0..27EF; Miscellaneous Mathematical Symbols-A27F0..27FF; Supplemental Arrows-A2800..28FF; Braille Patterns2900..297F; Supplemental Arrows-B2980..29FF; Miscellaneous Mathematical Symbols-B2A00..2AFF; Supplemental Mathematical Operators2B00..2BFF; Miscellaneous Symbols and Arrows2C00..2C5F; Glagolitic2C60..2C7F; Latin Extended-C2C80..2CFF; Coptic2D00..2D2F; Georgian Supplement2D30..2D7F; Tifinagh2D80..2DDF; Ethiopic Extended2DE0..2DFF; Cyrillic Extended-A2E00..2E7F; Supplemental Punctuation2E80..2EFF; CJK Radicals Supplement2F00..2FDF; Kangxi Radicals2FF0..2FFF; Ideographic Description Characters3000..303F; CJK Symbols and Punctuation3040..309F; Hiragana30A0..30FF; Katakana3100..312F; Bopomofo3130..318F; Hangul Compatibility Jamo3190..319F; Kanbun31A0..31BF; Bopomofo Extended31C0..31EF; CJK Strokes31F0..31FF; Katakana Phonetic Extensions3200..32FF; Enclosed CJK Letters and Months3300..33FF; CJK Compatibility3400..4DBF; CJK Unified Ideographs Extension A4DC0..4DFF; Yijing Hexagram Symbols4E00..9FFF; CJK Unified IdeographsA000..A48F; Yi SyllablesA490..A4CF; Yi RadicalsA4D0..A4FF; LisuA500..A63F; VaiA640..A69F; Cyrillic Extended-BA6A0..A6FF; BamumA700..A71F; Modifier Tone LettersA720..A7FF; Latin Extended-DA800..A82F; Syloti NagriA830..A83F; Common Indic Number FormsA840..A87F; Phags-paA880..A8DF; SaurashtraA8E0..A8FF; Devanagari ExtendedA900..A92F; Kayah LiA930..A95F; RejangA960..A97F; Hangul Jamo Extended-AA980..A9DF; JavaneseA9E0..A9FF; Myanmar Extended-BAA00..AA5F; ChamAA60..AA7F; Myanmar Extended-AAA80..AADF; Tai VietAAE0..AAFF; Meetei Mayek ExtensionsAB00..AB2F; Ethiopic Extended-AAB30..AB6F; Latin Extended-EAB70..ABBF; Cherokee SupplementABC0..ABFF; Meetei MayekAC00..D7AF; Hangul SyllablesD7B0..D7FF; Hangul Jamo Extended-BD800..DB7F; High SurrogatesDB80..DBFF; High Private Use SurrogatesDC00..DFFF; Low SurrogatesE000..F8FF; Private Use AreaF900..FAFF; CJK Compatibility IdeographsFB00..FB4F; Alphabetic Presentation FormsFB50..FDFF; Arabic Presentation Forms-AFE00..FE0F; Variation SelectorsFE10..FE1F; Vertical FormsFE20..FE2F; Combining Half MarksFE30..FE4F; CJK Compatibility FormsFE50..FE6F; Small Form VariantsFE70..FEFF; Arabic Presentation Forms-BFF00..FFEF; Halfwidth and Fullwidth FormsFFF0..FFFF; Specials10000..1007F; Linear B Syllabary10080..100FF; Linear B Ideograms10100..1013F; Aegean Numbers10140..1018F; Ancient Greek Numbers10190..101CF; Ancient Symbols101D0..101FF; Phaistos Disc10280..1029F; Lycian102A0..102DF; Carian102E0..102FF; Coptic Epact Numbers10300..1032F; Old Italic10330..1034F; Gothic10350..1037F; Old Permic10380..1039F; Ugaritic103A0..103DF; Old Persian10400..1044F; Deseret10450..1047F; Shavian10480..104AF; Osmanya104B0..104FF; Osage10500..1052F; Elbasan10530..1056F; Caucasian Albanian10600..1077F; Linear A10800..1083F; Cypriot Syllabary10840..1085F; Imperial Aramaic10860..1087F; Palmyrene10880..108AF; Nabataean108E0..108FF; Hatran10900..1091F; Phoenician10920..1093F; Lydian10980..1099F; Meroitic Hieroglyphs109A0..109FF; Meroitic Cursive10A00..10A5F; Kharoshthi10A60..10A7F; Old South Arabian10A80..10A9F; Old North Arabian10AC0..10AFF; Manichaean10B00..10B3F; Avestan10B40..10B5F; Inscriptional Parthian10B60..10B7F; Inscriptional Pahlavi10B80..10BAF; Psalter Pahlavi10C00..10C4F; Old Turkic10C80..10CFF; Old Hungarian10D00..10D3F; Hanifi Rohingya10E60..10E7F; Rumi Numeral Symbols10F00..10F2F; Old Sogdian10F30..10F6F; Sogdian10FE0..10FFF; Elymaic11000..1107F; Brahmi11080..110CF; Kaithi110D0..110FF; Sora Sompeng11100..1114F; Chakma11150..1117F; Mahajani11180..111DF; Sharada111E0..111FF; Sinhala Archaic Numbers11200..1124F; Khojki11280..112AF; Multani112B0..112FF; Khudawadi11300..1137F; Grantha11400..1147F; Newa11480..114DF; Tirhuta11580..115FF; Siddham11600..1165F; Modi11660..1167F; Mongolian Supplement11680..116CF; Takri11700..1173F; Ahom11800..1184F; Dogra118A0..118FF; Warang Citi119A0..119FF; Nandinagari11A00..11A4F; Zanabazar Square11A50..11AAF; Soyombo11AC0..11AFF; Pau Cin Hau11C00..11C6F; Bhaiksuki11C70..11CBF; Marchen11D00..11D5F; Masaram Gondi11D60..11DAF; Gunjala Gondi11EE0..11EFF; Makasar11FC0..11FFF; Tamil Supplement12000..123FF; Cuneiform12400..1247F; Cuneiform Numbers and Punctuation12480..1254F; Early Dynastic Cuneiform13000..1342F; Egyptian Hieroglyphs13430..1343F; Egyptian Hieroglyph Format Controls14400..1467F; Anatolian Hieroglyphs16800..16A3F; Bamum Supplement16A40..16A6F; Mro16AD0..16AFF; Bassa Vah16B00..16B8F; Pahawh Hmong16E40..16E9F; Medefaidrin16F00..16F9F; Miao16FE0..16FFF; Ideographic Symbols and Punctuation17000..187FF; Tangut18800..18AFF; Tangut Components1B000..1B0FF; Kana Supplement1B100..1B12F; Kana Extended-A1B130..1B16F; Small Kana Extension1B170..1B2FF; Nushu1BC00..1BC9F; Duployan1BCA0..1BCAF; Shorthand Format Controls1D000..1D0FF; Byzantine Musical Symbols1D100..1D1FF; Musical Symbols1D200..1D24F; Ancient Greek Musical Notation1D2E0..1D2FF; Mayan Numerals1D300..1D35F; Tai Xuan Jing Symbols1D360..1D37F; Counting Rod Numerals1D400..1D7FF; Mathematical Alphanumeric Symbols1D800..1DAAF; Sutton SignWriting1E000..1E02F; Glagolitic Supplement1E100..1E14F; Nyiakeng Puachue Hmong1E2C0..1E2FF; Wancho1E800..1E8DF; Mende Kikakui1E900..1E95F; Adlam1EC70..1ECBF; Indic Siyaq Numbers1ED00..1ED4F; Ottoman Siyaq Numbers1EE00..1EEFF; Arabic Mathematical Alphabetic Symbols1F000..1F02F; Mahjong Tiles1F030..1F09F; Domino Tiles1F0A0..1F0FF; Playing Cards1F100..1F1FF; Enclosed Alphanumeric Supplement1F200..1F2FF; Enclosed Ideographic Supplement1F300..1F5FF; Miscellaneous Symbols and Pictographs1F600..1F64F; Emoticons1F650..1F67F; Ornamental Dingbats1F680..1F6FF; Transport and Map Symbols1F700..1F77F; Alchemical Symbols1F780..1F7FF; Geometric Shapes Extended1F800..1F8FF; Supplemental Arrows-C1F900..1F9FF; Supplemental Symbols and Pictographs1FA00..1FA6F; Chess Symbols1FA70..1FAFF; Symbols and Pictographs Extended-A20000..2A6DF; CJK Unified Ideographs Extension B2A700..2B73F; CJK Unified Ideographs Extension C2B740..2B81F; CJK Unified Ideographs Extension D2B820..2CEAF; CJK Unified Ideographs Extension E2CEB0..2EBEF; CJK Unified Ideographs Extension F2F800..2FA1F; CJK Compatibility Ideographs SupplementE0000..E007F; TagsE0100..E01EF; Variation Selectors SupplementF0000..FFFFF; Supplementary Private Use Area-A100000..10FFFF; Supplementary Private Use Area-B# EOF''')


欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/5640223.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-16
下一篇 2022-12-16

发表评论

登录后才能评论

评论列表(0条)

保存