File.listFiles（）使用JDK 6处理Unicode名称（Unicode规范化问题）_随笔

File.listFiles（）使用JDK 6处理Unicode名称（Unicode规范化问题）

使用Unipre，可以使用多种有效的方式来表示同一字母。您在“棘手的名称”中使用的字符是“带小圆音的拉丁字母i”和“带圆环的拉丁字母a”。

您说“注意

%CC

与

%C3

字符的表示形式”，但是仔细看，您看到的是序列

i 0xCC 0x82 vs. 0xC3 0xAEa 0xCC 0x8A vs. 0xC3 0xA5

That is, the first is letter

followed by 0xCC82 which is the UTF-8
encoding of theUnipre

u0302

“combining circumflex accent” character while the second is UTF-8 for

u00EE

“latin
small letter i with circumflex”. Similarly for the other pair, the first is
the letter

followed by 0xCC8A the “combining ring above” character and the
second is “latin small letter a with ring above”. Both of these are valid
UTF-8 encodings of valid Unipre character strings, but one is in “composed”
and the other in “decomposed” format.

OS X HFSPlus卷将字符串（例如文件名）存储为“完全分解”。Unix文件系统实际上是根据文件系统驱动程序选择存储方式来存储的。您不能在不同类型的文件系统之间做任何笼统的声明。

有关组合形式与分解形式的一般性讨论，请参见Wikipedia上有关Unipre等价的文章，其中特别提到了OSX。

有关转换表格的信息，请参阅Apple的Tech Q＆A
QA1235（不幸的是，在Objective-C中）。

Apple的java-dev邮件列表上的最新电子邮件线程可能会对您有所帮助。

基本上，您需要先将分解形式标准化为组合形式，然后才能比较字符串。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5478324.html

File.listFiles（）使用JDK 6处理Unicode名称（Unicode规范化问题）

发表评论

评论列表（0条）