在Go中读取非UTF-8文本文件

在Go中读取非UTF-8文本文件,第1张

在Go中读取非UTF-8文本文件

以前(如在较早的答案中所述),“简单”的方法是使用需要cgo并包装iconv库的第三方程序包。由于许多原因,这是不希望的。值得庆幸的是,有一段时间以来,仅使用Go
Authors提供的软件包(不是在主要软件包中,而是在Go子存储库中),就有了一种上乘的Go语言方法。

golang.org/x/text/encoding
软件包定义了一个通用字符编码的接口,该接口可以与UTF-8进行相互转换。该
golang.org/x/text/encoding/simplifiedchinese
子程序包提供GB18030,GBK和HZ-
GB2312
编码实现。

这是读取和写入GBK编码文件的示例。请注意,在读取/写入数据时,“

io.Reader
io.Writer
”进行“即时”编码。

package mainimport (    "bufio"    "fmt"    "log"    "os"    "golang.org/x/text/encoding/simplifiedchinese"    "golang.org/x/text/transform")// Encoding to use. Since this implements the encoding.Encoding// interface from golang.org/x/text/encoding you can trivially// change this out for any of the other implemented enprers,// e.g. `traditionalchinese.Big5`, `charmap.Windows1252`,// `korean.EUCKR`, etc.var enc = simplifiedchinese.GBKfunc main() {    const filename = "example_GBK_file"    exampleWriteGBK(filename)    exampleReadGBK(filename)}func exampleReadGBK(filename string) {    // Read UTF-8 from a GBK enpred file.    f, err := os.Open(filename)    if err != nil {        log.Fatal(err)    }    r := transform.NewReader(f, enc.NewDeprer())    // Read converted UTF-8 from `r` as needed.    // As an example we'll read line-by-line showing what was read:    sc := bufio.NewScanner(r)    for sc.Scan() {        fmt.Printf("Read line: %sn", sc.Bytes())    }    if err = sc.Err(); err != nil {        log.Fatal(err)    }    if err = f.Close(); err != nil {        log.Fatal(err)    }}func exampleWriteGBK(filename string) {    // Write UTF-8 to a GBK enpred file.    f, err := os.Create(filename)    if err != nil {        log.Fatal(err)    }    w := transform.NewWriter(f, enc.NewEnprer())    // Write UTF-8 to `w` as desired.    // As an example we'll write some text from the Wikipedia    // GBK page that includes Chinese.    _, err = fmt.Fprintln(w,        `In 1995, China National Information Technology StandardizationTechnical Committee set down the Chinese Internal Code Specification(Chinese: 汉字内码扩展规范(GBK); pinyin: Hànzì NèimǎKuòzhǎn Guīfàn (GBK)), Version 1.0, known as GBK 1.0, which is aslight extension of Codepage 936. The newly added 95 characters were notfound in GB 13000.1-1993, and were provisionally assigned Unipre PUApre points.`)    if err != nil {        log.Fatal(err)    }    if err = f.Close(); err != nil {        log.Fatal(err)    }}

Playground



欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/5086687.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-11-16
下一篇 2022-11-16

发表评论

登录后才能评论

评论列表(0条)

保存