更快的ByteString构造技巧

更快的ByteString构造技巧,第1张

概述我是 Haskell的新手,我一直坚持效率问题. 任务是:从4GB文本文件构建CSV文件,其中列具有恒定大小 列大小是已知的,例如[col1:4个字符宽,col2:2个字符宽等等… 文件只能包含[A-Z0-9] ASCII字符,因此转义单元格没有意义 I have: $cat example.txt AAAABBCCCC...AAA1B1CCC1...... (72 chars per 我是 Haskell的新手,我一直坚持效率问题.

任务是:从4GB文本文件构建CSV文件,其中列具有恒定大小

列大小是已知的,例如[col1:4个字符宽,col2:2个字符宽等等…
文件只能包含[A-Z0-9] ASCII字符,因此转义单元格没有意义

I have: $cat example.txt AAAABBCCCC...AAA1B1CCC1...... (72 chars per line,usually 50 mln lines)I need: $cat done.csvAAAA,BB,CCCC,...AAA1,B1,CCC1,......

这是我在Haskell中最快的代码,
大约需要2分钟来处理整个4G​​B文件.
我需要最多30秒

import qualifIEd Data.ByteString.Lazy as BLimport qualifIEd Data.ByteString as Bimport qualifIEd Data.ByteString.Unsafe as Uimport Data.ByteString.Lazy.Builderimport Data.MonoIDimport Data.Listcol_sizes = intercalate [1] $map (`replicate` 0) cs  where    cs = [4,4,3,5,1,10,2,10]sp = char8 ',' -- column separatornl = char8 '\n'separator !cs !cl !xs !xl !ci !xi  | c  == 1   = ps  | xi == xl  = mempty -- at the end of bytestring,end recursion  | cl == ci  = pr  | otherwise = pc  where    c  = U.unsafeIndex cs ci         -- get column separation indicator    w  = word8 . U.unsafeIndex xs    -- get char from BS at position    p  = separator cs cl xs xl       -- partial recursion call    pr = nl   <> p  0       (xi + 1) -- end of row,put '\n',reset counter,recur    ps = sp   <> p (ci + 1)  xi      -- end of column,put column separator,recur    pc = w xi <> p (ci + 1) (xi + 1) -- in the mIDdle of column,copy byte,recurmain = do  contents <- B.getContents  BL.putStr . tolazyByteString $init_sep sp_after_char contentsinit_sep cs xs = separator cs (l cs) xs (l xs) 0 0  where l = fromIntegral . B.lengthsp_after_char = B.pack col_sizes

这是我在C @L_404_1@中的实现
(要把它贴在这里…)
大约需要5秒钟来处理同一个文件

所以我的Haskell代码是约.慢20倍.
因为Haskell ByteString过滤器和map比我在C中的实现更快,
(两者都需要不到2秒来处理相同的文件做一些简单的修改)
我希望我的Haskell代码有问题,我不会被迫使用C.

更新:测试数据生成器在这里可用http://pastebin.com/aJ3RW3jG

在生产时,数据从一个二进制文件传送到另一个二进制文件,因此没有硬盘驱动器IO

测试解决方案我使用SSD驱动器,但我认为Ext4缓存了该文件无论如何在RAM中

time cat test.txt > /dev/nullcat test.txt > /dev/null  0,00s user 0,35s system 99% cpu 0,353 total

香草发生器:

time ./data_builder | head -50000000 > /dev/null./data_builder  0,02s user 1,09s system 30% cpu 3,709 totalhead -50000000 > /dev/null  2,95s user 0,76s system 99% cpu 3,708 total

我的C解决方案:

time ./tocsvc < test.txt > /dev/null ./tocsvc < test.txt > /dev/null  5,35s user 0,35s system 100% cpu 5,689 total

与发电机

time ./data_builder | head -50000000 | ./tocsvc > /dev/null./data_builder  0,18s system 18% cpu 6,460 totalhead -50000000  3,15s user 1,19s system 67% cpu 6,459 total./tocsvc > /dev/null  5,81s user 0,55s system 98% cpu 6,459 total

@GabrIElGonzalez Haskell解决方案

time ./tocsvh1 < test.txt > /dev/null ./tocsv < test.txt > /dev/null  19,56s user 0,41s system 100% cpu 19,950 total

与发电机

time ./data_builder | head -50000000 | ./tocsvh1 > /dev/null ./data_builder  0,11s user 3,04s system 7% cpu 41,320 totalhead -50000000  7,29s user 3,56s system 26% cpu 41,319 total./tocsvh2 > /dev/null  33,01s user 2,42s system 85% cpu 41,327 total

我的Haskell解决方案

time ./tocsvh2 < test.txt > /dev/null ./tocsvh2 < test.txt > /dev/null  128,63s user 2,95s system 100% cpu 2:11,45 total

与发电机

time ./data_builder | head -50000000 | ./tocsvh2 > /dev/null ./data_builder  0,26s system 28% cpu 4,526 totalhead -50000000  3,17s user 1,33s system 99% cpu 4,524 total./tocsvh2 > /dev/null  129,95s user 3,33s system 98% cpu 2:14,75 total

@LukeTaylor解决方案

time ./tocsvh3 < test.txt > /dev/null ./tocsv < test.txt > /dev/null  324,38s user 4,13s system 100% cpu 5:28,18 total

与发电机

time ./data_builder | head -50000000 | ./tocsvh3 > /dev/null ./data_builder  0,43s user 4,46s system 1% cpu 5:30,34 totalhead -50000000  5,20s user 2,82s system 2% cpu 5:30,34 total./tocsv > /dev/null  329,08s user 4,21s system 100% cpu 5:32,96 total
解决方法 通过使用原始指针 *** 作,我能够在C的3倍范围内:
import Control.Monad (unless,when,voID)import Foreign.Safe hIDing (voID)import System.IOimport Foreign.C.TypesbufInSize :: IntbufInSize = n * (1024 * 1024 `div` n) where n = sum sizes0 + 1bufOutSize :: IntbufOutSize = n * (1024 * 1024 `div` n) where n = sum sizes0 + length sizes0sizes0 :: [Int]sizes0 = [4,10]-- I also trIEd using the C memset using the FFI,but got the same speedmemcpy :: Ptr Word8 -> Ptr Word8 -> Int -> IO ()memcpy dst src n = when (n > 0) $do    x <- peek src    poke dst x    memcpy (dst `plusPtr` 1) (src `plusPtr` 1) (n - 1)main = do    allocaArray bufInSize  $\bufIn0  -> do    allocaArray bufOutSize $\bufOut0 -> do    with (44 :: Word8)  $\cm -> do        let loop bufIn bufOut sizes suffixIn suffixOut = do                let (bytesIn,bytesOut,sizes',copy) = case sizes of                        []     -> (1,sizes0,memcpy bufOut bufIn 1)                        [s]    -> (s,s,[],memcpy bufOut bufIn s)                        s:izes -> (s,s + 1,izes,do                            memcpy  bufOut              bufIn s                            memcpy (bufOut `plusPtr` s) cm    1 )                if suffixIn < bytesIn                then do                    eof <- hISEOF stdin                    if eof                        then hPutBuf stdout bufOut0 (bufOut `minusPtr` bufOut0)                        else do                            suffixIn' <- hGetBuf stdin bufIn0 bufInSize                            loop bufIn0 bufOut sizes suffixIn' suffixOut                else if suffixOut < bytesOut                then do                    hPutBuf stdout bufOut0 (bufOut `minusPtr` bufOut0)                    loop bufIn bufOut0 sizes suffixIn bufOutSize                else do                    copy                    loop (bufIn  `plusPtr` bytesIn )                         (bufOut `plusPtr` bytesOut)                         sizes'                         (suffixIn  - bytesIn )                         (suffixOut - bytesOut)        loop bufIn0 bufOut0 sizes0 0 bufOutSize

以下是使用1000000行输入文件的一些基于时间的粗略测量:

$# The C Version$time ./a.out < in.dat > out.datreal    0m0.189suser    0m0.116ssys 0m0.068s$# The Haskell version$time ./csv < in.dat > out2.datreal    0m0.536suser    0m0.428ssys 0m0.104s$diff out.dat out2.dat$# No difference
总结

以上是内存溢出为你收集整理的更快的ByteString构造技巧全部内容,希望文章能够帮你解决更快的ByteString构造技巧所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/1244634.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-06-07
下一篇 2022-06-07

发表评论

登录后才能评论

评论列表(0条)

保存