任务是:从4GB文本文件构建CSV文件,其中列具有恒定大小
列大小是已知的,例如[col1:4个字符宽,col2:2个字符宽等等…
文件只能包含[A-Z0-9] ASCII字符,因此转义单元格没有意义
I have: $cat example.txt AAAABBCCCC...AAA1B1CCC1...... (72 chars per line,usually 50 mln lines)I need: $cat done.csvAAAA,BB,CCCC,...AAA1,B1,CCC1,......
这是我在Haskell中最快的代码,
大约需要2分钟来处理整个4GB文件.
我需要最多30秒
import qualifIEd Data.ByteString.Lazy as BLimport qualifIEd Data.ByteString as Bimport qualifIEd Data.ByteString.Unsafe as Uimport Data.ByteString.Lazy.Builderimport Data.MonoIDimport Data.Listcol_sizes = intercalate [1] $map (`replicate` 0) cs where cs = [4,4,3,5,1,10,2,10]sp = char8 ',' -- column separatornl = char8 '\n'separator !cs !cl !xs !xl !ci !xi | c == 1 = ps | xi == xl = mempty -- at the end of bytestring,end recursion | cl == ci = pr | otherwise = pc where c = U.unsafeIndex cs ci -- get column separation indicator w = word8 . U.unsafeIndex xs -- get char from BS at position p = separator cs cl xs xl -- partial recursion call pr = nl <> p 0 (xi + 1) -- end of row,put '\n',reset counter,recur ps = sp <> p (ci + 1) xi -- end of column,put column separator,recur pc = w xi <> p (ci + 1) (xi + 1) -- in the mIDdle of column,copy byte,recurmain = do contents <- B.getContents BL.putStr . tolazyByteString $init_sep sp_after_char contentsinit_sep cs xs = separator cs (l cs) xs (l xs) 0 0 where l = fromIntegral . B.lengthsp_after_char = B.pack col_sizes
这是我在C @L_404_1@中的实现
(要把它贴在这里…)
大约需要5秒钟来处理同一个文件
所以我的Haskell代码是约.慢20倍.
因为Haskell ByteString过滤器和map比我在C中的实现更快,
(两者都需要不到2秒来处理相同的文件做一些简单的修改)
我希望我的Haskell代码有问题,我不会被迫使用C.
更新:测试数据生成器在这里可用http://pastebin.com/aJ3RW3jG
在生产时,数据从一个二进制文件传送到另一个二进制文件,因此没有硬盘驱动器IO
测试解决方案我使用SSD驱动器,但我认为Ext4缓存了该文件无论如何在RAM中
time cat test.txt > /dev/nullcat test.txt > /dev/null 0,00s user 0,35s system 99% cpu 0,353 total
香草发生器:
time ./data_builder | head -50000000 > /dev/null./data_builder 0,02s user 1,09s system 30% cpu 3,709 totalhead -50000000 > /dev/null 2,95s user 0,76s system 99% cpu 3,708 total
我的C解决方案:
time ./tocsvc < test.txt > /dev/null ./tocsvc < test.txt > /dev/null 5,35s user 0,35s system 100% cpu 5,689 total
与发电机
time ./data_builder | head -50000000 | ./tocsvc > /dev/null./data_builder 0,18s system 18% cpu 6,460 totalhead -50000000 3,15s user 1,19s system 67% cpu 6,459 total./tocsvc > /dev/null 5,81s user 0,55s system 98% cpu 6,459 total
@GabrIElGonzalez Haskell解决方案
time ./tocsvh1 < test.txt > /dev/null ./tocsv < test.txt > /dev/null 19,56s user 0,41s system 100% cpu 19,950 total
与发电机
time ./data_builder | head -50000000 | ./tocsvh1 > /dev/null ./data_builder 0,11s user 3,04s system 7% cpu 41,320 totalhead -50000000 7,29s user 3,56s system 26% cpu 41,319 total./tocsvh2 > /dev/null 33,01s user 2,42s system 85% cpu 41,327 total
我的Haskell解决方案
time ./tocsvh2 < test.txt > /dev/null ./tocsvh2 < test.txt > /dev/null 128,63s user 2,95s system 100% cpu 2:11,45 total
与发电机
time ./data_builder | head -50000000 | ./tocsvh2 > /dev/null ./data_builder 0,26s system 28% cpu 4,526 totalhead -50000000 3,17s user 1,33s system 99% cpu 4,524 total./tocsvh2 > /dev/null 129,95s user 3,33s system 98% cpu 2:14,75 total
@LukeTaylor解决方案
time ./tocsvh3 < test.txt > /dev/null ./tocsv < test.txt > /dev/null 324,38s user 4,13s system 100% cpu 5:28,18 total
与发电机
time ./data_builder | head -50000000 | ./tocsvh3 > /dev/null ./data_builder 0,43s user 4,46s system 1% cpu 5:30,34 totalhead -50000000 5,20s user 2,82s system 2% cpu 5:30,34 total./tocsv > /dev/null 329,08s user 4,21s system 100% cpu 5:32,96 total解决方法 通过使用原始指针 *** 作,我能够在C的3倍范围内:
import Control.Monad (unless,when,voID)import Foreign.Safe hIDing (voID)import System.IOimport Foreign.C.TypesbufInSize :: IntbufInSize = n * (1024 * 1024 `div` n) where n = sum sizes0 + 1bufOutSize :: IntbufOutSize = n * (1024 * 1024 `div` n) where n = sum sizes0 + length sizes0sizes0 :: [Int]sizes0 = [4,10]-- I also trIEd using the C memset using the FFI,but got the same speedmemcpy :: Ptr Word8 -> Ptr Word8 -> Int -> IO ()memcpy dst src n = when (n > 0) $do x <- peek src poke dst x memcpy (dst `plusPtr` 1) (src `plusPtr` 1) (n - 1)main = do allocaArray bufInSize $\bufIn0 -> do allocaArray bufOutSize $\bufOut0 -> do with (44 :: Word8) $\cm -> do let loop bufIn bufOut sizes suffixIn suffixOut = do let (bytesIn,bytesOut,sizes',copy) = case sizes of [] -> (1,sizes0,memcpy bufOut bufIn 1) [s] -> (s,s,[],memcpy bufOut bufIn s) s:izes -> (s,s + 1,izes,do memcpy bufOut bufIn s memcpy (bufOut `plusPtr` s) cm 1 ) if suffixIn < bytesIn then do eof <- hISEOF stdin if eof then hPutBuf stdout bufOut0 (bufOut `minusPtr` bufOut0) else do suffixIn' <- hGetBuf stdin bufIn0 bufInSize loop bufIn0 bufOut sizes suffixIn' suffixOut else if suffixOut < bytesOut then do hPutBuf stdout bufOut0 (bufOut `minusPtr` bufOut0) loop bufIn bufOut0 sizes suffixIn bufOutSize else do copy loop (bufIn `plusPtr` bytesIn ) (bufOut `plusPtr` bytesOut) sizes' (suffixIn - bytesIn ) (suffixOut - bytesOut) loop bufIn0 bufOut0 sizes0 0 bufOutSize
以下是使用1000000行输入文件的一些基于时间的粗略测量:
$# The C Version$time ./a.out < in.dat > out.datreal 0m0.189suser 0m0.116ssys 0m0.068s$# The Haskell version$time ./csv < in.dat > out2.datreal 0m0.536suser 0m0.428ssys 0m0.104s$diff out.dat out2.dat$# No difference总结
以上是内存溢出为你收集整理的更快的ByteString构造技巧全部内容,希望文章能够帮你解决更快的ByteString构造技巧所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)