加快庞大数据集的Python文件处理速度_python

概述我有一个大型数据集存储为17GB的csv文件(fileData),其中包含每个customer_id的可变数量的记录(最多约30,000).我正在尝试搜索特定客户(在fileSelection中列出 – 总共90000中的大约1500个)并将每个客户的记录复制到单独的csv文件(fileOutput)中. 我是Python的新手,但使用它因为vba和matlab(我比较熟悉)无法处理文件大小. ( 我有一个大型数据集存储为17GB的csv文件(fileData),其中包含每个customer_ID的可变数量的记录(最多约30,000).我正在尝试搜索特定客户(在fileSelection中列出 – 总共90000中的大约1500个)并将每个客户的记录复制到单独的csv文件(fileOutput)中.

我是Python的新手,但使用它因为vba和matlab(我比较熟悉)无法处理文件大小. (我正在使用Aptana studio编写代码,但是直接从cmd行运行python以获得速度.运行64位windows 7.)

我写的代码是提取一些客户,但有两个问题：
1)无法在大型数据集中找到大多数客户. (我相信它们都在数据集中,但不能完全确定.)
2)非常慢.任何加速代码的方法都将受到赞赏,包括可以更好地利用16核PC的代码.

这是代码：

`def main():    # Initialisation :     #  - IDentify columns in slection file    #    fS = open (fileSelection,"r")    if fS.mode == "r":        header = fS.readline()        selheaderList = header.split(",")        custkey =   selheaderList.index('CUSTOMER_KEY')    #    # IDentify columns in dataset file    fileData = path2+file_data    fD = open (fileData,"r")    if fD.mode == "r":        header = fD.readline()        dataheaderList = header.split(",")        custID =   dataheaderList.index('CUSTOMER_ID')    fD.close()    # For each customer in the selection file    customercount=1    for sr in fS:        # Find customer key and locate it in customer ID fIEld in dataset          selrecord = sr.split(",")        requiredcustomer = selrecord[custkey]        #Look for required customer in dataset        found = 0        fD = open (fileData,"r")        if fD.mode == "r":            while found == 0:                dr = fD.readline()                if not dr: break                datrecord = dr.split(",")                if datrecord[custID] == requiredcustomer:                    found = 1                    # Open outputfile                    fileOutput= path3+file_out_root + str(requiredcustomer)+ ".csv"                    fO=open(fileOutput,"w+")                    fO.write(str(header))                    #copy all records for required customer number                    while datrecord[custID] == requiredcustomer:                        fO.write(str(dr))                        dr = fD.readline()                        datrecord = dr.split(",")                    #Close Output file                              fO.close()                       if found == 1:                print ("Customer Count "+str(customercount)+ "  Customer ID"+str(requiredcustomer)+" copIEd.  ")                customercount = customercount+1            else:                print("Customer ID"+str(requiredcustomer)+" not found in dataset")                fL.write (str(requiredcustomer)+","+"NOT FOUND")            fD.close()    fS.close()    `

提取几百个客户需要几天时间,但却找不到更多.

Sample Output

谢谢@Paul Cornelius.这样效率更高.我采用了你的方法,也使用@Bernardo建议的csv处理：

# import Modulesimport csvdef main():    # Initialisation :     fileSelection = path1+file_selection    fileData = path2+file_data    # Step through selection file and create dictionary with required ID's as keys,and empty objects    with open(fileSelection,'rb') as csvfile:        selected_IDs = csv.reader(csvfile)        ID_dict = {}        for row in selected_IDs:            ID_dict.update({row[1]:[]})      # step through data file: for selected customer ID's,append records to dictionary objects    with open(fileData,'rb') as csvfile:        dataset = csv.reader(csvfile)        for row in dataset:            if row[0] in ID_dict:                    ID_dict[row[0]].extend([row[1]+','+row[4]])        # write all dictionary objects to csv files    for row in ID_dict.keys():        fileOutput = path3+file_out_root+row+'.csv'        with open(fileOutput,'wb') as csvfile:            output = csv.writer(csvfile,delimiter='\n')            output.writerows([ID_dict[row]])

解决方法对于一个简单的答案,任务太过牵扯.但是你的方法非常低效,因为你有太多的嵌套循环.尝试通过客户列表进行一次传递,并为每个构建一个“客户”对象,其中包含您稍后需要使用的任何信息.你把它们放在字典里;键是不同的必需客户变量,值是客户对象.如果我是你,我会先让这部分工作,然后再对大文件进行愚弄.

现在,您将逐步浏览大量客户数据文件,每次遇到其datarecord [custID]字段位于字典中的记录时,都会在输出文件中附加一行.您可以使用相对高效的运算符来测试字典中的成员资格.

不需要嵌套循环.

您呈现它的代码无法运行,因为您在不打开它的情况下写入名为fL的某个对象.另外,正如Tim PIEtzcker指出的那样,你没有关闭你的文件,因为你实际上没有调用close函数.

总结

以上是内存溢出为你收集整理的加快庞大数据集的Python文件处理速度全部内容，希望文章能够帮你解决加快庞大数据集的Python文件处理速度所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/1193681.html

加快庞大数据集的Python文件处理速度

发表评论

评论列表（0条）