假设我有一个看起来像这样的文本文件:
Item,Date,Time,Location1,01/01/2016,13:41,[45.2344:-78.25453]2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242]3,01/10/2016,01:27,[51.2344:-86.24432]
我希望能够做的是用pandas.read_csv读取,但第二行将抛出错误.这是我目前使用的代码:
import pandas as pddf = pd.read_csv("path/to/file.txt",sep=",",dtype=str)
我试图将quotechar设置为“[”,但是这显然只是占用了行,直到下一个打开括号并添加一个右括号会导致“找到长度为2的字符串”错误.任何见解将不胜感激.谢谢!
更新
提供了三种主要解决方案:1)为数据框提供大量名称,以允许读入所有数据,然后对数据进行后处理,2)在方括号中查找值并在其周围加上引号,或者3)用分号替换前n个逗号.
总的来说,我认为选项3通常不是一个可行的解决方案(虽然对我的数据来说很好),因为a)如果我在一个包含逗号的列中引用了值,b)如果我的方括号列是不是最后一栏?这留下了解决方案1和2.我认为解决方案2更具可读性,但解决方案1更有效,仅运行1.38秒,而解决方案2则运行3.02秒.测试在包含18列和超过208,000行的文本文件上运行.
最佳答案我想你可以在每行文件中替换前3个出现的;然后使用参数sep =“;”在read_csv
:import pandas as pdimport iowith open('file2.csv','r') as f: lines = f.readlines() fo = io.StringIO() fo.writelines(u"" + line.replace(',',';',3) for line in lines) fo.seek(0) df = pd.read_csv(fo,sep=';')print df Item Date Time Location0 1 01/01/2016 13:41 [45.2344:-78.25453]1 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-81242]2 3 01/10/2016 01:27 [51.2344:-86.24432]
或者可以尝试这种复杂的方法,因为主要问题是,分隔符,列表中的值与其他列值的分隔符相同.
所以你需要后期处理:
import pandas as pdimport iotemp=u"""Item,41.2342:-81242,[51.2344:-86.24432]"""#after testing replace io.StringIO(temp) to filename#estimated max number of columnsdf = pd.read_csv(io.StringIO(temp),names=range(10))print df 0 1 2 3 4 #remove column with all NaNdf = df.dropna(how='all',axis=1)#first row get as columns namesdf.columns = df.iloc[0,:]#remove first rowdf = df[1:]#remove columns namedf.columns.name = None#get position of column Locationprint df.columns.get_loc('Location')3#df1 with Location valuesdf1 = df.iloc[:,df.columns.get_loc('Location'): ]print df1 Location NaN NaN1 [45.2344:-78.25453] NaN NaN2 [43.3423:-79.23423 41.2342:-81242 41.2342:-81242]3 [51.2344:-86.24432] NaN NaN#combine values to one columndf['Location'] = df1.apply( lambda x : ','.join([e for e in x if isinstance(e,basestring)]),axis=1)#subset of desired columnsprint df[['Item','Date','Time','Location']] Item Date Time Location1 1 01/01/2016 13:41 [45.2344:-78.25453]2 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-8...3 3 01/10/2016 01:27 [51.2344:-86.24432]
Item Date Time Location NaN 1 1 01/01/2016 13:41 [45.2344:-78.25453] NaN 2 2 01/03/2016 19:11 [43.3423:-79.23423 41.2342:-81242 3 3 01/10/2016 01:27 [51.2344:-86.24432] NaN 5 6 7 8 9 0 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN 2 41.2342:-81242] NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN
总结 以上是内存溢出为你收集整理的如何在Pandas.read_csv中使用方括号作为引号字符全部内容,希望文章能够帮你解决如何在Pandas.read_csv中使用方括号作为引号字符所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)