高效地将Pandas数据框写入Google BigQuery_随笔

高效地将Pandas数据框写入Google BigQuery

我

Datalab

使用以下代码对备选方案1和3进行了比较：

from datalab.context import Contextimport datalab.storage as storageimport datalab.bigquery as bqimport pandas as pdfrom pandas import Dataframeimport time# Dataframe to writemy_data = [{1,2,3}]for i in range(0,100000):    my_data.append({1,2,3})not_so_simple_dataframe = pd.Dataframe(data=my_data,columns=['a','b','c'])#Alternative 1start = time.time()not_so_simple_dataframe.to_gbq('TestDataSet.TestTable',       Context.default().project_id,      chunksize=10000,       if_exists='append',      verbose=False      )end = time.time()print("time alternative 1 " + str(end - start))#Alternative 3start = time.time()sample_bucket_name = Context.default().project_id + '-datalab-example'sample_bucket_path = 'gs://' + sample_bucket_namesample_bucket_object = sample_bucket_path + '/Hello.txt'bigquery_dataset_name = 'TestDataSet'bigquery_table_name = 'TestTable'# Define storage bucketsample_bucket = storage.Bucket(sample_bucket_name)# Create or overwrite the existing table if it existstable_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)# Write the Dataframe to GCS (Google Cloud Storage)%storage write --variable not_so_simple_dataframe --object $sample_bucket_object# Write the Dataframe to a BigQuery tabletable.insert_data(not_so_simple_dataframe)end = time.time()print("time alternative 3 " + str(end - start))

这是n = {10000,100000,1000000}的结果：

n       alternative_1  alternative_310000   30.72s         8.14s100000  162.43s        70.64s1000000 1473.57s       688.59s

从结果来看，备选方案3比备选方案1更快。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5666951.html

高效地将Pandas数据框写入Google BigQuery

发表评论

评论列表（0条）