我
Datalab使用以下代码对备选方案1和3进行了比较:
from datalab.context import Contextimport datalab.storage as storageimport datalab.bigquery as bqimport pandas as pdfrom pandas import Dataframeimport time# Dataframe to writemy_data = [{1,2,3}]for i in range(0,100000): my_data.append({1,2,3})not_so_simple_dataframe = pd.Dataframe(data=my_data,columns=['a','b','c'])#Alternative 1start = time.time()not_so_simple_dataframe.to_gbq('TestDataSet.TestTable', Context.default().project_id, chunksize=10000, if_exists='append', verbose=False )end = time.time()print("time alternative 1 " + str(end - start))#Alternative 3start = time.time()sample_bucket_name = Context.default().project_id + '-datalab-example'sample_bucket_path = 'gs://' + sample_bucket_namesample_bucket_object = sample_bucket_path + '/Hello.txt'bigquery_dataset_name = 'TestDataSet'bigquery_table_name = 'TestTable'# Define storage bucketsample_bucket = storage.Bucket(sample_bucket_name)# Create or overwrite the existing table if it existstable_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)# Write the Dataframe to GCS (Google Cloud Storage)%storage write --variable not_so_simple_dataframe --object $sample_bucket_object# Write the Dataframe to a BigQuery tabletable.insert_data(not_so_simple_dataframe)end = time.time()print("time alternative 3 " + str(end - start))
这是n = {10000,100000,1000000}的结果:
n alternative_1 alternative_310000 30.72s 8.14s100000 162.43s 70.64s1000000 1473.57s 688.59s
从结果来看,备选方案3比备选方案1更快。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)