似乎您可以使用的魔力
pandas。读取数据
pandas dataframe使用以下
read_csv()功能从csv文件中创建一个很容易:
import pandas as pddf = pd.read_csv(filename)
根据您的样本数据,这将创建以下内容
dataframe:
转换时间戳列ID timestamp latitude longitude0 3 6/9/2017 22:20 38.795333 77.0088831 1 5/5/2017 13:10 38.889011 77.0500612 2 2/10/2017 16:23 40.748249 73.9841913 1 5/5/2017 12:35 38.920602 77.2223294 3 6/10/2017 10:00 42.366211 71.0209435 1 5/5/2017 20:00 38.897416 77.0368336 2 2/10/2017 7:30 38.851426 77.0422987 3 6/9/2017 10:20 38.917346 77.2225538 2 2/10/2017 19:51 40.782869 73.9675449 3 6/10/2017 6:42 38.954268 77.44969510 1 5/5/2017 16:35 38.872875 77.00776311 2 2/10/2017 10:00 40.776931 73.876155
Pandas(通常是python)具有大量用于日期和时间 *** 作的库。但是首先,您需要通过将timestamp列(字符串)转换为datetime对象来准备数据。我假设您的数据采用格式
"MM/DD/YYYY"(因为未指定)。
辅助功能df['timestamp'] = pd.to_datetime(df['timestamp'], format='%m/%d/%Y %H:%M')
您将必须定义一些函数来计算距离和速度。Haversine距离函数就是根据此答案改编的。
进行一些中间变量from math import sin, cos, sqrt, atan2, radiansdef getDistanceFromLatLonInKm(lat1,lon1,lat2,lon2): R = 6371 # Radius of the earth in km dLat = radians(lat2-lat1) dLon = radians(lon2-lon1) rLat1 = radians(lat1) rLat2 = radians(lat2) a = sin(dLat/2) * sin(dLat/2) + cos(rLat1) * cos(rLat2) * sin(dLon/2) * sin(dLon/2) c = 2 * atan2(sqrt(a), sqrt(1-a)) d = R * c # Distance in km return ddef calc_velocity(dist_km, time_start, time_end): """Return 0 if time_start == time_end, avoid dividing by 0""" return dist_km / (time_end - time_start).seconds if time_end > time_start else 0
我们想在每一行上计算Haversine函数,但是我们需要第一行的每组一些信息。幸运的是,
pandas让这个容易
sort_values(),
groupby()和
transform()。
以下代码创建了3个新列,每个列分别用于每个ID的初始纬度,经度和时间。
应用功能# First sort by ID and timestamp:df = df.sort_values(by=['ID', 'timestamp'])# Group the sorted dataframe by ID, and grab the initial value for lat, lon, and time.df['lat0'] = df.groupby('ID')['latitude'].transform(lambda x: x.iat[0])df['lon0'] = df.groupby('ID')['longitude'].transform(lambda x: x.iat[0])df['t0'] = df.groupby('ID')['timestamp'].transform(lambda x: x.iat[0])
结果# create a new column for distancedf['dist_km'] = df.apply( lambda row: getDistanceFromLatLonInKm( lat1=row['latitude'], lon1=row['longitude'], lat2=row['lat0'], lon2=row['lon0'] ), axis=1)# create a new column for velocitydf['velocity_kmps'] = df.apply( lambda row: calc_velocity( dist_km=row['dist_km'], time_start=row['t0'], time_end=row['timestamp'] ), axis=1)
>>> print(df[['ID', 'timestamp', 'latitude', 'longitude', 'dist_km', 'velocity_kmps']]) IDtimestamp latitude longitude dist_km velocity_kmps3 1 2017-05-05 12:35:00 38.920602 77.222329 0.000000 0.0000001 1 2017-05-05 13:10:00 38.889011 77.050061 15.314742 0.00729310 1 2017-05-05 16:35:00 38.872875 77.007763 19.312148 0.0013415 1 2017-05-05 20:00:00 38.897416 77.036833 16.255868 0.0006096 2 2017-02-10 07:30:00 38.851426 77.042298 0.000000 0.00000011 2 2017-02-10 10:00:00 40.776931 73.876155 344.880549 0.0383202 2 2017-02-10 16:23:00 40.748249 73.984191 335.727502 0.0104988 2 2017-02-10 19:51:00 40.782869 73.967544 339.206320 0.0076297 3 2017-06-09 10:20:00 38.917346 77.222553 0.000000 0.0000000 3 2017-06-09 22:20:00 38.795333 77.008883 22.942974 0.0005319 3 2017-06-10 06:42:00 38.954268 77.449695 20.070609 0.0002744 3 2017-06-10 10:00:00 42.366211 71.020943 648.450485 0.007611
在这里,我将留给您了解如何获取每个ID的最后一个条目。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)