从全球范围来看,采用商业地理数据进行商业选址及消费者地理细分在发达经济体已经非常普及。为更精准地服务不断升级的中国消费者,宜家家居、麦当劳、星巴克等专门成立了商业地理分析团队,来指导其在中国的店铺选址。麦肯锡的“解读中国”商业地理分析团队亦感受到来自客户方越来越强烈的需求。
政府部门分析某地区的水资源、土地资源分布特征,是为了更好的优化国土空间开发格局,基础设施布局,全面促进资源节约、保护、利用和管理等进一步推动生态文明的建设。
我们正感受着商业世界发生的巨大变化。大数据时代商业选址的变革之路已悄然开启。古人有云,“一步差三市”,合理的选址对商业经营情况起到了至关重要的作用。采用商业地理数据进行商业选址及消费者地理细分已成为未来智能商业的发展趋势。
全球市场上普遍流行的产品工具,比如ArcGIS,GEOTools,GDAL,GEOS和JTS等,我国国产化自主可控的全局背景下,人大金仓数据库和PostgreSql数据库也是未来发展的方向。
利用FME等地理数据集成产品在数据应用侧有了普遍更广泛的应用,如加工后的数据分发和共享等。
当前公司企业等在地理数据应用普遍存在的问题是:如何使地理数据更快速的相应业务需求,接近实时的分析;在智慧城市的背景下如何得到更好的发散应用,如价值洼地,生活圈分析等;
分享接下来给大家分享IBM的有关地理数据的产品和Demo:
DB2Warehouse是一个分析数据仓库,具有内存数据处理和数据库分析功能。
它由客户机管理和优化,以实现快速灵活的部署,并具有支持分析工作负载的自动扩展功能。根据所选工作节点的数量,Cloud Pak for Data会自动创建适当的数据仓库环境。对于单个节点,仓库使用对称多处理(SMP)体系结构以提高成本效率。对于两个或多个节点,使用大规模并行处理(MPP)体系结构部署仓库,以实现高可用性和改进的性能。
优势
生命周期管理:与云服务类似,安装、升级和管理Db2仓库非常容易。能够在几分钟内部署Db2仓库数据库
丰富的生态系统:数据管理控制台、REST、图形具有多层恢复策略的Db2仓库的扩展可用性
支持软件定义的存储,如OCS和IBM Spectrum Scale CSI。
支持以下产品和开发语言
1. Esri ArcGIS
You can use Esri ArcGIS for Desktop version 10.3.1 together with your warehouse to analyze and visualize geospatial data.
2. Python
The ibmdbPy package provides methods to read data from, write data to, and sample data from a Db2 database. It also provides access methods for in-database analytic or geospatial functions
3. R
Use the RStudio® development environment that is provided by IBM® Watson™ Studio.
Use ODBC to connect your own, locally installed R development environment to your database
4. SQL/Procedures
5. DB内部的SpatialData 模块
Db2 Spatial Extender/Db2 spatial Analytics
- 安装并配置 DB2WH(Cloud Pak)
- 在 DB2WH(Cloud Pak) 服务器上创建名为 sde 的 *** 作系统登录帐户。
- 您将通过 sde 登录帐户连接到数据库来创建地理数据库。
- 创建一个 DB2WH(Cloud Pak) 数据库并将其注册到 Spatial Extender 模块。
- 在数据库中授予 sde 用户 DBADM 权限。
- 配置客户端
- 在 64 位的 *** 作系统上安装 Db2 客户端,请运行 64 位可执行文件;该文件将同时安装 32 位和 64 位文件,使您可以从 32 位和 64 位 ArcGIS 客户端进行连接。(IBM dataserver64-v11.5.6_ntx64_rtcl.exe)
- 创建地理数据库
- 连接到 Db2 数据库。通过 sde 登录帐户进行连接。
- 确保将 sde 用户密码保存在数据库连接对话框中。
- 右键单击数据库连接,然后单击启用地理数据库。
- 随即将打开启用企业级地理数据库工具。
SpatialData 模块
1. 重要的概念
Geometry types:Points,Linestrings,Polygons 等
Coordinate system:A geographic coordinate system uses a three-dimensional spherical surface to determine locations on the earth.
Data types:ST_Point, ST_LineString, and ST_Polygon, ST_MultiPoint, ST_MultiLineString, ST_MultiPolygon, and ST_Geometry when you are not sure which of the other data types to use.
2. 性能优化
Specifying inline lengths for geospatial columns
Registering spatial columns: call st_register_spatial_column()
Filtering using a bounding box
1. Db2 Spatial Extender/Db2 spatial Analytics(Successor)
Functions provided by the Db2 Spatial Extender component can be used to analyze data stored in row-organized/column-organized tables. Spatial Extender stores geospatial data in special data types, each of which can hold up to 4 MB.
2. 启用Db2 spatial Analytics
CALL SYSPROC.SYSINSTALLOBJECTS('GEO', 'C', CAST (NULL AS VARCHAr(128)), CAST (NULL AS VARCHAr(128)))
3. Db2 Spatial Extender/Analytics 接口
Db2 Spatial has a wide variety of interfaces to help you set up and create projects that use spatial data:
Db2 Spatial Extender stored procedures called from application programs.
SQL queries that you submit from application programs.
Open source projects that support Spatial Extender such as:
- GeoTools () is a Java™ library for building spatial applications. For more information, see http://www.geotools.org/.
- GeoServer is a Web map server and Web feature server. For more information, see http://geoserver.org/.
- uDIG is a desktop spatial data visualization and analysis application. For more information, see http://udig.refractions.net/.
使用案例
Safe Harbor Real Estate保险公司立项利用地理数据进行BI决策。
1. 确定目标:
• 哪里建立分支机构更合适 • 如何 根据客户属性调整保险附加费 (areas with high rates of traffic accidents, areas with high rates of crime, flood zones, earthquake faults, and so on)2. 确定地理参考系
• 根据地理位置,决定使用 spatial reference system, called NAD83_SRS_1, that is designed to be used with GCS_NORTH_AMERICAN_19833. 建立相关表
• 客户表 :CUSTOMERS 含LATITUDE and LonGITUDE columns • 子公司表 : OFFICE_LOCATIONS • 子公司销售表 : OFFICE_SALES4. 注册三张表
5. 根据经纬度信息更新Locations列
• UPDATe CUSTOMERS SET LOCATION = db2gse.ST_Point(LONGITUDE, LATITUDE,1) to populate the LOCATION value from LATITUDE and LONGITUDE. • 建立HAZARD_ZonES 表导入Shapefile6. 优化
• create a view that joins columns from the CUSTOMERS and HAZARD ZonES tables and register the spatial columns in this view7. 分析
• The spatial analysis team runs queries to obtain information CP4D - 地理数据的快速分析功能 1. 即席查询-Ad Hoc概念
前端BI无需关注底层物理设计,Web端自助式的拖拉拽进行自定义定制化查询,通过API与后端解耦的任何数据源进行互动,并近实时展现数据。
Automatically spin up lightweight, dedicated Apache Spark clusters to run a wide range of workloads.
Included with Cloud Pak for Data:Geospatio-temporal library
Geospatio-temporal library - 使用方式You can use the geospatio-temporal library to expand your data science analysis to include location analytics by gathering, manipulating and displaying imagery, GPS, satellite photography and historical data.You can use the geospatio-temporal library in Cloud Pak for Data to:
1. Run Spark jobs on your Cloud Pak for Data cluster by using the Spark jobs REST APIs of Analytics Engine powered by Apache Spark. pyst supports most of the common geospatial formats, including geoJSON and WKT.
from pyst import STContext
stc = STContext(spark.sparkContext._gateway)
Work with files in geoJSON format, first create a geoJSON reader and writer:
geojson_reader = stc.geojson_reader()
geojson_writer = stc.geojson_writer()
Direct input:
Point: point(lat, lon)
LineSegment: line_segment(start_point, end_point)
LineString: line_string([point_1, point_2, …]) or line_string([line_segment_1, line_segment_2, …])
Ring: ring([point_1, point_2, …]) or ring([line_segment_1, line_segment_2, …])
Polygon: polygon(exterior_ring, [interior_ring_1, interior_ring_2, …])
MultiGeometry: multi_geometry(geom_1, geom_2, …)
MultiPoint: multi_point(point_1, point_2, …)
MultiLineString: multi_line_string(line_string_1, line_string_2, …)
MultiPolygon: multi_polygon(polygon_1, polygon_2, …)
Null Geometry: null_geometry()
FullEarth: full_earth()
BoundingBox: bounding_box(lower_lat, lower_lon, upper_lat, upper_lon)
2. Run notebooks in Spark environments in Watson Studio.
You can use the geospatio-temporal library to expand your data science analysis in Python notebooks to include location analytics by gathering, manipulating and displaying imagery, GPS, satellite photography and historical data.
The spatio-temporal library is available in all IBM Watson Studio Spark runtime environments and if you run your notebooks in IBM Analytics Engine.
Key aspects of the library include:
All calculated geometries are accurate without the need for projections.
The geospatial functions take advantage of the distributed processing capabilities provided by Spark.
The library includes native geohashing support for geometries used in simple aggregations and in indexing, thereby improving storage retrieval considerably.
The library supports extensions of Spark distributed joins.
The library supports the SQL/MM extensions to Spark SQL.
Reference
https://www.ibm.com/docs/en/cloud-paks/cp-data/3.5.0?topic=scripts-geospatio-temporal-library
https://www.ibm.com/docs/en/cloud-paks/cp-data/3.5.0?topic=libraries-geospatio-temporal-library#getting-started-with-the-library
Geospatio-temporal library - 函数Topological functions
With the spatio-temporal library, you can use topological relations to confine the returned results of your location data analysis.
Get the aggregated bounding box for a list of geometries.
westchester_WKT = 'POLYGON((-73.984 41.325,...,-74.017 40.698,-74.019 40.698,-74.023 40.703,-74.023 40.709))'
wkt_reader = stc.wkt_reader()
westchester = wkt_reader.read(westchester_WKT)
white_plains = wkt_reader.read(white_plains_WKT)
manhattan = wkt_reader.read(manhattan_WKT)
white_plains_bbox = white_plains.get_bounding_box()
westchester_bbox = westchester.get_bounding_box()
manhattan_bbox = manhattan.get_bounding_box()
aggregated_bbox = white_plains_bbox.get_containing_bb(westchester_bbox).get_containing_bb(manhattan_bbox)
Geohashing functions
The spatio-temporal library includes geohashing functions for proximity search (encoding latitude and longitude and grouping nearby points) in location data analysis.
Geohash coverage
test_wkt = 'POLYGON((-73.76223024988917 41.04173285255264,-73.7749331917837 41.04121496082817,-73.78197130823878 41.02748934524744,-73.76476225519923 41.023733725449326,-73.75218805933741 41.031633228865495,-73.7558787789419 41.03752486433286,-73.76223024988917 41.04173285255264))'
poly = wkt_reader.read(test_wkt)
cover = stc.geohash.geohash_cover_at_bit_depth(poly, 36)
Geospatial indexing functions
With the spatio-temporal library, you can use functions to index points within a region, on a region containing points, and points within a radius to enable fast queries on this data during location analysis.
>>> tile_size = 100000
>>> si = stc.tessellation_index(tile_size=tile_size) # we leave bbox as None to use full earth as boundingbox
>>> si.from_df(county_df, 'NAME', 'geometry', verbosity='error')
3221 entries processed, 3221 entries successfully added
Which are the counties within 20 km of White Plains Hospital? The result is sorted by their distances.
>>> counties = si.within_distance_with_info(white_plains_hospital, 20000)
>>> counties.sort(key=lambda tup: tup[2])
>>> for county in counties:
... print(county[0], county[2])
Westchester 0.0
Fairfield 7320.602641166855
Rockland 10132.182241119823
Bergen 10934.1691335908
Bronx 15683.400292349625
Nassau 17994.425235412604
Ellipsoidal metrics
You can use ellipsoidal metrics to calculate the distance between points.
Compute the radians between two points using azimuth:
>>> p1 = stc.point(47.1, -73.5)
>>> p2 = stc.point(47.6, -72.9)
>>> stc.eg_metric.azimuth(p1, p2)
0,6802979449118038
Routing functions
The spatio-temporal library includes routing functions that list the edges that yield a path from one node to another node.
Find the best route with minimal distance cost (the fastest route distance-wise):
# Check distance cost, in the unit of meters
>>> best_distance_route.cost
2042,4082601271236
# Check route path (only showing the first three points), which is a list of points in 3-tuple (osm_point_id, lat, lon)
>>> best_distance_route.path[:3]
[(2036943312, 33.7631862, -84.3939405),
(3523447568, 33.7632666, -84.3939315),
(2036943524, 33.7633273, -84.3939155)]
Spark引擎 - Analytics Engine
You can use Analytics Engine powered by Apache Spark as a compute engine to run analytical and machine learning jobs.
The Analytics Engine powered by Apache Spark service is not available by default. An administrator must install this service on the IBM Cloud Pak for Data platform,If you have the Watson Studio service installed, the Analytics Engine powered by Apache Spark service automatically adds a set of default Spark environment definitions to analytics projects. You can also create custom Spark environment definitions in a project.
You can submit jobs to Spark clusters in two ways:
1. Specifying a Spark environment definition for a job in an analytics project
2. Running Spark job APIs
Each time you submit a job, a dedicated Spark cluster is created for the job. You can specify the size of the Spark driver, the size of the executor, and the number of executors for the job. This enables you to achieve predictable and consistent performance.
When a job completes, the cluster is automatically cleaned up so that the resources are available for other jobs. The service also includes interfaces that enable you to analyze the performance of your Spark applications and debug problems.
Spark APIs
You can run these types of workloads with Spark jobs APIs:
Spark applications that run Spark SQL
Data transformation jobs
Data science jobs
Machine learning jobs
使用Watson Studio中的spark
For Python:
At the beginning of this cell, add %%writefile myfile.py to save the code as a Python file to your working directory. Notebooks that use the same runtime can also import this file.The advantage of this method is that the code is available in your notebook, and you can edit and save it as a new Python script at any time.using pyst which supports most of the common geospatial formats, which includes shapefile, GeoJSON and Well-Known Text (WKT).
from pyst import STContext
# Register STContext, which is the main entry point
stc = STContext(spark.sparkContext._gateway)
For R:
If you want to save code in a notebook as an R script to the working directory, you can use the writeLines(myfile.R) function.
RStudio uses the sparklyr package to connect to Spark from R. The sparklyr package includes a dplyr interface to Spark data frames as well as an R interface to Spark’s distributed machine learning pipelines.There are two methods of connecting to Spark from RStudio:
In IBM Cloud Pak for Data, you can run Spark jobs or applications on your IBM Cloud Pak for Data cluster without installing Watson Studio by using the Spark jobs REST APIs of Analytics Engine powered by Apache Spark.
1. Submitting Spark jobs :
curl -k -X POST
2. Viewing Spark job status:
curl -k -X GET
3. Deleting Spark jobs:
curl -k -X DELETE
确定目标:
• 洪灾 影响范围,分析图像入库,格式为 wkt • 根据 wkt 影响范围统计涉及承包人数,并利用 ML 机器学习给出保险方案
代码参考如下
Reference :
• Insurance Loss Estimation using Remote Sensinghttps://dataplatform.cloud.ibm.com/exchange/public/entry/view/14ea8dfab582137c695a6630e90cdc32?context=cpdaas
内容偏多,若有任何疑问,请联系博主。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)