Location analytics on Snowflake with OpenStreetMap, US property data, and our ready made Geography Dimension
We have made available a ready made Geography Dimension on the Snowflake Data Marketplace. The marketplace is one of the components of the Snowflake Data Cloud. It covers the first to fifth order administrative divisions and cities in all major countries. This can be used for drilling down / rolling up in any type of location analytics.
In this blog post we will integrate the Geography Dimension with administrative boundary data which we derived from OpenStreetMap (OSM). This data set is also available on the Snowflake data marketplace. The OSM data contains geo coordinates as polygons and we can perform geo-lookups for latitude / longitude pairs.
Using US property data we can look up the geo-coordinates (latitude / longitude) to locate the US County, US State and other administrative divisions of the property. We can then use the Geography Dimension to drill up and down on this data.
Let us first have a look at the data sets in detail before we show you some examples.
Geography Dimension
The geography dimension on the Snowflake Marketplace contains the data from country to city level including the first to fifth order administrative divisions in between and it covers all major countries. A detailed documentation on this is available.
The core tables in the Geography Dimension Dataset are as given below: –
COUNTRY
This is the highest level of information available in the dataset. The attributes of the country such as iso codes, fips codes, capital, area, population, continent, currency, postal code format, phone dialing code, languages, neighbouring countries are all included in this table.
ADMIN TABLES
The admin tables cover the administrative divisions within a country. We have created five main admin tables namely ADMIN_1, ADMIN_2, ADMIN_3, ADMIN_4 AND ADMIN_5 which corresponds to the first to fifth order administrative divisions in a country. In addition to this tables ADMIN_3_REL1, ADMIN4_REL1, ADMIN4_REL2, ADMIN5_REL1, ADMIN5_REL2, ADMIN5_REL3 are also available which consists the administrative divisions which skipped one or more of it’s parent admin-orders.
CITY_500
The city data is the lowest level of information available in the dataset. It covers all cities around the world with a population > 500.
OSM Data
Sonra has made available OpenStreetMap (OSM) data sets on the Snowflake DataMarketplace. The data is available in each Snowflake region.
We have derived the administrative divisions of the major countries as documented in the OSM documentation on administrative boundaries.
Here is an example of what is contained in the data set.
The GEO field contains the coordinates of the admin boundaries as polygons or multipolygons. The RANK field indicates the hierarchy of the different boundary/administrative divisions such as country borders, state borders and county borders.
You may like to read: Open Street Map Documentation (PDF)
Property data
We also have scraped some US property data for this demo to show how the in-house data of a property website or agency can be enriched with geo information for advanced location analytics.
The property data contains the following data points.
Column | Description |
---|---|
ID | Unique Identifier |
NAME | Name of the property |
STATE | The state where the property is located |
COUNTY_OR_EQUIVALENT | The county or equivalent subdivision of state where the property is located |
RATE | The rate of the property in dollars |
AREA_IN_ACRES | The total area in acres |
COORDINATES | The geo coordinates of property (lat / lon) |
The data sets in action
Now let us have a look at how we can enrich our property data with admin boundary information and data from OSM.
For the demonstration, we will make use of Sonra’s OpenStreetMap United States (USA) and Geography Dimension data sets from Snowflake’s Data Marketplace.
Now let us consider the different analysis that could be done.
- Get the boundary coordinates of second order administrative divisions of the US (Counties and equivalents).
This can be done by making use of the ADMIN_2 table of Geography dimension and the OSM Reverse Geocoding table BOUNDARY.
In the same way the exact boundary for any admin order in the Geography Dimension can be obtained by joining it with the BOUNDARY table choosing the right admin level(RANK).
In this example we consider the RANK 6 in the BOUNDARY table as indicates the state counties and county equivalents for US. Refer the link for more details on choosing the appropriate RANK.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
SELECT AD2.GEO_ID, AD2.ADMIN2_CD, AD2.PARENT_ADMIN_CD, AD2.COUNTRY_CD, AD2.ASCII_NAME, BOUNDARY.COORDINATES COORDINATES FROM ( SELECT * FROM V_GEONM_ADMIN_2 WHERE COUNTRY_CD = 'US') AD2 LEFT JOIN ( SELECT ID, NAME, ST_COLLECT(GEO) COORDINATES FROM V_OSM_USA_BOUNDARY WHERE RANK = 6 GROUP BY ID, NAME) BOUNDARY ON AD2.ASCII_NAME = BOUNDARY.NAME AND ST_COVERS(BOUNDARY.COORDINATES, ST_POINT(AD2.LONGITUDE, AD2.LATITUDE)); |
2. List of second-order administrative divisions(counties) and number of general/departmental stores and malls.
This can be achieved by making use of ADMIN_2 from Geography dimension, BOUNDARY table from OSM Reverse Geocoding and the SHOP_GENERAL_STORE_DEPARTMENT_STORE_MALL table of OSM.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
SELECT AD2.GEO_ID, AD2.ADMIN2_CD, AD2.PARENT_ADMIN_CD, AD2.COUNTRY_CD, AD2.ASCII_NAME, COUNT(SHOP.ID) CNT_SHOP FROM ( SELECT * FROM V_GEONM_ADMIN_2 WHERE COUNTRY_CD = 'US') AD2 LEFT JOIN ( SELECT ID, NAME, ST_COLLECT(GEO) COORDINATES FROM V_OSM_USA_BOUNDARY WHERE RANK = 6 GROUP BY ID, NAME) BOUNDARY ON AD2.ASCII_NAME = BOUNDARY.NAME AND ST_COVERS(BOUNDARY.COORDINATES, ST_POINT(AD2.LONGITUDE, AD2.LATITUDE)) LEFT JOIN V_OSM_USA_SHOP_GENERAL_STORE_DEPARTMENT_STORE_MALL SHOP ON ST_COVERS(BOUNDARY.COORDINATES,ST_CENTROID(SHOP.COORDINATES)) GROUP BY AD2.GEO_ID, AD2.ADMIN2_CD, AD2.PARENT_ADMIN_CD, AD2.COUNTRY_CD, AD2.ASCII_NAME; |
3. Get the county population for the top rated properties in the US.
This can be done by making use of the property data and the ADMIN_1 and ADMIN_2 table(county in the US is the second-order administrative division) of Geography dimension.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
SELECT ID, PL.NAME PROP_NAME, COUNTY_OR_EQUIVALENT, RATE, AREA_IN_ACRES, ROUND(RATE / AREA_IN_ACRES,0) RATE_PER_ACRE, AD2.POPULATION FROM PROPERTY_LANDS PL LEFT JOIN ( SELECT * FROM V_GEONM_ADMIN_1 WHERE COUNTRY_CD = 'US') AD1 ON PL.STATE = REGEXP_SUBSTR(ADMIN1_CD,'_([a-zA-Z0-9]+)',1,1,'e',1) LEFT JOIN ( SELECT * FROM V_GEONM_ADMIN_2 WHERE COUNTRY_CD = 'US') AD2 ON UPPER(PL.COUNTY_OR_EQUIVALENT) = UPPER(AD2.ASCII_NAME) AND AD1.ADMIN1_CD = AD2.PARENT_ADMIN_CD ORDER BY RATE_PER_ACRE DESC; |
4. Number of cities with a population > 10,000 within the county where the property belongs.
This can be achieved by making use of the property data along with the ADMIN tables and CITY_500 table from Geography Dimension. We need to consider all the administrative levels from first to fifth order as the city could be listed directly under any of these divisions. We can omit the unwanted admin tables from the join if we correctly know the admin levels under which no city is listed for the country under consideration.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
SELECT ID,NAME,STATE,COUNTY_OR_EQUIVALENT,RATE,AREA_IN_ACRES,SUM(CASE WHEN POPULATION > 10000 THEN 1 ELSE 0 END ) CITY_CNT FROM( SELECT DISTINCT ID,PL.NAME,STATE,COUNTY_OR_EQUIVALENT,RATE,AREA_IN_ACRES,CITY.NAME CITY_NAME,CITY.POPULATION FROM PROPERTY_LANDS PL INNER JOIN ( SELECT * FROM V_GEONM_ADMIN_1 WHERE COUNTRY_CD = 'US') AD1 ON PL.STATE = REGEXP_SUBSTR(ADMIN1_CD,'_([a-zA-Z0-9]+)',1,1,'e',1) INNER JOIN ( SELECT * FROM V_GEONM_ADMIN_2 WHERE COUNTRY_CD = 'US') AD2 ON UPPER(PL.COUNTY_OR_EQUIVALENT) = UPPER(AD2.ASCII_NAME) AND AD1.ADMIN1_CD = AD2.PARENT_ADMIN_CD LEFT JOIN ( SELECT * FROM V_GEONM_ADMIN_3 WHERE COUNTRY_CD = 'US') AD3 ON AD2.ADMIN2_CD = AD3.PARENT_ADMIN_CD LEFT JOIN ( SELECT * FROM V_GEONM_ADMIN_4 WHERE COUNTRY_CD = 'US') AD4 ON AD3.ADMIN3_CD = AD4.PARENT_ADMIN_CD LEFT JOIN ( SELECT * FROM V_GEONM_ADMIN_5 WHERE COUNTRY_CD = 'US') AD5 ON AD4.ADMIN4_CD = AD5.PARENT_ADMIN_CD LEFT JOIN ( SELECT * FROM V_GEONM_CITY_500 WHERE COUNTRY_CD = 'US' AND POPULATION > 10000) CITY ON (AD1.ADMIN1_CD = CITY.ADMIN_CD OR AD2.ADMIN2_CD = CITY.ADMIN_CD OR AD3.ADMIN3_CD = CITY.ADMIN_CD OR AD4.ADMIN4_CD = CITY.ADMIN_CD OR AD5.ADMIN5_CD = CITY.ADMIN_CD) ) GROUP BY ID, NAME, STATE, COUNTY_OR_EQUIVALENT, RATE, AREA_IN_ACRES ORDER BY ID DESC; |
5. List the count of educational institutions within a distance of 2000m for all the listed properties
In order to achieve this the AMENITY_EDUCATION table from OSM is used. The table AMENITY_EDUCATION has the details of educational institutions.
Likewise we can make use of the following amenity tables to get information on respective facilities within a specified distance from the property under consideration.
AMENITY_ENTERTAINMENT_ARTS_AND_CULTURE
AMENITY_FINANCIAL
AMENITY_HEALTHCARE
AMENITY_TRANSPORTATION
AMENITY_SUSTENANCE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
SELECT PL.ID, PL.NAME, PL.STATE, PL.COUNTY_OR_EQUIVALENT, PL.RATE, PL.AREA_IN_ACRES, COUNT(*) CNT_EDUCATIONAL FROM PROPERTY_LANDS PL INNER JOIN ( SELECT *, ST_X(ST_CENTROID(COORDINATES)) AS LON, ST_Y(ST_CENTROID(COORDINATES)) AS LAT FROM V_OSM_USA_AMENITY_EDUCATION ) AM_EDU ON ST_DISTANCE(PL.COORDINATES, AM_EDU.COORDINATES) < 2000 WHERE AM_EDU.LON BETWEEN ST_X(PL.COORDINATES) -2 AND ST_X(PL.COORDINATES) + 2 AND AM_EDU.LAT BETWEEN ST_Y(PL.COORDINATES) -2 AND ST_Y(PL.COORDINATES) + 2 GROUP BY PL.ID, PL.NAME, PL.STATE, PL.COUNTY_OR_EQUIVALENT, PL.RATE, PL.AREA_IN_ACRES; |
6. Average rate of land in each of the second-order administrative divisions of US(Counties) .
This is calculated using ADMIN_2 table of Geography Dimension and the property dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
SELECT AD2.GEO_ID, AD2.ADMIN2_CD, AD2.PARENT_ADMIN_CD, AD2.COUNTRY_CD, AD2.ASCII_NAME, AD2.POPULATION, PL.AVG_RATE_OF_LAND FROM ( SELECT * FROM V_GEONM_ADMIN_2 WHERE COUNTRY_CD = 'US') AD2 LEFT JOIN ( SELECT STATE, COUNTY_OR_EQUIVALENT, ROUND(AVG(RATE / AREA_IN_ACRES),0) AVG_RATE_OF_LAND FROM PROPERTY_LANDS WHERE RATE IS NOT NULL AND AREA_IN_ACRES IS NOT NULL AND AREA_IN_ACRES > 20 GROUP BY STATE, COUNTY_OR_EQUIVALENT) PL ON AD2.ASCII_NAME = PL.COUNTY_OR_EQUIVALENT AND REGEXP_SUBSTR(AD2.ADMIN2_CD,'_([a-zA-Z0-9]+)',1,1,'e',1) = PL.STATE; |
Here we have covered a few of the scenarios where we can effectively make use of these datasets. A little modification to these SQLs would need to be done based on your dataset as well as your requirement. There are many more scenarios which we haven’t covered in this blog where our dataset can be made useful. Please feel free to reach out to Sonra if you need any assistance on this.