Location analytics on Snowflake with OpenStreetMap, US property data, and our ready made Geography Dimension

Uli Bethke Uncategorized

We have made available a ready made Geography Dimension on the Snowflake Data Marketplace. The marketplace is one of the components of the Snowflake Data Cloud. It covers the first to fifth order administrative divisions and cities in all major countries. This can be used for drilling down / rolling up in any type of location analytics.

In this blog post we will integrate the Geography Dimension with administrative boundary data which we derived from OpenStreetMap (OSM). This data set is also available on the Snowflake data marketplace. The OSM data contains geo coordinates as polygons and we can perform geo-lookups for latitude / longitude pairs.

Using US property data we can look up the geo-coordinates (latitude / longitude) to locate the US County, US State and other administrative divisions of the property. We can then use the Geography Dimension to drill up and down on this data.

Let us first have a look at the data sets in detail before we show you some examples.

Geography Dimension

The geography dimension on the Snowflake Marketplace contains the data from country to city level including the first to fifth order administrative divisions in between and it covers all major countries. A detailed documentation on this is available.

The core tables in the Geography Dimension Dataset are as given below: -

COUNTRY

This is the highest level of information available in the dataset. The attributes of the country such as iso codes, fips codes, capital, area, population, continent, currency, postal code format, phone dialing code, languages, neighbouring countries are all included in this table.

ADMIN TABLES

The admin tables cover the administrative divisions within a country. We have created five main admin tables namely ADMIN_1, ADMIN_2, ADMIN_3, ADMIN_4 AND ADMIN_5 which corresponds to the first to fifth order administrative divisions in a country. In addition to this tables ADMIN_3_REL1, ADMIN4_REL1, ADMIN4_REL2, ADMIN5_REL1, ADMIN5_REL2, ADMIN5_REL3 are also available which consists the administrative divisions which skipped one or more of it’s parent admin-orders.

CITY_500

The city data is the lowest level of information available in the dataset. It covers all cities around the world with a population > 500.

OSM Data

Sonra has made available OpenStreetMap (OSM) data sets on the Snowflake DataMarketplace. The data is available in each Snowflake region.

We have derived the administrative divisions of the major countries as documented in the OSM documentation on administrative boundaries.

Here is an example of what is contained in the data set.

The GEO field contains the coordinates of the admin boundaries as polygons or multipolygons. The RANK field indicates the hierarchy of the different boundary/administrative divisions such as country borders, state borders and county borders.

Property data

We also have scraped some US property data for this demo to show how the in-house data of a property website or agency can be enriched with geo information for advanced location analytics.

The property data contains the following data points.

Column Description
ID Unique Identifier
NAME Name of the property
STATE The state where the property is located
COUNTY_OR_EQUIVALENT The county or equivalent subdivision of state where the property is located
RATE The rate of the property in dollars
AREA_IN_ACRES The total area in acres
COORDINATES The geo coordinates of property (lat / lon)

The data sets in action

Now let us have a look at how we can enrich our property data with admin boundary information and data from OSM.

For the demonstration, we will make use of Sonra’s OpenStreetMap United States (USA) and Geography Dimension data sets from Snowflake’s Data Marketplace.

Now let us consider the different analysis that could be done.

  1. Get the boundary coordinates of second order administrative divisions of the US (Counties and equivalents).

This can be done by making use of the ADMIN_2 table of Geography dimension and the OSM Reverse Geocoding table BOUNDARY.

In the same way the exact boundary for any admin order in the Geography Dimension can be obtained by joining it with the BOUNDARY table choosing the right admin level(RANK).

In this example we consider the RANK 6 in the BOUNDARY table as indicates the state counties and county equivalents for US. Refer the link for more details on choosing the appropriate RANK.

 

2. List of second-order administrative divisions(counties) and number of general/departmental stores and malls.

This can be achieved by making use of ADMIN_2 from Geography dimension, BOUNDARY table from OSM Reverse Geocoding and the SHOP_GENERAL_STORE_DEPARTMENT_STORE_MALL table of OSM.

3. Get the county population for the top rated properties in the US.

This can be done by making use of the property data and the ADMIN_1 and ADMIN_2 table(county in the US is the second-order administrative division) of Geography dimension.

4. Number of cities with a population > 10,000 within the county where the property belongs.

This can be achieved by making use of the property data along with the ADMIN tables and CITY_500 table from Geography Dimension. We need to consider all the administrative levels from first to fifth order as the city could be listed directly under any of these divisions. We can omit the unwanted admin tables from the join if we correctly know the admin levels under which no city is listed for the country under consideration.

5. List the count of educational institutions within a distance of 2000m for all the listed properties

In order to achieve this the AMENITY_EDUCATION table from OSM is used. The table AMENITY_EDUCATION has the details of educational institutions.

Likewise we can make use of the following amenity tables to get information on respective facilities within a specified distance from the property under consideration.

AMENITY_ENTERTAINMENT_ARTS_AND_CULTURE

AMENITY_FINANCIAL

AMENITY_HEALTHCARE

AMENITY_TRANSPORTATION

AMENITY_SUSTENANCE

6. Average rate of land in each of the second-order administrative divisions of US(Counties) .

This is calculated using ADMIN_2 table of Geography Dimension and the property dataset.

Here we have covered a few of the scenarios where we can effectively make use of these datasets. A little modification to these SQLs would need to be done based on your dataset as well as your requirement. There are many more scenarios which we haven't covered in this blog where our dataset can be made useful. Please feel free to reach out to Sonra if you need any assistance on this.

 

About the author

Uli Bethke LinkedIn Profile

Uli has 18 years’ hands on experience as a consultant, architect, and manager in the data industry. He frequently speaks at conferences. Uli has architected and delivered data warehouses in Europe, North America, and South East Asia. He is a traveler between the worlds of traditional data warehousing and big data technologies.

Uli is a regular contributor to blogs and books and chairs the the Hadoop User Group Ireland. He is also a co-founder and VP of the Irish chapter of DAMA, a non for profit global data management organization. He has co-founded the Irish Oracle Big Data User Group.