Working with JSON in Redshift. Options, limitations, and alternatives

Uli Bethke JSON, Redshift

Redshift offers limited support to work with JSON documents. We have two options to load JSON data into Redshift.

  • We can convert JSON to a relational model when loading the data to Redshift (COPY JSON functions). This requires us to pre-create the relational target data model and to manually map the JSON elements to the target table columns. There are some serious limitations to this approach. We will cover them in detail in this post.
  • We can load whole JSON documents into Redshift and transform and query them with the JSON-SQL functions offered in Redshift. This has other limitations which will also be covered in this post.

COPY from JSON Format

The COPY command loads data into Redshift tables from JSON data files in an S3 bucket or on a remote host accessed via SSH. We can automatically COPY fields from the JSON file by specifying the 'auto' option, or we can specify a JSONPaths file. This is a mapping document that COPY will use to map and parse the JSON source data into the target.

Copy From JSON using the ‘auto’ option

To load a JSON document with the ‘auto’ option requires a template that exactly maps the JSON elements to the table column names. The order of the elements in the JSON file does not matter (matching is done by name). Following shows contents of a file named customer.json

For auto COPY to work we need to create the following table structure in Redshift

create table customer

name age address catname
Major 19 {"street":{"st":"5 maint st",ci":"Dub"},"city":"Dublin"} ["MLB”,”GBM”]

Note :

The COPY function only parses the first-level JSON data structures to columns in target table by matching each name. Hence multi-level structures or array will be treated as a single column string while loading. This is a serious limitation, but more on this in a moment

Copy From JSON using the ‘JSONPath file’option:

If JSON data objects don’t directly match Redshift column names, we use a JSONPath file to map JSON elements to table columns. Again, the order does not matter, but the order of JSON path file expressions must match the column order. Consider same example of JSON file customer.json. This time the field names are different from table column names in redshift.

Table in Redshift (with non-matching column names)

create table customer

JSON Path file jpath.json-

name age address catname
Major 19 {"street":{"st":"5 maint st",ci":"Dub"},"city":"Dublin"} ["MLB”,”GBM”]

Note :

In this example the JSON data objects don't correspond directly to column names. A JSONPaths file is used to map the JSON elements to columns. The COPY command only checks for an exact match from JSON file and hence ignores columns with case mismatch between JSON file and table.

Limitations of Copy to convert JSON in Redshift

Let’s have a look at the limitations of the Redshift COPY function to convert JSON data into Redshift tables

  • Amazon Redshift does not support any JSONPath elements, such as wildcard characters or filter expressions, that might resolve to an ambiguous path or multiple name elements.
  • By default the COPY function only parses first-level JSON data structures to columns in target table by matching each name. Hence multi-level structures are considered as single column strings while loading. You will still need to use the limited number of SQL JSON extensions that Redshift provides to further parse out your data.
  • The Redshift table schema is case insensitive, while the COPY command check is case sensitive. Only exact matching works.
  • Amazon Redshift can't parse complex, multi-level data structures or arrays of JSON files.
  • After parsing and loading the first-level of data, Redshift offers multiple JSON-SQL functions to parse nested lower level data.
  • We can import JSON directly into Redshift if we have a table structure that matches. This requires a lot of analysis and manual DDL.
  • JSON fields can only be stored as string data types. The MAX setting defines the width of the column as 4096 bytes for CHAR or 65535 bytes for VARCHAR. An attempt to store a longer string into a column of these types results in an error.

SQL-JSON Functions

Redshift does not have a native JSON data type like Snowflake or other data warehouse platforms, e.g. we can not load a JSON document into Redshift as a LOB. Each document must be parsed into a single column and can be manipulated and queried with the help of JSON-SQL functions offered in Redshift.

JSON Functions

Redshift offers a limited number of JSON functions that can be used directly from SQL. Let’s load some sample JSON data and go through these functions in detail.

IS_VALID_JSON Function

IS_VALID_JSON_ARRAY Function

JSON_ARRAY_LENGTH Function

JSON_EXTRACT_ARRAY_ELEMENT_TEXT Function

JSON_EXTRACT_PATH_TEXT Function

Have a look at this sample data from the traffic.csv file which includes JSON data in the trafficsource field.

channelgrouping trafficsource
Organic Search {"campaign": "A", "source": "google", "medium": "organic", "keyword": "water bottle", "adwordsClickInfo": {"criteriaParameters": "demo dataset"}}
Direct {"campaign": "B", "source": "(direct)", "medium": "(none)", "adwordsClickInfo": {"criteriaParameters": "demo dataset"}, "isTrueDirect": true}
Organic Search {"campaign": "C", "source": "google", "medium": "organic", "keyword": "(not provided)", "adwordsClickInfo": {"criteriaParameters": "demo dataset"}}
Referral {"referralPath": "/offer/2145", "campaign": "A", "source": "(direct)", "medium": "(none)", "adwordsClickInfo": {"criteriaParameters": "demo dataset"}}
Referral ["a",["b",1,["c",2,3,null]]]

Prerequisite Table in Redshift-

create table traffic

Note- The result of select * from."traffic" ; will appear same as csv file.

IS_VALID_JSON

Validates a JSON string. The function returns Boolean true (t) if the string is properly formed JSON or false (f) if the string is malformed.

select trafficSource,is_valid_json(trafficSource) from traffic

limit 1;

trafficsource validate_json
{"campaign": "A", "source": "google", "medium": "organic", "keyword": "water bottle", "adwordsClickInfo": {"criteriaParameters": "demo dataset"}} true

IS_VALID_JSON_ARRAY

Validates a JSON array. The function returns Boolean true (t) if the array is properly formed JSON or false (f) if the array is malformed.

select trafficsource, is_valid_json_array(trafficsource) as validate_json_array

from traffic limit 2;

trafficsource validate_json_array
{"referralPath": "/offer/2145", "source": "(direct)", "medium": "(none)", "adwordsClickInfo": {"criteriaParameters": "demo dataset"}} false
["a",["b",1,["c",2,3,null]]] true

JSON_ARRAY_LENGTH

Returns the number of elements in the outer array of a JSON string. If the null_if_invalid argument is set to true and the JSON string is invalid, the function returns NULL instead of returning an error.

The following example sets null_if_invalid to true, so the statement the returns NULL instead of returning an error for invalid JSON.

select json_array_length(trafficsource,true) from traffic ;

json_array_length
NULL
NULL
NULL
NULL
2

JSON_EXTRACT_ARRAY_ELEMENT_TEXT

Returns a JSON array element in the outermost array of a JSON string, using a zero-based index. The first element in an array is at position 0. If the index is negative or out of bound, JSON_EXTRACT_ARRAY_ELEMENT_TEXT returns empty string. If the null_if_invalid argument is set to true and the JSON string is invalid, the function returns NULL instead of returning an error.

select json_extract_array_element_text(trafficsource,1,true) from traffic ;

json_extract_array_element_text
NULL
NULL
NULL
NULL
["b",1,["c",2,3,null]]

JSON_EXTRACT_PATH_TEXT

Returns the value for the key:value pair referenced by a series of path elements in a JSON string. The JSON path can be nested up to five levels deep. Path elements are case-sensitive. If a path element does not exist in the JSON string, JSON_EXTRACT_PATH_TEXT returns an empty string. If the null_if_invalid argument is set to true and the JSON string is invalid, the function returns NULL instead of returning an error.

select channelgrouping, json_extract_path_text (trafficsource,'campaign',true) As campaign from traffic ;

channelgrouping campaign
Organic Search A
Direct B
Organic Search C
Referral A
Referral NULL

Limitations of JSON Functions :

  • Redshift checks for valid JSON text every time it execute any JSON function on JSON text field. This actually increase execution time of JSON functions.
  • Redshift is not a good choice for storing large JSON data sets because, by storing disparate data in a single column, JSON does not leverage Amazon Redshift’s column store architecture.
  • JSON uses UTF-8 encoded text strings, so JSON strings can only be stored as CHAR or VARCHAR data types. VARCHAR is better if the strings include multi-byte characters. Amazon Redshift doesn't support VARCHAR data types larger than 64 KB, so we can't store traditional LOBs on Amazon Redshift.
  • There is not RedShift JSON functions that can expand an array record from one row into one row for each array element. This makes it very inconvenient or impossible to convert JSON to a relational structure with tables, rows, and columns.

Redshift and the ANSI 2016 SQL JSON extensions

The ANSI 2016 SQL standard introduced 44 optional extensions. 22 of these extensions are in relation to working with JSON documents inside a database.

Let’s have a look which of these functions are supported by Redshift.

Here we are discussing features does not support by Redshift described by the 2016 SQL standard.

SQL/JSON functions are partitioned into two groups: constructor functions (JSON_OBJECT, JSON_OBJECT_AGG, JSON_ARRAY, and JSON_ARRAYAGG ) and query functions (JSON_VALUE, JSON_TABLE, JSON_EXISTS, and JSON_QUERY).

Constructor functions use values of SQL types and produce JSON values (JSON objects or JSON arrays) represented in SQL character or binary string types. Query functions evaluate SQL/JSON path language expressions against JSON values, producing values of SQL/JSON types, which are converted to SQL types. .

JSON Query function - JSON_TABLE

JSON_TABLE is a function that takes JSON data as input and generates relational data for valid input data. This function also allows unnesting of (even deeply) nested JSON objects/arrays in one invocation rather than chaining several JSON_TABLE expressions in the SQL-statement.

JSON constructor functions

Redshift doesn’t support constructing JSON data from SQL tables. SQL applications working with JSON data will often need to construct new JSON objects or arrays, either for use within the applications themselves, for storage in the SQL database, or to return to the application itself. This section describe all built-in function to generate JSON from explicit table data.

JSON_OBJECT that constructs JSON objects from explicit name/value pairs.

JSON_OBJECTAGG that constructs JSON object as an aggregation of information from SQL table.

JSON_ARRAY that constructs JSON array from an explicit list of SQL data.

JSON_ARRAYAGG that constructs JSON array as an aggregation of information from SQL table.

IS JSON predicate determines whether the value of a specified string does or does not conform to the structural rules for JSON.

JSON Path language elements and filters

The SQL/JSON path language is a query language used by certain SQL operators (JSON_VALUE,JSON_QUERY, JSON_TABLE and JSON_EXISTS, collectively known as the SQL/JSON query operators) to query JSON text. The SQL/JSON path language is not, strictly speaking, SQL, though it is embedded in these operators within SQL.

Amazon Redshift does not support any JSONPath elements, such as wildcard characters or filter expressions, that might resolve to an ambiguous path or multiple name elements. As a result, Amazon Redshift can't parse complex, multi-level data structures.

Converting Trello JSON to Redshift

Let’s have a look at how Redshift handles conversion of relatively complex JSON documents.

We will take some JSON data from Trello, a popular SaaS solution for managing projects. You can export your data as JSON files. We believe that Trello is using a NoSQL database such as MongoDB or similar to store their data.

Let’s have a look at some of the features of the exported JSON document.

File type: JSON multi-level nested objects

Size : 7 MB

Number of First level keys : 30 Fields.

The JSON document contains business keys

Unfortunately, we won’t be able to parse this JSON file into Redshift with native functionality. First of all it exceeds the maximum allowed size of 64 KB in Redshift. Secondly, it also contains multi-level nested data, which makes it very hard to convert with the limited support of JSON features in Redshift SQL.

Alternatives

One alternative solution to convert this data into Redshift is to use the JSON features of Redshift Spectrum. Redshift Spectrum does not have the limitations of the native Redshift SQL extensions for JSON. This approach works reasonably well for simple JSON documents. However, it gets difficult and very time consuming for more complex JSON data such as the one found in the Trello JSON.

For those scenarios we have created Flexter. Flexter is a data warehouse automation tool to convert JSON (or XML) to a relational database such as Redshift, Hadoop formats (ORC, Parquet, Avro) or simple text files. No coding is required and Flexter fully automates the process.

We have written up a separate post where we automatically convert Trello JSON to Redshift.

If you have any questions please refer to the Flexter FAQ section. You can also request a demo of Flexter.

 

About the author

Uli Bethke LinkedIn Profile

Uli has 18 years’ hands on experience as a consultant, architect, and manager in the data industry. He frequently speaks at conferences. Uli has architected and delivered data warehouses in Europe, North America, and South East Asia. He is a traveler between the worlds of traditional data warehousing and big data technologies.

Uli is a regular contributor to blogs and books and chairs the the Hadoop User Group Ireland. He is also a co-founder and VP of the Irish chapter of DAMA, a non for profit global data management organization. He has co-founded the Irish Oracle Big Data User Group.