Working with JSON in Redshift. Options, limitations, and alternatives
Redshift offers limited support to work with JSON documents. We have three options to load JSON data into Redshift.
- We can convert JSON to a relational model when loading the data to Redshift (COPY JSON functions). This requires us to pre-create the relational target data model and to manually map the JSON elements to the target table columns. There are some serious limitations to this approach. We will cover them in detail in this post.
- We can load whole JSON documents into Redshift and transform and query them with the JSON-SQL functions offered in Redshift. This has other limitations which will also be covered in this post.
Another option is Flexter, our data warehouse automation tool for JSON (and XML). Flexter fully automates the conversion of JSON to a relational model on Redshift. You don’t have to write a single line of code. The whole conversion process is fully automated. You can find out more on our product page. If you have questions or would like to see Flexter in action you can request a demo.
COPY from JSON Format
The COPY command loads data into Redshift tables from JSON data files in an S3 bucket or on a remote host accessed via SSH. We can automatically COPY fields from the JSON file by specifying the ‘auto’ option, or we can specify a JSONPaths file. This is a mapping document that COPY will use to map and parse the JSON source data into the target.
Copy From JSON using the ‘auto’ option
To load a JSON document with the ‘auto’ option requires a template that exactly maps the JSON elements to the table column names. The order of the elements in the JSON file does not matter (matching is done by name). Following shows contents of a file named customer.json
1 2 3 4 5 6 7 8 9 10 11 12 |
{ "name": "Major", "age": 19, "address": { "street":{ "st":"5 maint st", "ci":"Dub" }, "city":"Dublin" }, "catname": ["MLB","GBM"] } |
For auto COPY to work we need to create the following table structure in Redshift
1 2 3 4 5 6 7 8 9 10 11 12 |
create table customer ( name varchar, age int, address varchar, catname varchar ); copy customer --redshift code from 's3://mybucket/customer.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' json 'auto'; SELECT * FROM customer ; |
name | age | address | catname |
Major | 19 | {“street”:{“st”:”5 maint st”,ci”:”Dub”},”city”:”Dublin”} | [“MLB”,”GBM”] |
The COPY function only parses the first-level JSON data structures to columns in target table by matching each name. Hence multi-level structures or array will be treated as a single column string while loading. This is a serious limitation, but more on this in a moment
Copy From JSON using the ‘JSONPath file’option:
If JSON data objects don’t directly match Redshift column names, we use a JSONPath file to map JSON elements to table columns. Again, the order does not matter, but the order of JSON path file expressions must match the column order. Consider same example of JSON file customer.json. This time the field names are different from table column names in redshift.
1 2 3 4 5 6 7 8 9 10 11 12 |
{ "Name": "Major", "Age": 19, "Add": { "street":{ "st":"5 maint st", "ci":"Dub" }, "city":"Dublin" }, "Category_Name": ["MLB","GBM"] } |
Table in Redshift (with non-matching column names)
1 2 3 4 5 6 7 |
create table customer ( name varchar, age int, address varchar, catname varchar ); |
JSON Path file jpath.json-
1 2 3 4 5 6 7 8 9 10 11 12 13 |
{ "jsonpaths": [ "$['Name']", "$['Age']", "$['Add']", "$['Category_Name']" ] } copy customer --redshift code from 's3://mybucket/customer.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' json from 's3://mybucket/jpath.json' ; -- Jsonpath file to map fields SELECT * FROM customer ; |
name | age | address | catname |
---|---|---|---|
Major | 19 | {“street”:{“st”:”5 maint st”,ci”:”Dub”},”city”:”Dublin”} | [“MLB”,”GBM”] |
In this example the JSON data objects don’t correspond directly to column names. A JSONPaths file is used to map the JSON elements to columns. The COPY command only checks for an exact match from JSON file and hence ignores columns with case mismatch between JSON file and table.
Limitations of Copy to convert JSON in Redshift
Let’s have a look at the limitations of the Redshift COPY function to convert JSON data into Redshift tables
- Amazon Redshift does not support any JSONPath elements, such as wildcard characters or filter expressions, that might resolve to an ambiguous path or multiple name elements.
- By default the COPY function only parses first-level JSON data structures to columns in target table by matching each name. Hence multi-level structures are considered as single column strings while loading. You will still need to use the limited number of SQL JSON extensions that Redshift provides to further parse out your data.
- The Redshift table schema is case insensitive, while the COPY command check is case sensitive. Only exact matching works.
- Amazon Redshift can’t parse complex, multi-level data structures or arrays of JSON files.
- After parsing and loading the first-level of data, Redshift offers multiple JSON-SQL functions to parse nested lower level data.
- We can import JSON directly into Redshift if we have a table structure that matches. This requires a lot of analysis and manual DDL.
- JSON fields can only be stored as string data types. The MAX setting defines the width of the column as 4096 bytes for CHAR or 65535 bytes for VARCHAR. An attempt to store a longer string into a column of these types results in an error.
SQL-JSON Functions
Redshift does not have a native JSON data type like Snowflake or other data warehouse platforms, e.g. we can not load a JSON document into Redshift as a LOB. Each document must be parsed into a single column and can be manipulated and queried with the help of JSON-SQL functions offered in Redshift.
JSON Functions
Redshift offers a limited number of JSON functions that can be used directly from SQL. Let’s load some sample JSON data and go through these functions in detail.
IS_VALID_JSON Function
IS_VALID_JSON_ARRAY Function
JSON_ARRAY_LENGTH Function
JSON_EXTRACT_ARRAY_ELEMENT_TEXT Function
JSON_EXTRACT_PATH_TEXT Function
Have a look at this sample data from the traffic.csv file which includes JSON data in the trafficsource field.
channelgrouping | trafficsource |
---|---|
Organic Search | {“campaign”: “A”, “source”: “google”, “medium”: “organic”, “keyword”: “water bottle”, “adwordsClickInfo”: {“criteriaParameters”: “demo dataset”}} |
Direct | {“campaign”: “B”, “source”: “(direct)”, “medium”: “(none)”, “adwordsClickInfo”: {“criteriaParameters”: “demo dataset”}, “isTrueDirect”: true} |
Organic Search | {“campaign”: “C”, “source”: “google”, “medium”: “organic”, “keyword”: “(not provided)”, “adwordsClickInfo”: {“criteriaParameters”: “demo dataset”}} |
Referral | {“referralPath”: “/offer/2145”, “campaign”: “A”, “source”: “(direct)”, “medium”: “(none)”, “adwordsClickInfo”: {“criteriaParameters”: “demo dataset”}} |
Referral | [“a”,[“b”,1,[“c”,2,3,null]]] |
Prerequisite Table in Redshift-
1 2 3 4 5 6 7 8 9 |
create table traffic ( channelgrouping varchar(20), trafficsource varchar(max) --to load JSON data ); copy traffic --redshift code from 's3://mybucket/traffic.csv iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' CSV ; |
The result of select * from.”traffic” ; will appear same as csv file.
IS_VALID_JSON
Validates a JSON string. The function returns Boolean true (t) if the string is properly formed JSON or false (f) if the string is malformed.
1 2 |
select trafficSource,is_valid_json(trafficSource) from traffic limit 1; |
trafficsource | validate_json |
---|---|
{“campaign”: “A”, “source”: “google”, “medium”: “organic”, “keyword”: “water bottle”, “adwordsClickInfo”: {“criteriaParameters”: “demo dataset”}} | true |
IS_VALID_JSON_ARRAY
Validates a JSON array. The function returns Boolean true (t) if the array is properly formed JSON or false (f) if the array is malformed.
1 2 |
select trafficsource, is_valid_json_array(trafficsource) as validate_json_array from traffic limit 2; |
trafficsource | validate_json_array |
---|---|
{“referralPath”: “/offer/2145”, “source”: “(direct)”, “medium”: “(none)”, “adwordsClickInfo”: {“criteriaParameters”: “demo dataset”}} | false |
[“a”,[“b”,1,[“c”,2,3,null]]] | true |
JSON_ARRAY_LENGTH
Returns the number of elements in the outer array of a JSON string. If the null_if_invalid argument is set to true and the JSON string is invalid, the function returns NULL instead of returning an error.
The following example sets null_if_invalid to true, so the statement the returns NULL instead of returning an error for invalid JSON.
1 |
select json_array_length(trafficsource,true) from traffic ; |
json_array_length |
---|
NULL |
NULL |
NULL |
NULL |
2 |
JSON_EXTRACT_ARRAY_ELEMENT_TEXT
Returns a JSON array element in the outermost array of a JSON string, using a zero-based index. The first element in an array is at position 0. If the index is negative or out of bound, JSON_EXTRACT_ARRAY_ELEMENT_TEXT returns empty string. If the null_if_invalid argument is set to true and the JSON string is invalid, the function returns NULL instead of returning an error.
1 |
select json_extract_array_element_text(trafficsource,1,true) from traffic ; |
json_extract_array_element_text |
---|
NULL |
NULL |
NULL |
NULL |
[“b”,1,[“c”,2,3,null]] |
JSON_EXTRACT_PATH_TEXT
Returns the value for the key:value pair referenced by a series of path elements in a JSON string. The JSON path can be nested up to five levels deep. Path elements are case-sensitive. If a path element does not exist in the JSON string, JSON_EXTRACT_PATH_TEXT returns an empty string. If the null_if_invalid argument is set to true and the JSON string is invalid, the function returns NULL instead of returning an error.
1 |
select channelgrouping, json_extract_path_text (trafficsource,'campaign',true) As campaign from traffic ; |
channelgrouping | campaign |
---|---|
Organic Search | A |
Direct | B |
Organic Search | C |
Referral | A |
Referral | NULL |
Limitations of JSON Functions :
- Redshift checks for valid JSON text every time it execute any JSON function on JSON text field. This actually increase execution time of JSON functions.
- Redshift is not a good choice for storing large JSON data sets because, by storing disparate data in a single column, JSON does not leverage Amazon Redshift’s column store architecture.
- JSON uses UTF-8 encoded text strings, so JSON strings can only be stored as CHAR or VARCHAR data types. VARCHAR is better if the strings include multi-byte characters. Amazon Redshift doesn’t support VARCHAR data types larger than 64 KB, so we can’t store traditional LOBs on Amazon Redshift.
- There is not RedShift JSON functions that can expand an array record from one row into one row for each array element. This makes it very inconvenient or impossible to convert JSON to a relational structure with tables, rows, and columns.
Redshift and the ANSI 2016 SQL JSON extensions
The ANSI 2016 SQL standard introduced 44 optional extensions. 22 of these extensions are in relation to working with JSON documents inside a database.
Let’s have a look which of these functions are supported by Redshift.
Here we are discussing features does not support by Redshift described by the 2016 SQL standard.
SQL/JSON functions are partitioned into two groups: constructor functions (JSON_OBJECT, JSON_OBJECT_AGG, JSON_ARRAY, and JSON_ARRAYAGG ) and query functions (JSON_VALUE, JSON_TABLE, JSON_EXISTS, and JSON_QUERY).
Constructor functions use values of SQL types and produce JSON values (JSON objects or JSON arrays) represented in SQL character or binary string types. Query functions evaluate SQL/JSON path language expressions against JSON values, producing values of SQL/JSON types, which are converted to SQL types. .
JSON Query function – JSON_TABLE
JSON_TABLE is a function that takes JSON data as input and generates relational data for valid input data. This function also allows unnesting of (even deeply) nested JSON objects/arrays in one invocation rather than chaining several JSON_TABLE expressions in the SQL-statement.
JSON constructor functions
Redshift doesn’t support constructing JSON data from SQL tables. SQL applications working with JSON data will often need to construct new JSON objects or arrays, either for use within the applications themselves, for storage in the SQL database, or to return to the application itself. This section describe all built-in function to generate JSON from explicit table data.
JSON_OBJECT that constructs JSON objects from explicit name/value pairs.
JSON_OBJECTAGG that constructs JSON object as an aggregation of information from SQL table.
JSON_ARRAY that constructs JSON array from an explicit list of SQL data.
JSON_ARRAYAGG that constructs JSON array as an aggregation of information from SQL table.
IS JSON predicate determines whether the value of a specified string does or does not conform to the structural rules for JSON.
JSON Path language elements and filters
The SQL/JSON path language is a query language used by certain SQL operators (JSON_VALUE,JSON_QUERY, JSON_TABLE and JSON_EXISTS, collectively known as the SQL/JSON query operators) to query JSON text. The SQL/JSON path language is not, strictly speaking, SQL, though it is embedded in these operators within SQL.
Amazon Redshift does not support any JSONPath elements, such as wildcard characters or filter expressions, that might resolve to an ambiguous path or multiple name elements. As a result, Amazon Redshift can’t parse complex, multi-level data structures.
Converting Trello JSON to Redshift
Let’s have a look at how Redshift handles conversion of relatively complex JSON documents.
We will take some JSON data from Trello, a popular SaaS solution for managing projects. You can export your data as JSON files. We believe that Trello is using a NoSQL database such as MongoDB or similar to store their data.
Let’s have a look at some of the features of the exported JSON document.
File type: JSON multi-level nested objects
Size : 7 MB
Number of First level keys : 30 Fields.
The JSON document contains business keys
Unfortunately, we won’t be able to parse this JSON file into Redshift with native functionality. First of all it exceeds the maximum allowed size of 64 KB in Redshift. Secondly, it also contains multi-level nested data, which makes it very hard to convert with the limited support of JSON features in Redshift SQL.
Alternatives
One alternative solution to convert this data into Redshift is to use the JSON features of Redshift Spectrum. Redshift Spectrum does not have the limitations of the native Redshift SQL extensions for JSON. This approach works reasonably well for simple JSON documents. However, it gets difficult and very time consuming for more complex JSON data such as the one found in the Trello JSON.
For those scenarios we have created Flexter. Flexter is a data warehouse automation tool to convert JSON (or XML) to a relational database such as Redshift, Hadoop formats (ORC, Parquet, Avro) or simple text files. No coding is required and Flexter fully automates the process.
We have written up a separate post where we automatically convert Trello JSON to Redshift.
If you have any questions please refer to the Flexter FAQ section. You can also request a demo of Flexter.