Working with JSON in Redshift. Options, limitations, and alternatives

April 24, 2019

Redshift offers limited support to work with JSON documents. We have three options to load JSON data into Redshift.

Table of Contents

We can convert JSON to a relational model when loading the data to Redshift (COPY JSON functions). This requires us to pre-create the relational target data model and to manually map the JSON elements to the target table columns. There are some serious limitations to this approach. We will cover them in detail in this post.
We can load whole JSON documents into Redshift and transform and query them with the JSON-SQL functions offered in Redshift. This has other limitations which will also be covered in this post.

Another option is Flexter, our data warehouse automation tool for JSON (and XML). Flexter fully automates the conversion of JSON to a relational model on Redshift. You don’t have to write a single line of code. The whole conversion process is fully automated. You can find out more on our product page. If you have questions or would like to see Flexter in action you can request a demo.

COPY from JSON Format

The COPY command loads data into Redshift tables from JSON data files in an S3 bucket or on a remote host accessed via SSH. We can automatically COPY fields from the JSON file by specifying the ‘auto’ option, or we can specify a JSONPaths file. This is a mapping document that COPY will use to map and parse the JSON source data into the target.

Copy From JSON using the ‘auto’ option

To load a JSON document with the ‘auto’ option requires a template that exactly maps the JSON elements to the table column names. The order of the elements in the JSON file does not matter (matching is done by name). Following shows contents of a file named customer.json

{
"name": "Major",
"age": 19,
"address": {
"street":{
"st":"5 maint st",
"ci":"Dub"
},
"city":"Dublin"
},
"catname": ["MLB","GBM"]
}

{

"name": "Major",

"age": 19,

"address": {

"street":{

"st":"5 maint st",

"ci":"Dub"

"city":"Dublin"

"catname": ["MLB","GBM"]

}

For auto COPY to work we need to create the following table structure in Redshift

create table customer
(
name varchar,
age int,
address varchar,
catname varchar
);
copy customer --redshift code
from 's3://mybucket/customer.json'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
json 'auto';
SELECT * FROM customer ;

create table customer

(

name varchar,

age int,

address varchar,

catname varchar

);

copy customer --redshift code

from 's3://mybucket/customer.json'

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

json 'auto';

SELECT * FROM customer ;

name	age	address	catname
Major	19	{“street”:{“st”:”5 maint st”,ci”:”Dub”},”city”:”Dublin”}	[“MLB”,”GBM”]

The COPY function only parses the first-level JSON data structures to columns in target table by matching each name. Hence multi-level structures or array will be treated as a single column string while loading. This is a serious limitation, but more on this in a moment

Copy From JSON using the ‘JSONPath file’option:

If JSON data objects don’t directly match Redshift column names, we use a JSONPath file to map JSON elements to table columns. Again, the order does not matter, but the order of JSON path file expressions must match the column order. Consider same example of JSON file customer.json. This time the field names are different from table column names in redshift.

{
"Name": "Major",
"Age": 19,
"Add": {
"street":{
"st":"5 maint st",
"ci":"Dub"
},
"city":"Dublin"
},
"Category_Name": ["MLB","GBM"]
}

{

"Name": "Major",

"Age": 19,

"Add": {

"street":{

"st":"5 maint st",

"ci":"Dub"

"city":"Dublin"

"Category_Name": ["MLB","GBM"]

}

Table in Redshift (with non-matching column names)

create table customer
(
name varchar,
age int,
address varchar,
catname varchar
);

create table customer

(

name varchar,

age int,

address varchar,

catname varchar

);

JSON Path file jpath.json-

{
"jsonpaths": [
"$['Name']",
"$['Age']",
"$['Add']",
"$['Category_Name']"
]
}
copy customer --redshift code
from 's3://mybucket/customer.json'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
json from 's3://mybucket/jpath.json' ; -- Jsonpath file to map fields
SELECT * FROM customer ;

{

"jsonpaths": [

"$['Name']",

"$['Age']",

"$['Add']",

"$['Category_Name']"

]

}

copy customer --redshift code

from 's3://mybucket/customer.json'

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

json from 's3://mybucket/jpath.json' ; -- Jsonpath file to map fields

SELECT * FROM customer ;

name	age	address	catname
Major	19	{“street”:{“st”:”5 maint st”,ci”:”Dub”},”city”:”Dublin”}	[“MLB”,”GBM”]

In this example the JSON data objects don’t correspond directly to column names. A JSONPaths file is used to map the JSON elements to columns. The COPY command only checks for an exact match from JSON file and hence ignores columns with case mismatch between JSON file and table.

Limitations of Copy to convert JSON in Redshift

Let’s have a look at the limitations of the Redshift COPY function to convert JSON data into Redshift tables

Amazon Redshift does not support any JSONPath elements, such as wildcard characters or filter expressions, that might resolve to an ambiguous path or multiple name elements.
By default the COPY function only parses first-level JSON data structures to columns in target table by matching each name. Hence multi-level structures are considered as single column strings while loading. You will still need to use the limited number of SQL JSON extensions that Redshift provides to further parse out your data.
The Redshift table schema is case insensitive, while the COPY command check is case sensitive. Only exact matching works.
Amazon Redshift can’t parse complex, multi-level data structures or arrays of JSON files.
After parsing and loading the first-level of data, Redshift offers multiple JSON-SQL functions to parse nested lower level data.
We can import JSON directly into Redshift if we have a table structure that matches. This requires a lot of analysis and manual DDL.
JSON fields can only be stored as string data types. The MAX setting defines the width of the column as 4096 bytes for CHAR or 65535 bytes for VARCHAR. An attempt to store a longer string into a column of these types results in an error.

SQL-JSON Functions

Redshift does not have a native JSON data type like Snowflake or other data warehouse platforms, e.g. we can not load a JSON document into Redshift as a LOB. Each document must be parsed into a single column and can be manipulated and queried with the help of JSON-SQL functions offered in Redshift.

JSON Functions

Redshift offers a limited number of JSON functions that can be used directly from SQL. Let’s load some sample JSON data and go through these functions in detail.
IS_VALID_JSON Function
IS_VALID_JSON_ARRAY Function
JSON_ARRAY_LENGTH Function
JSON_EXTRACT_ARRAY_ELEMENT_TEXT Function
JSON_EXTRACT_PATH_TEXT Function
Have a look at this sample data from the traffic.csv file which includes JSON data in the trafficsource field.

channelgrouping	trafficsource
Organic Search	{“campaign”: “A”, “source”: “google”, “medium”: “organic”, “keyword”: “water bottle”, “adwordsClickInfo”: {“criteriaParameters”: “demo dataset”}}
Direct	{“campaign”: “B”, “source”: “(direct)”, “medium”: “(none)”, “adwordsClickInfo”: {“criteriaParameters”: “demo dataset”}, “isTrueDirect”: true}
Organic Search	{“campaign”: “C”, “source”: “google”, “medium”: “organic”, “keyword”: “(not provided)”, “adwordsClickInfo”: {“criteriaParameters”: “demo dataset”}}
Referral	{“referralPath”: “/offer/2145”, “campaign”: “A”, “source”: “(direct)”, “medium”: “(none)”, “adwordsClickInfo”: {“criteriaParameters”: “demo dataset”}}
Referral	[“a”,[“b”,1,[“c”,2,3,null]]]

Prerequisite Table in Redshift-

create table traffic
(
channelgrouping varchar(20),
trafficsource varchar(max) --to load JSON data
);
copy traffic --redshift code
from 's3://mybucket/traffic.csv
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
CSV ;

create table traffic

(

channelgrouping varchar(20),

trafficsource varchar(max) --to load JSON data

);

copy traffic --redshift code

from 's3://mybucket/traffic.csv

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

CSV ;

The result of select * from.”traffic” ; will appear same as csv file.

IS_VALID_JSON

Validates a JSON string. The function returns Boolean true (t) if the string is properly formed JSON or false (f) if the string is malformed.

select trafficSource,is_valid_json(trafficSource) from traffic
limit 1;

1 2	select trafficSource,is_valid_json(trafficSource) from traffic limit 1;

trafficsource	validate_json
{“campaign”: “A”, “source”: “google”, “medium”: “organic”, “keyword”: “water bottle”, “adwordsClickInfo”: {“criteriaParameters”: “demo dataset”}}	true

IS_VALID_JSON_ARRAY

Validates a JSON array. The function returns Boolean true (t) if the array is properly formed JSON or false (f) if the array is malformed.

select trafficsource, is_valid_json_array(trafficsource) as validate_json_array
from traffic limit 2;

1 2	select trafficsource, is_valid_json_array(trafficsource) as validate_json_array from traffic limit 2;

trafficsource	validate_json_array
{“referralPath”: “/offer/2145”, “source”: “(direct)”, “medium”: “(none)”, “adwordsClickInfo”: {“criteriaParameters”: “demo dataset”}}	false
[“a”,[“b”,1,[“c”,2,3,null]]]	true

JSON_ARRAY_LENGTH

Returns the number of elements in the outer array of a JSON string. If the null_if_invalid argument is set to true and the JSON string is invalid, the function returns NULL instead of returning an error.
The following example sets null_if_invalid to true, so the statement the returns NULL instead of returning an error for invalid JSON.

select json_array_length(trafficsource,true) from traffic ;

1	select json_array_length(trafficsource,true) from traffic ;

json_array_length
NULL
NULL
NULL
NULL
2

JSON_EXTRACT_ARRAY_ELEMENT_TEXT

Returns a JSON array element in the outermost array of a JSON string, using a zero-based index. The first element in an array is at position 0. If the index is negative or out of bound, JSON_EXTRACT_ARRAY_ELEMENT_TEXT returns empty string. If the null_if_invalid argument is set to true and the JSON string is invalid, the function returns NULL instead of returning an error.

select json_extract_array_element_text(trafficsource,1,true) from traffic ;

1	select json_extract_array_element_text(trafficsource,1,true) from traffic ;

json_extract_array_element_text
NULL
NULL
NULL
NULL
[“b”,1,[“c”,2,3,null]]

JSON_EXTRACT_PATH_TEXT

Returns the value for the key:value pair referenced by a series of path elements in a JSON string. The JSON path can be nested up to five levels deep. Path elements are case-sensitive. If a path element does not exist in the JSON string, JSON_EXTRACT_PATH_TEXT returns an empty string. If the null_if_invalid argument is set to true and the JSON string is invalid, the function returns NULL instead of returning an error.

select channelgrouping, json_extract_path_text (trafficsource,'campaign',true) As campaign from traffic ;

1	select channelgrouping, json_extract_path_text (trafficsource,'campaign',true) As campaign from traffic ;

channelgrouping	campaign
Organic Search	A
Direct	B
Organic Search	C
Referral	A
Referral	NULL

Limitations of JSON Functions :

Redshift checks for valid JSON text every time it execute any JSON function on JSON text field. This actually increase execution time of JSON functions.
Redshift is not a good choice for storing large JSON data sets because, by storing disparate data in a single column, JSON does not leverage Amazon Redshift’s column store architecture.
JSON uses UTF-8 encoded text strings, so JSON strings can only be stored as CHAR or VARCHAR data types. VARCHAR is better if the strings include multi-byte characters. Amazon Redshift doesn’t support VARCHAR data types larger than 64 KB, so we can’t store traditional LOBs on Amazon Redshift.
There is not RedShift JSON functions that can expand an array record from one row into one row for each array element. This makes it very inconvenient or impossible to convert JSON to a relational structure with tables, rows, and columns.

Redshift and the ANSI 2016 SQL JSON extensions

The ANSI 2016 SQL standard introduced 44 optional extensions. 22 of these extensions are in relation to working with JSON documents inside a database.

Let’s have a look which of these functions are supported by Redshift.
Here we are discussing features does not support by Redshift described by the 2016 SQL standard.

SQL/JSON functions are partitioned into two groups: constructor functions (JSON_OBJECT, JSON_OBJECT_AGG, JSON_ARRAY, and JSON_ARRAYAGG ) and query functions (JSON_VALUE, JSON_TABLE, JSON_EXISTS, and JSON_QUERY).

Constructor functions use values of SQL types and produce JSON values (JSON objects or JSON arrays) represented in SQL character or binary string types. Query functions evaluate SQL/JSON path language expressions against JSON values, producing values of SQL/JSON types, which are converted to SQL types. .

JSON Query function – JSON_TABLE

JSON_TABLE is a function that takes JSON data as input and generates relational data for valid input data. This function also allows unnesting of (even deeply) nested JSON objects/arrays in one invocation rather than chaining several JSON_TABLE expressions in the SQL-statement.

JSON constructor functions

Redshift doesn’t support constructing JSON data from SQL tables. SQL applications working with JSON data will often need to construct new JSON objects or arrays, either for use within the applications themselves, for storage in the SQL database, or to return to the application itself. This section describe all built-in function to generate JSON from explicit table data.

JSON_OBJECT that constructs JSON objects from explicit name/value pairs.

JSON_OBJECTAGG that constructs JSON object as an aggregation of information from SQL table.

JSON_ARRAY that constructs JSON array from an explicit list of SQL data.

JSON_ARRAYAGG that constructs JSON array as an aggregation of information from SQL table.

IS JSON predicate determines whether the value of a specified string does or does not conform to the structural rules for JSON.

JSON Path language elements and filters

The SQL/JSON path language is a query language used by certain SQL operators (JSON_VALUE,JSON_QUERY, JSON_TABLE and JSON_EXISTS, collectively known as the SQL/JSON query operators) to query JSON text. The SQL/JSON path language is not, strictly speaking, SQL, though it is embedded in these operators within SQL.

Amazon Redshift does not support any JSONPath elements, such as wildcard characters or filter expressions, that might resolve to an ambiguous path or multiple name elements. As a result, Amazon Redshift can’t parse complex, multi-level data structures.

Converting Trello JSON to Redshift

Let’s have a look at how Redshift handles conversion of relatively complex JSON documents.

We will take some JSON data from Trello, a popular SaaS solution for managing projects. You can export your data as JSON files. We believe that Trello is using a NoSQL database such as MongoDB or similar to store their data.

Let’s have a look at some of the features of the exported JSON document.

File type: JSON multi-level nested objects
Size : 7 MB
Number of First level keys : 30 Fields.
The JSON document contains business keys

Unfortunately, we won’t be able to parse this JSON file into Redshift with native functionality. First of all it exceeds the maximum allowed size of 64 KB in Redshift. Secondly, it also contains multi-level nested data, which makes it very hard to convert with the limited support of JSON features in Redshift SQL.

Alternatives

One alternative solution to convert this data into Redshift is to use the JSON features of Redshift Spectrum. Redshift Spectrum does not have the limitations of the native Redshift SQL extensions for JSON. This approach works reasonably well for simple JSON documents. However, it gets difficult and very time consuming for more complex JSON data such as the one found in the Trello JSON.

For those scenarios we have created Flexter. Flexter is a data warehouse automation tool to convert JSON (or XML) to a relational database such as Redshift, Hadoop formats (ORC, Parquet, Avro) or simple text files. No coding is required and Flexter fully automates the process.

We have written up a separate post where we automatically convert Trello JSON to Redshift.

If you have any questions please refer to the Flexter FAQ section. You can also request a demo of Flexter.

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

Working with JSON in Redshift. Options, limitations, and alternatives

COPY from JSON Format

Copy From JSON using the ‘auto’ option

Copy From JSON using the ‘JSONPath file’option:

Limitations of Copy to convert JSON in Redshift

SQL-JSON Functions

JSON Functions

IS_VALID_JSON

IS_VALID_JSON_ARRAY

JSON_ARRAY_LENGTH

JSON_EXTRACT_ARRAY_ELEMENT_TEXT

JSON_EXTRACT_PATH_TEXT

Limitations of JSON Functions :

Redshift and the ANSI 2016 SQL JSON extensions

JSON Query function – JSON_TABLE

JSON constructor functions

JSON Path language elements and filters

Converting Trello JSON to Redshift

Alternatives

Working with JSON in Redshift. Options, limitations, and alternatives

COPY from JSON Format

Copy From JSON using the ‘auto’ option

Copy From JSON using the ‘JSONPath file’option:

Limitations of Copy to convert JSON in Redshift

SQL-JSON Functions

JSON Functions

IS_VALID_JSON

IS_VALID_JSON_ARRAY

JSON_ARRAY_LENGTH

JSON_EXTRACT_ARRAY_ELEMENT_TEXT

JSON_EXTRACT_PATH_TEXT

Limitations of JSON Functions :

Redshift and the ANSI 2016 SQL JSON extensions

JSON Query function – JSON_TABLE

JSON constructor functions

JSON Path language elements and filters

Converting Trello JSON to Redshift

Alternatives

Related Articles

Should you Use Outsourced XML Conversion Services?

Deep dive on XML converters, XML conversion tools and XML conversion projects

Converting XML and JSON to a Data Lake

Cookies consent