SQL Challenge: Tackling the Island Problem in Time Series Data

Published on November 7, 2023
Updated on May 31, 2024

Jump into our SQL Challenge! Today, we’re exploring customer loyalty trends. Are you prepared to show off your SQL skills?

About the quiz

In this SQL pop quiz, we’re challenging you to demonstrate your understanding of islands in a time series data set. Whoever submits the correct and most performant query will be featured in our social media posts. They will also receive a mention and a link to their website in the blog post where we showcase the best solution.

We provide a basic solution to the problem. Try and beat it!

We are looking for a correct and optimized query for Snowflake. We will be using the TPCH_SF100 sample dataset that Snowflake provides to all accounts.

The most performant solution is based on the cost of the query. The cost is determined by the number of credits used. An example on how we calculate query cost is in the FAQ section at the end of the post.

Islands in time series explained

In the context of a time series of data, “islands” refer to contiguous sequences of data points that share a common attribute or condition. In simpler terms, Islands can be thought of as ranges comprising consecutive values without values being interrupted.

Some of the actual real life use cases for islands are:

Financial data analysis:

  • Stock Price Analysis: An investor analyzes a company’s stock price over the past year and identifies a continuous three-month period where the stock price has been steadily increasing. This “island” of growth may indicate a positive market sentiment or successful company initiatives, prompting the investor to consider buying the stock.
  • Revenue Trends: A business analyst detects an “island” of declining revenue over consecutive quarters. This pattern signals a need for the company to investigate potential causes, such as market changes or internal issues, and develop strategies to reverse the trend.

Customer Behavior Analysis:

  • Loyalty Program Engagement: A retail store analyzes their loyalty program data and identifies customers who have made purchases every month for a year. Recognizing this “island” of consistent activity, the store might offer these loyal customers special rewards or exclusive deals.
  • Subscription Service Renewal: A streaming service spots customers who have consistently renewed their subscriptions month after month without interruption. This pattern of loyalty can inform targeted marketing campaigns or loyalty perks to retain these subscribers.

Dataset Overview

We will be using the TPCH_SF100 sample dataset that Snowflake makes available for all accounts.

The TPCH_SF100 sample dataset is a rich source of transaction processing data, with an array of tables and interrelationships. Our focus for this challenge is the orders table, which records details of customer orders, including the date each was placed.

Problem

Your goal is to identify customers who have placed orders at least once every month for at least three consecutive months (3 months in a row) and count the number of such continuous ordering islands.

Example 

In the figure below we can see two order islands for the customer with o_custkey 7.

Expected Output

Solution

We have come up with this solution. Try to beat it:

We ran our query 3 times, and these are our results.

Run

Execution time

VWH size

Cost

1st

40s

XS

0,088$

2nd

40s

XS

0,088$

3rd

39s

XS

0,087$
* Formula: <query execution time as hours> × VWH size credits × <credit price>

Here is the query profile plan of our best run:

How to submit

  • Ensure you have a Snowflake account (register for a free trial, you can check this PDF to see how to register).
  • Come up with the correct solution. Try to beat the query cost of the solution we provide.
  • Before you submit your query register for FlowHigh to check for common SQL antipatterns and coding issues. Bonus: There is a performance anti pattern in the solution we provide. Can you find it with FlowHigh?

Sample screenshot of detecting SQL antipatterns with FlowHigh

  • Submit your query through our Google form. Include your name, email, the query, warehouse size you used, and a query profile of your best run as a screenshot.

FAQ

How do you calculate the cost of a query?

Suppose you have a Medium virtual warehouse that consumes 4 credits per hour. Your SQL query ran for 90 seconds, and the cost per credit is $2 (assuming this for the example. Snowflake credits vary by edition and region). For simplicity reasons we ignore the Snowflake minimum billing of 1 minute per query to calculate cost.

  • Virtual Warehouse Size: Medium (4 credits/hour)
  • Query Execution Time: 90 seconds
  • Cost per Credit: $2

First, convert the execution time to hours:

90seconds/3600seconds/hour=0.025 hours

Next, compute the credit consumption:

0.025 hours×4 credits/hour=0.1 credits

Finally, calculate the cost:

0.1 credits×$2/credit=$0.20

So, the cost of the query is $0.20.

Can I add Cluster Keys or Search Optimisation Service?

Yes, you can add cluster keys and secondary indexes (aka SOS) to the table.

You will need to create a copy of the provided sample.

Can I choose the size of the virtual warehouse?

Yes. We compare the query cost of the result and you can choose the size of the virtual warehouse you use.

Any other questions?

Contact us for any issues or questions