Snowflake, FlowHigh,

Home » Snowflake » How to do full text search with Snowflake, Inverted Indexes and Snowpark

How to do full text search with Snowflake, Inverted Indexes and Snowpark

by Uli Bethke

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.

Published on September 19, 2022
Updated on December 18, 2024

Inverted Index

An Inverted Index is a data structure used to support full text search over a set of documents.

An inverted index is very similar to an index at the end of a book. In the book index you get an alphabetically sorted list of the core keywords with a list of pages where the keyword occurs.

Some databases such as PostgreSQL ship with dedicated features to create an inverted index for full text search. Other databases such as Snowflake do not (yet) have such a feature (even though Snowflake recently improved string matching by a factor of 2x) . For those databases without full text search we can build our own inverted index and this blog post will show you how it is done in true DIY fashion.

We create our own inverted index by taking a table column with a lot of text and apply various text analytics techniques such as stop word removal, stemming etc.

The result is a new table with one entry per word/term for all the text documents processed, along with a list of the key pairs: document id and frequency of the term in the document (how many times does the term appear in the doc).

In this example we will use Snowpark. Snowpark ships with Snowflake to create custom UDFs in Python, Scala, or Java.

Your subscription could not be saved. Please try again.

You're In! Welcome to FastForward Congratulations on successfully subscribing to the FastForward Data Engineering Newsletter! You're now part of a growing community of 15,000+ data engineers who are staying ahead in the ever-evolving world of data.

FlowForward.

All Things Data Engineering
Straight to Your Inbox!

LIKE ‘%phrase%’ anti pattern. When should you use an inverted index?

When would you use or create an inverted index? Like with most scenarios the answer is: It depends. For small data volumes and infrequent access you should use the LIKE operator or for more complex pattern matching scenarios the REGEX operator.

If you store large text data volumes and require concurrent access by many users an inverted index might be a good fit. In particular if you experience slow query performance and a general performance degradation of the database when users run full text searches with LIKE or REGEX.

It is important to understand your search requirements before building an inverted index, e.g. for defining the rules for data cleansing, stemming, stop word removal etc. as part of the build process of the inverted index.

You will also need to take into account that building and maintaining an inverted index comes at a cost, e.g. each time you add a new document you need to also update your index.

For complex requirements it is better to rely on a dedicated text search engine such as Apache Solr instead of building your own inverted index. Some database vendors also offer full text search as a feature. Under the hood these RDBMS build their own inverted index.

How is an inverted index constructed

Let’s imagine we have two documents with different texts:

A car is a great automobile, I like cars.
Cars are bad for the environment.

In order to create the Inverted Index, each text is sliced into different units or terms. The rule is to use whitespace as the natural separator between words, even though this can be changed.

Additionally, per each term there is a list of pairs (document id, frequency), showing the document’s ID where the term is found, and the number of times the term appears in the text.

Therefore, the Inverted Index after processing the previous two documents would be:

Term	Appearances (DocId, Frequency)
automobile	(1,1)
bad	(2,1)
car	(1,2) (2,1)
environment	(2,1)
great	(1,1)
like	(1,1)

How do you detect the anti pattern?

You will need to identify SQL statements that use the LIKE operator or Regular Expressions to apply free text search.

Most databases log the history of SQL statements that were run in an information schema. You can use our product FlowHigh to parse the SQL and identify those queries that use LIKE and REGEX operations and the tables and columns that they use for filtering and free text search. Those columns that are queried frequently and have performance problems become candidates for an inverted index.

With FlowHigh you can parse, visualise, optimise, and format SQL. You can also detect SQL anti patterns that violate SQL best practices such as self joins, implicit cross joins etc.

Performance of an inverted index

How does the performance of an inverted index compare to using a LIKE operator. Let’s return to our example about the book index. Using the LIKE operator is similar to reading / scanning the entire book end to end instead of looking up the index at the end of the book to find the pages for a keyword.

When you use the LIKE operator the database needs to read all of the records from the database table column that contains the free text (full table scan). Free text can be large, does not compress well and storage indexes or sort keys do not work either (there are some edge scenarios). So using the LIKE operator will require a lot of table scanning and network traffic. The LIKE operation is also very CPU intensive as it needs to match the string of the search term against the full text. There are various algorithms that help to do this but nevertheless this is an expensive operation.

Nothing in life is free and an inverted index also comes at a cost. It needs to be updated each time you add a new document. For scenarios where you don’t have a near real time requirement you can update the inverted index as part of a batch process.

Creating an inverted index on Snowflake

Let’s see how an inverted index can be built on Snowflake using Snowpark.

The starting point is to create a snowpark session.

from snowflake.snowpark import Session

connection_parameters = {

"account" : "******"

,"user": "******"

,"password": "******"

,"role": "******"

,"warehouse": "******"

,"database": "******"

,"schema": "******"

}

session = Session.builder.configs(connection_parameters).create()

Now create the functions required for creating the inverted index. Firstly, the function for tokenizing text.

def tokenize(text) -> list:

from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer,SnowballStemmer

import unicodedata2

import regex as re

#https://godatadriven.com/blog/handling-encoding-issues-with-unicode-normalisation-in-python/

#Normalisation Form Compatibility Decomposition

text = unicodedata2.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

text = text.lower()

#Sustitute 's, email address, website address, html tags and new line with a space

cleansed1 = re.sub(r"'s\b|\w+@\w+\.\w+|\w+://\S+|<[^<]+?>|'[\n\r]'", ' ', text)

#Substitute all unwanted special characters with a space

cleansed2 = re.sub("[^A-Za-z' ]+", ' ', cleansed1)

#Substitute apostrophe at beginning and end of words, roman numerals and onewords with a space

cleansed3 = re.sub(r"(?<=\s)\'+(?=\w*)|(?<=\S)\'+(?=\s+)|(?<=\s)[ivxlcdm]+(?=\s)|(?<=\s)\w(?=\s)|^\w(?=\s)|(?<=\s)\w$", ' ', cleansed2)

#Remove all the remaining apostrophe

cleansed4 = re.sub(r"\'",'',cleansed3)

#Split the text into tokens

words = list(cleansed4.split())

#Correction = [word.replace("'d","ed") if word[-2:] == "'d" else word for word in words]

#Lemmatization using nltk WordNetLemmatizer

# Create an instance WordnetLemmatizer which is used for lemmatizing words

wordnet_lemmatizer = WordNetLemmatizer()

lemmatized =[

wordnet_lemmatizer.lemmatize(word, pos = "n") if word != wordnet_lemmatizer.lemmatize(word, pos ="n")

else wordnet_lemmatizer.lemmatize(word, pos ="a") if word != wordnet_lemmatizer.lemmatize(word, pos = "a")

else wordnet_lemmatizer.lemmatize(word, pos ="v") if word != wordnet_lemmatizer.lemmatize(word, pos = "v")

else wordnet_lemmatizer.lemmatize(word, pos ="r") if word != wordnet_lemmatizer.lemmatize(word, pos = "r")

else wordnet_lemmatizer.lemmatize(word, pos ="s") if word != wordnet_lemmatizer.lemmatize(word, pos = "s")

else wordnet_lemmatizer.lemmatize(word) for word in words

]

#Stemming using nltk SnowballStemmer

# Create a new instance of a English language specific subclass of SnowballStemmer which is used for stemming the words

snowball_stemmer = SnowballStemmer('english', ignore_stopwords=True)

stemmed = [snowball_stemmer.stem(word) for word in lemmatized]

#Removing stopwords using nltk stopwords

stopword = stopwords.words('english')

normalized_tokens = [word for word in lemmatized if word not in stopword]

return normalized_tokens

Here a list of the terms built from the text field. These are known as tokens. The steps involved in constructing the text tokens are described below.

Normalise Data

Unicode Normalisation converts alphabetical Unicode glyphs to their nearest equivalent characters.

Few examples are as given below:

𝕔𝕠𝕞𝕡𝕝𝕖𝕥𝕖𝕝𝕪 𝕦𝕟𝕣𝕖𝕒𝕕𝕒𝕓𝕝𝕖	→ completely unreadable
Ç, é, Å	→ C, e, A
ℌ	→ H
Xⱼ	→ x
½	→ 12

import unicodedata2

text = unicodedata2.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

Cleansing

Cleanses the document by removing the following

Apostrophe s (‘s)
Email addresses
Website addresses
Html tags
New lines
Special characters
Roman numerals
One words

import regex as re

#Sustitute 's, email address, website address, html tags and new line with a space

cleansed1 = re.sub(r"'s\b|\w+@\w+\.\w+|\w+://\S+|<[^<]+?>|'[\n\r]'", ' ', text)

#Substitute all unwanted special characters with a space

cleansed2 = re.sub("[^A-Za-z' ]+", ' ', cleansed1)

#Substitute apostrophe at beginning and end of words, roman numerals and onewords with a space

cleansed3 = re.sub(r"(?<=\s)\'+(?=\w*)|(?<=\S)\'+(?=\s+)|(?<=\s)[ivxlcdm]+(?=\s)|(?<=\s)\w(?=\s)|^\w(?=\s)|(?<=\s)\w$", ' ', cleansed2)

#Remove all the remaining apostrophe

cleansed4 = re.sub(r"\'",'',cleansed3)

Splitting text into words

The document text is made into a list of words by using whitespace as the separator.

1 2	#Split the text into tokens words = list(cleansed4.split())

Lemmatization

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word.

Examples of lemmatization:

-> rocks : rock

-> corpora : corpus

-> better : good

One major difference with stemming is that lemmatize takes a part of speech parameter, “pos” If not supplied, the default is “noun.”

from nltk.stem import WordNetLemmatizer,SnowballStemmer

#Lemmatization using nltk WordNetLemmatizer

# Create an instance WordnetLemmatizer which is used for lemmatizing words

wordnet_lemmatizer = WordNetLemmatizer()

lemmatized =[

wordnet_lemmatizer.lemmatize(word, pos = "n") if word != wordnet_lemmatizer.lemmatize(word, pos ="n")

else wordnet_lemmatizer.lemmatize(word, pos ="a") if word != wordnet_lemmatizer.lemmatize(word, pos = "a")

else wordnet_lemmatizer.lemmatize(word, pos ="v") if word != wordnet_lemmatizer.lemmatize(word, pos = "v")

else wordnet_lemmatizer.lemmatize(word, pos ="r") if word != wordnet_lemmatizer.lemmatize(word, pos = "r")

else wordnet_lemmatizer.lemmatize(word, pos ="s") if word != wordnet_lemmatizer.lemmatize(word, pos = "s")

else wordnet_lemmatizer.lemmatize(word) for word in words

]

Stemming

It is the process of reducing the word to its word stem that affixes to suffixes and prefixes or to roots of words known as a lemma. In simple words stemming is reducing a word to its base word or stem in such a way that the words of similar kind lie under a common stem. For example – The words care, cared and caring lie under the same stem ‘care’. Stemming is important in natural language processing(NLP). Porter’s Stemmer, Snowball stemmer, Lancaster Stemmer are the common stemmers available in NLTK library. We are using the Snowball stemmer in our codes. Please refer here for detailed information on the different stemmers.

Few common rules of Snowball stemming are:

ILY -----> ILI

LY -----> Nill

SS -----> SS

S -----> Nill

ED -----> E,Nill

Nill means the suffix is replaced with nothing and is just removed.
There may be cases where these rules vary depending on the words. As in the case of the suffix ‘ed’ if the words are ‘cared’ and ‘bumped’ they will be stemmed as ‘care‘ and ‘bump‘. Hence, here in cared the suffix is considered as ‘d’ only and not ‘ed’. One more interesting thing is in the word ‘stemmed‘ it is replaced with the word ‘stem‘ and not ‘stemmed‘. Therefore, the suffix depends on the word.

Let’s see a few examples:

Word Stem

cared care

university univers

fairly fair

easily easili

singing sing

sings sing

sung sung

singer singer

sportingly sport

Drawbacks of Stemming:

Issues of over stemming and under stemming may lead to not so meaningful or inappropriate stems.
Stemming does not consider how the word is being used. For example – the word ‘saw‘ will be stemmed to ‘saw‘ itself but it won’t be considered whether the word is being used as a noun or a verb in the context. For this reason, Lemmatization is used as it keeps this fact in consideration and will return either ‘see’ or ‘saw’ depending on whether the word ‘saw’ was used as a verb or a noun.

from nltk.stem import WordNetLemmatizer,SnowballStemmer

#Stemming using nltk SnowballStemmer

# Create a new instance of a English language specific subclass of SnowballStemmer which is used for stemming the words

snowball_stemmer = SnowballStemmer('english', ignore_stopwords=True)

stemmed = [snowball_stemmer.stem(word) for word in lemmatized]

Stopword Removal

One of the major forms of pre-processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words.

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) which is removed from the list of words.

from nltk.corpus import stopwords

#Removing stop words using nltk stopwords

stopword = stopwords.words('english')

normalized_tokens = [word for word in lemmatized if word not in stopword]

Once the function for tokenizing words is done, the next step is to create the main snowpark udf for building the inverted index.

@udf(packages=['nltk==3.7','unicodedata2']

,imports=["/home/ryzen/nltk_data/corpora/stopwords.zip"

,"/home/ryzen/nltk_data/corpora/wordnet.zip"

,"/home/ryzen/nltk_data/corpora/omw-1.4.zip"]

,name = 'inverted_index'

,is_permanent = True

,stage_location="@T_STAGE"

,replace=True)

def inverted_index(id : str,text : str) -> dict:

import sys

import os

import zipfile

import nltk

from pathlib import Path

IMPORT_DIRECTORY_NAME = "snowflake_import_directory"

import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]

if not Path("/tmp/nltk_data/corpora/stopwords").exists():

with zipfile.ZipFile(f'{import_dir}/stopwords.zip', 'r') as myzip:

myzip.extractall('/tmp/nltk_data/corpora/')

if not Path("/tmp/nltk_data/corpora/wordnet").exists():

with zipfile.ZipFile(f'{import_dir}/wordnet.zip', 'r') as myzip:

myzip.extractall('/tmp/nltk_data/corpora/')

if not Path("/tmp/nltk_data/corpora/omw-1.4").exists():

with zipfile.ZipFile(f'{import_dir}/omw-1.4.zip', 'r') as myzip:

myzip.extractall('/tmp/nltk_data/corpora/')

nltk.data.path.append('/tmp/nltk_data')

if text:

try:

terms = tokenize(text)

except:

terms = ['']

appearances_dict = dict() # Dictionary with each term and the frequency it appears in the text.

for term in terms:

term_frequency = appearances_dict[term]['frequency'] if term in appearances_dict else 0

appearances_dict[term] = {'doc_id' : id, 'frequency' : term_frequency + 1}

return appearances_dict

The packages to be installed on the snowflake warehouse such as nltk and unicodedata2 need to be specified when creating the UDF. There is also dependency on the nltk files namely stopwords,wordnet and omw-1.4, which should also be specified.

The import directory where the dependent files and packages are imported is specified as given below.

1 2	IMPORT_DIRECTORY_NAME = "snowflake_import_directory" import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]

The imported nltk files should be unzipped to a directory with write privilege and the NLTK_PATH(nltk.data.path) needs to be modified to add this path as well.

if not Path("/tmp/nltk_data/corpora/stopwords").exists():

with zipfile.ZipFile(f'{import_dir}/stopwords.zip', 'r') as myzip:

myzip.extractall('/tmp/nltk_data/corpora/')

if not Path("/tmp/nltk_data/corpora/wordnet").exists():

with zipfile.ZipFile(f'{import_dir}/wordnet.zip', 'r') as myzip:

myzip.extractall('/tmp/nltk_data/corpora/')

if not Path("/tmp/nltk_data/corpora/omw-1.4").exists():

with zipfile.ZipFile(f'{import_dir}/omw-1.4.zip', 'r') as myzip:

myzip.extractall('/tmp/nltk_data/corpora/')

nltk.data.path.append('/tmp/nltk_data')

The logic for building the inverted dictionary is then added to the udf.

Now a dataframe needs to be created selecting only the primary key and text fields from the table required. The udf inverted_index is then applied on these columns and the output which is the id (the primary key), text and the inverted index for the text would be written to a snowflake table.

#write all rows to table

df = session.table(table).select(column_list).to_df(["id","text"]) \

.filter("text is not NULL")

df.select("id","text",inverted_index("id","text").alias('APPEARANCE')) \

.write.saveAsTable(f"TERESA_SANDBOX_DB.INVERTED_INDEX.AGGIE_INVERTED_INDEX_INITIAL_{table.split('.')[-1]}", mode="overwrite")

The output table is as shown below.

Now as the last step, a single inverted index for all the rows of text will be created using the snowflake sql as given below.

session.sql(f"""CREATE OR REPLACE TABLE INVERTED_INDEX_FINAL_{table.split('.')[-1]} AS

WITH SET1 AS

(

SELECT

KEY,VALUE

FROM

INVERTED_INDEX_INITIAL_IRS990, LATERAL FLATTEN("APPEARANCE")

)

SELECT

KEY AS TERM,LISTAGG(VALUE,',') WITHIN GROUP (ORDER BY VALUE) AS APPEARENCE

FROM

SET1

GROUP BY TERM

;

""").collect()

The output of this is as given below.

The final inverted index thus has each of the terms along with the id of the text it appears on and its frequency in each of the text fields.

Thus we just need to pass the table name and the column names for the primary key and the text fields in order to create an inverted index for that.

This blog post was co-authored by Teresa Rosemary and Uli Bethke

About the author:

Uli Bethke

Co-founder of Sonra

Any questions or comments for Uli? Connect with him on LinkedIn.

Follow Uli Bethke:

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

How to do full text search with Snowflake, Inverted Indexes and Snowpark

Inverted Index

FlowForward.

LIKE ‘%phrase%’ anti pattern. When should you use an inverted index?

How do you detect the anti pattern?

Performance of an inverted index

Creating an inverted index on Snowflake

Normalise Data

Cleansing

Splitting text into words

Lemmatization

Stemming

Stopword Removal

Uli Bethke

How to do full text search with Snowflake, Inverted Indexes and Snowpark

Inverted Index

FlowForward.

LIKE ‘%phrase%’ anti pattern. When should you use an inverted index?

How do you detect the anti pattern?

Performance of an inverted index

Creating an inverted index on Snowflake

Normalise Data

Cleansing

Splitting text into words

Lemmatization

Stemming

Stopword Removal

Related Articles

Cookies consent