Successfully Transitioning your Team from Data Warehousing to Big Data

November 8, 2017

You are planning to complement your traditional data warehouse architecture with big data technologies. Now what? Should you upskill your existing data warehouse team? Or do Big Data technologies require a completely different set of skills?
Let me first clarify that I am not advocating to drop your relational data warehouse and replace it with some big data technology. Quite the opposite. However, what I want to drive home with the article is that if for whatever reasons you are moving to Hadoop or Spark etc. you need to be aware of the different skillset requirements that these technologies have in comparison to traditional data warehouse technologies.
What do we mean by big data technologies anyway? For the purpose of this article, I define big data as any distributed technology that is not a relational database. According to this definition a distributed relational database (MPP) such as Redshift, Vertica, Teradata etc. is not a big data technology. I know. Other definitions for big data exist, e.g. any data volume that we can’t easily fit on a single server.

Table of Contents

Tackling big data projects with a data warehouse mind set?

Your company already has a data warehouse built on traditional data warehouse architecture. Either a Kimball collection of conformed data marts or a Corporate Information Factory built on Inmon. Great! You already have a lot of experience running data projects successfully.
Let’s look at the various roles on your team and check if they are big data ready.

Project Managers

I presume you already run your data warehouse projects in an agile fashion. Not much changes there. It’s still scrum, user stories, spikes, and daily stand-ups. However, big data projects are more code- than tool-centric. As a result the agility around automated deployments and continuous integration can be increased.

Data Architects

The data architects on your team need to upskill significantly. If they have worked with an MPP database they have had exposure to distributed computing. That’s good. While frameworks that sit on top of a distributed file system, e.g. Spark or Hadoop share certain features, there are significant differences.
Data architects also need to upskill in the area of streaming architecture and understand concepts such as the Lambda and Kappa architectures, processing semantics (exactly once etc.), the changing nature of message brokers, stateful and stateless computations, the implications of differences in event and processing times etc.
ETL is rapidly changing as well. Traditional ETL tools don’t offer rich enough abstractions through their GUI. The black box approach to ETL is dead. Pushing down the calculations to the where the data is stored is the perfect fit for big data. It brings processing to the data rather than the other way around. Declarative designs, metadata driven ETL, code templates, a minimal GUI, and server-less ETL all trump the traditional approach to ETL.
Data architects need to understand which workloads are best served by an RDBMS and which workloads are a better fit for big data technologies. While the RDBMS is very versatile at running different types of workloads it is not always the best fit. Graphs for example can be modelled in an RDBMS. However, it’s awkward to model and query the data there. Performance may also become a problem.
Understanding the different types and degrees of query offload and pushdown are an important skill for data architects.
Your data architects also need to familiarise themselves with concepts such as data lakes, raw data reservoirs, analytics workbenches and sandboxes, self-service analytics etc. They need to be able to distinguish the valuable concepts from vendor hype in this area.
I cover all of these upskill areas in my popular training course Big Data for Data Warehouse Professionals.

Data Modellers

For data modellers nothing changes with respect to creating conceptual and logical models. However, when it comes to translating a logical model into a physical model, data modellers need to take the features of the target data store and the use case into account. Nothing new there. Let me give you a few examples.

When our use case is transaction processing and our data store is a relational database, it is best to implement the physical model in third normal form.
When our use case is business intelligence and our data store is a relational database, it is typically best to de-normalise our model into a star schema and pre-join some of the reference tables.
When our use case is business intelligence and our data store is a relational database with columnar compressed storage, we can de-normalise our star schema further. Let’s assume a fact table with two very large dimension tables. On an MPP we can use the distribution key to co-locate data of the fact table with the data of the first dimension. The joins will be fast. For the second dimension we don’t have that option. As an alternative we could de-normalise that dimension into the fact table. As our MPP supports columnar compression this is not a big concern from the storage or performance point of view (columnar compression and column projection).
When our use case is business intelligence and our data store is a distributed file system, we should avoid joins of very large tables where possible. On these systems joins are evil. Unlike on an MPP we have no control over row and key placement. We also need to think about how to handle updates and deletes on immutable storage. I go into a lot of details in my post xxx

As you can see, data models can not be abstracted from the data store and the use case. It is crucial that a data modeller understands the physical features of the underlying data store. Of course, you can have a separate role for the physical data modeller, e.g. a data engineer might play that role.

Data Engineers

This brings us to the data engineers themselves. In a data warehouse project these are typically called ETL developers. Most ETL developers have a mix of the following skills: SQL, data modelling, ETL tool knowledge, and relational database knowledge. From my experience these skills are not easily transferred to a big data project. Indeed I have seen various big data projects fail where such an attempt has been made. ETL developers lack two crucial skills when it comes to big data projects. They are not coders, geeks, nerds in the classic sense. They typically lack the low level programming skills that are needed to be successful in a big data project. Often, they have not had exposure to distributed computing and the complexities it brings with it either.
At Sonra, the engineers that work on our big data XML converter (written in Scala and running on Spark) look at the Spark source code all the time to understand what is going on under the hood and to make best use of the framework. We also frequently write our own extensions. This is the level of experience that is required on these frameworks. Despite all of the hype and what the vendors tell you the technology often is immature and not well documented. For an average ETL developer this would be a very, very steep learning curve.
As a rule of thumb I would not hire or upskill an ETL engineer for a big data project. As a hiring manager, the resources that you are after are “proper” developers ideally with a knowledge of distributed computing frameworks. Scala, Java, CI, multi-threaded programming, test driven development, understanding of common algorithms all come natural to these resources.
My recommendation: Don’t approach a big data project with a data warehouse mind set. Don’t think that by using higher level APIs and SQL you will be able to shield yourself from the nitty gritty details. You can’t eliminate exposure to the lower level APIs.
We provide a service of remote data engineers that bring all of these skills to the table.

Business Intelligence Developers

Not that much changes for BI developers. They still create reports, dashboards and visualisations mainly against data marts on relational databases. Typically you can upskill those resources for scenarios where you run business analytics against a big data technology. Some queries are not a good fit to be run on a distributed technology, e.g. DISTINCT queries or aggregate queries against high cardinality columns without filters. BI developers need to know and understand those cases.

Admins

Your DBAs won’t easily upskill to a big data framework such as Hadoop. There is some overlap in the skills administering a cluster of big data technologies is challenging. You are better off to acquire external resources. If you have the option, I would highly recommend to run the administration of a distributed big data system as a managed service and/or in the cloud. This eliminates a significant overhead and cause of headaches.

What about Advanced Analytics?

Data science is a team sport. No one person will have the required skills to cover the full data science life cycle. Let me walk you through the roles you will need on your team.

The Researcher

For a lack of a better term I call this role the researcher. This is the most important member on your data science team (and the least technical). This person comes from the business and has hands on business experience for the domain you are modelling. This role is not a business analyst. It is a person with intimate knowledge and hands on experience of the business domain. They have deep industry knowledge about typical problems in the domain. This person has the ability to think outside the box and the ability to question common practices and see beyond them.

You won’t solve problems by just looking at the data. As Picasso used to say Computers are stupid they can only give you answers. This person needs to be able to identify and see problems, e.g. by looking at existing metrics/reports or by defining new metrics that expose problems. He or she needs to come up with the right questions. This is the first step. Traditional BI tells you that you have a problem, advanced analytics tells you why you have this problem. It will be the task of this person to find the problems and then come up with the right questions and hypotheses to get to the bottom of the issues.
The other team members will assist this person to drive the program, e.g. the data analyst provides reports to identify problem areas, the data scientist helps in the design of surveys/questionnaires, creates and evaluates the models, and helps in designing experiments (A/B tests).
The researcher needs to be able to work with managers but also with people on the ground.
I can only iterate. This person is not a traditional business analyst who sits behind their desk all day and takes part in the odd phone call or meeting.
This is the core investment of your program and you should try to get the best person for this job. Ideally this is an internal recruit.
If next time you are wondering why your advanced analytics project does not deliver results you now have the answer. You are lacking a researcher. It is not enough to just hire a data scientist and put them in front of Terabytes or Petabytes of data and expect actionable insights. I’m afraid, it is a lot harder than that. Sorry to burst the bubble.
The other roles you need on this team are data analysts, visualisation experts, data engineers, and data scientist (the person building the models and selecting the algorithms). I won’t cover these here in detail as they are covered in other publications ad nauseam (yawn).

I am an ETL Developer. What should I do?

Finally, a slightly off-topic area. I often get asked the question: I am an ETL developer, what should I do? An ETL developer who tries to upskill on big data technologies will have a hard time to compete with a data engineer from a coding background. Similarly, an ETL developer who tries to upskill in advanced analytics will find it difficult to compete with battle hardened data scientists.
In my opinion there are three career paths open to ETL developers:

Stay an ETL developer but upskill, e.g. cloud data warehouses such as Snowflake, understand MPP concepts, advanced SQL such as window functions, streaming technologies for real-time ETL, data preparation tools and concepts etc.
Become a data architect and upskill in big data architecture.
Upskill to become a manager

My popular course Big Data for Data Warehouse Professionals covers all of these areas. Happy learning.

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

Successfully Transitioning your Team from Data Warehousing to Big Data

Tackling big data projects with a data warehouse mind set?

Project Managers

Data Architects

Data Modellers

Data Engineers

Business Intelligence Developers

Admins

What about Advanced Analytics?

The Researcher

I am an ETL Developer. What should I do?

Successfully Transitioning your Team from Data Warehousing to Big Data

Tackling big data projects with a data warehouse mind set?

Project Managers

Data Architects

Data Modellers

Data Engineers

Business Intelligence Developers

Admins

What about Advanced Analytics?

The Researcher

I am an ETL Developer. What should I do?

Related Articles

Data Warehouse 3.0. A Reference Architecture for the Modern Data Warehouse.

Using Virtual Data Marts the right way

The Secrets to Unlocking XML Data in an Optimised Format for Downstream Users

Cookies consent