You are planning to complement your traditional data warehouse architecture with big data technologies. Now what? Should you upskill your existing data warehouse team? Or do Big Data technologies require a completely different set of skills?
What do we mean by big data technologies anyway? For the purpose of this article, I define big data as any distributed technology that is not a relational database. According to this definition a distributed relational database (MPP) such as Redshift, Vertica, Teradata etc. is not a big data technology. I know. Other definitions for big data exist, e.g. any data volume that we can’t easily fit on a single server.
Tackling big data projects with a data warehouse mind set?
Your company already has a data warehouse built on traditional data warehouse architecture. Either a Kimball collection of conformed data marts or a Corporate Information Factory built on Inmon. Great! You already have a lot of experience running data projects successfully.
Let’s look at the various roles on your team and check if they are big data ready.
I presume you already run your data warehouse projects in an agile fashion. Not much changes there. It’s still scrum, user stories, spikes, and daily stand-ups. However, big data projects are more code- than tool-centric. As a result the agility around automated deployments and continuous integration can be increased.
The data architects on your team need to upskill significantly. If they have worked with an MPP database they have had exposure to distributed computing. That’s good. While frameworks that sit on top of a distributed file system, e.g. Spark or Hadoop share certain features, there are significant differences.
Data architects also need to upskill in the area of streaming architecture and understand concepts such as the Lambda and Kappa architectures, processing semantics (exactly once etc.), the changing nature of message brokers, stateful and stateless computations, the implications of differences in event and processing times etc.
ETL is rapidly changing as well. Traditional ETL tools don’t offer rich enough abstractions through their GUI. The black box approach to ETL is dead. Pushing down the calculations to the where the data is stored is the perfect fit for big data. It brings processing to the data rather than the other way around. Declarative designs, metadata driven ETL, code templates, a minimal GUI, and server-less ETL all trump the traditional approach to ETL.
Data architects need to understand which workloads are best served by an RDBMS and which workloads are a better fit for big data technologies. While the RDBMS is very versatile at running different types of workloads it is not always the best fit. Graphs for example can be modelled in an RDBMS. However, it’s awkward to model and query the data there. Performance may also become a problem.
Understanding the different types and degrees of query offload and pushdown are an important skill for data architects.
Your data architects also need to familiarise themselves with concepts such as data lakes, raw data reservoirs, analytics workbenches and sandboxes, self-service analytics etc. They need to be able to distinguish the valuable concepts from vendor hype in this area.
I cover all of these upskill areas in my popular training course Big Data for Data Warehouse Professionals.
For data modellers nothing changes with respect to creating conceptual and logical models. However, when it comes to translating a logical model into a physical model, data modellers need to take the features of the target data store and the use case into account. Nothing new there. Let me give you a few examples.
- When our use case is transaction processing and our data store is a relational database, it is best to implement the physical model in third normal form.
- When our use case is business intelligence and our data store is a relational database, it is typically best to de-normalise our model into a star schema and pre-join some of the reference tables.
- When our use case is business intelligence and our data store is a relational database with columnar compressed storage, we can de-normalise our star schema further. Let’s assume a fact table with two very large dimension tables. On an MPP we can use the distribution key to co-locate data of the fact table with the data of the first dimension. The joins will be fast. For the second dimension we don’t have that option. As an alternative we could de-normalise that dimension into the fact table. As our MPP supports columnar compression this is not a big concern from the storage or performance point of view (columnar compression and column projection).
- When our use case is business intelligence and our data store is a distributed file system, we should avoid joins of very large tables where possible. On these systems joins are evil. Unlike on an MPP we have no control over row and key placement. We also need to think about how to handle updates and deletes on immutable storage. I go into a lot of details in my post xxx
As you can see, data models can not be abstracted from the data store and the use case. It is crucial that a data modeller understands the physical features of the underlying data store. Of course, you can have a separate role for the physical data modeller, e.g. a data engineer might play that role.
This brings us to the data engineers themselves. In a data warehouse project these are typically called ETL developers. Most ETL developers have a mix of the following skills: SQL, data modelling, ETL tool knowledge, and relational database knowledge. From my experience these skills are not easily transferred to a big data project. Indeed I have seen various big data projects fail where such an attempt has been made. ETL developers lack two crucial skills when it comes to big data projects. They are not coders, geeks, nerds in the classic sense. They typically lack the low level programming skills that are needed to be successful in a big data project. Often, they have not had exposure to distributed computing and the complexities it brings with it either.
At Sonra, the engineers that work on our big data XML converter (written in Scala and running on Spark) look at the Spark source code all the time to understand what is going on under the hood and to make best use of the framework. We also frequently write our own extensions. This is the level of experience that is required on these frameworks. Despite all of the hype and what the vendors tell you the technology often is immature and not well documented. For an average ETL developer this would be a very, very steep learning curve.
As a rule of thumb I would not hire or upskill an ETL engineer for a big data project. As a hiring manager, the resources that you are after are “proper” developers ideally with a knowledge of distributed computing frameworks. Scala, Java, CI, multi-threaded programming, test driven development, understanding of common algorithms all come natural to these resources.
My recommendation: Don’t approach a big data project with a data warehouse mind set. Don’t think that by using higher level APIs and SQL you will be able to shield yourself from the nitty gritty details. You can’t eliminate exposure to the lower level APIs.
We provide a service of remote data engineers that bring all of these skills to the table.
Business Intelligence Developers
Not that much changes for BI developers. They still create reports, dashboards and visualisations mainly against data marts on relational databases. Typically you can upskill those resources for scenarios where you run business analytics against a big data technology. Some queries are not a good fit to be run on a distributed technology, e.g. DISTINCT queries or aggregate queries against high cardinality columns without filters. BI developers need to know and understand those cases.
Your DBAs won’t easily upskill to a big data framework such as Hadoop. There is some overlap in the skills administering a cluster of big data technologies is challenging. You are better off to acquire external resources. If you have the option, I would highly recommend to run the administration of a distributed big data system as a managed service and/or in the cloud. This eliminates a significant overhead and cause of headaches.
What about Advanced Analytics?
Data science is a team sport. No one person will have the required skills to cover the full data science life cycle. Let me walk you through the roles you will need on your team.
For a lack of a better term I call this role the researcher. This is the most important member on your data science team (and the least technical). This person comes from the business and has hands on business experience for the domain you are modelling. This role is not a business analyst. It is a person with intimate knowledge and hands on experience of the business domain. They have deep industry knowledge about typical problems in the domain. This person has the ability to think outside the box and the ability to question common practices and see beyond them.
You won't solve problems by just looking at the data. As Picasso used to say Computers are stupid they can only give you answers. This person needs to be able to identify and see problems, e.g. by looking at existing metrics/reports or by defining new metrics that expose problems. He or she needs to come up with the right questions. This is the first step. Traditional BI tells you that you have a problem, advanced analytics tells you why you have this problem. It will be the task of this person to find the problems and then come up with the right questions and hypotheses to get to the bottom of the issues.
The other team members will assist this person to drive the program, e.g. the data analyst provides reports to identify problem areas, the data scientist helps in the design of surveys/questionnaires, creates and evaluates the models, and helps in designing experiments (A/B tests).
The researcher needs to be able to work with managers but also with people on the ground.
I can only iterate. This person is not a traditional business analyst who sits behind their desk all day and takes part in the odd phone call or meeting.
This is the core investment of your program and you should try to get the best person for this job. Ideally this is an internal recruit.
If next time you are wondering why your advanced analytics project does not deliver results you now have the answer. You are lacking a researcher. It is not enough to just hire a data scientist and put them in front of Terabytes or Petabytes of data and expect actionable insights. I’m afraid, it is a lot harder than that. Sorry to burst the bubble.
The other roles you need on this team are data analysts, visualisation experts, data engineers, and data scientist (the person building the models and selecting the algorithms). I won’t cover these here in detail as they are covered in other publications ad nauseam (yawn).
I am an ETL Developer. What should I do?
Finally, a slightly off-topic area. I often get asked the question: I am an ETL developer, what should I do? An ETL developer who tries to upskill on big data technologies will have a hard time to compete with a data engineer from a coding background. Similarly, an ETL developer who tries to upskill in advanced analytics will find it difficult to compete with battle hardened data scientists.
In my opinion there are three career paths open to ETL developers:
- Stay an ETL developer but upskill, e.g. cloud data warehouses such as Snowflake, understand MPP concepts, advanced SQL such as window functions, streaming technologies for real-time ETL, data preparation tools and concepts etc.
- Become a data architect and upskill in big data architecture.
- Upskill to become a manager
My popular course Big Data for Data Warehouse Professionals covers all of these areas. Happy learning.