From Big to Smart Data in 5 Simple Steps

April 5, 2016

Are you the same as me? Tired of hearing buzzwords such as “mining gold in your data”, “finding valuable nuggets of information” or the ever present “actionable insights”. One of them crops up in any one of the many vendor pitches you are surely bombarded with. The hype around Big Data has reached fever pitch levels. On the other hand we have statements such as the one from Dan Ariely “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.” In summary some kids get Big Data and others (read most) don’t. It doesn’t have to be that way. Data analytics is not rocket science (at least most of it).
In this blog post we cut through the fog. We’ll have a look at what the real drivers behind the Big Data revolution are. More importantly I will show you how you can use data to reduce uncertainty in your decision making.
Big Data B4 Big Data
Have a guess what the following formula is.

big_data_formula
No idea? I’ll help you. This is the formula that won the Allies World War II. Ok. It is a bit of an exaggeration but it certainly contributed to VE day. This formula predicts the monthly German tank production. Initially the Allies used intelligence officers on the ground to gather that information, which turned out to be wildly inaccurate. Then they came across something else. They noticed that the serial number of German tanks was a sequence number that contained the production year of the tank. By applying sampling techniques they were able to infer the monthly number of tanks produced. It turned out after the war that they were pretty accurate in their predictions.
As you can see, Smart Data applications existed before the age of Big Data. The analytics, the methodology, the statistics, and algorithms were all in place. The problem was the availability of the data. As you can imagine it is hard to collect the serial numbers of German tanks. In the past it generally took a lot of resources and effort just to collect the data.
Digitisation of Everything
Beginning in the late 1980ies, data became more and more readily available in 1s and 0s. It was the dawn of the era of the personal computer. Digitisation was upon us. The whole process accelerated further with the arrival of the internet and smart devices. At the turn of the last century, more than 90% of the world’s data was still in analogue format. This figure has been put on its head and we now more than 99% of the world’s data is digital. Data is growing at an exponential rate and this rate itself is accelerating. The arrival of the Internet of Things and machine generated data in general unleashes another tidal wave of data.
Why is the process of digitisation so important? Information in digital form is easier, cheaper, and quicker to store, process, and share. The high cost of acquiring and processing data (think of serial numbers of German tanks) has been significantly reduced. This is a great opportunity to base more and more of our decisions and actions on data. Interestingly, the phenomenon of Big Data in Germany is referred to as just digitisation.
Where is the catch? The mere availability of more data is no guarantee to derive value from data. For (big) data to become smart data and for your organisation to become data driven a few more ingredients are needed. This can be summarised in the Big Data formula: Big Data = Digitisation + An opportunity to compete on analytics. Emphasis is on opportunity.
Moore’s Law, Distributed Computing & Open Source
The exponential growth in data coincided with a couple of other relevant developments. Moore’s law of exponential growth in CPU processing power gives us the horse power to crunch through this data tsunami – at least for the time being (Interestingly, the data growth rate is greater than the growth rate in processing power; also Moore’s law itself will sooner or later hit the laws of physics).
Over the last 15 years we also saw tremendous advancements in distributed computing. Distributed computing allows us to process data in parallel on multiple machines or servers. It works particularly well for embarrassingly parallel problems. A lot of the global web companies first started out with traditional, non-distributed technologies such as MySQL at Facebook. All of these companies quickly hit the roof with these horizontal, hard to scale technologies and had to come up with their own solutions. Google popularised the MapReduce processing paradigm and the Google distributed file system through a series of white papers, which resulted in the creation of Hadoop and other distributed compute frameworks. A lot of these technologies were open sourced and are within reach of small and medium sized enterprises.
The result of all of these developments is that we now have smart data applications available that only at the turn of the last century would have been unimaginable. Just to give you a few examples. The DARPA in 2004 ran a competition for self-driving cars nicknamed the debacle in the desert as the race was over after only 6% of the course were completed . Today self-driving cars are a reality. Machine generated translations only a few years ago caused more confusion and embarrassment than adding any value for understanding a certain text. Using data (EU and UN translations in phase one) and text analytics, Google was able to significantly improve machine generated translations.
Data, Data Everywhere… But no insights in sight.
The great opportunity that the process of digitisation offers is also a great challenge.
“Big Data is Useless. It only gives you Answers.”
To paraphrase Picasso’s statement about computers being useless it also neatly summarizes Big Data. While a lot of companies are now competing on data analytics a many companies at the other end of the spectrum find it increasingly hard to deliver on the promises of Big Data. So why is it that Big Data only lives up to the hype for some companies?
We often come across expectations by executives that insights can be generated by the push of a button. Just feed enough data to the sausage machine and insights come out at the other end. However, implementing a Hadoop project without being embedded in a business context does not deliver any value. Contrary to a common belief, the data does not speak to us. We need to stop putting the cart before the horse. The starting point for a Smart Data project should never be the data. Always start with the problem and the question. Be strategic and put the business question into context. Don’t just dive into the data. Those who discovered the Titanic’s shipwreck had a plan. They did not search every corner of the Atlantic Ocean but were putting the problem into context of available information
This brings us to the end of the first part of our article on Smart Data. In the second part we will look at the various steps that you as an organisation and individual can turn data into smart data and reduce uncertainty and risk in your decision making.
If you want to become data driven, contact us for the Sonra Smart Data workshops.

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Big to Smart Data in 5 Simple Steps

From Big to Smart Data in 5 Simple Steps

Related Articles

Big Data, Data Warehouses, Business Intelligence, Data Analytics Intern

Successfully Transitioning your Team from Data Warehousing to Big Data

Spark and Hadoop in Risk Line of Business at Bank of America

Cookies consent