Big Data News – LinkedIn’s Ops team raises it’s Hadoop game with “Rewinder”

September 25, 2015

LinkedIn’s new “Rewinder” tool outshines Apache Resource Manager and Job History Server on its Hadoop clusters… SIREn Solutions announces Kibi

Another week has passed where the team at Sonra have been impressed with developments in our big data community. LinkedIn is a big data company, which also happens to do other stuff that pays the bills and has faced the same big data cluster challenges like all successful companies in the big data space. Like so many big data companies, the ability to monitor/manage its resource allocation on clusters in near real time or even real time is a huge challenge, which in LinkedIn’s case led to a solution, the team in Mountain View call “Rewinder”.
Rewinder is basically Linkedin’s upgrading of its resource allocation and reporting tools for the management of its Hadoop cluster that runs its network. With existing apache resources such as resource manager and job history server only providing current resource allocation data on the cluster(s) (not historic data) along with job history server providing historic data on Map Reduce jobs only, LinkedIn wanted a single solution to meet their needs. This solution needs to encapsulate the functionality of both resource manager and job history server. Their solution and research lead to the development of Rewinder.
Rewinder works as a platform management and analysis tool with four basic features:

Extractor: This is an every 1 minute job run that extracts data from every application master via YARN API and does some aggregation on the extracted data for next step, which is reporter.
Reporter: This is a nightly job ran to extract the collated day’s data via REST API from Extractor and performs analysis on it. It’s job is to provide insights on the day’s resource allocation activities.
Housekeeper: An administrative tool ran nightly to do housekeeping activities across the cluster like deleting old data and creating new table partitions.
Trigger: The control feature that manages the above three features via Java Schwartz scheduler. It maps out all tasks and times assigning them to the subordinate features.

Interestingly enough, LinkedIn keep 70 days of dev cluster data on a MySQL database importing c.600,000 records per day. This MySQL database has a UI for the administrator or analyst to use as the end user of the data, processing insights for LinkedIn’s technology group.
Some very cool features of rewinder are as follows:

Rewinding today – You can revisit any minute of a given day to see what the allocation state was on the cluster(s).
Who’s using what – You can see resource consumption seeing what queues use what and how much is used over a ranged time period.
Queues – You can see how application queues are building for resources, thus it’s possible to spot possible pending failures that may or may not fall into (checked) specific fault tolerance perimeters.
Reporting – It can report over a 30 day period on usage allocation, application usage, job usage frequency plus more.
Visualizations – Usage utilization and allocation charts plus more

With an integrated tool at LinkedIn, it is not hard to see why the operations team built it and are benefiting from the obvious value it adds to the management of LinkedIn’s Hadoop clusters, which it uses to run its business.

SIREn solutions have announced the launch of their business intelligence tool called “Kibi”. It’s a cross-index (abstract relational layer) data intelligence tool that is a fork of Kibana and provides useful enhancements to Elasticsearch. One could say it is the BI analyst’s best friend given Kibi’s cross indexing ability on existing Elasticsearch indexes, which makes for a very flexible plugin. SIREn advises the plugin will be released as a stand alone product in the coming weeks. This cross indexing relational capability has some interesting functional use cases for analysts, where indexing is complete and the ability to join indices producing single returns on queries is desired. It’s a definite ‘must have’ for the analyst, who may have an interest in related data from several indexes to be returned in a single query. SIREn’s Kibi plugin for Elasticsearch once again deserves the credit for being able to join several indexes in a runtime environment, returning a single result for the analyst, which is a good thing in the world of big data analysis. Kibi is more however than this innovative plugin. It’s analytic capability extends to beyond cross indexing joins and filtering to decent data visualisations, relational analysis and external SQL queries on external data sources. As an open source project, Kibi can be found on Github with its intuitive front end distributed under Apache 2 licence and SIREns innovative cross indexing plugin distributed under AGPL licence. You can also check out Kibi at http://siren.solutions/kibi.
As another week comes to a close, the evolution of Hadoop in a customised cluster environment and new BI innovations is testament to progress for everybody in our Kaizen big data world that marks its successes by a “+1” of progress through people, process and technology.
About Sonra
We are a Big Data company based in Ireland. We are experts in data lake implementations, clickstream analytics, real time analytics, and data warehousing on Hadoop. We can help with your Big Data implementation. Get in touch.
We also run the Hadoop User Group Ireland. If you are interested to attend or present register on the Meetup website.

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

Big Data News – LinkedIn’s Ops team raises it’s Hadoop game with “Rewinder”

Big Data News – LinkedIn’s Ops team raises it’s Hadoop game with “Rewinder”

Related Articles

Spark and Hadoop in Risk Line of Business at Bank of America

Dimensional Modeling and Kimball Data Marts in the Age of Big Data and Hadoop

Big Data News: Streaming in the Extreme.. An evolution in Data Processing and Analytics

Cookies consent