Big Data News - LinkedIn’s Ops team raises it’s Hadoop game with “Rewinder”

Uli Bethke Business Intelligence, Distributed Computing, Technology

LinkedIn’s new “Rewinder” tool outshines Apache Resource Manager and Job History Server on its Hadoop clusters… SIREn Solutions announces Kibi

imgLinkedIn

Another week has passed where the team at Sonra have been impressed with developments in our big data community. LinkedIn is a big data company, which also happens to do other stuff that pays the bills and has faced the same big data cluster challenges like all successful companies in the big data space. Like so many big data companies, the ability to monitor/manage its resource allocation on clusters in near real time or even real time is a huge challenge, which in LinkedIn’s case led to a solution, the team in Mountain View call “Rewinder”.

Rewinder is basically Linkedin’s upgrading of its resource allocation and reporting tools for the management of its Hadoop cluster that runs its network. With existing apache resources such as resource manager and job history server only providing current resource allocation data on the cluster(s) (not historic data) along with job history server providing historic data on Map Reduce jobs only, LinkedIn wanted a single solution to meet their needs. This solution needs to encapsulate the functionality of both resource manager and job history server. Their solution and research lead to the development of Rewinder.

Rewinder works as a platform management and analysis tool with four basic features:

  1. Extractor: This is an every 1 minute job run that extracts data from every application master via YARN API and does some aggregation on the extracted data for next step, which is reporter.
  2. Reporter: This is a nightly job ran to extract the collated day’s data via REST API from Extractor and performs analysis on it. It’s job is to provide insights on the day’s resource allocation activities.
  3. Housekeeper: An administrative tool ran nightly to do housekeeping activities across the cluster like deleting old data and creating new table partitions.
  4. Trigger: The control feature that manages the above three features via Java Schwartz scheduler. It maps out all tasks and times assigning them to the subordinate features.

Interestingly enough, LinkedIn keep 70 days of dev cluster data on a MySQL database importing c.600,000 records per day. This MySQL database has a UI for the administrator or analyst to use as the end user of the data, processing insights for LinkedIn’s technology group.

Some very cool features of rewinder are as follows:

  • Rewinding today - You can revisit any minute of a given day to see what the allocation state was on the cluster(s).
  • Who’s using what - You can see resource consumption seeing what queues use what and how much is used over a ranged time period.
  • Queues - You can see how application queues are building for resources, thus it’s possible to spot possible pending failures that may or may not fall into (checked) specific fault tolerance perimeters.
  • Reporting - It can report over a 30 day period on usage allocation, application usage, job usage frequency plus more.
  • Visualizations - Usage utilization and allocation charts plus more

With an integrated tool at LinkedIn, it is not hard to see why the operations team built it and are benefiting from the obvious value it adds to the management of LinkedIn’s Hadoop clusters, which it uses to run its business.

imgElasticsearch

SIREn solutions have announced the launch of their business intelligence tool called “Kibi”. It’s a cross-index (abstract relational layer) data intelligence tool that is a fork of Kibana and provides useful enhancements to Elasticsearch. One could say it is the BI analyst’s best friend given Kibi’s cross indexing ability on existing Elasticsearch indexes, which makes for a very flexible plugin. SIREn advises the plugin will be released as a stand alone product in the coming weeks. This cross indexing relational capability has some interesting functional use cases for analysts, where indexing is complete and the ability to join indices producing single returns on queries is desired. It’s a definite ‘must have’ for the analyst, who may have an interest in related data from several indexes to be returned in a single query. SIREn’s Kibi plugin for Elasticsearch once again deserves the credit for being able to join several indexes in a runtime environment, returning a single result for the analyst, which is a good thing in the world of big data analysis. Kibi is more however than this innovative plugin. It’s analytic capability extends to beyond cross indexing joins and filtering to decent data visualisations, relational analysis and external SQL queries on external data sources. As an open source project, Kibi can be found on Github with its intuitive front end distributed under Apache 2 licence and SIREns innovative cross indexing plugin distributed under AGPL licence. You can also check out Kibi at http://siren.solutions/kibi.

As another week comes to a close, the evolution of Hadoop in a customised cluster environment and new BI innovations is testament to progress for everybody in our Kaizen big data world that marks its successes by a “+1” of progress through people, process and technology.

About Sonra

We are a Big Data company based in Ireland. We are experts in data lake implementations, clickstream analytics, real time analytics, and data warehousing on Hadoop. We can help with your Big Data implementation. Get in touch.

We also run the Hadoop User Group Ireland. If you are interested to attend or present register on the Meetup website.