Big Data framework hadoop comprehensive analysis, not to waste every minute performance

There are various transport dimensions amount to task status, error, computing resources, storage and other measurements to help users understand the health status of the platform, thereby improving the user experience on Hadoop platform.

For product manager in terms of understanding the use of the product is one of the most important things. However, for such products Hadoop platform in terms of this matter a bit erratic. There are various transport dimensions amount to task status, error, computing resources, storage and other measurements to help users understand the health status of the platform, thereby improving the user experience on Hadoop platform. For the consumer Internet, mobile Internet product manager is concerned, the same thing also exist. For example, on consumer goods is concerned, they usually measure around user activity, conducted engagement, revenue, conversion rates, retention rates.

This paper presents some of the basic method to get users on the Hadoop platform to measure, analyze usage patterns in order to carry out; and product planning based on analysis results.

Hadoop user metrics will change based on Hadoop distributions use, I use here is Cloudera's Hadoop distribution. Cloudera company provides a good tool, called Cloudera Navigator. The configuration tool provides event collection, viewing and other audit functions, in order to better understand the use of data and their use. Demand for the tool can meet most application scenarios, making the viewing platform and auditing much easier. Cloudera Navigator is part of Cloudera Manager, which provides a range of robust API, for integration with existing monitoring tools. Further, Cloudera Manager also provides a control panel can be configured, for each measurement can be visualized in the near-real-time tracking.

ClouderaManager API use Description

There may be product managers feel Navigator provides functionality not enough, they want to show more customized metrics, and more and more rich visualization capabilities on the control panel, then Cloudera Manager API (CMAPI) is a great tool. REST API provides a number of metrics you want KPI can be polymerized and summarized by it. As mentioned herein, all access points have a common set of operating data, JSON API call returns the object is applicable to various interfaces.

In the following example, we will see is the use of Cloudera Manager to query, access to the user state to perform the task for some time YARN internal Hadoop cluster. Using Python script to show how to call the API, including passing parameters and time periods specified number of results obtained in JSON format.

Note: This article uses Python 2.7 version.

We look at the access point 'yarnApplication' - invoking the interface will return the user to run YARN container properties. Including an application ID, application name, start time, end time, user name, resource pool (application submitted to the pool of resources) and other properties are available through the API. For example in terms of, "yarnmetrics.py" create a file named, open the file and enter the following:

1) introducing necessary libraries

import json

import sys

importurllib2

fromdatetime import datetime, timedelta

2) set the number of records to be returned, the maximum number of single call record 1000 by default.

3) We also need to extract the time slice available to API. For example, the range can be set based on the current time slice an offset time taken. In this case, I hope that API returns data from nine o'clock to five o'clock in the afternoon.

cur_time =(datetime.now() - timedelta(day = 1))

to_time =cur_time.replace(hour = 17, minute = 00, second = 00, microsecond =0).isoformat()

from_time =yes_time.replace(hour = 8, minute = 0, second = 0, microsecond = 0).isoformat()

4) The parameters passed to the script. We pass parameters to tell the script which one access point API call.

limiter =20

metric =sys.argv[1]

5) API access point format: / API / V7 / Clusters / {the clusterName} / Services /} {serviceName / yarnApplications

This call depends on your specific configuration on a cluster of Cloudera Manager. Click for more information about the services and the role of metrics.

Here, we assume a scenario: If the incoming "applications" following call is executed. from_time, to_time ISO format should be a time stamp, including the inner limiter, the three parameters required to pass

6) Depending on the configuration of Cloudera Manager is installed, you need to authenticate with its server. So we use the base64 encoded. Note: Due to different security environment and configuration, there will be different ways to authenticate. For demonstration purposes, brevity the following example:

return"Basic " + (user + ":" +password).encode("base64").rstrip()

7) Now we can make API calls, the call request is included in the encoded data.

8) submit a request, and outputs the result to JSON file.

req =urllib2.Request(url)

req.add_header('Accept','application/json')

req.add_header ( 'Authorization', 'Basic fsfadgibberishsdfdfsfF =') # populated with user data encoded

9) This time, we will be able to get JSON file contains all the information of task-specific time period running up.

You can run the script with the following command: python yarnmetrics.py applications

With this data, you can analyze the data visualization tools to their preferences based on.

For example in terms of, you can get a list of users to access interfaces Impala database through impalaQueries.

Or obtained by routine use of the above-mentioned YARN interface.

 

 

At this time, it is not hard to use tools such as Cron Oozie or automated process to obtain the use of cluster at any time interval up. If you need to automate the control panel, or user-related query, would write these structured JSON data into Hadoop is also possible. A simple workflow you can write these JSON files to the Hive table, then you can use visualization tools to access Hive further the control panel.

Recommended Reading articles

40 + annual salary of big data development [W] tutorial, all here!

Zero-based Big Data Quick Start Tutorial

Java Basic Course

web front-end development based tutorial

Big Data era need to know six things

Big Data framework hadoop Top 10 Myths

Experience big data development engineer salary 30K summary?

Big Data framework hadoop we encountered problems

Guess you like

Origin blog.csdn.net/yuyuy0145/article/details/92425651
Recommended