hadoop guaranteeing data collection platform + + + visualization analysis

1. Requirements:

1.1 Background Description

With the rapid development of Chinese economy, the masses income is improving, but there are still
a part of the population who need help financially, receiving monthly urban minimum living security, in order to more
good to achieve accurate poverty alleviation, where you are team need to "guaranteeing a city population information table" given
by acquisition where subsistence income population, the low number of the population unemployed, people with disabilities such as visual or hearing the number of letter
information, analyzing the number of regional subsistence allowances population, per capita income information, the overall situation in recent years, income, etc.,
to achieve accurate data on poverty alleviation to give evidence.
To accomplish this task, the group where you plan to choose "Python" language widely used in the industry,
as the base language of the entire project, and comprehensive utilization of MySQL, Matplotlib, pandas, Hive other technical
surgery to improve development efficiency and achieve project requirements, the project uses server cluster environment to build way through
through the acquisition of "guaranteeing a city population information table", cleaning and analysis, and strive to achieve the poverty reduction given to accurate
data support.
Task one: to deploy and manage Hadoop platform components (15 points)
11, the deployment environment
1) Hadoop system is stored in the "/ usr / local / hadoop" , requires configuration hadoop.tmp.dir directory
storage location to "/ usr / local / adoop / tmp "
2) are arranged hadoop dfs.namenode.name.dir is / usr / local / adoop / tmp / DFS / name;
. 3) arranged hadoop dfs.datanode.data.dir is / usr / local / adoop / tmp / DFS / Data;
. 4) formatting the NameNode;
5) open NameNode and DataNode daemon.
22, the network configuration (all nodes)
1) modify the current name of the machine;
2) exit the current login, and log back in;
3) turn off the firewall;
4) modify the current machine IP;
5) Configure the hosts file;
6) to restart the network;
7 ) to create a normal user (you can create a graphical interface in the process of installing CentOS system
hadoop user, password is set to hadoop).
3 3, SSH without password authentication configuration
process to run Hadoop needs to manage remote Hadoop daemons, after Hadoop startup,
the NameNode is to start and stop various keep on each DataNode by SSH (Secure Shell)
care process. This time must execute instructions between nodes is not required to enter a password form, because
this need to configure SSH public key authentication using password-free form, so no password NameNode using SSH login
record and start DataName process, the same principle, DataNode also can log on to use SSH without password
NameNode.
1) Installation and start SSH protocol; (all nodes)
2) to switch to hadoop user;
3) for each node generates a secret key; (all nodes)
4) to see whether there is "/ home / hadoop /" " .ssh" folder, and ".ssh" whether there are two case files
just produced no cryptographic key pair; (all nodes)
5) The id_rsa.pub added to the authorized key to go inside; (all nodes)
6) modify the file "authorized_keys" rights; (all nodes)
(all nodes); 7) SSH configuration settings
remember to restart the SSH service after 8) After setting , just to make an effective setting; (all nodes)
9) to a user hadoop;
10) verification was successful; (all nodes)
11) copy of the public key id_rsa_pub master node to each slave node;
12) in each slave copy the master node node authorized_keys file copied to the public key;
(all nodes slav ve e)
13) delete file id_rsa.pub; (all nodes slav ve e)
14) each slave node to the master authentication without password authentication; (master node)
public key 15) each slave node to the master copy; (note 16, 17 to complete a step
a) the slave node after the operation
16) copied from the copied node public key to slave in the master file node authorized_keys ;
(master node)
17) delete file id_rsa.pub; (master node)
18) each slave node to verify that no master secret Code verification; (slav ve e Node)
. 4. 4, the installation environment of the Java (all nodes must be configured)
The jdk-8u77-linux-x64.tar.gz master node package uploaded to the / root directory.
1) to root;
2) new java directory;
3) extracted to / usr / java directory;
4) configure the environment variables;
5) Add environment variables to take effect;
6) verify a successful installation.
55, mounted on the Master node hadoop
. 1) to decompress the / usr directory;
2) Rename;
3) Environment variables hadoop configuration;
4) hadoop configuration environmental variables;
5) arranged hadoop-env.sh ;
6) arranged the site.xml-Core;
. 7) Configuring the site.xml-HDFS;
. 8) arranged the site.xml-Yarn;
. 9) the site.xml-configuration mapred;
10) masters configuration file;
11) slaves the configuration file;
12 ) New directory;
13) to modify / usr / local / directory permissions hadoop;
14) on the installation file hadoop synchronized to the master slave2 Slave1;
15) arranged hadoop environment variables on each slave node; (all slave nodes)
16) reacting hadoop environment variable configuration to take effect; (all slave nodes)
17) Modify / usr / local authority / hadoop directory; (all slave nodes)
18) to switch to hadoop user. (All slave nodes)
66, test
1) to switch to the hadoop; (master node)
2) to format; (master node)
3) start hadoop; (master node)
4) View Java process;
5) Using browsers See NameNode Master node computer node status;
6) Browse Datanode data node;
7) Master node using the browser to view all applications;
8) Browse nodes;
. 9) to close hadoop.
Two tasks, the data acquisition (15 minutes)
according to the following list of head styles, custom-written or data acquisition sources available, and save the corresponding "Task II"
server.


Task three, data cleaning and analysis (25 points)
11)) create table hive;
22)) import data into the corresponding table crawling;
33)) read data set;
44)) purge data invalid data ;
missing number value 55)) of specified property of the column;
66)) to view the rows having the missing data values;
77)), and missing values Money added new storage table;
) 8) and reads the data See below set.
aa)) the average of 2013-2015 subsistence income populations;
) b) count the number of unemployed population guaranteeing districts in 2016;
CC)) 2015- 2016, the average income "unregistered unemployed people" of;
dd)) 2014 the number of districts with a "visual or hearing disability";
EE)) for specified attribute standardize, and write the file.
Task four, data visualization (2200 points)
will be analyzed data pushed to a MySQL database, visualization and content rendering:
11)) using a draw charts Matplotlib counties;
22)) shows a two-year guaranteeing that the average income of the population;
33)) demographic guaranteeing a population of districts in unemployment;
44)) show a two-year average earnings "unregistered unemployed people"; and
55)) show a year district It has a number of "visual or hearing disability".
Task Five: Comprehensive analysis (15)
Answer the following questions based on visual charts
1) guaranteeing funding for 2016 which areas need investment of up to;
2) 2016 regional population which average minimum income;
3) 2016 average regional population which the highest income;
4) how to improve the income of the average subsistence of the population.

 

2. Implement

Link: https: //pan.baidu.com/s/1Olalilme_4hmpeJOakrEDg 
extraction code: htp1 
After you copy the contents of this open Baidu network disk phone App, the operation more convenient oh

Guess you like

Origin blog.csdn.net/weixin_40903057/article/details/90598882