Hive + Job Data Acquisition Data Analysis Data Visualization +

1. Demand

Background described
in recent years with the rapid development of IT industry, the number of people across the country to demand for IT classes are constantly
increasing, "XHS Group" in order to clarify the future direction of IT industry personnel training, performed well in a number of provinces IT
Division of Investigation jobs analysis. Group your location will assume simulation research and analysis tasks, through recruitment
recruitment information website crawling to get to the company name, place of work, job name, recruitment requirements,
information on the number of recruits, and through data cleaning and analysis, the final analysis of the current IT industry, popular posts,
a large amount of data related jobs such as information professionals, and visual presentation.
This is a simulated mission, the project team plans to use a distributed node Hadoop mode, set up the environment using
server cluster approach, by recruitment website crawling to get the relevant information, and perform data crawling,
clean, organize, calculate, expression, analysis, and strive to achieve employment information for IT personnel have a clearer palm
grip.
As a technical staff of the project team, you are the core members of the technical solutions show, follow
the completion of this technology demonstration mission next step and submit a technical report, I wish you success.
A Task: Hadoop platform and deployment manager component (15 minutes)
1) The Hive installation package on a specified path to extract the specified directory;
2) the apache-hive-1.1.0-bin file renamed folder after extracting hive ;
3) Hive set environment variables, and environment variables take effect only on the current root user;
4) the renamed file Hive hive-default.xml.template installation directory for the site.xml-Hive;
5) is created in the installation directory Hive temporary folder;
6) New and hive-site.xml configuration file, to realize "Hive storage element" in the storage location for the MySQL
database;
7) Hive initialization metadata;
8) starts Hive.

Task II data acquisition (15 points)
1) fetch data from various recruitment site, extracting data items include at least the following fields: "Company Name
" the "city of work", "work requirements", "The number of recruitment" wages "(format:
'salary - limit')," name "(job name)," Detail "(job details), and saved;
2) crawling data need to import hadoop internet data cleansing and analysis, the data save to HDFS
system.
task three, data cleaning and analysis (2255 points)
1, to facilitate data analysis and visualization, data needs to climb removed for cleaning, purged does not contain null
position information of the data field, after cleaning between the vertical position data of each field with "," split data cleaning.
using MapReduce programs.
1) prepared by the use of data cleaning MapReduce program (included in the original data word positions described
results);
2) the preparation of a good file upload HDFS and washed raw data;
3) washing the loaded data to the data warehouse Hive.
2, the data need to clean the cleaning Jar package uploaded to the sequence proceeds to the operation hadoop platform, and after washing
the result hive saved to the database, for subsequent application.
1) The jar package uploaded to the / root directory hadoop platform;
2) mapreduce perform tasks and write Run out;
3) to be performed after the data stored in the success / Clean HDFS file system directory;
3) After washing cleandata table data is stored in the task database hive.
3, run HQL command line, complete the following analysis of statistical data
1) statistics of the number of job recruitment, the result is written cleantable table;
2) the query "data" job-related skill requirements, the results of the query written table_bigdata table;

3) perform keycount.sql script hive and view appears keycount the table all the core skills of
times.
Note: the following core skills Keywords: c ++, Scala, FFlume, Flink, ETL, mathematics, data warehouse,
Hbase, Hadoop, Python, the Java, Kafka used to live, Storm, Linux, Hive, the Spark.
Task four, data visualization (20 points)
data visualization presented as follows:
1) the use of the largest number of top ten hot jobs histogram showing the specified direction of the current recruitment;
2) using a line chart shows the number of differences related to Careers "big data";
3) for visual display of the knowledge and skills required "big data" related positions through word cloud ways.
Task Five: Comprehensive analysis (15) 1)
explain the main skills big data jobs needed to contain what, and elaborate management based on the analysis
of;
2) analysis of the IT industry Engineering Training Programs which according to market demand, and explains in detail the reasons ;
3) according to market demand analysis, personnel training and direction of what big data industry, and elaborate on the reasons;
4) Please describe briefly what direction the IT industry personnel training "XHS group" in the future yes.

 

2. Implement

Link: https: //pan.baidu.com/s/1dHLhFtAVThOr5pGecO4g6w 
extraction code: zvif 
copy the contents of this open Baidu network disk phone App, the operation more convenient oh

Guess you like

Origin blog.csdn.net/weixin_40903057/article/details/90599231