After graduation, there are no relevant screenshots of the original code, and the experimental documents can be obtained directly at the end of the article.
Big data platform and application
Table of contents
1 summary
Through the analysis and modeling of practical problems, data model selection and other links, this course improves students' ability to use non-relational models to solve practical problems. In order to realize the goal of the course, this course uses a practical case, comprehensive, and slightly scaled non-relational database system analysis, design, implementation, debugging, testing and demonstration. The course assessment adopts the combination of process assessment and result assessment, which not only examines the ability of students to understand and solve problems in the design, but also assesses the practicability and rationality of the final design results. In short, the teaching of this course fully implements the concept and requirements of cultivating students' ability to solve complex engineering problems in the links of design solutions, design and development systems, etc., so as to achieve the goal of this course.
2 Topic design (describe the topic design ideas in detail):
1. Data set download, data preprocessing (a removable file will be obtained)
2. Import local data set into Hive for data analysis
3. Hive\Mysql\Hbase data mutual guidance
4. Use Python for data visualization analysis
Upload the local dataset to the data warehouse Hive
Hive data analysis
Hive\Mysql\Hbase data mutual guidance
The data set used in this report is user.zip, which contains a large-scale data set raw_user.csv (contains 20 million records) and a small data set small_user.csv (contains only 300,000 records). The small data set small_usercsv is a small part of data extracted from the large-scale data set raw_user.csv. The reason why a small number of records are extracted to form a small data set is that when running through the entire experimental process for the first time, various errors and problems will be encountered. First, test with a small data set, which can save a lot of programs operation hours. After the first complete experimental process runs smoothly, the final test can be carried out with a large-scale data set.
The training report is an important part of the learning of the big data technology system. It can form a global understanding of the comprehensive application methods of the big data technology, so that the learned technologies can be effectively integrated, and the practical application problems can be solved through the combination of various technologies. It covers the installation and use methods of Linux, MySQL, Hadoop, HBase, Hive, Sqoop, Python, Eclipse and other systems and software. The installation and use methods of these software are effectively integrated into each process of the experiment, which can effectively deepen the understanding of various technical understanding.
6. Project detailed implementation process or program source code list:
1. Download and save the experimental data set
1. Download the data set, download a small data set small_user.csv (including 300,000 records) from the official website of the reference book
2. First Create a directory bigdatacase under /usr/local to run this case
3. Create a dataset under /usr/local/bigdatacase/ to save the dataset
4. Move the small_user.csv under the dataset /home/hadoop/download/ Go to dataset
5, view the first five records of the previous small_user.csv dataset
3 Dataset Preprocessing
1. Delete the field name in the first line of the file
2. Preprocess the field, create a script file pre_deal.sh and insert the content
4 import database
1. Start HDFS
2. Upload user_table.txt to HDFS
3. Create a database on Hive and start Hive
4. Create an external table
5. Query data
5Hive data analysis
1. Simple query analysis
A queries the behavior of the top 10 users on commodities
B Query the time and type of goods purchased by the top 20 users
C nested statement
2. Statistical analysis of the number of queries
A uses the aggregation function count() to calculate how many rows of data there are in the table
B Add distinct inside the function to find out how many pieces of data with non-duplicate uids
3. Keyword condition query analysis
A query based on the existence interval of the keyword
The B keyword assigns a fixed value as a condition to analyze other data
4. According to user behavior analysis
A, query the purchase ratio or browsing ratio of a product on a certain day
B Query the proportion of a user's click on the website on a certain day to all clicks on that day
C Given the quantity range of purchased goods, query the user id who purchased the quantity of goods on the website on a certain day
5. User real-time query analysis
Query the number of times users in a certain place browse the website that day
6 hive、mysql、hbase互导
1. Create a temporary table user_action
2. Insert the data in the bigdata_user table into user_action (execution time: about 10 seconds)
3. Log in to MySQL (import from Hive to MyAQL)
4. Create a database
5. Create a table
Next, create a new table user_action in the MySQL database dblab, and set its encoding to utf-8:
6. Import data
7. View the user_action table data in MySQL
8. Start Hbase (import from MySQL to Hbase)
- Create table user_action
10. Import data
- View user_action table data in HBase
12. Data preparation (use HBase Java API to import data from local to HBase)
-
Write a data importer
-
Export as a jar package
-
Empty user_action table
-
Run the hadoop jar command to run the program
-
View user_action table data in HBase
7Python for data visualization analysis
- Analyzing consumer behavior towards products
- Analyze the top 10 selling items and their sales
8 References:
Big data basic programming, experiments and case tutorials
Lin Ziyu blog
NOSQL database principles
Big data technology principles and applications
9 Realization environment (system environment and development software used):
Linux:Ubuntu(VMware Workstation Pro)
9. Summary and experience (problems or personal experience encountered when completing the project):
the document contains
Pay attention to the official account: Timewood
Reply: Non-relational data training
can get relevant codes, data, and documents.
For more university course experiment training, you can follow the official account and reply to related keywords
. If you are not good at learning, please give me advice if you make mistakes.