Realize offline and real-time consumer product transaction behavior analysis based on Hadoop ecology (consumption behavior analysis, purchase preference analysis)

Background of the project

Big data professional comprehensive project practice, the data set adopts the public data set of Ali Tianchi, download link: consumer product transaction survey list
This data set is a sample set, with a total of more than 5,000 records, each record represents a consumer's product transaction research information. Below is a description of each field:

消费者姓名:消费者的姓名。
年龄:消费者的年龄。
性别:消费者的性别。
月薪:消费者的月薪状况。
消费偏好:消费者在购买商品时的偏好类型,如性价比、功能性、时尚潮流、环保可持续等。
消费领域:消费者购买的商品领域,如家居用品、汽车配件、珠宝首饰、美妆护肤等。
购物平台:消费者常用的购物平台,如天猫、苏宁易购、淘宝、拼多多等。
支付方式:消费者在购物时使用的支付方式,如微信支付、货到付款、支付宝、信用卡等。
单次购买商品数量:消费者每次购买商品的数量。
优惠券获取情况:消费者在购物过程中是否获取到优惠券,如折扣优惠、免费赠品等。
购物动机:消费者购物的动机,如品牌忠诚、日常使用、礼物赠送、商品推荐等。

Through the analysis and visualization of data sets, it is possible to understand consumers' shopping preferences, consumption habits and shopping motivations, thereby providing reference for enterprises to formulate marketing strategies and product positioning.

1. Description of project environment

Linux Ubuntu 16.04
jdk-7u75-linux-x64
eclipse-java-juno-SR2-linux-gtk-x86_64
Flume 1.5.0 -cdh5.4.5
Sqoop 1.4.5-cdh5.4.5
Hive-common-1.1.0-cdh5.4.5
Spark 1.6.0      Scala 2.10.5    kafka 0.8.2
Mysql Ver 14.14 Distrib 5.7.24 for Linux(x86_64)

2. Mapreduce data cleaning

1. Download the dataset and move to the directory

Open the terminal, create a directory, create a new file

mkdir /data/shiyan1
gedit /data/shiyan1/shujuji

Remove the title line from the downloaded content and write it into the shujuji file (or do this step in the mapreduce program later, but here is to remove the first line first)

2. Upload the dataset to the hadoop cluster

hadoop fs -mkdir /shiyan1/origindata/
hadoop fs -put /data/shiyan1/shujuji  /shiyan1/origindata/*

3. Write mapreduce to do simple data cleaning (delete a few irrelevant columns)

First create a directory where the files are kept after cleaning

hadoop fs -mkdir /shiyan1/cleandata/

Create a new project in eclipse, create a new directory folder, name it lib, and import the jar package required by the project. Specific operation: Select all jar packages (press and hold the Shift shortcut key), right-click Add to Build Path. (The project jar package is in my blog homepage resource, which needs to be picked up by myself)
Create a new class named Clean and write the following code

package my.clean;

import java.io.IOException;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class Clean {
    
    

	public static class doMapper extends Mapper<Object, Text, Text, Text> {
    
    
		@Override
		protected void map(Object key, Text value, Context context)
				throws IOException, InterruptedException {
    
    

			String[] arr = value.toString().split(",");
			StringBuilder one = new StringBuilder();
			one.append(arr[1]);
			one.append("\t");
			one.append(arr[2]);
			one.append("\t");
			one.append(arr[3]);
			one.append("\t");
			one.append(arr[4]);
			one.append("\t");
			one.append(arr[5]);
			one.append("\t");
			one.append(arr[9]);
			one.append("\t");
			one.append(arr[10]);
			context.write(new Text(one.toString()), new Text(""));
		}
	}

	public static void main(String[] args) throws IOException,
			ClassNotFoundException, InterruptedException {
    
    
		Job job = Job.getInstance();
		job.setJobName("Clean");
		job.setJarByClass(Clean.class);

		job.setMapperClass(doMapper.class);
		// job.setReducerClass(doReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		Path in = new Path("hdfs://localhost:9000//shiyan1/origindata");
		Path out = new Path("hdfs://localhost:9000//shiyan1/cleandata");
		FileInputFormat.addInputPath(job, in);
		FileOutputFormat.setOutputPath(job, out);
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}

}

4. Run the program and check the results

Right-click the project, click run on hadoop (be careful to start hadoop first)

Open the terminal, enter the command, view the result, and download the cleaned data to the local

hadoop fs -cat /shiyan1/cleandata/part-r-00000 >> /data/shiyan1/cleandata

insert image description here

3. Hive offline analysis data

1. Execute hive, enter the interactive command line, and create databases and tables (the default internal table, the default path is /user/hive/warehouse/)

create database behavior;
use behavior;
create table xiaofei (age int,sex string,salary int,consumelike string,consumearea string, coupon string,shoppeupose string) row format delimited fields terminated by '\t' ;

2. Reload data from local

load data local inpath '/data/shiyan1/cleandata' into table xiaofei;

In hive, perform a query operation to verify whether the data is imported successfully. If there is no data, see if any command is missing

select * from xiaofei limit 10;

3. Write query sql for data analysis

Requirement 1:
In the survey list of consumer goods shopping, the number and proportion of middle-aged and elderly people (35 years old) and young people shoppers:

select age,count(*) as num from(
select case when age>=35 then 1 when age<35 then 0 end as age from xiaofei
) t
group by age

(The discovery ratio is approximately 1:2)
insert image description here
Requirement 2: Statistics of consumption preferences and consumption motivations of different ages, mining the most frequent occurrences, and discovering what people of different ages are pursuing for consumption

select age,consumelike,shoppurpose,nums from (
select * ,row_number()over(partition by age order by nums desc) as rank from (
select age,consumelike,shoppurpose,nums from (
select age,consumelike,shoppurpose ,count(*) as nums from xiaofei
group by age,consumelike,shoppurpose
) t
where nums>=2
) p
) m
where rank = 1

(The statistical results show that some age groups have more concentrated consumption pursuits, while some age groups have more extensive consumption pursuits, and the consumption pursuits are also quite different) Requirement 3: Statistics on the attention of different genders to
insert image description here
using coupons when shopping (take the top three for each bit)

select sex ,coupon from (
select *, row_number()over(partition by sex order by num desc) as rank from (
select sex,coupon,count(*) as num from xiaofei
group by sex,coupon
) as t
) as p
where rank <=3

(Statistical results: Regardless of men or women, free gifts are still very tempting to them. In addition, women like to use coupons when shopping, while men buy when they have the desire to buy. The use of coupons is not strong. very big)
insert image description here

4. Rewrite the query results to the new table of hive for subsequent sqoop export

Three new tables are created here to save the results of the above query

create table agecount(age int,num int) row format delimited fields terminated by '\t' ;
create table agelike(age int,consumelike string,consumearea string,num int) row format delimited fields terminated by '\t' ;
create table sexcoupon(sex string,counpon string) row format delimited fields terminated by '\t' ;

Then add insert into table before the query statement, taking the first requirement as an example, and so on

insert into table agecount select age,count(*) as num from(
select case when age>=35 then 1 when age<35 then 0 end as age from xiaofei
) t
group by age

4. Sqoop exports data from hive to mysql

1. Install the mysql environment, and ensure that the service has been started, and start the command

sudo service mysql start

2. Enter the mysql database and create corresponding libraries and tables

Enter your username and password here

mysql -u root -p

Create a library and create a table under the library

CREATE DATABASE IF NOT EXISTS  behavior DEFAULT CHARSET utf8 COLLATE utf8_general_ci;  
use behavior;
create table agecount (age int ,num int);  
create table agecount (age int ,consumelike varchar(200),consumearea varchar(200),num int);  
create table sexcoupon(sex varchar(20),shoppurpose varchar(200));

3. Sqoop imports data

sqoop export --connect jdbc:mysql://localhost:3306/behavior?characterEncoding=UTF-8 --username root --password strongs --table agecount --export-dir /user/hive/warehouse/behavior.db/agecount/000000_0 --input-fields-terminated-by '\t'

sqoop export --connect jdbc:mysql://localhost:3306/behavior?characterEncoding=UTF-8 --username root --password strongs --table agelike --export-dir /user/hive/warehouse/behavior.db/agelike/000000_0 --input-fields-terminated-by '\t'

sqoop export --connect jdbc:mysql://localhost:3306/behavior?characterEncoding=UTF-8 --username root --password strongs --table sexcoupon --export-dir /user/hive/warehouse/behavior.db/sexcoupon/000000_0 --input-fields-terminated-by '\t'

4. Execute the command to check whether there is data in the Mysql table

select * from sexcoupon;  

Five, SparkStreaming real-time analysis

Here is a small explanation: Since the project itself should use a crawler program to crawl real-time data of the website, and then analyze some comment-intensive time, comment content, etc. However, since the data set is downloaded directly, it is not easy to do real-time crawling. It is troublesome to find similar available data and then filter it, which deviates from the focus of the project. So here is a shell script to simulate and generate real-time data.

1. Create a project file

mkdir /data/shiyan1/realtime/datasource
mkdir /data/shiyan1/realtime/datarandom
mkdir /data/shiyan1/realtime/shellrealtime

2. Write a shell script

First enter the editing mode, if you find that the gedit command is not recognized, you can try to use vim or vi

gedit /data/shiyan1/realtime/shellrealtime/time.sh

write the following

#!/bin/bash
file_count=1
while true;do
	for i in {1..5} ; do
		if read -r line; then
			echo "$line" >> /data/shiyan1/realtime/datarandom/file_${file_count}.txt
		else
			break 2
		fi
	done
	((file_count++))
	sleep 10
done < /data/shiyan1/realtime/datasource/source

3. Start the flume service to see if it can detect new file generation

Configure the conf file of flume below to test whether flume can work normally

gedit spooldir_mem_logger.conf  

Add the following Flume configuration information to the file, then save and exit. Make it realize the function of monitoring the /data/shiyan1/realtime/datarandom directory, and output the read files to the console interface.

agent1.sources=src  
agent1.channels=ch  
agent1.sinks=des  
  
agent1.sources.src.type = spooldir  
agent1.sources.src.restart = true  
agent1.sources.src.spoolDir =/data/shiyan1/realtime/datarandom
  
agent1.channels.ch.type=memory  
  
agent1.sinks.des.type = logger  
  
agent1.sources.src.channels=ch  
agent1.sinks.des.channel=ch  

After configuring the spooldir_mem_logger.conf file, switch to the Flume installation directory and start Flume. (Note: here /data/edu6/ is the directory where I configure the conf file of flume )

cd /apps/flume  

flume-ng agent -c /data/edu6/ -f /data/edu6/spooldir_mem_logger.conf -n agent1   
-Dflume.root.logger=DEBUG,console  

Run the time.sh script program written above, and find the following changes in the files in the directory.
insert image description here
If the detection is successful, you can find that the suffix .COMPLETED is added to the end of the file.
Then please delete all files in the /data/shiyan1/realtime/datarandom directory to prevent the impact of flume content monitoring.

4. Start the kafka service and detect the connectivity of producers and consumers

Pre-steps: start hadoop service, zookeeper service, then enter the kafka installation directory, and start Kafka-server.

cd /apps/kafka  
bin/kafka-server-start.sh config/server.properties 
1. After the Kafka service is started, the window enters the blocking state, and another port emulator needs to be started for operation.

Create a topic named flumesendkafka.

bin/kafka-topics.sh \  
--create \  
--zookeeper localhost:2181 \  
--replication-factor 1 \  
--topic flumesendkafka \  
--partitions 1  

Check which topics are in the current Kafka

/apps/kafka/bin/kafka-topics.sh  --list  --zookeeper  localhost:2181 
2. Call kafka-console-producer.sh in the /apps/kafka/bin directory to produce some messages, and the producer is the producer
cd /apps/kafka  
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic flumesendkafka  

Here localhost is the IP of Kafka, and 9092 is the port of the broker node. Users can input information on the console interface, hand it over to the producer for processing, and send it to the consumer.

3. Open another window, call kafka-console-consumer.sh in the bin directory, start the consumer, and use the consumer as a consumer to consume data.
cd /apps/kafka  
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic flumesendkafka  
 --from-beginning  

kafka-console-consumer.sh still needs to add some parameters, such as ZooKeeper IP and port, topic name, read data location, etc.

4. Test

In the interface for executing the kafka-console-producer.sh command, enter a few lines of text and press Enter. You can see that on the consumer side, the same content will be output

5. Write Spark streaming program

Create a new Scala Project, (if you need the jar package used by the project, please go to my blog resources to download it yourself)
count how many pieces of new data there are in each period of time, and do a real-time calculation

package my.streaming  
 
import kafka.serializer.StringDecoder 
import org.apache.spark.SparkConf 
import org.apache.spark.streaming.StreamingContext 
import org.apache.spark.streaming.kafka.KafkaUtils 
import org.apache.spark.streaming.Seconds 
import org.apache.spark.streaming.dstream.InputDStream 
 
import java.sql.DriverManager 
import java.sql.ResultSet 
import java.sql.Connection 
import java.sql.PreparedStatement 
import java.text.SimpleDateFormat 
import java.util.Date 
 
object JianKong { 
 
  def main(args: Array[String]) { 
 
    val sparkConf = new SparkConf().setAppName("jiankong").setMaster("local[2]") 
    val ssc = new StreamingContext(sparkConf, Seconds(4)) 
    ssc.checkpoint("checkpoint") 
    val topics = Set("flumesendkafka") 
    val brokers = "localhost:9092" 
    val zkQuorum = "localhost:2181" 
 
    val kafkaParams = Map[String, String]( 
      "metadata.broker.list" -> brokers, 
      "serializer.class" -> "kafka.serializer.StringEncoder") 
 
    val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics).map(_._2) 
 
    lines.foreach(line => { 
      var strs = line.collect() 
      println(strs.size) 
     
      var finalNum = 0 
       
      for (str: String <- strs) { 
        /**Use Fastjson to parse jsonString!*/ 
        println("finalNum : " + finalNum + "#"+ str) 
        if (!str.equals("")) { 
            finalNum = finalNum + 5           
         } 
       } 
      println("finalNum: " + finalNum) 
 
      var now: Date = new Date() 
      val dateFormat: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") 
      var creationtime = dateFormat.format(now) 
 
      val db_host = "localhost" 
      val db_name = "realtimebase" 
      val db_user = "root" 
      val db_passwd = "strongs" 
      val db_connection_str = "jdbc:mysql://" + db_host + ":3306/" + db_name + "?user=" + db_user + "&password=" + db_passwd 
      var conn: Connection = null 
      var ps: PreparedStatement = null 
      val sql = "insert into jiankong (creationtime, num) values (?, ?)" 
      try { 
        conn = DriverManager.getConnection(db_connection_str) 
        ps = conn.prepareStatement(sql) 
        ps.setString(1, creationtime) 
        ps.setInt(2, finalNum) 
        ps.executeUpdate() 
      } catch { 
        case e: Exception => println("MySQL Exception") 
      } finally { 
        if (ps != null) { 
          ps.close() 
        } 
        if (conn != null) { 
          conn.close() 
        } 
      } 
 
    }) 
 
    ssc.start() 
    ssc.awaitTermination() 
    ssc.stop() 
  } 
 
}  

6. Start the Mysql service and create a table

sudo service mysql start  
mysql -u root -p  
CREATE DATABASE IF NOT EXISTS realtimebase DEFAULT CHARSET utf8 \  
COLLATE utf8_general_ci;  
use  realtimebase
create table jiankong (creationtime datetime,num int);  

In this way, the program processed by sparkstreaming will write the results into the Mysql database, and finally check how many comments are recorded in each period of time

7. Start the real-time processing program in order

1. Start kafka-server
cd /apps/kafka  
bin/kafka-server-start.sh config/server.properties
2. Start the JianKong.scala program of spark streaming
3. Open another terminal emulator and start flume
cd /apps/flume  
flume-ng agent -c /data/edu6/ -f /data/edu6/spooldir_mem_logger.conf -n agent1   
-Dflume.root.logger=DEBUG,console  
4. Start the simulated crawler program
/data/shiyan1/realtime/shellrealtime/time.sh
5. Check MySQL and find that there are corresponding statistics

insert image description here
It is found that there is corresponding content in the Mysql table, and the real-time processing ends.

Guess you like

Origin blog.csdn.net/weixin_52323239/article/details/131611807