E-commerce data warehouse analysis project

In this project, the e-commerce data statistics module and the business collection and data warehouse building modules are separately used. Hive is used to count the popular commodities in each area for statistics; the offline business data warehouse is built based on the business data. Subsequent supplementation of project connection.

1. E-commerce popular commodity statistics items

(1) Project introduction

Perform big data analysis on regular e-commerce websites, collect statistics on popular products in each area, and support user decision-making.

Project process and framework: Python-->Flume-->HDFS-->Mapreduce/Spark ETL-->HDFS-->Hive-->Sqoop-->Mysql

(2) Demand analysis

  1. How to define popular products?

    • Simple model: directly measure the popularity of the product through the user's clicks on the product.

    • Complex model: According to the weight of each category (subsequent supplement)

  2. How to get the area?

    • After the user clicks on the log, the access IP is obtained, and then the area information is obtained.

    • Get the user's geographic information through the order-related user table in the database

  3. How to get rid of reptiles (businesses use reptiles to frequently visit their stores in order to improve their ranking)?

    • Analyze the number of visits of user IP over a period of time

(3) Technical plan

  1. Data Collection (ETL)

    • E-commerce logs are generally stored on a log server and pulled to HDFS through Flume. This article simulates log data by writing a python program.

    • The business data is read from the relational database mysql through Sqoop, and then imported into HDFS.

    Because of the need to access the database, it will cause a lot of pressure on the database, and in a real production environment, there is generally no permission to directly access the database. You can export the data into a csv file, put it on the log server, and then collect it on the HDFS through Flume. If you have permission to access the database, the database also needs to be set to read-write separation mode to relieve pressure.

  2. Data cleaning

    • Use MapReduce for data cleaning.

    • Use Spark Core for data cleaning.

  3. Calculation of popular products in various regions

    • Use Hive for data analysis and processing.

    • Use Spark SQL for data analysis and processing

(4) Experimental data and description

  1. product table:

    Column name description type of data Empty/non-empty constraint
    product_id Article number varchar(18) Not null
    product_name product name varchar(20) Not null
    Mark Product model varchar(10) Not null
    barcode Warehouse barcode varchar Not null
    price Commodity price double Not null
    brand_id product brand varchar(8) Not null
    market_price market price double Not null
    stock in stock int Not null
    status status int Not null

    Supplementary explanation: status: off shelf -1, on shelf 0, pre-sale 1

  2. area_info (area information) table

    Column name description type of data Empty/non-empty constraint
    area_id Area code varchar(18) Not null
    area_name Area name varchar(20) Not null
  3. user_click_log (user click information) table

    Column name description type of data Empty/non-empty constraint
    user_id User ID varchar(18) Not null
    user_ip User IP varchar(20) Not null
    url User clicks on the URL varchar(200)  
    click_time User click time varchar(40)  
    action_type Action name varchar(40)  
    area_id Area ID varchar(40)  

    Supplementary explanation: action_type: 1 collection, 2 add shopping cart, 3 purchase area_id: the area information has been parsed through the IP address

  4. area_hot_product (regional hot products) table

    Column name description type of data Empty/non-empty constraint
    area_id Area ID varchar(18) Not null
    area_name Area name varchar(20) Not null
    product_id Commodity ID varchar(200)  
    product_name product name varchar(40)  
    pv Views BIGINT  

(5) Technical realization

 Use Flume to collect user click logs

  1. Flume configuration file (flume-areahot.conf)

    • Start Flume agent and execute the command in the root directory of Flume: bin/flume-ng agent -n a4 -f flume-areahot.conf -c conf -Dflume.root.logger=INFO,console

    • Then execute python dslog.py to put the user log file in the /log0208 directory (implementation method: here

    • Flume will collect the files in the /log0208 directory to hdfs://master:9000/flume/ today's date directory.

      2. Python compiles dslog.py simulation log and puts it in the /log0208 folder, and customizes to add non-conforming field data, which must be cleaned by mr or spark.

#coding=utf-8
import random
import time
iplist=[26,23,47,56,108,10,33,48,66,77,101,45,61,52,88,89,108,191,65,177,98,21,34,61,19,11,112,114]

url = "http://mystore.jsp/?productid={query}"
x=[1,2,3,4]

def use_id():
    return random.randint(1,20)
def get_ip():
    return '.'.join(str(x) for x in random.sample(iplist,4))

def urllist():
def sample_references():
    if random.uniform(0,1)>0.8:
        return ""

    query_str=random.sample(x,1)
    return url.format(query=query_str[0])

def get_time():
    return time.strftime('%Y%m%d%H%M%S',time.localtime())

#  action: 1 收藏,2 加购物车,3 购买  area_id代表不同区域
def action():
    return random.randint(1,4)

def area_id():
    return random.randint(1,21)


def get_log(count):
    while count>0:
        log='{},{},{},{},{},{}\n'.format(use_id(),get_ip(),urllist(),get_time(),action(),area_id())
        # with open('/usr/local/src/tmp/1.log','a+')as file:
        with open('/log0208/click.log','a+')as file:
            file.write(log)
        # print(log)
        # time.sleep(1)
        count=count-1
if __name__ == '__main__':
    get_log(10000)

Interception of generated log results:

5,10.26.56.45,http://mystore.jsp/?productid=1,20210222005139,1,19
2,10.101.98.47,http://mystore.jsp/?productid=1,20210222005139,3,8
17,191.88.66.108,http://mystore.jsp/?productid=3,20210222005139,2,14
4,89.21.33.108,,20210222005139,2,10
4,108.23.48.114,http://mystore.jsp/?productid=4,20210222005139,1,21
8,21.48.19.65,,20210222005139,1,3
16,61.21.89.11,http://mystore.jsp/?productid=2,20210222005139,3,11
6,56.47.112.88,,20210222005139,1,3

flume-areahot.conf

#bin/flume-ng agent -n a4 -f myagent/a4.conf -c conf -Dflume.root.logger=INFO,console
#定义agent名, source、channel、sink的名称
a4.sources = r1
a4.channels = c1
a4.sinks = k1

#具体定义source
a4.sources.r1.type = spooldir
a4.sources.r1.spoolDir = /log0208

#具体定义channel
a4.channels.c1.type = memory
a4.channels.c1.capacity = 10000
a4.channels.c1.transactionCapacity = 100 

#定义拦截器,为消息添加时间戳
a4.sources.r1.interceptors = i1
a4.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

#具体定义sink
a4.sinks.k1.type = hdfs
a4.sinks.k1.hdfs.path = hdfs://master:9000/flume/%Y%m%d
a4.sinks.k1.hdfs.filePrefix = events-
a4.sinks.k1.hdfs.fileType = DataStream

#不按照条数生成文件
a4.sinks.k1.hdfs.rollCount = 0 
#HDFS上的文件达到128M时生成一个文件
a4.sinks.k1.hdfs.rollSize = 134217728
#HDFS上的文件达到60秒生成一个文件
a4.sinks.k1.hdfs.rollInterval = 60

#组装source、channel、sink
a4.sources.r1.channels = c1
a4.sinks.k1.channel = c1

    3. Data cleaning

  • Need to identify the user's click on the product in the click log

  • Filter data that does not meet 6 fields

  • Filter data whose URL is empty, that is, filter out log records that contain http beginning

 Implementation method 1: Use MapReduce program to clean data

1. CleanDataMain.java and CleanDataMapper.java code implementation:

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class CleanDataMain {

	public static void main(String[] args) throws Exception {
		//1、创建Job
		Job job = Job.getInstance(new Configuration());
		job.setJarByClass(CleanDataMain.class);
		
		//2、指定任务的Mapper和输出的类型
		job.setMapperClass(CleanDataMapper.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(NullWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);
		
		//4、任务的输入和输出
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		//5、执行
		job.waitForCompletion(true);
	}

}
import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/*
过滤不满足6个字段的数据
过滤URL为空的数据,即:过滤出包含http开头的日志记录
 */
public class CleanDataMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

	@Override
	protected void map(LongWritable key1, Text value1, Context context)
			throws IOException, InterruptedException {
		String log = value1.toString();
		
		//分词
		String[] words = log.split(",");
		
		if(words.length == 6 && words[2].startsWith("http")){
			context.write(value1, NullWritable.get());
		}
	}

}

2. Use maven clean and maven install to make a jar package and submit it to yarn to run: run the script run.sh, and the input data is the path collected by Flume

HADOOP_CMD="/usr/local/src/hadoop-2.6.5/bin/hadoop"

OUTPUT_PATH="/output/210219"

$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH

hadoop jar /ds/MyMapReduceProject-0.0.1-SNAPSHOT.jar mapreduce.clean/CleanDataMain /flume/20210219/events-.1613712374044 /output/210219

3. View the results after filtering:

[root@master ds]# hadoop fs -cat /output/210219/part-r-00000
1,201.105.101.102,http://mystore.jsp/?productid=1,2017020020,1,1
1,201.105.101.102,http://mystore.jsp/?productid=1,2017020029,2,1
1,201.105.101.102,http://mystore.jsp/?productid=4,2017020021,3,1
2,201.105.101.103,http://mystore.jsp/?productid=2,2017020022,1,1
3,201.105.101.105,http://mystore.jsp/?productid=3,2017020023,1,2
4,201.105.101.107,http://mystore.jsp/?productid=1,2017020025,1,1

  Implementation method 2: Use Spark program to clean data

1. CleanData code implementation:



import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object CleanData {
  def main(args: Array[String]): Unit = {
    // 为了避免执行过程中打印过多的日志
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
    
    val conf = new SparkConf().setAppName("CleanData")
    val sc = new SparkContext(conf)
    
    // 读取数据
    val fileRDD = sc.textFile(args(0))
    
    // 清洗数据
    val cleanDataRDD = fileRDD.map(_.split(",")).filter(_(2).startsWith("http")).filter(_.length == 6)
    
    // 将清洗后的结果保存到HDFS
    cleanDataRDD.saveAsTextFile(args(1))
    
    // 停止SparkContext
    sc.stop()
    
    println("Finished")
  }
}

    2. Ditto it as a jar package and submit it to spark to run:

bin/spark-submit /
--class clean.CleanData   /
--master spark://master:7077 /
/ds/people-0.0.1-SNAPSHOT.jar  /
hdfs://master:9000/flume/210219/events-.1613712374044   /
hdfs://master:9000/testOutput/    
  1. Popularity statistics of popular products in various regions: based on Hive and Spark SQL

         Method 1: Use Hive for statistics

# 创建地区表:
create external table area
(area_id string,area_name string)
row format delimited fields terminated by ','
location '/input/hotproject/area';
# 创建商品表
create external table product
(product_id string,product_name string,
marque string,barcode string, price double,
brand_id string,market_price double,stock int,status int)
row format delimited fields terminated by ','
location '/input/hotproject/product';
# 创建一个临时表,用于保存用户点击的初始日志
create external table clicklogTemp
(user_id string,user_ip string,url string,click_time string,action_type string,area_id string)
row format delimited fields terminated by ','
location '/input/hotproject/cleandata';
# 创建用户点击日志表(注意:需要从上面的临时表中解析出product_id)
create external table clicklog
(user_id string,user_ip string,product_id string,click_time string,action_type string,area_id string)
row format delimited fields terminated by ','
location '/input/hotproject/clicklog';
#导入数据,业务一般用sqoop从mysql数据库导入到HDFS
load data  inpath "/input/data/areainfo.txt" into table area;
load data  inpath "/input/data/productinfo.txt" into table product;
#日志通过flume采集到HDFS
load data  inpath "/output/210220/part-r-00000" into table clicklogTemp;
insert into table clicklog
select user_id,user_ip,substring(url,instr(url,"=")+1),
click_time,action_type,area_id from clicklogTemp;
## 查询各地区商品热度
select a.area_id,b.area_name,a.product_id,c.product_name,count(a.product_id) pv 
from clicklog a join area b on a.area_id = b.area_id join product c on a.product_id = c.product_id
group by a.area_id,b.area_name,a.product_id,c.product_name;

Note: In the above example, we create a temporary table, and then parse the productid from the temporary table. You can also directly use the hive function: parse_url to parse, as follows: parse_url(a.url,'QUERY','productid' )

# 这样就可以不用创建临时表来保存中间状态的结果,修改后的Hive SQL如下:
select a.area_id,b.area_name,parse_url(a.url,'QUERY','productid'),
c.product_name,count(parse_url(a.url,'QUERY','productid'))
from clicklogtemp a join area b on a.area_id = b.area_id
join product c on parse_url(a.url,'QUERY','productid') = c.product_id
group by a.area_id,b.area_name,parse_url(a.url,'QUERY','productid'),c.product_name;

Output result, the last column is PV

a.area_id    b.area_name    a.product_id    c.product_name    pv
1    beijing    2    nike shoes1    2
1    beijing    3    nike shoes2    1
1    beijing    4    nike shoes4    1
10    heilongjiang    2    nike shoes1    3
11    tianjin    2    nike shoes1    1
11    tianjin    3    nike shoes2    1
11    tianjin    4    nike shoes4    2

The above statement can import another new table through insert into, insert the hive analysis result into another table, and import the mysql relational database through sqoop, and finally realize the visualization of e-commerce visualization page display.

insert into table result
select a.area_id,b.area_name,parse_url(a.url,'QUERY','productid'),c.product_name,count(parse_url(a.url,'QUERY','productid'))
from clicklogtemp a join area b on a.area_id = b.area_id
join product c on parse_url(a.url,'QUERY','productid') = c.product_id
group by a.area_id,b.area_name,parse_url(a.url,'QUERY','productid'),c.product_name;

Note: The principle of importing data: If you can not import data, don't import data (external table). The output results are different due to different logs.

Method 2: Use Spark SQL to perform statistics

        1. Hotproduct.scala code implementation

package com.hot

import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession

//地区表
case class AreaInfo(area_id:String,area_name:String)

//商品表 用不到的数据,不要导入
case class ProductInfo(product_id:String,product_name:String,marque:String,barcode:String,price:Double,brand_id:String,market_price:Double,stock:Int,status:Int)

//经过清洗后的,用户点击日志信息
case class LogInfo(user_id:String,user_ip:String,product_id:String,click_time:String,action_type:String,area_id:String)

object HotProduct {
  def main(args:Array[String]):Unit={
    // 避免打印过多的日志
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)

    val spark=SparkSession.builder().master("local").appName("").getOrCreate()
//    val spark=SparkSession.builder().appName("").getOrCreate()
    import spark.sqlContext.implicits._

    //获取地区数据
    val areaDF = spark.sparkContext.textFile("hdfs://master:9000/input/data/areainfo1.txt")
      .map(_.split(",")).map(x=> AreaInfo(x(0),x(1))).toDF()
    areaDF.createTempView("area")

    //获取商品数据
    val productDF = spark.sparkContext.textFile("hdfs://master:9000/input/data/productinfo.txt")
      .map(_.split(",")).map(x=>  ProductInfo(x(0),x(1),x(2),x(3),x(4).toDouble,x(5),x(6).toDouble,x(7).toInt,x(8).toInt))
      .toDF()
    productDF.createTempView("product")

    //获取点击日志
    val clickLogDF = spark.sparkContext.textFile("hdfs://master:9000/output/210220/part-r-00000")
      .map(_.split(",")).map(x =>  LogInfo(x(0),x(1),x(2).substring(x(2).indexOf("=")+1),x(3),x(4),x(5)))
      .toDF()
    clickLogDF.createTempView("clicklog")

    //执行SQL
    // 通过SparkSQL分析各区域商品的热度,结果输出到屏幕
    val sql = "select a.area_id,a.area_name,p.product_id,product_name,count(c.product_id) from area a,product p,clicklog c where a.area_id=c.area_id and p.product_id=c.product_id group by a.area_id,a.area_name,p.product_id,p.product_name"

    spark.sql(sql).show()

//    var sql1 = " select concat(a.area_id,',',a.area_name,',',p.product_id,',',p.product_name,',',count(c.product_id)) "
//    sql1 = sql1 + " from area a,product p,clicklog c "
//    sql1 = sql1 + " where a.area_id=c.area_id and p.product_id=c.product_id "
//    sql1 = sql1 + " group by a.area_id,a.area_name,p.product_id,product_name "
//    spark.sql(sql1).repartition(1).write.text(args(3))

    spark.stop()

  }
}

       2. Maven packaged and submitted to the spark cluster to run:

spark-submit --class com.hot.HotProduct --master spark://master:7077 hotspark-1.0-SNAPSHOT.jar 
#hdfs://master:9000/input/hotproject/area/areainfo.txt \ #hdfs://master:9000/input/hotproject/product/productinfo.txt \ #hdfs://master:9000/output/210219/part-r-00000 hdfs://master:9000/output/analysis

+-------+---------+----------+------------+-----------------+                   
|area_id|area_name|product_id|product_name|count(product_id)|
+-------+---------+----------+------------+-----------------+
|      7|    hubei|         3| nike shoes2|                1|
|     15|  guizhou|         3| nike shoes2|                2|
|     11|  tianjin|         3| nike shoes2|                1|
|      3| shanghai|         3| nike shoes2|                1|
|      8| zhejiang|         3| nike shoes2|                2|
|      5| shenzhen|         3| nike shoes2|                2|
|     17|   fujian|         3| nike shoes2|                1|
|     19|    anhui|         3| nike shoes2|                3|
| 9| jili| 3| nike shoes2| 1|
| 1| beijing| 3| nike shoes2| 1|
| 20| henan| 3| nike shoes2| 4|
| 4| hangzhou| 3| nike shoes2| 1|
| 13 | hebei| 3| nike shoes2| 3|
| 15| guizhou| 1| nike shoes| 1|
| 3| shanghai| 1| nike shoes| 1|
| 8| zhejiang| 1| nike shoes| 1|
| 18|neimenggu | 1| nike shoes| 2|
| 17| fujian| 1| nike shoes| 2|
| 19| anhui| 1| nike shoes| 2|
| 9| jili| 1| nike shoes| 2|
+-------+---------+----------+-------- ----+-----------------+
 

2. E-commerce data warehouse collection

Simulation construction of business data warehouse:

Import business data from mysql database to HDFS through sqoop, and then import hive data warehouse. The principle of sqoop is to use map in mapreduce.

import Import data from relational database to data, warehouse, custom InputFormat,

export Import data from the data warehouse to the relational database, customize the OutputFormat,

Use sqoop to import the data of eight tables from mysql into the ods original data layer of the data warehouse, all unconditionally, the increment is based on the creation time, and the increment + change is based on the creation time or operation time.

Table Structure:

Three, offline data warehouse construction

1.origin_data original data

sku_info product table (full daily guide amount)

user_info user table (full daily guide)

base_category1 commodity first level classification table (full daily guide)

base_category2 commodity secondary classification table (full daily guide)

base_category3 commodity three-level classification table (full daily guide)

order_detail order details table (daily guide increment)

payment_info payment flow table (daily guide increment)

order_info order table (daily guide increment + change)

2.ods layer

(Eight tables, table names, fields are exactly the same as mysql)

Import the data from origin_data to the ods layer, and add ods_ before the original table name.

3.dwd layer

Perform null filtering on ODS layer data. Perform dimensionality degradation (dimensionality reduction) on the commodity classification table. The other data is exactly the same as the ods layer.

Fact table

1 Order table dwd_order_info

2. Order details table dwd_order_detail

3. Payment flow table dwd_payment_info

Dimension table

User table dwd_user_info

Commodity table dwd_sku_info

The other table fields remain unchanged, except for the commodity table, which has been added by associating 3 classification tables

        category3_id` string COMMENT '3id',  

        category2_id` string COMMENT '2id',  

        `category1_id` string COMMENT '1id',  

        `category3_name` string COMMENT '3',  

        `category2_name` string COMMENT '2',  

        `category1_name` string COMMENT '1',  

4.dws layer-build wide table by theme


Get the number of orders and the total amount of orders from the order table dwd_order_info. Get the number of
payments and the total amount of payments from the payment flow table dwd_payment_info.
Finally, aggregate them according to user_id to get the details.

5.ADS layer 

Product theme table

 

 

Guess you like

Origin blog.csdn.net/qq_36816848/article/details/113865910