real-time analysis

Real - time analysis
1. Create topics in kafka 2. Storm consumes data from Kafka Related packages Import related jar packages and write code according to the document to realize Storm consumes data from kafka Method 1: You can develop your own spout and consume data using the API provided by Kafka Method 2: Use the kafka extension package provided by Storm to connect String topic = "flux" ; BrokerHosts hosts = new ZkHosts("hadoop01,hadoop02,hadoop03:2181"); SpoutConfig spoutConfig = new SpoutConfig(hosts,topic, "/" + topic, UUID.randomUUID().toString()); spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme()); KafkaSpout spout = new KafkaSpout(spoutConfig); 3. Process business logic data cleaning
























"url","urlname","uvid","sid","scount","stime","cip" A
pv
user visit is a pv directly record each visit as 1 pv to
uv
independent visitors Number - whether the current uvid appears for the first time in today's data, if it is recorded as 1, otherwise it is 0
- each record should be stored in hbase as the basis for uv calculation
- when each record comes over, use uvid and the database Compare today's uvids. If a match is found, the uvid is 0. If it does not match, it is recorded as 1.


Design hbase table results:
column family design:
design a column family named cf1.
Row key design:
"url", "urlname" ,"uvid","sid","scount","stime","cip"
time_uvid_cip_rand
1495244059010_45189712356761262218_0:0:0:0:0:0:0:1_xxxxx(5)
^\d+_xxxxx_.*$
^\d +_\d+_xxxx_.*$
^\d+_xxxx_.*$


create 'flux','cf1';


vv
Whether the current visit is a new session - yes, vv is 1 , otherwise 0




Whether the current visit of newcust
is a uvid that has never appeared in history - is newcust is 1, otherwise 0


-------------------
br
Bounce rate - Bounces within a period of time The ratio of the total number of sessions / the total number of all sessions - because you can't immediately infer whether it is a bounced session based on a log, this parameter is not suitable for real-time calculation


avgtime
Average online duration - the average online duration of all sessions over a period of time - This parameter is not suitable for calculating the average access depth of


avgdeep in real time because it cannot be immediately inferred from a log whether it is the end of a session
- the average of all session access depths over a period of time - Since it cannot be immediately inferred from a log whether it is At the end of a session, this parameter is not suitable for real-time calculation. The


above parameters need to accumulate data for a period of time and calculate based on the data in this period of time. It is more suitable for offline calculation.
However, in reality, if you want to It is not very efficient to start off-line analysis every time to perform statistics of the above parameters in a short period of time, and it may even fail to complete the task on time. Although data processing is the data processing of data within a period of time, because the time period is relatively small and the amount of data is not too large, it is more like a real-time analysis scenario
. So how to realize the above requirements of using real-time analysis to calculate data for a period of time?
You can design a special spout with a built-in timer that sends a tuple backwards every specified period of time to indicate that the time is up to require subsequent bolts to calculate. After receiving this message, subsequent bolts start to calculate the data collected during this period.


===Storm's tick mechanism--timed trigger task mechanism========================
Storm provides a tick mechanism to implement timed tasks in versions above 0.8.
It enables all tasks of any bolt to receive a tick tuple from the _tick stream of _systemd at regular intervals (accurate to the second level, users can customize), and the bolt can complete the corresponding response according to business needs after receiving such a tuple processing.
Method 1: Specify a timing task for a
specific bolt Override getComponentConfiguration in the bolt, and set the property TOPOLOGY_TICK_TUPLE_FREQ_SECS of conf to the specified time interval
@Override
public Map<String, Object> getComponentConfiguration() {
Config conf = new Config() ;
conf.put(conf.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 900);
return conf;
}
In this way, this bolt will receive a regular tuple to trigger the program every specified time after the program starts.
In the execute method, you can use the following judgment to know whether it is The code for the scheduled task to start:
if (tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID) && tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID)){
//It is triggered by a timed tuple
}else{
//It is triggered by a normal tuple
}
Method 2: You can specify timed tasks for the entire topology, so that all bolts in the entire topology will receive the tuple
code regularly as follows:
Config conf = new Config();
conf.put(conf.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 7);
If the global timer is set and then If a bolt has a separate timer, it will work if it is started separately.
===================================================== 4. Store the results in mysql create database fluxdb; use fluxdb; create table tongji_2( stime DateTime, pv int, uv int, vv int, newip int, newcust int );















create table tongji_3(
stime DateTime,
br double,
avgtime double,
avgdeep double
);

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325550574&siteId=291194637
Recommended