A: Storm Overview
ApacheStorm is a free open source distributed real-time computing systems. Storm can easily and reliably handle unlimited data streams for real-time processing on Hadoop batch made. Storm is very simple and can be used with any programming language, and is fun to use!
There are many use cases Storm: real-time analysis, online machine learning, continuous computing, distributed RPC, ETL and so on. Storm soon: a benchmark means that each node handles more than one million yuan per set. It is scalable, fault tolerance, make sure that you can get the data processing, and easy to set up and operate.
Storm integrated queuing and database technology that you already use. Storm topology consumption data stream and a complicated manner in any of these streams, and then re-divided between each phase flow calculations. Read more tutorial.
What offline computing is?
Get bulk data, the bulk data transfer, data storage volume, data periodicity calculation, data visualization
flume get bulk data, sqoop bulk transfer, hdfs / hive / hbase bulk storage, mr / hive calculation data, BI
What real-time computing is?
Generating real-time data, real-time data transmission, data calculation, real-time display
flume real-time data acquisition, kafka real-time data storage, Storm / JStorm real-time computing, real-time display (dataV / quickBI)
Two: Storm and Hadoop
|
hadoop |
storm |
Character |
JobTracker |
Nimbus |
TaskTracker |
Supervisor |
|
Child |
Worker |
|
Application Name |
Job |
Topology |
Programming Interface |
Mapper/Reducer |
Spout/Bolt |
Three: Storm programming model
tuple: Ganso
Message transmission is a basic unit.
Spout: Faucet
storm's core abstraction. Source topology stream. Spout typically read data from an external data source. Is converted to internal data source.
主要方法:nextTuple() -》 发出一个新的元祖到拓扑。
ack()
fail()
Bolt:转接头
Bolt是对流的处理节点。Bolt作用:过滤、业务、连接运算。
Topology:拓扑
是一个实时的应用程序。
永远运行除非被杀死。
Spout到Bolt是一个连接流...
storm流式计算
hadoop与storm兼容性
闲聊:。。。。
spark-core
spark-sql离线计算
spark-streaming流式计算
一个团队开发 没有兼容性问题
spark团队:我要做一栈式开发平台!
但凡涉及到大数据计算 我都能搞定!
spark替代了mapreduce
spark没有底层存储
依赖hdfs
hdfs/mr............
完善整个生态圈系统!
mapreduce思想、编程 、sqoop->mr hive->mr hbasemr
dfs/mapreduce/bigtable
java/scala...
四:Storm集群安装部署
1)准备工作
zk01 zk02 zk03
storm01 storm02 storm03
2)下载安装包
http://storm.apache.org/downloads.html
3)上传
4)解压
5)修改配置文件
设置环境变量~/.bash_profile
$ vi storm.yaml
# 设置Zookeeper的主机名称
storm.zookeeper.servers:
- "bigdata11"
- "bigdata12"
- "bigdata13"
# 设置主节点的主机名称
nimbus.seeds: ["bigdata11"]
# 设置Storm的数据存储路径(需要自己提前创建)
storm.local.dir: "/root/training/storm/data"
# 设置Worker的端口号
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
分发到bigdata12 和bigdata13,还有~/.bash_profile也要分发
6)启动nimbus
$ storm nimbus &
7) 启动supervisor
$ storm supervisor &
8)启动ui界面 端口8080
$ storm ui
Storm命令行操作
1)查看命令帮助
storm help
2)查看版本
storm version
3)运行storm程序
storm jar [/路径/.jar][全类名][拓扑名称]
4)查看当前正在运行拓扑及其状态
storm list
5)终止拓扑程序
storm kill [拓扑名称]
6)激活拓扑程序
storm activate [拓扑名称]
7)禁止拓扑程序
storm deactivate [拓扑名称]