那些storm的坑坑

转载请声明出处：http://blackwing.iteye.com/blog/2147633

在使用storm的过程中，感觉它还是不如hadoop那么成熟。当然，它的流式处理能力挺让人眼前一亮，以前做的个性化推荐都是离线计算，现在总算把实时部分也加上了。

总结一下storm使用的些心得：
1.尽量把大量数据处理行为分拆成多个处理component。
2.storm不擅长保存状态，一般需要借助如redis这些外部存储比较方便实现逻辑。
3.其实跟1有点类似，不用在component（例如spout或者bolt）中保存大量数据，因为很容易撑爆内存，导致worker被kill掉。

下面是我遇到的一些坑：
1. 出现错误：GC overhead limit exceeded
从 http://www.slideshare.net/miguno/apache-storm-09-basic-training-verisign
这里的117页看到：

“OOM: GC overhead limit exceeded” exception, then typically your upstream spouts/bolts are outpacing your downstream bolts.

意思是说，上游的spout或者boltemit的数据速度超过下游bolt的处理速度。因此导致很多emit出去的tuple被缓存起来，积累到一定程度后就会撑爆内存。

PS：看不到原ppt的可以到附件下载。

2. 需要在component间传输的类，外部类如果对内部类有引用，则内部类也要实现串行化

public Class A implements Serializable  {
B tmp = new B();
Class B{
....
}
}

如果A要被emit出去，则B也有串行化，不然下一个接收bolt会包tmp变量为null错误。

public Class A implements Serializable  {
B tmp = new B();
Class B implements Serializable{
....
}
}

3. spout、bolt中初始化尽量放到prepare()中进行

public class ReadLogsFromFileEmitSetSpout extends BaseRichSpout {

	Configuration conf = new Configuration();
......

}

这个spout在初始化时就会报错：java.io.NotSerializableException
原因是supervisor先实例化这个spout，再传输到具体的worker机器后，跟着调用其prepare()方法来初始化spout，那么如果在spout声明变量时就初始化变量，而该变量是不能串行化的，则会报错。以下是来自google的原文解释：

The supervisor instantiates the bolts, sends them to the workers, and then calls prepare() on all of them. Therefore, anything that isn't serializable that is instantiated before prepare() causes this process to fail.

猜你喜欢