Storm 学习记录

原创编写：王宇
2016-10-20

Storm 学习记录
Storm 概念
Storm Workflow
Storm 配置
在Storm上开发实现一个统计任务
参考

Storm 概念

Let us now have a closer look at the components of Apache Storm −

Components Description

Tuple	Tuple is the main data structure in Storm. It is a list of ordered elements. By default, a Tuple supports all data types. Generally, it is modelled as a set of comma separated values and passed to a Storm cluster.
Stream	Stream is an unordered sequence of tuples.
Spouts	Source of stream. Generally, Storm accepts input data from raw data sources like Twitter Streaming API, Apache Kafka queue, Kestrel queue, etc. Otherwise you can write spouts to read data from datasources. “ISpout” is the core interface for implementing spouts. Some of the specific interfaces are IRichSpout, BaseRichSpout, KafkaSpout, etc.
Bolts	Bolts are logical processing units. Spouts pass data to bolts and bolts process and produce a new output stream. Bolts can perform the operations of filtering, aggregation, joining, interacting with data sources and databases. Bolt receives data and emits to one or more bolts. “IBolt” is the core interface for implementing bolts. Some of the common interfaces are IRichBolt, IBasicBolt, etc.
Topology	Spouts and bolts are connected together and they form a topology. Real-time application logic is specified inside Storm topology. In simple words, a topology is a directed graph where vertices are computation and edges are stream of data.
Tasks	In simple words, a task is either the execution of a spout or a bolt.
Nimbus	Nimbus is a master node of Storm cluster. All other nodes in the cluster are called as worker nodes. Master node is responsible for distributing data among all the worker nodes, assign tasks to worker nodes and monitoring failures.
Supervisor	The nodes that follow instructions given by the nimbus are called as Supervisors. A supervisor has multiple worker processes and it governs worker processes to complete the tasks assigned by the nimbus.
Worker process	A worker process will execute tasks related to a specific topology. A worker process will not run a task by itself, instead it creates executors and asks them to perform a particular task. A worker process will have multiple executors.
Executor	An executor is nothing but a single thread spawn by a worker process. An executor runs one or more tasks but only for a specific spout or bolt.
Task	A task performs actual data processing. So, it is either a spout or a bolt.
ZooKeeper framework	Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate between themselves and maintaining shared data with robust synchronization techniques. Nimbus is stateless, so it depends on ZooKeeper to monitor the working node status.

ZooKeeper helps the supervisor to interact with the nimbus. It is responsible to maintain the state of nimbus and supervisor.|

Stream Grouping(消息分发策略)
- Shuffle Grouping 随机分组
- Fields Grouping 按字段分组
- All Grouping 广播发送，对于每个tuple, 所有Bolts都会收到
- Global Grouping 全局分组
- None Grouping 同随机分组相同
- Direct Grouping 指向分组
- Local or shuffle Grouping 本地或随机分组

Storm Workflow

Local Mode
Production Mode

Storm 配置

步骤一：安装JDK 并配置环境变量 JAVA_HOME CLASSPATH
步骤二 : 安装ZooKeeper

下载ZooKeeper
解包
1. $ tar xzvf zookeeper-3.5.2-alpha.tar.gz
2. $ mv ./zookeeper-3.5.2-alpha /opt/zookeepter
3. $ cd /opt/zookeeper
4. $ mkdir data
创建配置文件
1. $ cd /opt/zookeeper
2. $ vim conf/zoo.cfg
4. tickTime=2000
5. dataDir=/path/to/zookeeper/data
6. clientPort=2181
7. initLimit=5
8. syncLimit=2
启动ZooKeeper Seve
1. $ bin/zkServer.sh start
步骤三：在安装配置Storm
下载Storm
解包
1. $ tar xvfz apache-storm-1.0.2.tar.gz
2. $ mv apache-storm-1.0.2/opt/storm
3. $ cd /opt/storm
4. $ mkdir data
编辑Storm配置
1. $ cd /opt/storm
2. $ vim conf/storm.yaml
4. storm.zookeeper.servers:
5. -"localhost"
6. storm.local.dir:“/path/to/storm/data(any path)”
7. nimbus.host:"localhost"
8. supervisor.slots.ports:
9. -6700
10. -6701
11. -6702
12. -6703
14. ui.port:6969
启动 Nimbus
1. $ cd /opt/storm
2. $ ./bin/storm nimbus
启动 Supervisor
1. $ cd /opt/storm
2. $ ./bin/stormi supervisor
启动 UI
1. $ cd /opt/storm
2. $ ./bin/storm ui

在Storm上开发实现一个统计任务

场景 - 统计移动电话的数量.

在Spout中，准备4个电话号码和电话之间随机通话数量。
分别创建不同的Bolt，用于统计
使用 Topology 将 Spout 和 Bolt 关联起来
以下程序在Ubuntu 16.04 64位 JDK1.8 环境下编译执行通过
创建 Spout 组件
Spout 需要继承 IRichSpout 接口，接口描述如下：

open − Provides the spout with an environment to execute. The executors will run this method to initialize the spout.
nextTuple − Emits the generated data through the collector.
close − This method is called when a spout is going to shutdown.
declareOutputFields − Declares the output schema of the tuple.
ack − Acknowledges that a specific tuple is processed
fail − Specifies that a specific tuple is not processed and not to be reprocessed.
1. import java.util.*;
2. //import storm tuple packages
3. import org.apache.storm.tuple.Fields;
4. import org.apache.storm.tuple.Values;
6. //import Spout interface packages
7. import org.apache.storm.topology.IRichSpout;
8. import org.apache.storm.topology.OutputFieldsDeclarer;
9. import org.apache.storm.spout.SpoutOutputCollector;
10. import org.apache.storm.task.TopologyContext;
12. //Create a class FakeLogReaderSpout which implement IRichSpout interface to access functionalities
14. publicclassFakeCallLogReaderSpoutimplementsIRichSpout{
15. //Create instance for SpoutOutputCollector which passes tuples to bolt.
16. privateSpoutOutputCollector collector;
17. privateboolean completed =false;
19. //Create instance for TopologyContext which contains topology data.
20. privateTopologyContext context;
22. //Create instance for Random class.
23. privateRandom randomGenerator =newRandom();
24. privateInteger idx =0;
26. @Override
27. publicvoid open(Map conf,TopologyContext context,SpoutOutputCollector collector){
28. this.context = context;
29. this.collector = collector;
30. }
32. @Override
33. publicvoid nextTuple(){
34. if(this.idx <=1000){
35. List<String> mobileNumbers =newArrayList<String>();
36. mobileNumbers.add("1234123401");
37. mobileNumbers.add("1234123402");
38. mobileNumbers.add("1234123403");
39. mobileNumbers.add("1234123404");
40. Integer localIdx =0;
41. while(localIdx++<100&&this.idx++<1000){
42. String fromMobileNumber = mobileNumbers.get(randomGenerator.nextInt(4));
43. String toMobileNumber = mobileNumbers.get(randomGenerator.nextInt(4));
45. while(fromMobileNumber == toMobileNumber){
46. toMobileNumber = mobileNumbers.get(randomGenerator.nextInt(4));
47. }
49. Integer duration = randomGenerator.nextInt(60);
50. this.collector.emit(newValues(fromMobileNumber, toMobileNumber, duration));
51. }
52. }
53. }
55. @Override
56. publicvoid declareOutputFields(OutputFieldsDeclarer declarer){
57. declarer.declare(newFields("from","to","duration"));
58. }
60. //Override all the interface methods
61. @Override
62. publicvoid close(){}
64. publicboolean isDistributed(){
65. returnfalse;
66. }
68. @Override
69. publicvoid activate(){}
71. @Override
72. publicvoid deactivate(){}
74. @Override
75. publicvoid ack(Object msgId){}
77. @Override
78. publicvoid fail(Object msgId){}
80. @Override
81. publicMap<String,Object> getComponentConfiguration(){
82. returnnull;
83. }
84. }
创建 Bolt 组件
Bolt 需要继承 IRichBolt 接口，接口描述如下

prepare − Provides the bolt with an environment to execute. The executors will run this method to initialize the spout.
execute − Process a single tuple of input.
cleanup − Called when a bolt is going to shutdown.
declareOutputFields − Declares the output schema of the tuple.
2. //import util packages
3. import java.util.HashMap;
4. import java.util.Map;
6. import org.apache.storm.tuple.Fields;
7. import org.apache.storm.tuple.Values;
8. import org.apache.storm.task.OutputCollector;
9. import org.apache.storm.task.TopologyContext;
11. //import Storm IRichBolt package
12. import org.apache.storm.topology.IRichBolt;
13. import org.apache.storm.topology.OutputFieldsDeclarer;
14. import org.apache.storm.tuple.Tuple;
16. //Create a class CallLogCreatorBolt which implement IRichBolt interface
17. publicclassCallLogCreatorBoltimplementsIRichBolt{
18. //Create instance for OutputCollector which collects and emits tuples to produce output
19. privateOutputCollector collector;
21. @Override
22. publicvoid prepare(Map conf,TopologyContext context,OutputCollector collector){
23. this.collector = collector;
24. }
26. @Override
27. publicvoid execute(Tuple tuple){
28. Stringfrom= tuple.getString(0);
29. String to = tuple.getString(1);
30. Integer duration = tuple.getInteger(2);
31. collector.emit(newValues(from+" - "+ to, duration));
32. }
34. @Override
35. publicvoid cleanup(){}
37. @Override
38. publicvoid declareOutputFields(OutputFieldsDeclarer declarer){
39. declarer.declare(newFields("call","duration"));
40. }
42. @Override
43. publicMap<String,Object> getComponentConfiguration(){
44. returnnull;
45. }
46. }
1. import java.util.HashMap;
2. import java.util.Map;
4. import org.apache.storm.tuple.Fields;
5. import org.apache.storm.tuple.Values;
6. import org.apache.storm.task.OutputCollector;
7. import org.apache.storm.task.TopologyContext;
8. import org.apache.storm.topology.IRichBolt;
9. import org.apache.storm.topology.OutputFieldsDeclarer;
10. import org.apache.storm.tuple.Tuple;
12. publicclassCallLogCounterBoltimplementsIRichBolt{
13. Map<String,Integer> counterMap;
14. privateOutputCollector collector;
16. @Override
17. publicvoid prepare(Map conf,TopologyContext context,OutputCollector collector){
18. this.counterMap =newHashMap<String,Integer>();
19. this.collector = collector;
20. }
22. @Override
23. publicvoid execute(Tuple tuple){
24. String call = tuple.getString(0);
25. Integer duration = tuple.getInteger(1);
27. if(!counterMap.containsKey(call)){
28. counterMap.put(call,1);
29. }else{
30. Integer c = counterMap.get(call)+1;
31. counterMap.put(call, c);
32. }
34. collector.ack(tuple);
35. }
37. @Override
38. publicvoid cleanup(){
39. for(Map.Entry<String,Integer> entry:counterMap.entrySet()){
40. System.out.println(entry.getKey()+" : "+ entry.getValue());
41. }
42. }
44. @Override
45. publicvoid declareOutputFields(OutputFieldsDeclarer declarer){
46. declarer.declare(newFields("call"));
47. }
49. @Override
50. publicMap<String,Object> getComponentConfiguration(){
51. returnnull;
52. }
54. }
创建 Topology 和 Local Cluster
1. import org.apache.storm.tuple.Fields;
2. import org.apache.storm.tuple.Values;
4. //import storm configuration packages
5. import org.apache.storm.Config;
6. import org.apache.storm.LocalCluster;
7. import org.apache.storm.topology.TopologyBuilder;
9. //Create main class LogAnalyserStorm submit topology.
10. publicclassLogAnalyserStorm{
11. publicstaticvoid main(String[] args)throwsException{
12. //Create Config instance for cluster configuration
13. Config config =newConfig();
14. config.setDebug(true);
16. //
17. TopologyBuilder builder =newTopologyBuilder();
18. builder.setSpout("call-log-reader-spout",newFakeCallLogReaderSpout());
20. builder.setBolt("call-log-creator-bolt",newCallLogCreatorBolt())
21. .shuffleGrouping("call-log-reader-spout");
23. builder.setBolt("call-log-counter-bolt",newCallLogCounterBolt())
24. .fieldsGrouping("call-log-creator-bolt",newFields("call"));
26. LocalCluster cluster =newLocalCluster();
27. cluster.submitTopology("LogAnalyserStorm", config, builder.createTopology());
28. Thread.sleep(10000);
30. //Stop the topology
32. cluster.shutdown();
33. }
34. }
远程模式
http://storm.apache.org/releases/current/Distributed-RPC.html
编译并运行应用
1. $ cd /opt/storm/my-example
2. $ javac *.java
3. $ java LogAnalyserStorm
输出结果
1. 1234123402-1234123401:78
2. 1234123402-1234123404:88
3. 1234123402-1234123403:105
4. 1234123401-1234123404:74
5. 1234123401-1234123403:81
6. 1234123401-1234123402:81
7. 1234123403-1234123404:86
8. 1234123404-1234123401:63
9. 1234123404-1234123402:82
10. 1234123403-1234123402:83
11. 1234123404-1234123403:86
12. 1234123403-1234123401:93

参考

Storm 官网 : http://storm.apache.org
教程 : https://www.tutorialspoint.com/apache_storm/index.htm
Storm-Java Doc http://storm.apache.org/releases/current/javadocs/index.html

PDF
《Storm Applied》
《Getting Started with Storm》
《Storm Real-time Processing Cookbook》
《Learning Storm》
《Storm Blueprints:Patterns for Distributed Real-time Computation》
《Hadoop The Definitive Guide》