atom.xml

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>zyb</title>
  <subtitle>zybrc</subtitle>
  <link href="/atom.xml" rel="self"/>
  
  <link href="https://zyb.github.io/"/>
  <updated>2016-12-14T01:45:01.534Z</updated>
  <id>https://zyb.github.io/</id>
  
  <author>
    <name>zyb</name>
    
  </author>
  
  <generator uri="http://hexo.io/">Hexo</generator>
  
  <entry>
    <title>Jstorm接口开发之HelloWorld</title>
    <link href="https://zyb.github.io/2016/12/12-jstorm-hello.html"/>
    <id>https://zyb.github.io/2016/12/12-jstorm-hello.html</id>
    <published>2016-12-12T03:19:02.000Z</published>
    <updated>2016-12-14T01:45:01.534Z</updated>
    
    <content type="html"><![CDATA[<h2 id="环境"><a href="#环境" class="headerlink" title="环境"></a>环境</h2><ul>
<li>Jstorm版本2.1.1</li>
<li>JDK版本1.7</li>
<li>archlinux x64操作系统</li>
</ul>
<h2 id="Jstorm概述"><a href="#Jstorm概述" class="headerlink" title="Jstorm概述"></a>Jstorm概述</h2><p>从应用的角度来说，JStorm它是一种分布式的应用；从系统层面来说，它又类似于MapReduce这样的调度系统；而从数据方面来说，它又 是一种基于流水数据的实时处理解决方案。如今，DT时代的当下，用户和企业也不仅仅只满足于离线数据，对于数据的实时性要求也越来越高了。</p>
<p>在早期，Storm和JStorm未问世之前，业界有很多实时计算系统，可谓百家争鸣，自Storm和JStorm出世之后，基本这两者占据主要地位，原因如下：</p>
<ul>
<li>易开发：接口简单，上手容易，只需要按照Spout，Bolt以及Topology的编程规范即可开发一个扩展性良好的应用，底层的细节我们可以不用去深究其原因。</li>
<li>扩展性：可线性扩展性能。</li>
<li>容错：当Worker异常或挂起，会自动分配新的Worker去工作。</li>
<li>数据精准：其包含Ack机制，规避了数据丢失的风险。使用事物机制，提高数据精度。</li>
</ul>
<p>JStorm处理数据的方式流程是基于流式处理，因此，我们会用它做以下处理：</p>
<ul>
<li>日志分析：从收集的日志当中，统计出特定的数据结果，并将统计后的结果持久化到外界存储介质中，如：DB。当下，实时统计主流使用JStorm和Storm。</li>
<li>消息转移：将接受的消息进行Filter后，定向的存储到另外的消息中间件中。</li>
</ul>
<h2 id="基本术语"><a href="#基本术语" class="headerlink" title="基本术语"></a>基本术语</h2><p>Storm通过一系列基本元素实现实时计算的目标，其中包括了Topology、Stream、Spout、Bolt、Tuple、worker、task、slot。</p>
<h3 id="Stream"><a href="#Stream" class="headerlink" title="Stream"></a>Stream</h3><p>在JStorm当中，有对Stream的抽象，它是一个不间断的无界的连续Tuple，而JStorm在建模事件流时，把流中的事件抽象为Tuple。</p>
<p><img src="http://images2015.cnblogs.com/blog/666745/201509/666745-20150915135843351-643602897.png" alt=""></p>
<h3 id="Spout和Bolt"><a href="#Spout和Bolt" class="headerlink" title="Spout和Bolt"></a>Spout和Bolt</h3><p>在JStorm中，它认为每个Stream都有一个Stream的来源，即Tuple的源头，所以它将这个源头抽象为Spout，而Spout可能是一个消息中间件，如：MQ，Kafka等。并不断的发出消息，也可能是从某个队列中不断读取队列的元数据。</p>
<p>在有了Spout后，接下来如何去处理相关内容，以类似的思想，将JStorm的处理过程抽象为Bolt，Bolt可以消费任意数量的输入流， 只要将流方向导到该Bolt即可，同时，它也可以发送新的流给其他的Bolt使用，因而，我们只需要开启特定的Spout，将Spout流出的Tuple 导向特定的Bolt，然后Bolt对导入的流做处理后再导向其它的Bolt等。</p>
<p>那么，通过上述描述，其实，我们可以用一个形象的比喻来理解这个流程。我们可以认为Spout就是一个个的水龙头，并且每个水龙头中的水是不同的，我们想要消费那种水就去开启对应的水龙头，然后使用管道将水龙头中的水导向一个水处理器，即Bolt，水处理器处理完后会再使用管道导向到另外的处理器或者落地到存储介质。</p>
<p><img src="http://images2015.cnblogs.com/blog/666745/201509/666745-20150915140959179-1408063851.png" alt=""></p>
<h3 id="Topology"><a href="#Topology" class="headerlink" title="Topology"></a>Topology</h3><p>实时计算任务需要打包成Topology提交，计算任务Topology是由不同的Spout和Bolt通过Stream连接起来的DAG图，它是JStorm中最高层次的一个抽象概念，一个Topology即为一个数据流转换图，图中的每个节点是一个 Spout或者Bolt，当Spout或Bolt发送Tuple到流时，它就发送Tuple到每个订阅了该流的Bolt上。</p>
<p><img src="http://images2015.cnblogs.com/blog/666745/201509/666745-20150915141401726-976011955.png" alt=""></p>
<h3 id="Tuple"><a href="#Tuple" class="headerlink" title="Tuple"></a>Tuple</h3><p>JStorm当中将Stream中数据抽象为了Tuple，一个Tuple就是一个Value List，List值的每个Value都有一个Name，并且该Value可以是基本类型，字符类型，字节数组等，当然也可以是其它可序列化的类型。 Topology的每个节点都要说明它所发射出的Tuple的字段的Name，其它节点只需要订阅该Name就可以接收处理相应的内容。</p>
<h3 id="Worker和Task"><a href="#Worker和Task" class="headerlink" title="Worker和Task"></a>Worker和Task</h3><p>Work和Task在JStorm中的职责是一个执行单元，一个Worker表示一个进程，一个Task表示一个线程，一个Worker可以运行多个Task，一个Worker中的Task必须属于同一个Topology。</p>
<p>Worker可以通过setNumWorkers(int workers)方法来设置对应的数目，表示这个Topology运行在多个JVM（PS：一个JVM为一个进程，即一个Worker）；另外 setSpout(String id, IRichSpout spout, Number parallelism_hint)和setBolt(String id, IRichBolt bolt,Number parallelism_hint)方法中的参数parallelism_hint代表这样一个Spout或Bolt有多少个实例，即对应多少个线程，一 个实例对应一个线程。</p>
<h3 id="Slot"><a href="#Slot" class="headerlink" title="Slot"></a>Slot</h3><p>在JStorm当中，Slot的类型分为四种，他们分别是：CPU，Memory，Disk，Port；与Storm有所区别（Storm局限 于Port）。一个Supervisor可以提供的对象有：CPU Slot、Memory Slot、Disk Slot以及Port Slot。</p>
<ul>
<li>在JStorm中，一个Worker消耗一个Port Slot，默认一个Task会消耗一个CPU Slot和一个Memory Slot。</li>
<li>在Task执行较多的任务时，可以申请更多的CPU Slot。</li>
<li>在Task需要更多的内存时，可以申请更多的额Memory Slot。</li>
<li>在Task磁盘IO较多时，可以申请Disk Slot。</li>
</ul>
<h2 id="Jstorm架构"><a href="#Jstorm架构" class="headerlink" title="Jstorm架构"></a>Jstorm架构</h2><p>从设计层面来说，JStorm是一个典型的调度系统。架构如下：</p>
<p><img src="/uploads/jstorm-framework.png" alt=""></p>
<ul>
<li>ZooKeeper：系统的协调者</li>
<li>Nimbus：调度器</li>
<li>Supervisor：Worker的代理角色，负责Kill掉Worker和运行Worker</li>
<li>Worker：一个JVM进程，Task的容器</li>
<li>Task：一个线程，任务的执行者</li>
</ul>
<h2 id="Jstorm接口开发——Topology"><a href="#Jstorm接口开发——Topology" class="headerlink" title="Jstorm接口开发——Topology"></a>Jstorm接口开发——Topology</h2><p>Topology的开发基本也有一些套路，根据官方的一些Example，总结了一个Topology基类：</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div><div class="line">51</div><div class="line">52</div><div class="line">53</div><div class="line">54</div><div class="line">55</div><div class="line">56</div><div class="line">57</div><div class="line">58</div><div class="line">59</div><div class="line">60</div><div class="line">61</div><div class="line">62</div><div class="line">63</div><div class="line">64</div><div class="line">65</div><div class="line">66</div><div class="line">67</div><div class="line">68</div><div class="line">69</div><div class="line">70</div><div class="line">71</div><div class="line">72</div><div class="line">73</div><div class="line">74</div><div class="line">75</div><div class="line">76</div><div class="line">77</div><div class="line">78</div><div class="line">79</div><div class="line">80</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">public</span> <span class="keyword">static</span> <span class="keyword">abstract</span> <span class="class"><span class="keyword">class</span> <span class="title">SelfDefTopologyImp</span> </span>&#123;</div><div class="line"></div><div class="line">  <span class="keyword">protected</span> Map conf = <span class="keyword">new</span> HashMap&lt;Object, Object&gt;();</div><div class="line"></div><div class="line">  <span class="comment">// TopologyBuilder设置接口，不同的Topology实现这个接口</span></div><div class="line">  <span class="function"><span class="keyword">protected</span> <span class="keyword">abstract</span> <span class="keyword">void</span> <span class="title">SetBuilder</span><span class="params">(TopologyBuilder builder, Map conf)</span></span>;</div><div class="line"></div><div class="line">  <span class="comment">// 本地模式启动：</span></div><div class="line">  <span class="comment">// 1、通过配置文件获取Toplogy Name</span></div><div class="line">  <span class="comment">// 2、通过setBuild抽象接口设置TopologyBuilder</span></div><div class="line">  <span class="comment">// 3、以本地模式启动任务</span></div><div class="line">  <span class="comment">// 4、本地调试模式根据调试时间，最终关闭本地模式</span></div><div class="line">  <span class="function"><span class="keyword">protected</span> <span class="keyword">void</span> <span class="title">SetLocalTopology</span><span class="params">()</span> <span class="keyword">throws</span> Exception </span>&#123;</div><div class="line">    String tname = (String) conf.get(Config.TOPOLOGY_NAME);</div><div class="line">    <span class="keyword">if</span> (tname == <span class="keyword">null</span>) &#123;</div><div class="line">      <span class="keyword">new</span> IllegalArgumentException(<span class="string">"Toplogy Name is null"</span>);</div><div class="line">    &#125;</div><div class="line"></div><div class="line">    TopologyBuilder builder = <span class="keyword">new</span> TopologyBuilder();</div><div class="line">    SetBuilder(builder, conf);</div><div class="line"></div><div class="line">    LocalCluster cluster = <span class="keyword">new</span> LocalCluster();</div><div class="line">    cluster.submitTopology(tname, conf, builder.createTopology());</div><div class="line"></div><div class="line">    Thread.sleep(<span class="number">60000</span>);</div><div class="line">    cluster.killTopology(tname);</div><div class="line">    cluster.shutdown();</div><div class="line">  &#125;</div><div class="line"></div><div class="line">  <span class="comment">// 集群模式启动：</span></div><div class="line">  <span class="comment">// 1、通过配置文件获取Toplogy Name</span></div><div class="line">  <span class="comment">// 2、通过setBuild抽象接口设置TopologyBuilder</span></div><div class="line">  <span class="comment">// 3、向集群提交任务</span></div><div class="line">  <span class="function"><span class="keyword">protected</span> <span class="keyword">void</span> <span class="title">SetRemoteTopology</span><span class="params">()</span> <span class="keyword">throws</span> AlreadyAliveException,</span></div><div class="line">      InvalidTopologyException &#123;</div><div class="line">    String tname = (String) conf.get(Config.TOPOLOGY_NAME);</div><div class="line">    <span class="keyword">if</span> (tname == <span class="keyword">null</span>) &#123;</div><div class="line">      <span class="keyword">new</span> IllegalArgumentException(<span class="string">"Toplogy Name is null"</span>);</div><div class="line">    &#125;</div><div class="line"></div><div class="line">    TopologyBuilder builder = <span class="keyword">new</span> TopologyBuilder();</div><div class="line">    SetBuilder(builder, conf);</div><div class="line"></div><div class="line">    conf.put(Config.STORM_CLUSTER_MODE, <span class="string">"distributed"</span>);</div><div class="line"></div><div class="line">    StormSubmitter.submitTopology(tname, conf, builder.createTopology());</div><div class="line">  &#125;</div><div class="line"></div><div class="line">  <span class="function"><span class="keyword">protected</span> <span class="keyword">void</span> <span class="title">LoadConf</span><span class="params">(String arg)</span> </span>&#123;</div><div class="line">    <span class="keyword">if</span> (arg.endsWith(<span class="string">"yaml"</span>)) &#123;</div><div class="line">      conf = LoadConf.LoadYaml(arg);</div><div class="line">    &#125; <span class="keyword">else</span> &#123;</div><div class="line">      conf = LoadConf.LoadProperty(arg);</div><div class="line">    &#125;</div><div class="line">  &#125;</div><div class="line"></div><div class="line">  <span class="comment">// 根据配置文件判断启动模式是‘本地模式’或‘集群模式’</span></div><div class="line">  <span class="function"><span class="keyword">protected</span> <span class="keyword">boolean</span> <span class="title">local_mode</span><span class="params">(Map conf)</span> </span>&#123;</div><div class="line">    String mode = (String) conf.get(Config.STORM_CLUSTER_MODE);</div><div class="line">    <span class="keyword">if</span> (mode != <span class="keyword">null</span>) &#123;</div><div class="line">      <span class="keyword">if</span> (mode.equals(<span class="string">"local"</span>)) &#123;</div><div class="line">        <span class="keyword">return</span> <span class="keyword">true</span>;</div><div class="line">      &#125;</div><div class="line">    &#125;</div><div class="line"></div><div class="line">    <span class="keyword">return</span> <span class="keyword">false</span>;</div><div class="line">  &#125;</div><div class="line"></div><div class="line">  <span class="comment">// 主入口：1、加载配置文件；2、根据配置文件中‘本地模式’或‘集群模式’的配置，分别启动</span></div><div class="line">  <span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">run</span><span class="params">(String cfile)</span> <span class="keyword">throws</span> Exception </span>&#123;</div><div class="line">    <span class="keyword">if</span> (StringUtils.isBlank(cfile)) <span class="keyword">throw</span> <span class="keyword">new</span> IllegalArgumentException(<span class="string">"params invalid."</span>);</div><div class="line"></div><div class="line">    LoadConf(cfile);</div><div class="line">    <span class="keyword">if</span> (local_mode(conf)) &#123;</div><div class="line">      SetLocalTopology();</div><div class="line">    &#125; <span class="keyword">else</span> &#123;</div><div class="line">      SetRemoteTopology();</div><div class="line">    &#125;</div><div class="line">  &#125;</div><div class="line">&#125;</div></pre></td></tr></table></figure>
<blockquote>
<ul>
<li>最主要的就是SetBuilder()这个接口，直接通过这个接口配置TopologyBuilder即可。</li>
<li>SelfDefTopologyImp主要封装了配置读取，本地模式和集群模式启动，其他细节参考代码。</li>
</ul>
</blockquote>
<p>HelloTopology类继承SelfDefTopologyImp，具体实现如下：</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">public</span> <span class="keyword">static</span> <span class="class"><span class="keyword">class</span> <span class="title">HelloTopology</span> <span class="keyword">extends</span> <span class="title">SelfDefTopologyImp</span> </span>&#123;</div><div class="line">  <span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">SetBuilder</span><span class="params">(TopologyBuilder builder, Map conf)</span> </span>&#123;</div><div class="line"></div><div class="line">    <span class="comment">// 从配置文件中读取spout和bolt的并行数</span></div><div class="line">    <span class="keyword">int</span> spout_Parallelism_hint = JStormUtils.parseInt(conf.get(<span class="string">"topology.spout.parallel"</span>), <span class="number">1</span>);</div><div class="line">    <span class="keyword">int</span> bolt_Parallelism_hint = JStormUtils.parseInt(conf.get(<span class="string">"topology.bolt.parallel"</span>), <span class="number">1</span>);</div><div class="line"></div><div class="line">    <span class="comment">// 设置spout和bolt的名称</span></div><div class="line">    String spoutName = HelloSpout.class.getSimpleName();</div><div class="line">    String boltName = HelloBolt.class.getSimpleName();</div><div class="line"></div><div class="line">    <span class="comment">// 设置spout和bolt，其中shuffleGrouping指明了HelloBolt接收HelloSpout的数据</span></div><div class="line">    <span class="comment">// 这里的设置最终就是Topology的DAG图</span></div><div class="line">    builder.setSpout(spoutName, <span class="keyword">new</span> HelloSpout(), spout_Parallelism_hint);</div><div class="line">    builder.setBolt(boltName, <span class="keyword">new</span> HelloBolt(), bolt_Parallelism_hint).shuffleGrouping(spoutName);</div><div class="line">  &#125;</div><div class="line">&#125;</div></pre></td></tr></table></figure>
<h2 id="Jstorm接口开发——Spout"><a href="#Jstorm接口开发——Spout" class="headerlink" title="Jstorm接口开发——Spout"></a>Jstorm接口开发——Spout</h2><p>HelloSpout实现的功能是：每秒生成一个随机数，并向后传递。</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">public</span> <span class="keyword">static</span> <span class="class"><span class="keyword">class</span> <span class="title">HelloSpout</span> <span class="keyword">extends</span> <span class="title">BaseRichSpout</span> </span>&#123;</div><div class="line">  <span class="keyword">private</span> SpoutOutputCollector collector;</div><div class="line">  <span class="keyword">private</span> <span class="keyword">static</span> Random rand;</div><div class="line"></div><div class="line">  <span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">open</span><span class="params">(Map conf, TopologyContext context, SpoutOutputCollector collector)</span> </span>&#123;</div><div class="line">    <span class="keyword">this</span>.collector = collector;</div><div class="line">    <span class="keyword">this</span>.rand = <span class="keyword">new</span> Random();</div><div class="line">  &#125;</div><div class="line"></div><div class="line">  <span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">nextTuple</span><span class="params">()</span> </span>&#123;</div><div class="line">    <span class="keyword">int</span> r = rand.nextInt(<span class="number">9999</span>);</div><div class="line">    collector.emit(<span class="keyword">new</span> Values(r));</div><div class="line">    <span class="keyword">try</span> &#123;</div><div class="line">      Thread.sleep(<span class="number">1000</span>);</div><div class="line">    &#125; <span class="keyword">catch</span> (InterruptedException e) &#123;</div><div class="line">      e.printStackTrace();</div><div class="line">    &#125;</div><div class="line">  &#125;</div><div class="line"></div><div class="line">  <span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">declareOutputFields</span><span class="params">(OutputFieldsDeclarer declarer)</span> </span>&#123;</div><div class="line">    declarer.declare(<span class="keyword">new</span> Fields(<span class="string">"value"</span>));</div><div class="line">  &#125;</div><div class="line">&#125;</div></pre></td></tr></table></figure>
<h2 id="Jstorm接口开发——Bolt"><a href="#Jstorm接口开发——Bolt" class="headerlink" title="Jstorm接口开发——Bolt"></a>Jstorm接口开发——Bolt</h2><figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">public</span> <span class="keyword">static</span> <span class="class"><span class="keyword">class</span> <span class="title">HelloBolt</span> <span class="keyword">extends</span> <span class="title">BaseBasicBolt</span> </span>&#123;</div><div class="line"></div><div class="line">  <span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">prepare</span><span class="params">(Map stormConf, TopologyContext context)</span> </span>&#123;</div><div class="line">    <span class="keyword">super</span>.prepare(stormConf, context);</div><div class="line">  &#125;</div><div class="line"></div><div class="line">  <span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">execute</span><span class="params">(Tuple input, BasicOutputCollector collector)</span> </span>&#123;</div><div class="line">    <span class="keyword">int</span> n = input.getIntegerByField(<span class="string">"value"</span>);</div><div class="line">    System.out.println();</div><div class="line">    System.out.println(<span class="string">"==========================="</span>);</div><div class="line">    System.out.println(<span class="string">"value: "</span> + n);</div><div class="line">    System.out.println(<span class="string">"==========================="</span>);</div><div class="line">    System.out.println();</div><div class="line">  &#125;</div><div class="line"></div><div class="line">  <span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">declareOutputFields</span><span class="params">(OutputFieldsDeclarer declarer)</span> </span>&#123;</div><div class="line">  &#125;</div><div class="line">&#125;</div></pre></td></tr></table></figure>
<h2 id="Jstorm运行接口开发——主函数"><a href="#Jstorm运行接口开发——主函数" class="headerlink" title="Jstorm运行接口开发——主函数"></a>Jstorm运行接口开发——主函数</h2><figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">public</span> <span class="class"><span class="keyword">class</span> <span class="title">JstormHelloWorld</span> </span>&#123;</div><div class="line">  <span class="function"><span class="keyword">public</span> <span class="keyword">static</span> <span class="keyword">void</span> <span class="title">main</span><span class="params">(String[] args)</span> <span class="keyword">throws</span> Exception </span>&#123;</div><div class="line">    <span class="keyword">if</span> (args.length == <span class="number">0</span>) &#123;</div><div class="line">      System.err.println(<span class="string">"Please input configuration file"</span>);</div><div class="line">      System.exit(<span class="number">1</span>);</div><div class="line">    &#125;</div><div class="line"></div><div class="line">    (<span class="keyword">new</span> HelloTopology()).run(args[<span class="number">0</span>]);</div><div class="line">  &#125;</div><div class="line">&#125;</div></pre></td></tr></table></figure>
<h2 id="Jstorm任务配置"><a href="#Jstorm任务配置" class="headerlink" title="Jstorm任务配置"></a>Jstorm任务配置</h2><p>配置文件如下：</p>
<figure class="highlight yaml"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># 集群模式还是本地模式</span></div><div class="line">storm.cluster.mode: <span class="string">"local"</span></div><div class="line"><span class="comment">#storm.cluster.mode: "distributed"</span></div><div class="line"></div><div class="line"><span class="comment"># topology名称配置</span></div><div class="line">topology.name: <span class="string">"JstormHelloWorld"</span></div><div class="line"></div><div class="line"><span class="comment"># spout和bolt的并行数配置</span></div><div class="line">topology.spout.parallel: <span class="number">1</span></div><div class="line">topology.bolt.parallel: <span class="number">1</span></div></pre></td></tr></table></figure>
<h2 id="Jstorm任务提交"><a href="#Jstorm任务提交" class="headerlink" title="Jstorm任务提交"></a>Jstorm任务提交</h2><p>本地运行命令如下，conf/jstormHelloWorld.yaml是配置文件。</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ java -cp JstormHelloWorld-1.0.0-jar-with-dependencies.jar io.github.zyb.jstorm.JstormHelloWorld conf/jstormHelloWorld.yaml</div></pre></td></tr></table></figure>
<h2 id="注意事项"><a href="#注意事项" class="headerlink" title="注意事项"></a>注意事项</h2><ul>
<li><p>Jstorm提交的Topology的名称中不能包含空格，准确来说名称应符合的正则表达式为”[a-zA-Z0-9-_.]+”</p>
</li>
<li><p><strong>重要提醒</strong>：Jstorm开发中pom.xml配置依赖jstorm-core，在集群模式下需要这个包的配置是provided，但是在本地运行模式下又需要是非provided，也就是本地模式要配置如下，集群模式需要去掉注视：</p>
</li>
</ul>
<figure class="highlight xml"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line"><span class="tag">&lt;<span class="name">dependency</span>&gt;</span></div><div class="line">  <span class="tag">&lt;<span class="name">groupId</span>&gt;</span>com.alibaba.jstorm<span class="tag">&lt;/<span class="name">groupId</span>&gt;</span></div><div class="line">  <span class="tag">&lt;<span class="name">artifactId</span>&gt;</span>jstorm-core<span class="tag">&lt;/<span class="name">artifactId</span>&gt;</span></div><div class="line">  <span class="tag">&lt;<span class="name">version</span>&gt;</span>2.1.1<span class="tag">&lt;/<span class="name">version</span>&gt;</span></div><div class="line">  <span class="comment">&lt;!-- &lt;scope&gt;provided&lt;/scope&gt; --&gt;</span></div><div class="line"><span class="tag">&lt;/<span class="name">dependency</span>&gt;</span></div></pre></td></tr></table></figure>
<h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><p><a href="https://yq.aliyun.com/articles/34083" target="_blank" rel="external">JStorm－介绍</a> <strong>强烈推荐，入门先看</strong> 本文多出引用来自于此<br><a href="https://github.com/alibaba/jstorm/blob/2.1.1/example/sequence-split-merge/src/main/java/com/alipay/dw/jstorm/example/sequence" target="_blank" rel="external">Jstorm Example</a> 本文示例部分代码参考于此<br><a href="http://mojijs.com/2016/10/219731/index.html" target="_blank" rel="external">Storm之Hello World：单词统计</a> 本文示例部分代码参考于此<br><a href="http://blog.csdn.net/szzhaom/article/details/41792023" target="_blank" rel="external">JStorm - Hello Word</a> 本文示例部分代码参考于此<br><a href="http://www.voidcn.com/blog/wwwxxdddx/article/p-4881831.html" target="_blank" rel="external">JStorm介绍</a> <strong>强烈推荐，入门先看</strong><br><a href="http://www.voidcn.com/blog/wwwxxdddx/article/p-4881832.html" target="_blank" rel="external">JStorm之Nimbus简介</a> <strong>入门先看</strong><br><a href="http://www.voidcn.com/blog/wwwxxdddx/article/p-4881833.html" target="_blank" rel="external">JStorm之Supervisor简介</a> <strong>入门先看</strong><br><a href="http://jstorm.io/quickstart_cn/Example.html" target="_blank" rel="external">Jstorm官方Example</a>官网示例，基本没怎么参考，无力吐槽…<br><a href="https://github.com/alibaba/jstorm/wiki/JStorm-Chinese-Documentation" target="_blank" rel="external">Jstorm Github Wiki</a><br><a href="https://github.com/alibaba/jstorm/wiki/%E5%9F%BA%E6%9C%AC%E6%A6%82%E5%BF%B5" target="_blank" rel="external">Jstorm基本概念</a></p>
<h2 id="拓展阅读"><a href="#拓展阅读" class="headerlink" title="拓展阅读"></a>拓展阅读</h2><p><a href="https://yq.aliyun.com/articles/66098?spm=5176.100240.searchblog.22.FjBOiE" target="_blank" rel="external">双11媒体大屏背后的数据技术与产品 </a><br><a href="https://yq.aliyun.com/articles/62693?spm=5176.100240.searchblog.36.FjBOiE" target="_blank" rel="external">JStorm，让大规模流处理成为可能 </a></p>
]]></content>
    
    <summary type="html">
    
      &lt;h2 id=&quot;环境&quot;&gt;&lt;a href=&quot;#环境&quot; class=&quot;headerlink&quot; title=&quot;环境&quot;&gt;&lt;/a&gt;环境&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Jstorm版本2.1.1&lt;/li&gt;
&lt;li&gt;JDK版本1.7&lt;/li&gt;
&lt;li&gt;archlinux x64操作系统&lt;/li&gt;

    
    </summary>
    
      <category term="大数据" scheme="https://zyb.github.io/categories/%E5%A4%A7%E6%95%B0%E6%8D%AE/"/>
    
    
      <category term="jstorm" scheme="https://zyb.github.io/tags/jstorm/"/>
    
  </entry>
  
  <entry>
    <title>apache Pig踩坑之旅（以及pig特性描述）</title>
    <link href="https://zyb.github.io/2016/12/11-pig-trap.html"/>
    <id>https://zyb.github.io/2016/12/11-pig-trap.html</id>
    <published>2016-12-11T08:48:57.000Z</published>
    <updated>2016-12-12T03:10:44.955Z</updated>
    
    <content type="html"><![CDATA[<h2 id="环境"><a href="#环境" class="headerlink" title="环境"></a>环境</h2><ul>
<li>pig使用版本是0.16.0版本</li>
<li>jdk版本1.7</li>
<li>archlinux x64操作系统</li>
</ul>
<h2 id="Pig坑"><a href="#Pig坑" class="headerlink" title="Pig坑"></a>Pig坑</h2><h3 id="1-在local模式同时执行多个pig脚本"><a href="#1-在local模式同时执行多个pig脚本" class="headerlink" title="1. 在local模式同时执行多个pig脚本"></a>1. 在local模式同时执行多个pig脚本</h3><p><strong>问题描述</strong>：如果在以local模式同时执行多个pig脚本，就部分脚本就有可能遇到类似下面的错误信息：</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div></pre></td><td class="code"><pre><div class="line">org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find output/spill0.out in any of the configured local directories</div><div class="line">        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:<span class="number">429</span>)</div><div class="line">        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:<span class="number">160</span>)</div><div class="line">        at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:<span class="number">107</span>)</div><div class="line">        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:<span class="number">1614</span>)</div><div class="line">        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:<span class="number">1323</span>)</div><div class="line">        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:<span class="number">699</span>)</div><div class="line">        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:<span class="number">766</span>)</div><div class="line">        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:<span class="number">370</span>)</div><div class="line">        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:<span class="number">212</span>)</div></pre></td></tr></table></figure>
<p><strong>原因分析</strong>：local模式下所有pig脚本任务的起始jobID都是1，因此可以想像多个pig脚本执行的maprdeuce任务的临时目录就会出现冲突，因此会发生多个任务同时在操作一个文件。但是这个并不能算是pig的bug，因为一般local模式主要是为了测试和调试使用，</p>
<p><strong>解决方案</strong>：最直接的解决方案，既然临时目录冲突，那就为每个pig脚本配置一个不同的临时目录：</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">set hadoop.tmp.dir &apos;/unique/tmp/path/hadoop-tmp&apos;;</div></pre></td></tr></table></figure>
<h3 id="2-define关键字的坑"><a href="#2-define关键字的坑" class="headerlink" title="2. define关键字的坑"></a>2. define关键字的坑</h3><p><strong>问题描述</strong>：说是坑有点勉强，主要是理解define的本质，define可以定义pig的函数，在pig函数内部也会为每一行的关系命名，直接看下面的函数实例，可以看出每一行的关系名称基本类似’h_$M’这样的名称，而不是类似‘flatten_foreach’这样的固定的名称，如果写成固定的名称，多次调用这个函数，会出现关系名覆盖的问题，导致最终结果不符合预期。</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">define FlattenFunc(MS) RETURNS M &#123;</div><div class="line">    h_$M = foreach $MS generate flatten($0) as (s:chararray, a:long, b:chararray, c:long, d:long);</div><div class="line">    $M = foreach h_$M generate s, ToDate(ToString(ToDate(a), &apos;yyyyMMddHH&apos;), &apos;yyyyMMddHH&apos;) as a, b..d;</div><div class="line">&#125;;</div></pre></td></tr></table></figure>
<p><strong>原因分析</strong>：define定义的并非一个函数，而是类似C语言中的define，是一个宏定义，这个在pig执行之前先将所有的difine进行字符串替换，其实最终执行的脚本都没有所谓的函数，如果在define里关系名称是固定的名称，如果在同一个脚本中多次调用，那么很明显，这个名字就在脚本中重复了，前面的关系将被后面同名的关系覆盖。</p>
<p><strong>解决方案</strong>：理解define的本质含义：跟C语言中的difine类似，都是字符串的替换。</p>
<h3 id="3-foreach嵌套的坑"><a href="#3-foreach嵌套的坑" class="headerlink" title="3. foreach嵌套的坑"></a>3. foreach嵌套的坑</h3><p><strong>问题描述</strong>：foreach嵌套语句导致OOM，例如下面的语句，原始数据比如是100W，group最终结果是一条，那么在有限的内存下就会产生OOM。</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">xf = foreach src generate a, b, c, d, e;</div><div class="line">xg = group xf by (a, b, c, d);</div><div class="line">x = foreach xg &#123;</div><div class="line">	fa = filter xf by e is not null and (e==&apos;AD_SUCCESS&apos; or e==&apos;BID_FAILED&apos;);</div><div class="line">	fb = filter xf by e is not null and e==&apos;AD_SUCCESS&apos;;</div><div class="line">	generate flatten(group), COUNT(fa), COUNT(fb);</div><div class="line">&#125;</div></pre></td></tr></table></figure>
<p><strong>原因分析</strong>：从之前对pig的了解，以及实际执行的现象看，foreach的嵌套在pig中进行了特殊的处理，跟外部的语句的处理有较大的区别，foreach嵌套的语句在一些情况会导致相关的数据全被加载到内存中处理，所以如果对嵌套执行的数据量比较大的话，可能导致OOM。具体的foreach嵌套的原理还没详细分析。</p>
<p><strong>解决方案</strong>：对于foreach嵌套使用，虽然灵活，但是要慎重，避免这种情况的发生。比如上面的例子可以用下面的pig中的case语句替换实现：</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">xf = foreach src generate a, b, c, d, case when (e is not null and (e==&apos;AD_SUCCESS&apos; or e==&apos;BID_FAILED&apos;)) then 1 else 0 end as fa, case when (e is not null and e==&apos;AD_SUCCESS&apos;) then 1 else 0 end as fb;</div><div class="line">xg = group xf by (a, b, c, d);</div><div class="line">x = foreach xg generate flatten(group), SUM(xf.fa), SUM(xf.fb);</div></pre></td></tr></table></figure>
<h2 id="Pig有用的特性"><a href="#Pig有用的特性" class="headerlink" title="Pig有用的特性"></a>Pig有用的特性</h2><h3 id="1-pig在同一个load多个store上所进行的优化"><a href="#1-pig在同一个load多个store上所进行的优化" class="headerlink" title="1. pig在同一个load多个store上所进行的优化"></a>1. pig在同一个load多个store上所进行的优化</h3><p><strong>描述</strong>：考虑这样一种场景，读取一份原始数据，生成多张不同维度的报表，pig脚本写法就是：在一个pig脚本中，从一个数据源中load数据，分别对不同维度进行统计分析最终生成多个统计报表，也就是最终执行多个store。因为是同一份数据，因此如果直接写MapReduce，可能一个MapReduce就可以同时将这几个报表都生成了，那么pig最终会用一个MapReduce分析，还是不同的store会产生多个MapReduce？</p>
<p><strong>结论</strong>：其实pig这方面优化做的很好，也是能够比hive性能高的一个方面，pig最终只会用一个mapreduce生成这个分析。这个能力来自于pig在执行之前，首先是会生成一个执行计划，执行计划阶段能够做很多优化，其中一项就是将能够合并的操作进行合并，对于这个场景，因为来自于同一数据源，pig在生成执行计划时能够识别这种可以的合并，最终只会有一个MapReduce执行。如果这操作在hive中实现，由于生成多张报表，因此必然会有多个SQL语句，因而必然会有多个MapReduce执行，而且pig在这方面的优化还不仅仅这是这个。</p>
<h3 id="2-pig对控制语句支持的缺失"><a href="#2-pig对控制语句支持的缺失" class="headerlink" title="2. pig对控制语句支持的缺失"></a>2. pig对控制语句支持的缺失</h3><p><strong>描述</strong>：pig中并没有类似高级程序语言的控制语句，因此你不可能写出一个pig脚本，控制内部语句的执行流程，pig的语句只会逐条分析生成执行计划，然后执行。</p>
<p><strong>结论</strong>：缺失控制语句，只能说是pig侧重的方向不同，这个可以通过用其他脚本配合pig使用来解决这个问题，比如pig+shell或者pig+python，都可以解决这个问题。</p>
<h3 id="3-pig中有用的关键字describe"><a href="#3-pig中有用的关键字describe" class="headerlink" title="3. pig中有用的关键字describe"></a>3. pig中有用的关键字describe</h3><p><strong>描述</strong>：在理解的pig的基本语法和用法后，常用describe，不仅能够解决pig中的问题，而且有助于理解pig处理的机制。</p>
<p>（持续补充中…）</p>
<h2 id="Pig学习过程中参考的资料"><a href="#Pig学习过程中参考的资料" class="headerlink" title="Pig学习过程中参考的资料"></a>Pig学习过程中参考的资料</h2><p><a href="http://pig.apache.org/docs/r0.16.0/index.html" target="_blank" rel="external">0.16.0官方文档</a></p>
<p>《pig编程指南》<br>《hadoop权威指南》第三版<br><a href="https://www.zybuluo.com/BrandonLin/note/449340" target="_blank" rel="external">Pig完全入门</a></p>
]]></content>
    
    <summary type="html">
    
      &lt;h2 id=&quot;环境&quot;&gt;&lt;a href=&quot;#环境&quot; class=&quot;headerlink&quot; title=&quot;环境&quot;&gt;&lt;/a&gt;环境&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;pig使用版本是0.16.0版本&lt;/li&gt;
&lt;li&gt;jdk版本1.7&lt;/li&gt;
&lt;li&gt;archlinux x64操作系统&lt;/l
    
    </summary>
    
      <category term="大数据" scheme="https://zyb.github.io/categories/%E5%A4%A7%E6%95%B0%E6%8D%AE/"/>
    
    
      <category term="pig" scheme="https://zyb.github.io/tags/pig/"/>
    
  </entry>
  
  <entry>
    <title>CDH5.7.2 Hadoop源码编译</title>
    <link href="https://zyb.github.io/2016/12/10-cdh-hadoop-compile.html"/>
    <id>https://zyb.github.io/2016/12/10-cdh-hadoop-compile.html</id>
    <published>2016-12-10T07:50:49.000Z</published>
    <updated>2016-12-10T10:01:16.464Z</updated>
    
    <content type="html"><![CDATA[<h2 id="一、环境"><a href="#一、环境" class="headerlink" title="一、环境"></a>一、环境</h2><ul>
<li>Hadoop源码为Cloudera的5.7.2版本，这个版本源于Hadoop官方2.6.0版本</li>
<li>JDK版本为1.7</li>
<li>archlinux x64操作系统</li>
</ul>
<h2 id="二、CDH5-Hadoop源码下载"><a href="#二、CDH5-Hadoop源码下载" class="headerlink" title="二、CDH5 Hadoop源码下载"></a>二、CDH5 Hadoop源码下载</h2><p>CDH是Cloudera开源组件的集合，Hadoop只是其中一个，Cloudera的源码并没有在github或其他类似的工具上维护，而是CDH的每个版本Cloudera都提供了源码下载包，都在 <a href="http://archive.cloudera.com/xxxx" target="_blank" rel="external">http://archive.cloudera.com/xxxx</a> 中，’xxxx’表示具体的子路径。具体如下：</p>
<p>CDH3及以前的版本在 <a href="http://archive.cloudera.com/cdh" target="_blank" rel="external">http://archive.cloudera.com/cdh</a> 中<br>CDH4版本在 <a href="http://archive.cloudera.com/cdh4" target="_blank" rel="external">http://archive.cloudera.com/cdh4</a> 中<br>CDH5版本在 <a href="http://archive.cloudera.com/cdh5" target="_blank" rel="external">http://archive.cloudera.com/cdh5</a> 中</p>
<p>下载CDH5.7.2 Hadoop：</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ wget http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.7.2-src.tar.gz</div></pre></td></tr></table></figure>
<h2 id="三、CDH5-Hadoop源码编译"><a href="#三、CDH5-Hadoop源码编译" class="headerlink" title="三、CDH5 Hadoop源码编译"></a>三、CDH5 Hadoop源码编译</h2><p>CDH5.7.2版本的Hadoop依赖的protobuf是2.5.0版本，有两种方式：</p>
<ol>
<li>将系统的protobuf版本替换为2.5.0版本。</li>
<li>如果当前开发系统还在用其它版本的protobuf，不想替换系统protobuf的，也可以设置到非系统环境目录下。通过github下载源码自行编译一个版本，或者下载别人编译好的2.5.0版本。但是如果是下载或者自行编译，假设protobuf的主目录是PROTOBUF_HOME，那么protoc的路径需要保证为’PROTOBUF_HOME/bin/protoc’，这个是后续用Cloudera的编译脚本执行是需要设置的，自行编译参考下面方法：</li>
</ol>
<h4 id="protobuf编译"><a href="#protobuf编译" class="headerlink" title="protobuf编译"></a>protobuf编译</h4><p>protobuf需要自己编译，将下面%PROTOBUF_COMPILE_PATH%替换成你自己的protobuf所在的目录，：</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">$ <span class="built_in">cd</span> %PROTOBUF_COMPILE_PATH%</div><div class="line">$ wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz</div><div class="line">$ <span class="built_in">cd</span> protobuf-2.5.0</div><div class="line">$ ./configure --prefix=%PROTOBUF_COMPILE_PATH%/protobuf-2.5.0 <span class="comment"># 设置prefix非常重要，否则在'make install'会替换系统中的protoc</span></div><div class="line">$ make</div><div class="line">$ make install</div></pre></td></tr></table></figure>
<p>最终protoc生成位置为 %PROTOBUF_COMPILE_PATH%/protobuf-2.5.0/bin/protoc</p>
<h4 id="Hadoop源码编译"><a href="#Hadoop源码编译" class="headerlink" title="Hadoop源码编译"></a>Hadoop源码编译</h4><blockquote>
<ul>
<li>Hadoop的编译除了jdk和protoc还依赖maven、ant。</li>
<li>编译之前先修改./build.sh和./lib.sh中的第二行’set -xe’注释掉，这个是shell脚本设置的调试参数。</li>
<li><strong>重要提醒</strong>：本次编译修改了lib.sh中的’MAVEN_FLAGS’，由于本机编译hadoop的hadoop-mapreduce-client-nativetask项目失败，因此需要去掉’-Pnative’参数。后续尝试成功之后再记录。</li>
</ul>
</blockquote>
<p>将下载的hadoop源码包拷贝到编译的路径下，我本地环境protobuf 2.5.0版的路径在跟’hadoop-2.6.0-cdh5.7.2’目录处于同一级目录，根据以下命令编译Hadoop：</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">$ tar zxf hadoop-2.6.0-cdh5.7.2-src.tar.gz</div><div class="line">$ <span class="built_in">cd</span> hadoop-2.6.0-cdh5.7.2</div><div class="line">$ <span class="built_in">cd</span> cloudera</div><div class="line">$ <span class="comment"># ./build.sh --protobuf-home=../../protobuf-2.5.0 # 这种方式尝试失败，直接将本机的protobuf从3.0降到2.5.0，然后就不需要设置protobuf-home这个参数了</span></div><div class="line">$ ./build.sh</div></pre></td></tr></table></figure>
<p>（未完待续…）</p>
<h2 id="四、参考资料"><a href="#四、参考资料" class="headerlink" title="四、参考资料"></a>四、参考资料</h2><p>无</p>
<blockquote>
<p>对于hadoop-mapreduce-client-nativetask项目native模式下失败，在github上看到一个同样的问题<a href="https://github.com/protegeproject/protege/issues/514" target="_blank" rel="external">https://github.com/protegeproject/protege/issues/514</a>，回复了对方，等待对方回复，看看对方现在是否已经解决。</p>
</blockquote>
]]></content>
    
    <summary type="html">
    
      &lt;h2 id=&quot;一、环境&quot;&gt;&lt;a href=&quot;#一、环境&quot; class=&quot;headerlink&quot; title=&quot;一、环境&quot;&gt;&lt;/a&gt;一、环境&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Hadoop源码为Cloudera的5.7.2版本，这个版本源于Hadoop官方2.6.0版本&lt;/li&gt;
&lt;li
    
    </summary>
    
      <category term="大数据" scheme="https://zyb.github.io/categories/%E5%A4%A7%E6%95%B0%E6%8D%AE/"/>
    
    
      <category term="hadoop" scheme="https://zyb.github.io/tags/hadoop/"/>
    
  </entry>
  
  <entry>
    <title>Jstorm Web UI 安装部署</title>
    <link href="https://zyb.github.io/2016/12/09-jstorm-webui-install.html"/>
    <id>https://zyb.github.io/2016/12/09-jstorm-webui-install.html</id>
    <published>2016-12-09T07:56:15.000Z</published>
    <updated>2016-12-09T09:45:41.456Z</updated>
    
    <content type="html"><![CDATA[<h2 id="一、环境"><a href="#一、环境" class="headerlink" title="一、环境"></a>一、环境</h2><ul>
<li>Jstorm为当前稳定版本2.1.1</li>
<li>JDK为1.7</li>
<li>archlinux x64 操作系统</li>
<li>Jstorm服务的安装部署参考：<a href="/2016/12/07-jstorm-install.html">Jstorm从源码编译及配置部署</a></li>
</ul>
<h2 id="二、概述"><a href="#二、概述" class="headerlink" title="二、概述"></a>二、概述</h2><p>WebUI 的安装部署和JStorm 是完全独立的。而且并不要求WebUI的机器必须是在Jstorm机器中。一个web UI 可以管理多个集群，只需在WebUI的配置文件中，增加新集群的配置即可。</p>
<p><strong>注意</strong>：WebUI使用的版本必须和集群中Jstorm最高的版本一致。</p>
<h2 id="三、Jstorm-WebUI配置"><a href="#三、Jstorm-WebUI配置" class="headerlink" title="三、Jstorm WebUI配置"></a>三、Jstorm WebUI配置</h2><p>这个在Jstorm WebUI相关组件安装前有必要先说明一下如何配置。</p>
<p>Jstorm WebUI的配置文件目录在用户的home目录下: ~/.jstorm/storm.yaml，从Jstorm WebUI的源码上看，这个配置是代码中直接写死的，无法配置。（也是无语了…）</p>
<p>官方上做法，这个配置是将nimbus服务上的conf/storm.yaml配置拷贝到~/.jstorm/storm.yaml。这样就完成了Jstorm WebUI单Jstorm集群的配置了。</p>
<p>如果是多机群的话在~/.jstorm/storm.yaml配置中追加其他集群配置信息即可，追加信息如下：</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div></pre></td><td class="code"><pre><div class="line"># UI MultiCluster</div><div class="line"># Following is an example of multicluster UI configuration</div><div class="line"> ui.clusters:</div><div class="line">     - &#123;</div><div class="line">         name: &quot;zjstorm&quot;,</div><div class="line">         zkRoot: &quot;/zjstorm&quot;,</div><div class="line">         zkServers:</div><div class="line">             [ &quot;127.0.0.1&quot;],</div><div class="line">         zkPort: 2181,</div><div class="line">       &#125;</div><div class="line">     - &#123;</div><div class="line">         name: &quot;jstorm.o&quot;,</div><div class="line">         zkRoot: &quot;/jstorm.other&quot;,</div><div class="line">         zkServers:</div><div class="line">             [&quot;zk.test1.com&quot;, &quot;zk.test2.com&quot;, &quot;zk.test3.com&quot;],</div><div class="line">         zkPort: 2181,</div><div class="line">       &#125;</div></pre></td></tr></table></figure>
<blockquote>
<p><strong>name</strong>: 是集群的名称，<strong>保证不能重复</strong>。<br><strong>zkRoot</strong>: 对应$JSTORM_HOME/conf/storm.yaml 中’storm.zookeeper.root’配置。<br><strong>zkServers</strong>: Zookeeper集群机器列表。<br><strong>zkPort</strong>: Zookeeper集群端口。</p>
</blockquote>
<p><strong>注意</strong>：这里需要注意的一点是，多机群配置’ui.clusters’下面的集群配置不包含配置文件中原本的集群，之包含后续加入的集群。这个尝试配置，最终在WebUI界面上也可以看到。</p>
<p>以上都是官方的给出的做法，个人实践了一种方式：配置文件中只配置多机群相关的配置，比如现在只有一个Jstorm集群，~/.jstorm/storm.yaml配置文件中只需要写入下列信息即可：</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line"># UI MultiCluster</div><div class="line"> ui.clusters:</div><div class="line">     - &#123;</div><div class="line">         name: &quot;z.jstorm&quot;,</div><div class="line">         zkRoot: &quot;/zjstorm&quot;,</div><div class="line">         zkServers:</div><div class="line">             [ &quot;127.0.0.1&quot;],</div><div class="line">         zkPort: 2181,</div><div class="line">       &#125;</div></pre></td></tr></table></figure>
<p>其实这个也不难搞明白，其实Jstorm WebUI也只需要这些配置信息。</p>
<h2 id="四、Jstorm-WebUI安装"><a href="#四、Jstorm-WebUI安装" class="headerlink" title="四、Jstorm WebUI安装"></a>四、Jstorm WebUI安装</h2><p>Jstorm WebUI依赖Tomcat，因此首先安装Tomcat，官方文档说明必须用7.x版本，没有说明为什么，当前7.x的稳定版本为7.0.73。</p>
<p>Tomcat下载：</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ wget http://mirrors.hust.edu.cn/apache/tomcat/tomcat-7/v7.0.73/bin/apache-tomcat-7.0.73.tar.gz</div></pre></td></tr></table></figure>
<p>2.1.1版本的Jstorm WebUI依赖的jstorm-ui-2.1.1.war获取：</p>
<blockquote>
<ul>
<li>从官网下载，官网下载的Jstorm的zip包里包含了jstorm-ui-2.1.1.war。</li>
<li>如果是跟据 <a href="/2016/12/07-jstorm-install.html">Jstorm从源码编译及配置部署</a> 自己编译的Jstorm，在编译后的目录和zip包中就包含了jstorm-ui-2.1.1.war</li>
</ul>
</blockquote>
<p>Jstorm WebUI安装，将Tomcat下载的压缩包拷贝到部署目录，然后执行以下命令：</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">$ tar zxf apache-tomcat-7.0.73.tar.gz</div><div class="line">$ <span class="built_in">cd</span> apache-tomcat-7.0.73/webapps</div><div class="line">$ cp <span class="variable">$JSTORM_HOME</span>/jstorm-ui-2.1.1.war ./</div><div class="line">$ mv ROOT ROOT.old</div><div class="line">$ ln <span class="_">-s</span> jstorm-ui-2.1.1 ROOT</div><div class="line">$ <span class="built_in">cd</span> ..</div></pre></td></tr></table></figure>
<p>执行到这里Jstorm WebUI就安装完成了，配置信息根据上一节提到的配置进行配置。</p>
<h2 id="五、启动Jstorm-WebUI"><a href="#五、启动Jstorm-WebUI" class="headerlink" title="五、启动Jstorm WebUI"></a>五、启动Jstorm WebUI</h2><p>Jstorm WebUI的启动就是启动Tomcat，到Tomcat的主目录下执行下面命令：</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ ./bin/startup.sh</div></pre></td></tr></table></figure>
<p>如果是本机部署的Tomcat，打开 <a href="http://127.0.0.1:8080" target="_blank" rel="external">http://127.0.0.1:8080</a> 就可以看到Jstorm的WebUI界面，不得不说这个界面比较丑，也比较简单，只有一些基本能力。</p>
<p>至于Jstorm WebUI的服务关闭等其他功能，这都是Tomcat的操作，自行Google。</p>
<h2 id="六、参考资料"><a href="#六、参考资料" class="headerlink" title="六、参考资料"></a>六、参考资料</h2><p><a href="http://jstorm.io/quickstart_cn/Deploy/WebUI.html" target="_blank" rel="external">Jstorm QuickStart Deploy WebUI</a></p>
]]></content>
    
    <summary type="html">
    
      &lt;h2 id=&quot;一、环境&quot;&gt;&lt;a href=&quot;#一、环境&quot; class=&quot;headerlink&quot; title=&quot;一、环境&quot;&gt;&lt;/a&gt;一、环境&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Jstorm为当前稳定版本2.1.1&lt;/li&gt;
&lt;li&gt;JDK为1.7&lt;/li&gt;
&lt;li&gt;archlinux x
    
    </summary>
    
      <category term="大数据" scheme="https://zyb.github.io/categories/%E5%A4%A7%E6%95%B0%E6%8D%AE/"/>
    
    
      <category term="jstorm" scheme="https://zyb.github.io/tags/jstorm/"/>
    
  </entry>
  
  <entry>
    <title>Jstorm从源码编译及配置部署</title>
    <link href="https://zyb.github.io/2016/12/07-jstorm-install.html"/>
    <id>https://zyb.github.io/2016/12/07-jstorm-install.html</id>
    <published>2016-12-07T12:40:57.000Z</published>
    <updated>2016-12-13T09:05:35.378Z</updated>
    
    <content type="html"><![CDATA[<p>虽然本文只是Jstorm的编译、配置、部署，但是仍需要对Storm的有基本的了解。</p>
<h2 id="一、环境"><a href="#一、环境" class="headerlink" title="一、环境"></a>一、环境</h2><ul>
<li>Jstorm使用的当前稳定版本2.1.1</li>
<li>JDK使用1.7</li>
<li>archlinux x64操作系统</li>
<li>Jstorm依赖zookeeper，参考：<a href="/2016/12/06-zookeeper-install.html">Zookeeper从源码编译及配置部署</a></li>
</ul>
<h2 id="二、Jstorm编译"><a href="#二、Jstorm编译" class="headerlink" title="二、Jstorm编译"></a>二、Jstorm编译</h2><h3 id="源码下载"><a href="#源码下载" class="headerlink" title="源码下载"></a>源码下载</h3><p>从github上下载Jstorm源码</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ git <span class="built_in">clone</span> https://github.com/alibaba/jstorm.git</div></pre></td></tr></table></figure>
<h3 id="源码编译"><a href="#源码编译" class="headerlink" title="源码编译"></a>源码编译</h3><p>Jstorm是java编写的，使用maven进行包管理，因此编译就比较简单了，直接通过下面得的maven命令就可以完成编译打包了（当前稳定版是2.1.1，git切换到稳定版本分支），编译好的包就是target下的’aloha-tgz.zip’包，解压出来就是Jstorm，其实打成zip包之前的目录就是target/aloha-tgz/jstorm-{version}，这个目录就是Jstorm。</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">$ <span class="built_in">cd</span> jstorm</div><div class="line">$ git checkout 2.1.1</div><div class="line">$ mvn clean package assembly:assembly -Dmaven.test.skip=<span class="literal">true</span></div></pre></td></tr></table></figure>
<h2 id="三、Jstorm安装"><a href="#三、Jstorm安装" class="headerlink" title="三、Jstorm安装"></a>三、Jstorm安装</h2><p>将上一步编译打包好的zip包拷贝到部署目录，然后解压出来，就完成安装了。（稳定版本的zip包也可以直接从官网下载）</p>
<h2 id="四、Jstorm配置"><a href="#四、Jstorm配置" class="headerlink" title="四、Jstorm配置"></a>四、Jstorm配置</h2><p>根据官方文档第一步是要先配置JSTORM_HOME环境变量，但是根据后面的bin/jstorm启动脚本来看，这个环境变量并不需要。</p>
<blockquote>
<ul>
<li>bin/jstorm是一个python脚本，作为Jstorm的启动脚本，2.1.1这个版本是通过bin/jstorm脚本自身的绝对路径反推出来JSTORM_HOME，因此这个环境变量在启动脚本中根本没有用到。</li>
<li>在Jstorm服务启动时，bin/jstorm脚本将Jstorm的home目录作为参数传递给了Jstorm服务，其中使用的参数名称是’jstorm.home’，并且在Jstorm服务读取storm.yaml这个配置文件时，用’jstorm.home’的值替换掉了storm.yaml中的所有包含%JSTORM_HOME%这个字符串的配置，因此即使没有配置JSTORM_HOME，在配置文件中依然可以使用%JSTORM_HOME%。</li>
</ul>
</blockquote>
<p>然后就是storm的主要配置文件conf/storm.yaml，个人测试环境storm.yaml配置如下：</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">storm.zookeeper.servers:</div><div class="line">    - &quot;localhost&quot;</div><div class="line">storm.zookeeper.root: &quot;/zjstorm&quot;</div><div class="line">storm.local.dir: &quot;%JSTORM_HOME%/data&quot;</div><div class="line">nimbus.host: 127.0.0.1</div><div class="line">nimbus.host.start.supervisor: true</div><div class="line">supervisor.slots.ports.base: 56800</div></pre></td></tr></table></figure>
<p><strong>配置说明：</strong></p>
<blockquote>
<p><strong>storm.zookeeper.servers</strong>: 表示zookeeper的IP，不要包含端口，对于Zookeeper集群，需要将IP都配置上。<br><strong>storm.zookeeper.port</strong>: 表示zookeeper的端口，默认为2181。<br><strong>storm.zookeeper.root</strong>: 表示JStorm在zookeeper中的根目录，当多个JStorm集群共享一个zookeeper时，需要设置该选项，默认即为“/jstorm”。（建议生产环境配置独立的名称，以备后续的扩展需要）<br><strong>storm.local.dir</strong>: 表示JStorm临时数据存放目录，Nimbus和Supervisor进程用于存储少量状态数据，如jars、confs等，需要保证JStorm程序对该目录有写权限。<br><strong>jstorm.log.dir</strong>: 表示JStorm日志目录，默认为：$JSTORM_HOME/logs。<br><strong>nimbus.host</strong>: Nimbus节点的地址，只支持IP，不支持域名，用于下载Topologies的jars、confs等文件。（也可不指定，启动nimbus节点后由于与Zookeeper交互会知道是哪个）<br><strong>nimbus.host.start.supervisor</strong>: 表示是否允许在nimbus节点启动supervisor服务。（这个配置貌似只在bin/start.sh这个脚本中用了这个配置，如果不用这个脚本，这个配置就没任何作用了，还有待确认是不是只有这个脚本使用了）<br><strong>supervisor.slots.ports</strong>: 表示Supervisor节点运行的worker能使用哪些端口，每个worker独占一端口用于接收消息，因此也定义了可运行的Woker数量。如果这个参数为空，则根据系统cpu数和内存数自动计算需要几个端口，并根据配置的基准端口为起始端口递增使用。<br><strong>supervisor.slots.ports.base</strong>: 表示Supervisor节点运行的worker使用端口的基准端口，如果没有明确指明使用哪几个，则这个端口为基准递增使用。默认值为6800。<br><strong>supervisor.slots.port.cpu.weight</strong>: 以cpu数计算worker数量的权重值，“cpu数/这个值”得到的值作为worker数量的参考值，这个值跟以memory权重计算得到的worker数值，取其中较小的一个值。<br><strong>supervisor.slots.port.mem.weight</strong>: 以memory数计算worker数量的权重值，“memory数/这个值”得到的值作为worker数量的参考值，这个值跟以cpu权重计算得到的worker数值，取其中较小的一个值。<br><strong>worker.memory.size</strong>: 每个worker内存大小，单位是byte。<br><strong>java.library.path</strong>: 讲道理应该是Jstorm运行时的java lib path，未仔细研究，官方配置中有说明。<br>剩余配置当前未仔细研究，不做说明，官方配置中都有说明。</p>
</blockquote>
<h2 id="五、Jstorm运行"><a href="#五、Jstorm运行" class="headerlink" title="五、Jstorm运行"></a>五、Jstorm运行</h2><p>JStorm集群中包含两类节点：主控节点（Nimbus）和工作节点（Supervisor）。其分别对应的角色如下：</p>
<ul>
<li>Nimbus，它负责在Storm集群内分发代码，分配任务给工作机器，并且负责监控集群运行状态。</li>
<li>每个工作节点运行一个Supervisor，Supervisor负责监听从Nimbus分配给它执行的任务，据此启动或停止执行任务的工作进程。</li>
</ul>
<p><img src="/uploads/jstorm-framework.png" alt="Jstorm框架"></p>
<ul>
<li>ZooKeeper：系统的协调者</li>
<li>Nimbus：调度器</li>
<li>Supervisor：Worker的代理角色，负责Kill掉Worker和运行Worker</li>
<li>Worker：一个JVM进程，Task的容器</li>
<li>Task：一个线程，任务的执行者</li>
</ul>
<p><strong>启动前最重要的一个设置</strong>：<br>在/etc/hosts将当前hostname配置为本机IP，确保’hostname -i’命令可以获取到正确的本机IP，而不是127.0.0.1，否则会导致Jstorm获取不到本机IP或者host而启动失败，导致失败的原因是：这个信息在Jstorm向Zookeeper上注册相关信息时的必要信息。而且Jstorm在内部其他多处也需要使用这个信息。</p>
<p>nimbus启动命令如下，通过查看%JSTORM_HOME%/logs/nimbus.log检查有无错误（这个是默认日志路径，如果单独配置了日志路径，到配置的日志目录下查看）。</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ nohup bin/jstorm nimbus &gt; /dev/null 2&gt;&amp;1 &amp;</div></pre></td></tr></table></figure>
<p>supervisor启动命令如下，查看%JSTORM_HOME%/logs/supervisor.log检查有无错误（这个是默认日志路径，如果单独配置了日志路径，到配置的日志目录下查看）。</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ nohup bin/jstorm supervisor &gt; /dev/null 2&gt;&amp;1 &amp;</div></pre></td></tr></table></figure>
<p>nimbus服务关闭命令：</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ ps -ef | grep NimbusServer | grep -v grep | awk <span class="string">'&#123;print $2&#125;'</span> | xargs <span class="built_in">kill</span></div></pre></td></tr></table></figure>
<p>supervisor服务关闭命令：</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ ps -ef | grep Supervisor | grep -v grep | awk <span class="string">'&#123;print $2&#125;'</span> | xargs <span class="built_in">kill</span></div></pre></td></tr></table></figure>
<h4 id="Jstorm-2-1-1版本的一个bug："><a href="#Jstorm-2-1-1版本的一个bug：" class="headerlink" title="Jstorm 2.1.1版本的一个bug："></a>Jstorm 2.1.1版本的一个bug：</h4><ul>
<li>在Jstorm启动失败的时候（比如当前主机ip由于是127.0.0.1的时候），这时Jstorm退出时并没有将Jstorm启动时在‘数据目录’下记录的pid文件删除，导致重新启动时，Jstorm检测到上次pid的文件，因此Jstorm会尝试杀掉这个pid的进程，但是其实这个Jstorm进程已经不存在，根据linux进程号的分配策略，这个进程号有可能又被分给了其他进程，那么这时这个进程就会因为这个原因被kill。这个pid的删除并没有在shutdown hook中处理？</li>
</ul>
<h4 id="Jstorm-2-1-1中易用性问题："><a href="#Jstorm-2-1-1中易用性问题：" class="headerlink" title="Jstorm 2.1.1中易用性问题："></a>Jstorm 2.1.1中易用性问题：</h4><ul>
<li>Jstorm需要本机hosts中要配置本机IP，而用于注册到zk，虽然在生产环境一般这种方式没问题，但是用户单机测试环境等IP不固定而是DHCP分配的场景，hosts中就不会设置或者设置是localhost，就会导致Jstorm启动失败，这个最好是可以配置了，默认使用从hosts中读取的方式。</li>
<li>随后发现Jstorm中多出使用这种方式获取本机IP，DHCP测试环境下确实比较麻烦。</li>
</ul>
<h2 id="六、参考资料"><a href="#六、参考资料" class="headerlink" title="六、参考资料"></a>六、参考资料</h2><p><a href="http://jstorm.io/quickstart_cn/Compile.html" target="_blank" rel="external">Jstorm QuickStart Compile</a><br><a href="http://jstorm.io/quickstart_cn/Deploy/Standalone.html" target="_blank" rel="external">Jstorm QuickStart Deploy</a><br><a href="https://github.com/alibaba/jstorm/wiki/JStorm-basics-in-5-min" target="_blank" rel="external">Jstorm Basic in 5 minutes</a><br><a href="http://jstorm.io/Maintenance/Configuration.html" target="_blank" rel="external">Jstorm Configration</a></p>
<p>（完结）</p>
]]></content>
    
    <summary type="html">
    
      &lt;p&gt;虽然本文只是Jstorm的编译、配置、部署，但是仍需要对Storm的有基本的了解。&lt;/p&gt;
&lt;h2 id=&quot;一、环境&quot;&gt;&lt;a href=&quot;#一、环境&quot; class=&quot;headerlink&quot; title=&quot;一、环境&quot;&gt;&lt;/a&gt;一、环境&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Jstorm使
    
    </summary>
    
      <category term="大数据" scheme="https://zyb.github.io/categories/%E5%A4%A7%E6%95%B0%E6%8D%AE/"/>
    
    
      <category term="jstorm" scheme="https://zyb.github.io/tags/jstorm/"/>
    
  </entry>
  
  <entry>
    <title>Zookeeper从源码编译及配置部署</title>
    <link href="https://zyb.github.io/2016/12/06-zookeeper-install.html"/>
    <id>https://zyb.github.io/2016/12/06-zookeeper-install.html</id>
    <published>2016-12-06T12:26:13.000Z</published>
    <updated>2016-12-09T06:59:16.958Z</updated>
    
    <content type="html"><![CDATA[<p>虽然本文只是Zookeeper的编译、配置、部署，但是仍需要对Zookeeper的有基本的了解。</p>
<h2 id="一、环境"><a href="#一、环境" class="headerlink" title="一、环境"></a>一、环境</h2><ul>
<li>当前zookeeper使用的是git上3.4.9这个tag，是当前稳定版本</li>
<li>jdk为1.7版本</li>
<li>archlinux x64 操作系统</li>
</ul>
<h2 id="二、Zookeeper源码编译"><a href="#二、Zookeeper源码编译" class="headerlink" title="二、Zookeeper源码编译"></a>二、Zookeeper源码编译</h2><h3 id="源码下载"><a href="#源码下载" class="headerlink" title="源码下载"></a>源码下载</h3><p>从github上下载zookeeper源码</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ git <span class="built_in">clone</span> https://github.com/apache/zookeeper.git</div></pre></td></tr></table></figure>
<h3 id="源码编译"><a href="#源码编译" class="headerlink" title="源码编译"></a>源码编译</h3><p>zookeeper使用ant管理项目，因此编译zookeeper使用ant工具，使用下面命令，最终在build文件夹下生成’zookeeper-{version}.tar.gz’，这个包就是最后zookeeper编译完成生成的包，如果只使用’ant package’命令，则不会生成.tar.gz这个包，只会生成’zookeeper-{version}’这个目录，这个目录就是编译好的zookeeper</p>
<p>进入zookeeper目录，切换git分支到稳定版本分支，当前分支为3.4.9。</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">$ <span class="built_in">cd</span> zookeeper</div><div class="line">$ git checkout release-3.4.9</div><div class="line">$ ant package tar</div></pre></td></tr></table></figure>
<p>如果有需要用eclipse打开，可以使用下面命令生成eclipse的project，在eclipse中通过导入’已存在的eclipse项目‘，将zookeeper导入到eclipse中</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">$ <span class="built_in">cd</span> zookeeper</div><div class="line">$ ant eclipse</div></pre></td></tr></table></figure>
<h2 id="三、Zookeeper部署"><a href="#三、Zookeeper部署" class="headerlink" title="三、Zookeeper部署"></a>三、Zookeeper部署</h2><p>将编译后的zookeeper目录或者.tar.gz包拷贝到将要部署的目录，即完成部署。多机（多进程）部署，就分别拷贝将要部署的目录即可。（上一步编译的.tar.gz包，如果不自己编译，可以直接从官网下载。）</p>
<h2 id="四、Zookeeper配置"><a href="#四、Zookeeper配置" class="headerlink" title="四、Zookeeper配置"></a>四、Zookeeper配置</h2><p>以下分别针对单实例和多机配置说明。</p>
<p>zookeeper主要的配置文件为conf目录下的zoo.conf。</p>
<p>zookeeper在linux下的启动脚本为bin目录下的zkServer.sh，zookeeper相关的环境变量脚本为bin目录下的zkEnv.sh，zkServer.sh和zkEnv.sh脚本一般不需要修改，用于启动服务，主要修改zoo.conf。</p>
<h3 id="zookeeper单实例配置"><a href="#zookeeper单实例配置" class="headerlink" title="zookeeper单实例配置"></a>zookeeper单实例配置</h3><p>zoo.conf配置非常简单。单实例配置如下：</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">tickTime=2000 </div><div class="line">dataDir=./zoodata</div><div class="line">clientPort=2181</div></pre></td></tr></table></figure>
<blockquote>
<p><strong>tickTime</strong>: 这个时间作为zookeeper服务端与客户端之间维持心跳的时间间隔，时间单位为：ms（毫秒），每隔tickTime时间就会发送一个心跳。<br><strong>dataDir</strong>: 这个是zookeeper保存数据的目录，默认情况下，zookeeper也将写数据的日志保存在这个目录下。<br><strong>clientPort</strong>: 这个作为客户端连接zookeeper服务器的端口。</p>
</blockquote>
<h3 id="zookeeper集群配置"><a href="#zookeeper集群配置" class="headerlink" title="zookeeper集群配置"></a>zookeeper集群配置</h3><p>对于集群配置，只需要在单实例的基础上增加几个配置就可以了，如下：</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">initLimit=10</div><div class="line">syncLimit=5</div><div class="line">server.1=192.168.0.1:2888:3888</div><div class="line">server.2=192.168.0.2:2888:3888</div></pre></td></tr></table></figure>
<blockquote>
<p><strong>initLimit</strong>: 这个配置项是用来配置 Zookeeper 接受客户端（这里所说的客户端不是用户连接 Zookeeper 服务器的客户端，而是 Zookeeper 服务器集群中连接到 Leader 的 Follower 服务器）初始化连接时最长能忍受多少个心跳时间间隔数。当已经超过 10 个心跳的时间（也就是 tickTime）长度后 Zookeeper 服务器还没有收到客户端的返回信息，那么表明这个客户端连接失败。总的时间长度就是 10x2000=20 秒<br><strong>syncLimit</strong>: 这个配置项标识 Leader 与 Follower 之间发送消息，请求和应答时间长度，最长不能超过多少个 tickTime 的时间长度，总的时间长度就是 5x2000=10 秒<br><strong>server.N=A:P1:P2</strong>, 其中 N 是一个数字，表示这个是第几号服务器；A 是这个服务器的 ip 地址；P1 表示的是这个服务器与集群中的 Leader 服务器交换信息的端口；P2 表示的是万一集群中的 Leader 服务器挂了，需要一个端口来重新进行选举，选出一个新的 Leader，而这个端口就是用来执行选举时服务器相互通信的端口。如果是伪集群的配置方式，由于 A 都是一样，所以不同的 Zookeeper 实例通信端口号不能一样，所以要给它们分配不同的端口号。<br><strong>集群配置最重要的一点</strong>：对于集群配置，除了修改zoo.cfg外，还需要在dataDir目录下配置一个名字为’myid’的文件，这个文件里的值就是上个配置A的值，zookeeper 启动时会读取这个文件，拿到里面的数据与 zoo.cfg 里面的配置信息比较从而判断到底是那个server。</p>
</blockquote>
<h2 id="五、Zookeeper启动"><a href="#五、Zookeeper启动" class="headerlink" title="五、Zookeeper启动"></a>五、Zookeeper启动</h2><p>zookeeper命令：bin/zkServer.sh {start|start-foreground|stop|restart|status|upgrade|print-cmd}</p>
<p>启动zookeeper：</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">bin/zkServer.sh start</div></pre></td></tr></table></figure>
<p>启动后查看zookeeper是否启动成功：</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">bin/zkServer.sh status</div></pre></td></tr></table></figure>
<p>还可以通过bin/zkCli.sh客户端连接到zkserver，进一步检查zkserver更详细的信息：</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">bin/zkCli.sh</div></pre></td></tr></table></figure>
<p>zkCli运行后，会连接上server会进入一个zookeeper的命令行交互界面，zookeeper有自身的交互命令可以参看官方文档，这里不细说。由于zookeeper内部存储数据结构类似目录树的结构，因此zookeeper也有一个’ls’命令，最简单的可以执行’ls /‘，查看zookeeper根目录下的信息。</p>
<p>停止zookeeper：<br><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">bin/zkServer.sh stop</div></pre></td></tr></table></figure></p>
<p><strong>zookeeper两点注意事项：</strong></p>
<blockquote>
<p>1、由于Zookeeper是快速失败（fail-fast)的，且遇到任何错误情况，进程均会退出，因此，最好能通过监控程序将Zookeeper管理起来，保证Zookeeper退出后能被自动重启。详情参考<a href="http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_supervision" target="_blank" rel="external">这里</a>。<br>2、Zookeeper运行过程中会在dataDir目录下生成很多日志和快照文件，而Zookeeper运行进程并不负责定期清理合并这些文件，导致占用大量磁盘空间，因此，需要通过cron等方式定期清除没用的日志和快照文件。详情参考<a href="http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_maintenance" target="_blank" rel="external">这里</a>。</p>
</blockquote>
<h2 id="六、参考资料"><a href="#六、参考资料" class="headerlink" title="六、参考资料"></a>六、参考资料</h2><p><a href="https://zookeeper.apache.org/doc/trunk/zookeeperStarted.html" target="_blank" rel="external">Apache Zookeeper GettingStart</a><br><a href="https://www.ibm.com/developerworks/cn/opensource/os-cn-zookeeper/" target="_blank" rel="external">分布式服务框架Zookeeper(IBM DevelopWorks中国)</a><br><a href="http://www.cnblogs.com/panfeng412/archive/2012/11/30/how-to-install-and-deploy-storm-cluster.html" target="_blank" rel="external">Storm集群安装部署步骤【详细版】（主要参考Zookeeper部分）</a></p>
<p>（完结）</p>
]]></content>
    
    <summary type="html">
    
      &lt;p&gt;虽然本文只是Zookeeper的编译、配置、部署，但是仍需要对Zookeeper的有基本的了解。&lt;/p&gt;
&lt;h2 id=&quot;一、环境&quot;&gt;&lt;a href=&quot;#一、环境&quot; class=&quot;headerlink&quot; title=&quot;一、环境&quot;&gt;&lt;/a&gt;一、环境&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;
    
    </summary>
    
      <category term="大数据" scheme="https://zyb.github.io/categories/%E5%A4%A7%E6%95%B0%E6%8D%AE/"/>
    
    
      <category term="zookeeper" scheme="https://zyb.github.io/tags/zookeeper/"/>
    
  </entry>
  
  <entry>
    <title>Hello World</title>
    <link href="https://zyb.github.io/2015/09/06-hello-world.html"/>
    <id>https://zyb.github.io/2015/09/06-hello-world.html</id>
    <published>2015-09-06T10:00:28.000Z</published>
    <updated>2016-12-07T03:32:28.066Z</updated>
    
    <content type="html"><![CDATA[<p>Welcome to <a href="http://hexo.io/" target="_blank" rel="external">Hexo</a>! This is your very first post. Check <a href="http://hexo.io/docs/" target="_blank" rel="external">documentation</a> for more info. If you get any problems when using Hexo, you can find the answer in <a href="http://hexo.io/docs/troubleshooting.html" target="_blank" rel="external">troubleshooting</a> or you can ask me on <a href="https://github.com/hexojs/hexo/issues" target="_blank" rel="external">GitHub</a>.</p>
<h2 id="Quick-Start"><a href="#Quick-Start" class="headerlink" title="Quick Start"></a>Quick Start</h2><h3 id="Create-a-new-post"><a href="#Create-a-new-post" class="headerlink" title="Create a new post"></a>Create a new post</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ hexo new <span class="string">"My New Post"</span></div></pre></td></tr></table></figure>
<p>More info: <a href="http://hexo.io/docs/writing.html" target="_blank" rel="external">Writing</a></p>
<h3 id="Run-server"><a href="#Run-server" class="headerlink" title="Run server"></a>Run server</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ hexo server</div></pre></td></tr></table></figure>
<p>More info: <a href="http://hexo.io/docs/server.html" target="_blank" rel="external">Server</a></p>
<h3 id="Generate-static-files"><a href="#Generate-static-files" class="headerlink" title="Generate static files"></a>Generate static files</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ hexo generate</div></pre></td></tr></table></figure>
<p>More info: <a href="http://hexo.io/docs/generating.html" target="_blank" rel="external">Generating</a></p>
<h3 id="Deploy-to-remote-sites"><a href="#Deploy-to-remote-sites" class="headerlink" title="Deploy to remote sites"></a>Deploy to remote sites</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ hexo deploy</div></pre></td></tr></table></figure>
<p>More info: <a href="http://hexo.io/docs/deployment.html" target="_blank" rel="external">Deployment</a></p>
]]></content>
    
    <summary type="html">
    
      &lt;p&gt;Welcome to &lt;a href=&quot;http://hexo.io/&quot; target=&quot;_blank&quot; rel=&quot;external&quot;&gt;Hexo&lt;/a&gt;! This is your very first post. Check &lt;a href=&quot;http://hexo.io
    
    </summary>
    
      <category term="hello" scheme="https://zyb.github.io/categories/hello/"/>
    
    
      <category term="hexo" scheme="https://zyb.github.io/tags/hexo/"/>
    
  </entry>
  
</feed>