hbase基于solr配置二级索引-JobPlus

一.概述

Hbase适用于大表的存储，通过单一的RowKey查询虽然能快速查询，但是对于复杂查询，尤其分页、查询总数等，实现方案浪费计算资源，所以可以针对hbase数据创建二级索引(Hbase Secondary Indexing)，供复杂查询使用。
Solr是一个高性能，采用Java5开发，基于Lucene的全文搜索服务器。同时对其进行了扩展，提供了比Lucene更为丰富的查询语言，同时实现了可配置、可扩展并对查询性能进行了优化，并且提供了一个完善的功能管理界面，是一款非常优秀的全文搜索引擎。
Key-Value Store Indexer是Hbase到Solr生成索引的中间工具。在CDH5中的Key-Value Store Indexer使用的是Lily HBase NRT Indexer服务
Lily HBase Indexer是一款灵活的、可扩展的、高容错的、事务性的，并且近实时的处理HBase列索引数据的分布式服务软件。它是NGDATA公司开发的Lily系统的一部分，已开放源代码。Lily HBase Indexer使用SolrCloud来存储HBase的索引数据，当HBase执行写入、更新或删除操作时，Indexer通过HBase的replication功能来把这些操作抽象成一系列的Event事件，并用来保证写入Solr中的HBase索引数据的一致性。并且Indexer支持用户自定义的抽取，转换规则来索引HBase列数据。Solr搜索结果会包含用户自定义的columnfamily:qualifier字段结果，这样应用程序就可以直接访问HBase的列数据。而且Indexer索引和搜索不会影响HBase运行的稳定性和HBase数据写入的吞吐量，因为索引和搜索过程是完全分开并且异步的。Lily HBase Indexer在CDH5中运行必须依赖HBase、SolrCloud和Zookeeper服务。

二.近实时(NRT)查询方案

分工：hbase负责海量数据存储；solr负责构建索引和提供对外查询；Indexer负责提供hbase到solr的索引构建。
索引创建流程：Hbase->Lily HBase Indexer->Solr
数据使用流程图

hbase-indexer-structure

三.二级索引创建方法

1.hbase启用复制（在CM的hbase上搜索复制，勾选启用复制）

2.hbase表开启REPLICATION功能（1表示开启replication功能，0表示不开启，默认为0 ）

已存在的表

disable 'table'
alter 'table',{NAME => 'cf', REPLICATION_SCOPE => 1}
enable 'table'

新创建的表

create 'table',{NAME => 'cf', REPLICATION_SCOPE => 1}

3.创建solr实体目录,其中/home/data/collectionSmsDay是在本地自定义目录。命令执行后完成工作：生成solr的配置文件。

solrctl instancedir --generate /home/data/collectionSmsDay

编辑已生成的schema.xml

把hbase表中需要索引的列添加到scheme.xml的filed中,其中的name属性值要与Morphline.conf文件中的outputField属性值对应，以便indexer中间件完成hbase到solr的索引创建工作。其中id保存的是hbase的rowkey；uniqueKey也是id；field的类型最好使用string；root、version、text等field不能少。

<field name="_version_" type="long" indexed="true" stored="true"/>
<!-- points to the root document of a block of nested documents. Required for nested
document support, may be removed otherwise
-->
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="timestamp" type="tdate" indexed="true" stored="true" default="NOW+8HOUR" multiValued="false"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- points to the root document of a block of nested documents. Required for nested
document support, may be removed otherwise
-->
<field name="send_number" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="send_parent_account" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="busi_type" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="cnts" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="day_id" type="string" indexed="true" stored="true" multiValued="false"/>
<uniqueKey>id</uniqueKey>

4.创建collection并将配置文件上传至zk

solrctl instancedir --create collectionSmsDay /home/data/collectionSmsDay

登陆zk客户端查看节点:ls /solr/configs/collectionSmsDay,该节点下有solrconfig.xml、scheme.xml等配置文件;ls /solr/collection/下有collectionSmsDay

[root@db1 ~]# cd /opt/cloudera/parcels/CDH/lib/zookeeper/bin/
[root@db1 bin]# ./zkCli.sh
...
[zk: localhost:2181(CONNECTED) 3] ls /solr/configs/collectionSmsDay
[admin-extra.menu-top.html, currency.xml, protwords.txt, mapping-FoldToASCII.txt, solrconfig.xml.secure, _schema_analysis_synonyms_english.json, _rest_managed.json, solrconfig.xml, _schema_analysis_stopwords_english.json, stopwords.txt, lang, spellings.txt, mapping-ISOLatin1Accent.txt, admin-extra.html, schema_bak.xml, xslt, synonyms.txt, scripts.conf, update-script.js, velocity, elevate.xml, admin-extra.menu-bottom.html, schema.xml, clustering]
[zk: localhost:2181(CONNECTED) 4]
[zk: localhost:2181(CONNECTED) 3] ls /solr/configs
[collectionSmsDay]

solrctl工具使用方法

[hadoop@db1 lib]$ solrctl --help
usage: /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/bin/../lib/solr/bin/solrctl.sh [options] command [command-arg] [command [command-arg]]
solrctl [options] command [command-arg] [command [command-arg]] ...
可选参数有：
--solr：指定 SolrCloud 的 web API，如果在 SolrCloud 集群之外的节点运行命令，就需要指定该参数。
--zk：指定 zk 集群solr目录。
--help：打印帮助信息。
--quiet：静默模式运行。
command 命令有：
init [--force]：初始化配置。
instancedir：维护实体目录。可选的参数有：
--generate path
--create name path
--update name path
--get name path
--delete name
--list
collection：维护 collections。可选的参数有：
[--create name -s <numShards>
[-a Create collection with autoAddReplicas=true]
[-c <collection.configName>]
[-r <replicationFactor>]
[-m <maxShardsPerNode>]
[-n <createNodeSet>]]
--delete name: Deletes a collection.
--reload name: Reloads a collection.
--stat name: Outputs SolrCloud specific run-time information fora collection.
`--list: Lists all collections registered in SolrCloud.
--deletedocs name: Purges all indexed documents from a collection.
core：维护 cores。可选的参数有：
--create name [-p name=value]]
--reload name: Reloads a core.
--unload name: Unloads a core.
--status name: Prints status of a core.
cluster：维护集群配置信息。可选的参数有：
--get-solrxml file
--put-solrxml file

5.在solr上创建collection:collectionSmsDay

solrctl collection --create collectionSmsDay -s 6 -m 15 -r 2 -c collectionSmsDay -a
其中-s是6个分片(shard)，我们的solrclound是6台机器，-r是2个副本(replication)，-c是指定zk上solr/configs节点下使用的配置文件名称，-a是允许添加副本(必须写，否则创建不了副本)，-m 默认值是1，注意三个数值：numShards、replicationFactor、liveSolrNode，一个正常的solrCloud集群不容许同一个liveSolrNode上部署同一个shard的多个replic，因此当maxShardsPerNode=1时，numShards*replicationFactor>liveSolrNode时，报错。因此正确时因满足以下条件：
numShards*replicationFactor<liveSolrNode*maxShardsPerNode

创建solr分片时，要根据实际情况定shard、replication，maxShardsPerNode,否则报错

[root@db1 conf]# solrctl collection --create solrtest -s 7 –r 2 -m 20
Error: A call to SolrCloud WEB APIs failed: HTTP/1.1 400 Bad Request
Server: Apache-Coyote/1.1
Content-Type: application/xml;charset=UTF-8
Transfer-Encoding: chunked
Date: Tue, 11 Oct 2016 01:11:36 GMT
Connection: close
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">
400</int>
<int name="QTime">
73</int>
</lst>
<str name="Operation createcollection caused exception:">
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Cannot create collection solrtest. Value of maxShardsPerNode is 1, and the number of live nodes is 6. This allows a maximum of 6 to be created. Value of numShards is 7 and value of replicationFactor is 1. This requires 7 shards to be created (higher than the allowed number)</str>
<lst name="exception">
<str name="msg">
Cannot create collection solrtest. Value of maxShardsPerNode is 1, and the number of live nodes is 6. This allows a maximum of 6 to be created. Value of numShards is 7 and value of replicationFactor is 1. This requires 7 shards to be created (higher than the allowed number)</str>
<int name="rspCode">
400</int>
</lst>
<lst name="error">
<str name="msg">
Cannot create collection solrtest. Value of maxShardsPerNode is 1, and the number of live nodes is 6. This allows a maximum of 6 to be created. Value of numShards is 7 and value of replicationFactor is 1. This requires 7 shards to be created (higher than the allowed number)</str>
<int name="code">
400</int>
</lst>
</response>
[root@db1 conf]#

如果修改/home/data/collectionSmsDay/confs下的配置文件schema.xml，需要重新上传加载，执行以下语句:

solrctl instancedir --update collectionSmsDay /home/data/collectionSmsDay
solrctl collection --reload collectionSmsDay

web端查看新创建的collection

6.在hbase-solr目录下创建morphline-hbase-mapper-smslogday.xml

其中morphlineId 的value是对应Key-Value Store Indexer 中配置文件Morphlines.conf 中morphlines 属性id值。morphlineId不要和hbase的table名称相同。

[hadoop@db1 hbase-solr]$ pwd
/opt/cloudera/parcels/CDH/lib/hbase-solr
[hadoop@db1 hbase-solr]$ vi morphline-hbase-mapper-smslogday.xml
<?xml version="1.0" encoding="UTF-8"?>
<indexer table="tb_sms_log_day" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper">
<param name="morphlineFile" value="morphlines.conf"></param>
<param name="morphlineId" value="smslogdayMap"></param>
</indexer>
~

7.修改Morphlines配置文件, 在CM中进入Key-Value Store Indexer面板->配置->类别->Morphlines-Morphlines文件；如果添加多个morphline，用逗号分隔。

{
id : smsdayMap
importCommands : ["org.kitesdk.**", "com.ngdata.**"]
commands : [
{
extractHBaseCells {
mappings : [
{
inputColumn : "cf:send_number"
outputField : "send_number"
type : string
source : value
}
{
inputColumn : "cf:send_parent_account"
outputField : "send_parent_account"
type : string
source : value
}
{
inputColumn : "cf:busi_type"
outputField : "busi_type"
type : string
source : value
}
{
inputColumn : "cf:cnts"
outputField : "cnts"
type : string
source : value
}
{
inputColumn : "cf:day_id"
outputField : "day_id"
type : string
source : value
}
]
}
}
{ logDebug { format : "output record: {}", args : ["@{}"] } }
]
}

8.将morphline-hbase-mapper-smsday.xml注册到Lily Hbase Service服务中

-n表示indexer的名称，-c表示加载的indexer的配置文件名称，–connection-param表示连接的solr的zk地址和对应的collection名称，–zookeeper表示indexer要上传到的zk地址。

hbase-indexer add-indexer
-n smsdayIndexer
-c /opt/cloudera/parcels/CDH/lib/hbase-solr/morphline-hbase-mapper-smsday.xml
--connection-param solr.zk=nn1.hadoop:2181,nn2.hadoop:2181,dn7.hadoop:2181,dn5.hadoop:2181,dn3.hadoop:2181/solr
--connection-param solr.collection=collectionSmsDay
--zookeeper nn1.hadoop:2181,nn2.hadoop:2181,dn7.hadoop:2181,dn5.hadoop:2181,dn3.hadoop:2181

查看索引器是否创建成功，关键Processes是否都是running processes

[hadoop@db1 hbase-solr]$ hbase-indexer list-indexers --zookeeper nn1.hadoop:2181
Number of indexes: 1
smsdayIndexer
+ Lifecycle state: ACTIVE
+ Incremental indexing state: SUBSCRIBE_AND_CONSUME
+ Batch indexing state: INACTIVE
+ SEP subscription ID: Indexer_smsdayIndexer
+ SEP subscription timestamp: 2016-10-22T11:13:48.888+08:00
+ Connection type: solr
+ Connection params:
+ solr.collection = collectionSmsDay
+ solr.zk = nn1.hadoop:2181,nn2.hadoop:2181,dn7.hadoop:2181,dn5.hadoop:2181,dn3.hadoop:2181/solr
+ Indexer config:
268 bytes, use -dump to see content
+ Indexer component factory: com.ngdata.hbaseindexer.conf.DefaultIndexerComponentFactory
+ Additional batch index CLI arguments:
(none)
+ Default additional batch index CLI arguments:
(none)
+ Processes
+ 2 running processes
+ 0 failed processes

如果索引器需要重建，删除使用下列方法。如果删除不了，始终在删除中的死循环中，就需要到zk上手动删除节点信息：ls /ngdata/hbaseindexer下。

hbase-indexer delete-indexer -n smsdayIndexer --zookeeper nn1.hadoop:2181

9.hbase插入测试,solr中查询成功

put 'tb_sms_day', '20161012000000001', 'cf:send_number', '15173751522'
put 'tb_sms_day', '20161012000000001', 'cf:day_id', '2016-10-21'
put 'tb_sms_day', '20161012000000001', 'cf:cnts', '1'
put 'tb_sms_day', '20161012000000001', 'cf:busi_type', 'SMS'

<h3>一.概述</h3><ol><li>Hbase适用于大表的存储，通过单一的RowKey查询虽然能快速查询，但是对于复杂查询，尤其分页、查询总数等，实现方案浪费计算资源，所以可以针对hbase数据创建二级索引(Hbase Secondary Indexing)，供复杂查询使用。</li><li>Solr是一个高性能，采用Java5开发，基于Lucene的全文搜索服务器。同时对其进行了扩展，提供了比Lucene更为丰富的查询语言，同时实现了可配置、可扩展并对查询性能进行了优化，并且提供了一个完善的功能管理界面，是一款非常优秀的全文搜索引擎。</li><li>Key-Value Store Indexer是Hbase到Solr生成索引的中间工具。在CDH5中的Key-Value Store Indexer使用的是Lily HBase NRT Indexer服务</li><li>Lily HBase Indexer是一款灵活的、可扩展的、高容错的、事务性的，并且近实时的处理HBase列索引数据的分布式服务软件。它是NGDATA公司开发的Lily系统的一部分，已开放源代码。Lily HBase Indexer使用SolrCloud来存储HBase的索引数据，当HBase执行写入、更新或删除操作时，Indexer通过HBase的replication功能来把这些操作抽象成一系列的Event事件，并用来保证写入Solr中的HBase索引数据的一致性。并且Indexer支持用户自定义的抽取，转换规则来索引HBase列数据。Solr搜索结果会包含用户自定义的columnfamily:qualifier字段结果，这样应用程序就可以直接访问HBase的列数据。而且Indexer索引和搜索不会影响HBase运行的稳定性和HBase数据写入的吞吐量，因为索引和搜索过程是完全分开并且异步的。Lily HBase Indexer在CDH5中运行必须依赖HBase、SolrCloud和Zookeeper服务。</li></ol><h3>二.近实时(NRT)查询方案</h3><ol><li>分工：hbase负责海量数据存储；solr负责构建索引和提供对外查询；Indexer负责提供hbase到solr的索引构建。</li><li>索引创建流程：Hbase->Lily HBase Indexer->Solr</li><li>数据使用流程图  <img src="https://file.jobplus.com.cn/2018/07/19/408b8363807f47b38004dec162303ca5.png" _src="https://file.jobplus.com.cn/2018/07/19/408b8363807f47b38004dec162303ca5.png"/>  hbase-indexer-structure  <img src="https://file.jobplus.com.cn/2018/07/19/dc1deaccb7804b1d9c05a31bf6a95c57.png" _src="https://file.jobplus.com.cn/2018/07/19/dc1deaccb7804b1d9c05a31bf6a95c57.png"/></li></ol><h3>三.二级索引创建方法</h3>1.hbase启用复制（在CM的hbase上搜索复制，勾选启用复制）  <img src="https://file.jobplus.com.cn/2018/07/19/2e6e28a373ff481e964f060e9ed19ca0.png" _src="https://file.jobplus.com.cn/2018/07/19/2e6e28a373ff481e964f060e9ed19ca0.png"/>2.hbase表开启REPLICATION功能（1表示开启replication功能，0表示不开启，默认为0 ）<ul><li>已存在的表</li></ul><ol><li>disable 'table'</li><li>alter 'table',{NAME => 'cf', REPLICATION_SCOPE => 1}</li><li>enable 'table'</li></ol><ul><li>新创建的表</li></ul>create 'table',{NAME => 'cf', REPLICATION_SCOPE => 1}<ul><li>3.创建solr实体目录,其中/home/data/collectionSmsDay是在本地自定义目录。命令执行后完成工作：生成solr的配置文件。</li></ul>solrctl instancedir --generate /home/data/collectionSmsDay<ul><li>编辑已生成的schema.xml</li></ul>把hbase表中需要索引的列添加到scheme.xml的filed中,其中的name属性值要与Morphline.conf文件中的outputField属性值对应，以便indexer中间件完成hbase到solr的索引创建工作。其中id保存的是hbase的rowkey；uniqueKey也是id；field的类型最好使用string；root、version、text等field不能少。<ol><li><field name="_version_" type="long" indexed="true" stored="true"/></li><li></li><li><field name="_root_" type="string" indexed="true" stored="false"/></li><li><field name="timestamp" type="tdate" indexed="true" stored="true" default="NOW+8HOUR" multiValued="false"/></li><li><field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/></li><li><field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /></li><li></li><li><field name="send_number" type="string" indexed="true" stored="true" multiValued="false"/></li><li><field name="send_parent_account" type="string" indexed="true" stored="true" multiValued="false"/></li><li><field name="busi_type" type="string" indexed="true" stored="true" multiValued="false"/></li><li><field name="cnts" type="string" indexed="true" stored="true" multiValued="false"/></li><li><field name="day_id" type="string" indexed="true" stored="true" multiValued="false"/></li><li><uniqueKey>id</uniqueKey></li></ol>4.创建collection并将配置文件上传至zksolrctl instancedir --create collectionSmsDay /home/data/collectionSmsDay<ul><li> </li></ul><ul><li>登陆zk客户端查看节点:ls /solr/configs/collectionSmsDay,该节点下有solrconfig.xml、scheme.xml等配置文件;ls /solr/collection/下有collectionSmsDay</li></ul><ol><li>[root@db1 ~]# cd /opt/cloudera/parcels/CDH/lib/zookeeper/bin/</li><li>[root@db1 bin]# ./zkCli.sh</li><li>...</li><li>[zk: localhost:2181(CONNECTED) 3] ls /solr/configs/collectionSmsDay</li><li>[admin-extra.menu-top.html, currency.xml, protwords.txt, mapping-FoldToASCII.txt, solrconfig.xml.secure, _schema_analysis_synonyms_english.json, _rest_managed.json, solrconfig.xml, _schema_analysis_stopwords_english.json, stopwords.txt, lang, spellings.txt, mapping-ISOLatin1Accent.txt, admin-extra.html, schema_bak.xml, xslt, synonyms.txt, scripts.conf, update-script.js, velocity, elevate.xml, admin-extra.menu-bottom.html, schema.xml, clustering]</li><li>[zk: localhost:2181(CONNECTED) 4]</li><li>[zk: localhost:2181(CONNECTED) 3] ls /solr/configs</li><li>[collectionSmsDay]</li></ol><ul><li>solrctl工具使用方法</li></ul><ol><li>[hadoop@db1 lib]$ solrctl --help</li><li>usage: /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/bin/../lib/solr/bin/solrctl.sh [options] command [command-arg] [command [command-arg]]</li><li>solrctl [options] command [command-arg] [command [command-arg]] ...</li><li>可选参数有：</li><li>--solr：指定 SolrCloud 的 web API，如果在 SolrCloud 集群之外的节点运行命令，就需要指定该参数。  </li><li>--zk：指定 zk 集群solr目录。  </li><li>--help：打印帮助信息。  </li><li>--quiet：静默模式运行。  </li><li>command 命令有：</li><li>init [--force]：初始化配置。  </li><li>instancedir：维护实体目录。可选的参数有：</li><li>--generate path  </li><li>--create name path  </li><li>--update name path  </li><li>--get name path  </li><li>--delete name  </li><li>--list  </li><li>collection：维护 collections。可选的参数有：</li><li>[--create name -s <numShards></li><li>[-a Create collection with autoAddReplicas=true]</li><li>[-c <collection.configName>]</li><li>[-r <replicationFactor>]</li><li>[-m <maxShardsPerNode>]</li><li>[-n <createNodeSet>]]</li><li>--delete name: Deletes a collection.  </li><li>--reload name: Reloads a collection.  </li><li>--stat name: Outputs SolrCloud specific run-time information fora collection.  </li><li>`--list: Lists all collections registered in SolrCloud.  </li><li>--deletedocs name: Purges all indexed documents from a collection.  </li><li> </li><li> </li><li>core：维护 cores。可选的参数有：  </li><li>--create name [-p name=value]]  </li><li>--reload name: Reloads a core.  </li><li>--unload name: Unloads a core.  </li><li>--status name: Prints status of a core.  </li><li> </li><li> </li><li>cluster：维护集群配置信息。可选的参数有：  </li><li>--get-solrxml file  </li><li>--put-solrxml file</li><li> </li></ol><ul><li> </li></ul>5.在solr上创建collection:collectionSmsDay<ol><li>solrctl collection --create collectionSmsDay  -s 6 -m 15 -r 2 -c collectionSmsDay -a</li><li>其中-s是6个分片(shard)，我们的solrclound是6台机器，-r是2个副本(replication)，-c是指定zk上solr/configs节点下使用的配置文件名称，-a是允许添加副本(必须写，否则创建不了副本)，-m 默认值是1，注意三个数值：numShards、replicationFactor、liveSolrNode，一个正常的solrCloud集群不容许同一个liveSolrNode上部署同一个shard的多个replic，因此当maxShardsPerNode=1时，numShards*replicationFactor>liveSolrNode时，报错。因此正确时因满足以下条件：</li><li>numShards*replicationFactor<liveSolrNode*maxShardsPerNode</li></ol><ul><li> </li></ul><ul><li>创建solr分片时，要根据实际情况定shard、replication，maxShardsPerNode,否则报错</li></ul><ol><li>[root@db1 conf]# solrctl collection --create  solrtest  -s  7 –r 2 -m 20</li><li>Error: A call to SolrCloud WEB APIs failed: HTTP/1.1 400 Bad Request</li><li>Server: Apache-Coyote/1.1</li><li>Content-Type: application/xml;charset=UTF-8</li><li>Transfer-Encoding: chunked</li><li>Date: Tue, 11 Oct 2016 01:11:36 GMT</li><li>Connection: close</li><li><?xml version="1.0" encoding="UTF-8"?></li><li><response></li><li><lst name="responseHeader"></li><li><int name="status"></li><li>400</int></li><li><int name="QTime"></li><li>73</int></li><li></lst></li><li><str name="Operation createcollection caused exception:"></li><li>org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Cannot create collection solrtest. Value of maxShardsPerNode is 1, and the number of live nodes is 6. This allows a maximum of 6 to be created. Value of numShards is 7 and value of replicationFactor is 1. This requires 7 shards to be created (higher than the allowed number)</str></li><li><lst name="exception"></li><li><str name="msg"></li><li>Cannot create collection solrtest. Value of maxShardsPerNode is 1, and the number of live nodes is 6. This allows a maximum of 6 to be created. Value of numShards is 7 and value of replicationFactor is 1. This requires 7 shards to be created (higher than the allowed number)</str></li><li><int name="rspCode"></li><li>400</int></li><li></lst></li><li><lst name="error"></li><li><str name="msg"></li><li>Cannot create collection solrtest. Value of maxShardsPerNode is 1, and the number of live nodes is 6. This allows a maximum of 6 to be created. Value of numShards is 7 and value of replicationFactor is 1. This requires 7 shards to be created (higher than the allowed number)</str></li><li><int name="code"></li><li>400</int></li><li></lst></li><li></response></li><li>[root@db1 conf]#</li></ol><ul><li> </li></ul><ul><li>如果修改/home/data/collectionSmsDay/confs下的配置文件schema.xml，需要重新上传加载，执行以下语句:</li></ul><ol><li>solrctl instancedir --update collectionSmsDay /home/data/collectionSmsDay</li><li>solrctl collection --reload collectionSmsDay</li></ol><ul><li>web端查看新创建的collection  <img src="https://file.jobplus.com.cn/2018/07/19/c227884d7f954366945caeb1cc3572dc.png" _src="https://file.jobplus.com.cn/2018/07/19/c227884d7f954366945caeb1cc3572dc.png"/></li></ul>6.在hbase-solr目录下创建morphline-hbase-mapper-smslogday.xml其中morphlineId 的value是对应Key-Value Store Indexer 中配置文件Morphlines.conf 中morphlines 属性id值。morphlineId不要和hbase的table名称相同。<ol><li>[hadoop@db1 hbase-solr]$ pwd</li><li>/opt/cloudera/parcels/CDH/lib/hbase-solr</li><li>[hadoop@db1 hbase-solr]$ vi morphline-hbase-mapper-smslogday.xml</li><li><?xml version="1.0" encoding="UTF-8"?></li><li><indexer table="tb_sms_log_day" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper"></li><li><param name="morphlineFile" value="morphlines.conf"></param></li><li><param name="morphlineId" value="smslogdayMap"></param></li><li></indexer></li><li>~</li></ol>7.修改Morphlines配置文件, 在CM中进入Key-Value Store Indexer面板->配置->类别->Morphlines-Morphlines文件；如果添加多个morphline，用逗号分隔。<ol><li>{</li><li>id : smsdayMap</li><li>importCommands : ["org.kitesdk.**", "com.ngdata.**"]</li><li>commands : [</li><li>{</li><li>extractHBaseCells {</li><li>mappings : [</li><li>{</li><li>inputColumn : "cf:send_number"</li><li>outputField : "send_number"</li><li>type : string</li><li>source : value</li><li>}</li><li>{</li><li>inputColumn : "cf:send_parent_account"</li><li>outputField : "send_parent_account"</li><li>type : string</li><li>source : value</li><li>}</li><li>{</li><li>inputColumn : "cf:busi_type"</li><li>outputField : "busi_type"</li><li>type : string</li><li>source : value</li><li>}</li><li>{</li><li>inputColumn : "cf:cnts"</li><li>outputField : "cnts"</li><li>type : string</li><li>source : value</li><li>}</li><li>{</li><li>inputColumn : "cf:day_id"</li><li>outputField : "day_id"</li><li>type : string</li><li>source : value</li><li>}</li><li>]</li><li>}</li><li>}</li><li>{ logDebug { format : "output record: {}", args : ["@{}"] } }</li><li>]</li><li>}</li></ol><img src="https://file.jobplus.com.cn/2018/07/19/8e6d473371aa4cc4af4da9b6540c83b4.png" _src="https://file.jobplus.com.cn/2018/07/19/8e6d473371aa4cc4af4da9b6540c83b4.png"/>8.将morphline-hbase-mapper-smsday.xml注册到Lily Hbase Service服务中<ul><li>-n表示indexer的名称，-c表示加载的indexer的配置文件名称，–connection-param表示连接的solr的zk地址和对应的collection名称，–zookeeper表示indexer要上传到的zk地址。</li></ul><ol><li>hbase-indexer add-indexer</li><li>-n smsdayIndexer</li><li>-c /opt/cloudera/parcels/CDH/lib/hbase-solr/morphline-hbase-mapper-smsday.xml</li><li>--connection-param solr.zk=nn1.hadoop:2181,nn2.hadoop:2181,dn7.hadoop:2181,dn5.hadoop:2181,dn3.hadoop:2181/solr  </li><li>--connection-param solr.collection=collectionSmsDay</li><li>--zookeeper nn1.hadoop:2181,nn2.hadoop:2181,dn7.hadoop:2181,dn5.hadoop:2181,dn3.hadoop:2181</li></ol><ul><li>查看索引器是否创建成功，关键Processes是否都是running processes</li></ul><ol><li>[hadoop@db1 hbase-solr]$ hbase-indexer list-indexers --zookeeper nn1.hadoop:2181</li><li>Number of indexes: 1</li><li>smsdayIndexer</li><li>+ Lifecycle state: ACTIVE</li><li>+ Incremental indexing state: SUBSCRIBE_AND_CONSUME</li><li>+ Batch indexing state: INACTIVE</li><li>+ SEP subscription ID: Indexer_smsdayIndexer</li><li>+ SEP subscription timestamp: 2016-10-22T11:13:48.888+08:00</li><li>+ Connection type: solr</li><li>+ Connection params:</li><li>+ solr.collection = collectionSmsDay</li><li>+ solr.zk = nn1.hadoop:2181,nn2.hadoop:2181,dn7.hadoop:2181,dn5.hadoop:2181,dn3.hadoop:2181/solr</li><li>+ Indexer config:</li><li>268 bytes, use -dump to see content</li><li>+ Indexer component factory: com.ngdata.hbaseindexer.conf.DefaultIndexerComponentFactory</li><li>+ Additional batch index CLI arguments:</li><li>(none)</li><li>+ Default additional batch index CLI arguments:</li><li>(none)</li><li>+ Processes</li><li>+ 2 running processes</li><li>+ 0 failed processes</li></ol><ul><li>如果索引器需要重建，删除使用下列方法。如果删除不了，始终在删除中的死循环中，就需要到zk上手动删除节点信息：ls /ngdata/hbaseindexer下。</li></ul>hbase-indexer delete-indexer -n smsdayIndexer --zookeeper nn1.hadoop:2181<ul><li> </li></ul>9.hbase插入测试,solr中查询成功<ol><li>put 'tb_sms_day', '20161012000000001', 'cf:send_number', '15173751522'</li><li>put 'tb_sms_day', '20161012000000001', 'cf:day_id', '2016-10-21'</li><li>put 'tb_sms_day', '20161012000000001', 'cf:cnts', '1'</li><li>put 'tb_sms_day', '20161012000000001', 'cf:busi_type', 'SMS'</li></ol><img src="https://file.jobplus.com.cn/2018/07/19/455142065efc40cea6046e13a8916898.png" _src="https://file.jobplus.com.cn/2018/07/19/455142065efc40cea6046e13a8916898.png"/>