使用SolrJ生成索引

macrochen

浏览: 2473206 次
性别:
来自: 杭州

最近访客更多访客>>

beifengbei08

teaklee

吴志新

u012363178

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Search engine

solr

代码很简单, 直接看就明白了, 可以在实际工作中借鉴, 原文在这里. 这个例子使用两种方式来演示如何生成全量索引:
一个是从db中通过sql生成全量索引
一个是通过tika解析文件生成全量索引

package SolrJExample;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.UpdateResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.sql.*;
import java.util.ArrayList;
import java.util.Collection;

/* Example class showing the skeleton of using Tika and
   Sql on the client to index documents from
   both structured documents and a SQL database.

   NOTE: The SQL example and the Tika example are entirely orthogonal.
   Both are included here to make a
   more interesting example, but you can omit either of them.

 */
public class SqlTikaExample {
  private StreamingUpdateSolrServer _server;
  private long _start = System.currentTimeMillis();
  private AutoDetectParser _autoParser;
  private int _totalTika = 0;
  private int _totalSql = 0;

  private Collection _docs = new ArrayList();

  public static void main(String[] args) {
    try {
      SqlTikaExample idxer = new SqlTikaExample("http://localhost:8983/solr");

      idxer.doTikaDocuments(new File("/Users/Erick/testdocs"));
      idxer.doSqlDocuments();

      idxer.endIndexing();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }

  private SqlTikaExample(String url) throws IOException, SolrServerException {
      // Create a multi-threaded communications channel to the Solr server.
      // Could be CommonsHttpSolrServer as well.
      //
    _server = new StreamingUpdateSolrServer(url, 10, 4);

    _server.setSoTimeout(1000);  // socket read timeout
    _server.setConnectionTimeout(1000);
    _server.setMaxRetries(1); // defaults to 0.  > 1 not recommended.
         // binary parser is used by default for responses
    _server.setParser(new XMLResponseParser()); 

      // One of the ways Tika can be used to attempt to parse arbitrary files.
    _autoParser = new AutoDetectParser();
  }

    // Just a convenient place to wrap things up.
  private void endIndexing() throws IOException, SolrServerException {
    if (_docs.size() > 0) { // Are there any documents left over?
      _server.add(_docs, 300000); // Commit within 5 minutes
    }
    _server.commit(); // Only needs to be done at the end,
                      // commitWithin should do the rest.
                      // Could even be omitted
                      // assuming commitWithin was specified.
    long endTime = System.currentTimeMillis();
    log("Total Time Taken: " + (endTime - _start) +
         " milliseconds to index " + _totalSql +
        " SQL rows and " + _totalTika + " documents");
  }

  // I hate writing System.out.println() everyplace,
  // besides this gives a central place to convert to true logging
  // in a production system.
  private static void log(String msg) {
    System.out.println(msg);
  }

  /**
   * ***************************Tika processing here
   */
  // Recursively traverse the filesystem, parsing everything found.
  private void doTikaDocuments(File root) throws IOException, SolrServerException {

    // Simple loop for recursively indexing all the files
    // in the root directory passed in.
    for (File file : root.listFiles()) {
      if (file.isDirectory()) {
        doTikaDocuments(file);
        continue;
      }
        // Get ready to parse the file.
      ContentHandler textHandler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      ParseContext context = new ParseContext();

      InputStream input = new FileInputStream(file);

        // Try parsing the file. Note we haven't checked at all to
        // see whether this file is a good candidate.
      try {
        _autoParser.parse(input, textHandler, metadata, context);
      } catch (Exception e) {
          // Needs better logging of what went wrong in order to
          // track down "bad" documents.
        log(String.format("File %s failed", file.getCanonicalPath()));
        e.printStackTrace();
        continue;
      }
      // Just to show how much meta-data and what form it's in.
      dumpMetadata(file.getCanonicalPath(), metadata);

      // Index just a couple of the meta-data fields.
      SolrInputDocument doc = new SolrInputDocument();

      doc.addField("id", file.getCanonicalPath());

      // Crude way to get known meta-data fields.
      // Also possible to write a simple loop to examine all the
      // metadata returned and selectively index it and/or
      // just get a list of them.
      // One can also use the LucidWorks field mapping to
      // accomplish much the same thing.
      String author = metadata.get("Author");

      if (author != null) {
        doc.addField("author", author);
      }

      doc.addField("text", textHandler.toString());

      _docs.add(doc);
      ++_totalTika;

      // Completely arbitrary, just batch up more than one document
      // for throughput!
      if (_docs.size() >= 1000) {
          // Commit within 5 minutes.
        UpdateResponse resp = _server.add(_docs, 300000);
        if (resp.getStatus() != 0) {
          log("Some horrible error has occurred, status is: " +
                  resp.getStatus());
        }
        _docs.clear();
      }
    }
  }

    // Just to show all the metadata that's available.
  private void dumpMetadata(String fileName, Metadata metadata) {
    log("Dumping metadata for file: " + fileName);
    for (String name : metadata.names()) {
      log(name + ":" + metadata.get(name));
    }
    log("\n\n");
  }

  /**
   * ***************************SQL processing here
   */
  private void doSqlDocuments() throws SQLException {
    Connection con = null;
    try {
      Class.forName("com.mysql.jdbc.Driver").newInstance();
      log("Driver Loaded......");

      con = DriverManager.getConnection("jdbc:mysql://192.168.1.103:3306/test?"
                + "user=testuser&password=test123");

      Statement st = con.createStatement();
      ResultSet rs = st.executeQuery("select id,title,text from test");

      while (rs.next()) {
        // DO NOT move this outside the while loop
        // or be sure to call doc.clear()
        SolrInputDocument doc = new SolrInputDocument();&nbsp;
        String id = rs.getString("id");
        String title = rs.getString("title");
        String text = rs.getString("text");

        doc.addField("id", id);
        doc.addField("title", title);
        doc.addField("text", text);

        _docs.add(doc);
        ++_totalSql;

        // Completely arbitrary, just batch up more than one
        // document for throughput!
        if (_docs.size() > 1000) {
             // Commit within 5 minutes.
          UpdateResponse resp = _server.add(_docs, 300000);
          if (resp.getStatus() != 0) {
            log("Some horrible error has occurred, status is: " +
                  resp.getStatus());
          }
          _docs.clear();
        }
      }
    } catch (Exception ex) {
      ex.printStackTrace();
    } finally {
      if (con != null) {
        con.close();
      }
    }
  }
}

分享到：

最近碰到的一些storm问题总结(不断更新) | storm的ack和fail

2012-02-23 20:54
浏览 5066
评论(2)
分类:企业架构
查看更多

2 楼 macrochen 2012-02-27

huangfoxAgain 写道

请问在3.x的solr中怎么解决“实时检索”的问题呢?
这里采用的应该是hard commit~

这里是针对全量索引.

实时检索可以参考Sensei

然后定时或者主动更新索引

1 楼 huangfoxAgain 2012-02-25

请问在3.x的solr中怎么解决“实时检索”的问题呢?
这里采用的应该是hard commit~

发表评论

您还没有登录,请您登录后再发表评论

相关推荐

离散数学课后题答案+sdut往年试卷+复习提纲资料: 离散数学课后题答案+sdut往年试卷+复习提纲资料

智能点阵笔项目源代码全套技术资料.zip: 智能点阵笔项目源代码全套技术资料.zip

英文字母手语图像分类数据集【已标注，约26,000张数据】: 英文字母手语图像分类数据集【已标注，约26,000张数据】分类个数【28】：a、b、c等【具体查看json文件】划分了训练集、测试集。存放各自的同一类数据图片。如果想可视化数据集，可以运行资源中的show脚本。 CNN分类网络改进：https://blog.csdn.net/qq_44886601/category_12858320.html 【更多图像分类、图像分割（医学）、目标检测（yolo）的项目以及相应网络的改进，可以参考本人主页：https://blog.csdn.net/qq_44886601/category_12803200.html】

(31687028)PID控制器matlab仿真.zip: 标题中的“PID控制器matlab仿真.zip”指的是一个包含PID控制器在MATLAB环境下进行仿真的资源包。PID（比例-积分-微分）控制器是一种广泛应用的自动控制算法，它通过结合当前误差、过去误差的积分和误差变化率的微分来调整系统输出，以达到期望的控制效果。MATLAB是一款强大的数学计算软件，而Simulink是MATLAB的一个扩展模块，专门用于建模和仿真复杂的动态系统。描述中提到，“PID控制器——MATLAB/Simulink仿真以及性能比较与分析”表明这个资源包不仅提供了PID控制器的模型，还可能包括对不同参数配置下的性能比较和分析。博主分享的是“最新升级版框架的Simulink文件”，意味着这些文件基于最新的MATLAB版本进行了优化，确保了与不同版本的MATLAB（从2015a到2020a共11个版本）的兼容性，这为用户提供了广泛的应用范围。标签中的“PID”、“matlab”、“simulink”、“博文附件”和“多版本适用”进一步细化了内容的关键点。这表示该资源包是博客文章的附加材料，专门针对PID控制器在MATLAB的Simulink环境中进行仿真实验。多

MATLAB代码：考虑P2G和碳捕集设备的热电联供综合能源系统优化调度模型关键词：碳捕集综合能源系统电转气P2G 热电联产低碳调度参考文档：Modeling and Optimiza: MATLAB代码：考虑P2G和碳捕集设备的热电联供综合能源系统优化调度模型关键词：碳捕集综合能源系统电转气P2G 热电联产低碳调度参考文档：《Modeling and Optimization of Combined Heat and Power with Power-to-Gas and Carbon Capture System in Integrated Energy System》完美复现仿真平台：MATLAB yalmip+gurobi 主要内容：代码主要做的是一个考虑电转气P2G和碳捕集设备的热电联供综合能源系统优化调度模型，模型耦合CHP热电联产单元、电转气单元以及碳捕集单元，并重点考虑了碳交易机制，建立了综合能源系统运行优化模型，模型为非线性模型，采用yalmip加ipopt对其进行高效求解，该模型还考虑了碳排放和碳交易，是学习低碳经济调度必备程序代码非常精品，注释保姆级这段代码是一个用于能源系统中的综合能源系统（Integrated Energy System）建模和优化的程序。它使用了MATLAB的优化工具箱和SDP（半定规划）变量来定义决策变

中国飞行器设计大赛圆筒权重文件: 中国飞行器设计大赛圆筒权重文件

java毕设项目之ssm社区文化宣传网站+jsp(完整前后端+说明文档+mysql+lw).zip: 项目包含完整前后端源码和数据库文件环境说明：开发语言：Java 框架：ssm，mybatis JDK版本：JDK1.8 数据库：mysql 5.7 数据库工具：Navicat11 开发软件：eclipse/idea Maven包：Maven3.3 服务器：tomcat7

风光储、风光储并网直流微电网simulink仿真模型系统由光伏发电系统、风力发电系统、混合储能系统（可单独储能系统）、逆变器VSR+大电网构成光伏系统采用扰动观察法实现mppt控: 风光储、风光储并网直流微电网simulink仿真模型。系统由光伏发电系统、风力发电系统、混合储能系统（可单独储能系统）、逆变器VSR+大电网构成。光伏系统采用扰动观察法实现mppt控制，经过boost电路并入母线；风机采用最佳叶尖速比实现mppt控制，风力发电系统中pmsg采用零d轴控制实现功率输出，通过三相电压型pwm变器整流并入母线；混合储能由蓄电池和超级电容构成，通过双向DCDC变器并入母线，并采用低通滤波器实现功率分配，超级电容响应高频功率分量，蓄电池响应低频功率分量，有限抑制系统中功率波动，且符合储能的各自特性。并网逆变器VSR采用PQ控制实现功率入网以下是视频讲解文案：接下来我来介绍一下就是这个风光储直流微电网整个仿真系统的一些架构啊然后按照需求呢正常的讲一些多讲一些就是储能的这块的还有这个并网的三相两电瓶调的这个并网继变器的这个模块首先就是来介绍一下呃整个系统的一个架构你可以看到这个系统的架构分别有四大部分组成最左边的这块就是混合储能啊这边这个是蓄电池这个超级电容他们都是

ajax发请求示例.txt: ajax发请求示例.txt

深圳建筑安装公司“电工安全技术操作规程”.docx: 深圳建筑安装公司“电工安全技术操作规程”

220) Vinkmag - 多概念创意报纸新闻杂志 WordPress v5.0.zip: 220) Vinkmag - 多概念创意报纸新闻杂志 WordPress v5.0.zip

智力残疾评定标准一览表.docx: 智力残疾评定标准一览表.docx

MDIN380 SDI转VGA 转LVDS VGA转SDI 高清视频处理 MDIN380芯片 PCB代码方案资料 3G-SDI转VGA ?3G-SDI转LVDS ?高清视频 MDIN380、GV76: MDIN380 SDI转VGA 转LVDS VGA转SDI 高清视频处理 MDIN380芯片 PCB代码方案资料 3G-SDI转VGA ?3G-SDI转LVDS ?高清视频 MDIN380、GV7601 芯片方案(PCB图和源码)。此方案是韩国视频处理芯片MDIN380的整合应用方案。 3G-SDI转VGA或3G-SDI转LVDS。方案共有两块电路板(一块底板，一块MDIN380核心板四层板)。 MDIN380和GV7601 都是BGA封装，最好有焊接BGA经验才拿。另外有视频处理方面其它需要可联系我定制开发。其它视频格式转，视频图像分割、拼接等可定制开发。方案资料含有源码、PCB图。方案已有成熟产品在应用。注意该资料没有原理图，只有PCB图。代码环境编译KEIL4。画图软件Protel99、AD10。电子文档资料

YOLO算法-锡罐-牙罐-盖子打开数据集-179张图像带标签-锡罐-牙罐-盖子打开.zip: YOLO系列算法目标检测数据集，包含标签，可以直接训练模型和验证测试，数据集已经划分好，包含数据集配置文件data.yaml，适用yolov5,yolov8,yolov9,yolov7,yolov10,yolo11算法；包含两种标签格:yolo格式（txt文件）和voc格式（xml文件），分别保存在两个文件夹中，文件名末尾是部分类别名称; yolo格式：<class> <x_center> <y_center> <width> <height>，其中： <class> 是目标的类别索引（从0开始）。 <x_center> 和 <y_center> 是目标框中心点的x和y坐标，这些坐标是相对于图像宽度和高度的比例值，范围在0到1之间。 <width> 和 <height> 是目标框的宽度和高度，也是相对于图像宽度和高度的比例值；【注】可以下拉页面，在资源详情处查看标签具体内容；

G120 EPOS基本定位功能关键点系列-堆垛机报F7452追踪原因.mp4: G120 EPOS基本定位功能关键点系列_堆垛机报F7452追踪原因.mp4

java毕设项目之ssm亚盛汽车配件销售业绩管理统+jsp(完整前后端+说明文档+mysql+lw).zip: 项目包含完整前后端源码和数据库文件环境说明：开发语言：Java 框架：ssm，mybatis JDK版本：JDK1.8 数据库：mysql 5.7 数据库工具：Navicat11 开发软件：eclipse/idea Maven包：Maven3.3 服务器：tomcat7

zigbee CC2530无线自组网协议栈系统代码实现协调器与终端基于GenericApp的无线收发例程.zip: 1、嵌入式物联网单片机项目开发例程，简单、方便、好用，节省开发时间。 2、代码使用IAR软件开发，当前在CC2530上运行，如果是其他型号芯片，请自行移植。 3、软件下载时，请注意接上硬件，并确认烧录器连接正常。 4、有偿指导v：wulianjishu666; 5、如果接入其他传感器，请查看账号发布的其他资料。 6、单片机与模块的接线，在代码当中均有定义，请自行对照。 7、若硬件有差异，请根据自身情况调整代码，程序仅供参考学习。 8、代码有注释说明，请耐心阅读。 9、例程具有一定专业性，非专业人士请谨慎操作。

基于小程序的小区物业新冠疫情物资管理平台小程序源代码（java+小程序+mysql+LW）.zip: 系统可以提供信息显示和相应服务，其管理小区物业新冠疫情物资管理平台信息，查看小区物业新冠疫情物资管理平台信息，管理小区物业新冠疫情物资管理平台。项目包含完整前后端源码和数据库文件环境说明：开发语言：Java JDK版本：JDK1.8 数据库：mysql 5.7 数据库工具：Navicat11 开发软件：eclipse/idea Maven包：Maven3.3 部署容器：tomcat7 小程序开发工具：hbuildx/微信开发者工具

亲测源码云赏V7.0微信视频打赏系统源码已测试完整无错版: 云赏V7.0包括V6的所有功能外，全新UI设计，代理可以选择8种风格，添加后台统计等多种功能。 1基本设置(网站基础信息配置、包括主域名、防封尾缀、url.cnt.cn短连接接口可切换); 2转跳域名(10层防守转跳,都输入的话,都会转跳到对应的地方在跳回来,在随机取用落地); 3落地域名(添加落地域名及设置默认落地域名); 4视频列表(添加视频批量添加外链视频给代理们获取); 5代理推广:代理使用推广链接发展下级代理,后台设置提成); 6代理列表(生成邀请码注册,手动添加代理); 7提现记录(用于结算代理们的提现); 8余额记录(记录代理的余额变动); 9订单记录(记录打赏数,今日收入)。测试环境: Nginx 1.18+PHP56+MySQL5.6，详细教程见文件内文字教程。后台账号:admin 密码:admin888

深圳建设施工项目易燃、易爆、有毒、有害物品管理制度.docx: 深圳建设施工项目易燃、易爆、有毒、有害物品管理制度

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

使用SolrJ生成索引

评论

发表评论

相关推荐

关于搜索理论的一些学习总结

新版SolrCloud概述

solr的facet查询

[ppt] elasticsearch vs solr

[ppt] Lucene today, tomorrow and beyond

[译]lucene&solr 2011年盘点

[译]lucene & solr 2011年盘点

最近访客更多访客>>