`

Spatial search with Lucene

阅读更多

At the Hadoop Hackathon this weekend at Sybit, we've worked on a getting tiled images into HDFS and HBase. A side-story was how to search for these tiles based on GPS coordinates. I took the task to see how you can perform queries based on spatial coordinates. That is, searching for locations which are within a specified radius from an origin:


 



With the Lucene-Spatial extension, it's quite easy to get this working and I'd like to share my results with you.

First, the usual setup in pom.xml:

<dependency>
 <groupId>org.apache.lucene</groupId>
 <artifactId>lucene-spatial</artifactId>
 <version>3.0.2</version>
</dependency>
<dependency>
 <groupId>org.apache.lucene</groupId>
 <artifactId>lucene-misc</artifactId>
 <version>3.0.2</version>
</dependency>


This fragment will add Lucene-Core, Lucene-Spatial and the Lucene-Misc modules to the classpath. Don't forget to update the project configuration via Maven > Update Project Configuration when you're using M2Eclipse.

Indexing

Then, let's begin the test case by setting up an example database:

@Test
public void testSpatialSearch() throws Exception {
	Directory dir = new RAMDirectory();
	IndexWriter writer = new IndexWriter(dir,
	  new WhitespaceAnalyzer(),MaxFieldLength.UNLIMITED);
	Examples.createExampleLocations(writer);
	writer.commit();
	writer.close(true);
	...


The Examples class contains a few named GPS locations with their longitude and latitude positions (taken from Wikipedia):

public class Examples {
  public static final Coordinate SYBIT = new Coordinate(47.740688, 8.972429);

  public static void createExampleLocations(IndexWriter w) throws IOException {
	Spatial s = new Spatial();
	s.addLoc(w, "Sybit-Radolfzell", SYBIT);
	s.addLoc(w, "Innovations-Immenstaad", new Coordinate(47.666667, 9.366667));
	s.addLoc(w, "Fino-Radolfzell", new Coordinate(47.736944, 8.969722));
	s.addLoc(w, "Zentrum-Berlin", new Coordinate(52.518611, 13.408056));
}
}



The Spatial class is a wrapper for a Lucene Document. It contains a lot of code, so i'm going to explain it in more detail:

public class Spatial {
  public static final String LAT_FIELD = "lat";
  public static final String LON_FIELD = "lon";
  public static final String TIER_PREFIX_FIELD = "_localTier";
  ...



These constants are used as field names in the Lucene index. Latitude and longitude are obviously for the position of the location. The TIER_PREFIX_FIELD constant is used as a prefix for multiple Document fields. For each "tier", a field is added to the Lucene Document, describing in which "box" the coordinates are located. Now, this sentence may cause some questions to emerge, so I'll explain ..

What is a tier and what is a box?

Spatial search has the problem of checking potentially millions of locations. As this would take a long time to process, the search space is optimized by using a Cartesian Grid. A Cartesian Grid splits the map (of Germany, in this case) into multiple rectangular regions.

What is a Cartesian Grid?

Assume the following map of Germany:



Now, the map is cut into four cells on the first tier. Then each of the cells on this tier are again cut into multiple cells - that's the next tier. This is continued until we reach a level of detail where one cell is approx. 1 mile. The cells are called boxes. The higher the tier, the more boxes exist.

tier = a grid, box = a cell of the grid



Now that we understand how the search space is narrowed down, let's go on with the Spatial.java class file:

public void addLoc(IndexWriter writer, String name, Coordinate coord) {
   Document doc = new Document();
   doc.add(new Field("name", name, Field.Store.YES, Index.ANALYZED));
   doc.add(new Field("metafile", "doc", Store.YES, Index.ANALYZED));
   addSpatialLcnFields(coord, doc);
   writer.addDocument(doc);
}


This code will create a new Lucene Document and add the name of the location, such as "Innovations-Immenstaad". The addSpatialLcnFields() method will add the exact coordinate of the location. In this case, that'd be 47.666667, 9.366667:

private void addSpatialLcnFields(Coordinate coord, Document document) {
  document.add(new Field(LAT_FIELD,
      NumericUtils.doubleToPrefixCoded(coord.getLat()), Field.Store.YES,
      Field.Index.NOT_ANALYZED));
   document.add(new Field(LON_FIELD,
      NumericUtils.doubleToPrefixCoded(coord.getLon()), Field.Store.YES,
      Field.Index.NOT_ANALYZED));
   addCartesianTiers(coord, document);
}


Storing the box for each tier

The interesting part, from a search algorithm point of view, is adding the box of our current coordinate per tier to the search index. That's going to be more complex, but not much:

private double maxMiles = 250, minMiles = 1;
private IProjector projector = new SinusoidalProjector();
private CartesianTierPlotter ctp = new CartesianTierPlotter(0, projector, TIER_PREFIX_FIELD);
// startTier is 14 for 25 miles, 15 for 1 miles in lucene 3.0
private int startTier = ctp.bestFit(maxMiles), endTier = ctp.bestFit(minMiles);

private void addCartesianTiers(Coordinate coord, Document document) {
   for (int tier = startTier; tier <= endTier; tier++) {
      ctp = new CartesianTierPlotter(tier, projector, TIER_PREFIX_FIELD);
      double boxId = ctp.getTierBoxId(coord.getLat(), coord.getLon());
      document.add(new Field(ctp.getTierFieldName(), NumericUtils
         .doubleToPrefixCoded(boxId), Field.Store.YES,
         Field.Index.NOT_ANALYZED_NO_NORMS));
   }
}



This code will iterate through multiple tiers and for each tier, it will generate an ID for the box. In simple terms, this would look like:





Now, as our location "Immenstaad" is near the Lake of Constance, the field "_localTier1" would contain the value "C" (the lower left cell of the yellow grid) and the field "_localTier2" would contain the value "N", which is the second cell at the bottom of the red grid. It will of course go on until it reaches the highest tier with the highest number of total boxes. But for each tier, it will only add the box where the location is, of course.

In reality, the boxId is not a single letter, but a specially encoded double value in the form "00013.00032", which means it's the 13th cell in the horizontal axis and the 32nd cell in the vertical axis. This is due to the nature how Lucene is storing values and saves space. (more about Cartesian Grid in Lucene)

Searching

Now, let's perform an actual query against the database:

   ...
   IndexSearcher searcher = new IndexSearcher(dir);
   {
      List locations = find(searcher, Examples.SYBIT, 5.0d);
      assertEquals(locations.toString(), 2, locations.size());
      assertTrue(locations.contains("Sybit-Radolfzell"));
      assertTrue(locations.contains("Fino-Radolfzell"));
   }
}

private List find(IndexSearcher searcher, Coordinate start,
     double miles)  {

      List result = new ArrayList();
      DistanceQueryBuilder dq = new DistanceQueryBuilder(start.getLat(),
         start.getLon(), miles, Spatial.LAT_FIELD, Spatial.LON_FIELD,
         Spatial.TIER_PREFIX_FIELD, true);

      Query query = new TermQuery(new Term("metafile", "doc"));
      TopDocs hits = searcher.search(dq.getQuery(query), 10);
      for (int i = 0; i < hits.totalHits; i++) {
         Document doc = searcher.doc(hits.scoreDocs[i].doc);
         result.add(doc.get("name"));
      }
      return result;
}



The test case starts the search at Examples.SYBIT, which is a pre-defined Coordinate object from the Examples class. The radius for the distance search is set to 5.0d miles. The expected result of the search are two locations: the Sybit location and the nearby restaurant called Fino (We actually didn't go there for lunch, but that's another story).

The DistanceQueryBuild is part of the Lucene-Spatial contribution and performs queries based on the distance of the start coordinate to all the coordinates in the search space. The search space however is not the whole Lucene index. It is pre-filtered based on a dervied best-fit value. The builder calculates the best matching tier for the search of 5-mile radiuses. When it has identified the tier, it will simply use all boxes within the 5 mile radius of the origin:



This narrows down the search to only a few possible locations - compared to all the locations of Germany or world-wide, this decreases the amount of work to a near minimum.

The last step is to actually calculate the distance between the search origin and all the locations in the boxes:



Finally, after the query has been executed by the IndexSearcher, the results are gathered into a list and returned.

 

http://www.java-community.de/archives/156-Spatial-search-with-Lucene.html

分享到:
评论

相关推荐

    Apache.Solr.Search.Patterns.1783981849

    We then dive deep into the spatial features such as indexing strategies and search/filtering strategies for a spatial search. We also do an in-depth analysis of problems faced in an ad serving ...

    Apache Solr Search Patterns(PACKT,2015)

    Apache Solr is an open source search platform built on a Java library called Lucene. It serves as a search platform for many websites, as it has the capability of indexing and searching multiple ...

    apache-solr-ref-guide-7.1.pdf

    “Spatial Search”部分提供了地理空间搜索功能的详细说明。 最后,“The Terms Component”、“The TermVector Component”、“The Stats Component”、“The Query Elevation Component”等章节,介绍了Solr的...

    win7修复本地系统工具

    win7修复本地系统工具

    《自动化专业英语》04-Automatic-Detection-Block(自动检测模块).ppt

    《自动化专业英语》04-Automatic-Detection-Block(自动检测模块).ppt

    《计算机专业英语》chapter12-Intelligent-Transportation.ppt

    《计算机专业英语》chapter12-Intelligent-Transportation.ppt

    西门子S7-1200博图平台下3轴伺服螺丝机程序解析与应用

    内容概要:本文详细介绍了基于西门子S7-1200博图平台的3轴伺服螺丝机程序。该程序使用SCL语言编写,结合KTP700组态和TIA V14及以上版本,实现了对X、Y、Z三个轴的精密控制。文章首先概述了程序的整体架构,强调了其在自动化控制领域的高参考价值。接着深入探讨了关键代码片段,如轴初始化、运动控制以及主程序的设计思路。此外,还展示了如何通过KTP700组态实现人机交互,并分享了一些实用的操作技巧和技术细节,如状态机设计、HMI交互、异常处理等。 适用人群:从事自动化控制系统开发的技术人员,尤其是对西门子PLC编程感兴趣的工程师。 使用场景及目标:适用于希望深入了解西门子S7-1200博图平台及其SCL语言编程特点的学习者;旨在帮助读者掌握3轴伺服系统的具体实现方法,提高实际项目中的编程能力。 其他说明:文中提供的代码示例和设计理念不仅有助于理解和学习,还能直接应用于类似的实际工程项目中。

    MATLAB仿真:非线性滤波器在水下长基线定位(LBL)系统的应用与比较

    内容概要:本文详细探讨了五种非线性滤波器(卡尔曼滤波(KF)、扩展卡尔曼滤波(EKF)、无迹卡尔曼滤波(UKF)、粒子滤波(PF)和变维卡尔曼滤波(VDKF))在水下长基线定位(LBL)系统中的应用。通过对每种滤波器的具体实现进行MATLAB代码展示,分析了它们在不同条件下的优缺点。例如,KF适用于线性系统但在非线性环境中失效;EKF通过雅可比矩阵线性化处理非线性问题,但在剧烈机动时表现不佳;UKF利用sigma点处理非线性,精度较高但计算量大;PF采用蒙特卡罗方法,鲁棒性强但计算耗时;VDKF能够动态调整状态维度,适合信标数量变化的场景。 适合人群:从事水下机器人(AUV)导航研究的技术人员、研究生以及对非线性滤波感兴趣的科研工作者。 使用场景及目标:①理解各种非线性滤波器的工作原理及其在水下定位中的具体应用;②评估不同滤波器在特定条件下的性能,以便为实际项目选择合适的滤波器;③掌握MATLAB实现非线性滤波器的方法和技术。 其他说明:文中提供了详细的MATLAB代码片段,帮助读者更好地理解和实现这些滤波器。此外,还讨论了数值稳定性问题和一些实用技巧,如Cholesky分解失败的处理方法。

    VMware-workstation-full-14.1.3-9474260

    VMware-workstation-full-14.1.3-9474260

    DeepSeek系列-提示词工程和落地场景.pdf

    DeepSeek系列-提示词工程和落地场景.pdf

    javaSE阶段面试题

    javaSE阶段面试题

    《综合布线施工技术》第5章-综合布线工程测试.ppt

    《综合布线施工技术》第5章-综合布线工程测试.ppt

    安川机器人NX100使用说明书.pdf

    安川机器人NX100使用说明书.pdf

    S7-1200 PLC改造M7120平面磨床电气控制系统:IO分配、梯形图设计及组态画面实现

    内容概要:本文详细介绍了将M7120型平面磨床的传统继电器控制系统升级为基于西门子S7-1200 PLC的自动化控制系统的过程。主要内容涵盖IO分配、梯形图设计和组态画面实现。通过合理的IO分配,确保了系统的可靠性和可维护性;梯形图设计实现了主控制逻辑、砂轮升降控制和报警逻辑等功能;组态画面则提供了友好的人机交互界面,便于操作和监控。此次改造显著提高了设备的自动化水平、运行效率和可靠性,降低了维护成本。 适合人群:从事工业自动化领域的工程师和技术人员,尤其是熟悉PLC编程和控制系统设计的专业人士。 使用场景及目标:适用于需要进行老旧设备升级改造的企业,旨在提高生产设备的自动化水平和可靠性,降低故障率和维护成本。具体应用场景包括但不限于金属加工行业中的平面磨床等设备的控制系统改造。 其他说明:文中还分享了一些实际调试中的经验和技巧,如急停逻辑的设计、信号抖动的处理方法等,有助于读者在类似项目中借鉴和应用。

    chromedriver-linux64-136.0.7103.48.zip

    chromedriver-linux64-136.0.7103.48.zip

    IMG_20250421_180507.jpg

    IMG_20250421_180507.jpg

    《网络营销策划实务》项目一-网络营销策划认知.ppt

    《网络营销策划实务》项目一-网络营销策划认知.ppt

    Lianantech_Security-Vulnerabil_1744433229.zip

    Lianantech_Security-Vulnerabil_1744433229

    MybatisCodeHelperNew2019.1-2023.1-3.4.1.zip

    MybatisCodeHelperNew2019.1-2023.1-3.4.1

    《Approaching(Almost)any machine learning problem》中文版第13章(最后一章)

    【深度学习部署】基于Docker的BERT模型训练与API服务部署:实现代码复用与模型共享

Global site tag (gtag.js) - Google Analytics