Spatial search with Lucene

haiker

浏览: 296478 次

最近访客更多访客>>

mumume123

Luckdeng

mft8899

lijianfeng007

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

lucene Spatial

At the Hadoop Hackathon this weekend at Sybit, we've worked on a getting tiled images into HDFS and HBase. A side-story was how to search for these tiles based on GPS coordinates. I took the task to see how you can perform queries based on spatial coordinates. That is, searching for locations which are within a specified radius from an origin:

With the Lucene-Spatial extension, it's quite easy to get this working and I'd like to share my results with you.

First, the usual setup in pom.xml:

<dependency>
 <groupId>org.apache.lucene</groupId>
 <artifactId>lucene-spatial</artifactId>
 <version>3.0.2</version>
</dependency>
<dependency>
 <groupId>org.apache.lucene</groupId>
 <artifactId>lucene-misc</artifactId>
 <version>3.0.2</version>
</dependency>

This fragment will add Lucene-Core, Lucene-Spatial and the Lucene-Misc modules to the classpath. Don't forget to update the project configuration via Maven > Update Project Configuration when you're using M2Eclipse.

Indexing

Then, let's begin the test case by setting up an example database:

@Test
public void testSpatialSearch() throws Exception {
	Directory dir = new RAMDirectory();
	IndexWriter writer = new IndexWriter(dir,
	  new WhitespaceAnalyzer(),MaxFieldLength.UNLIMITED);
	Examples.createExampleLocations(writer);
	writer.commit();
	writer.close(true);
	...

The Examples class contains a few named GPS locations with their longitude and latitude positions (taken from Wikipedia):

public class Examples {
  public static final Coordinate SYBIT = new Coordinate(47.740688, 8.972429);

  public static void createExampleLocations(IndexWriter w) throws IOException {
	Spatial s = new Spatial();
	s.addLoc(w, "Sybit-Radolfzell", SYBIT);
	s.addLoc(w, "Innovations-Immenstaad", new Coordinate(47.666667, 9.366667));
	s.addLoc(w, "Fino-Radolfzell", new Coordinate(47.736944, 8.969722));
	s.addLoc(w, "Zentrum-Berlin", new Coordinate(52.518611, 13.408056));
}
}

The Spatial class is a wrapper for a Lucene Document. It contains a lot of code, so i'm going to explain it in more detail:

public class Spatial {
  public static final String LAT_FIELD = "lat";
  public static final String LON_FIELD = "lon";
  public static final String TIER_PREFIX_FIELD = "_localTier";
  ...

These constants are used as field names in the Lucene index. Latitude and longitude are obviously for the position of the location. The TIER_PREFIX_FIELD constant is used as a prefix for multiple Document fields. For each "tier", a field is added to the Lucene Document, describing in which "box" the coordinates are located. Now, this sentence may cause some questions to emerge, so I'll explain ..

What is a tier and what is a box?

Spatial search has the problem of checking potentially millions of locations. As this would take a long time to process, the search space is optimized by using a Cartesian Grid. A Cartesian Grid splits the map (of Germany, in this case) into multiple rectangular regions.

What is a Cartesian Grid?

Assume the following map of Germany:

Now, the map is cut into four cells on the first tier. Then each of the cells on this tier are again cut into multiple cells - that's the next tier. This is continued until we reach a level of detail where one cell is approx. 1 mile. The cells are called boxes. The higher the tier, the more boxes exist.

tier = a grid, box = a cell of the grid

Now that we understand how the search space is narrowed down, let's go on with the Spatial.java class file:

public void addLoc(IndexWriter writer, String name, Coordinate coord) {
   Document doc = new Document();
   doc.add(new Field("name", name, Field.Store.YES, Index.ANALYZED));
   doc.add(new Field("metafile", "doc", Store.YES, Index.ANALYZED));
   addSpatialLcnFields(coord, doc);
   writer.addDocument(doc);
}

This code will create a new Lucene Document and add the name of the location, such as "Innovations-Immenstaad". The addSpatialLcnFields() method will add the exact coordinate of the location. In this case, that'd be 47.666667, 9.366667:

private void addSpatialLcnFields(Coordinate coord, Document document) {
  document.add(new Field(LAT_FIELD,
      NumericUtils.doubleToPrefixCoded(coord.getLat()), Field.Store.YES,
      Field.Index.NOT_ANALYZED));
   document.add(new Field(LON_FIELD,
      NumericUtils.doubleToPrefixCoded(coord.getLon()), Field.Store.YES,
      Field.Index.NOT_ANALYZED));
   addCartesianTiers(coord, document);
}

Storing the box for each tier

The interesting part, from a search algorithm point of view, is adding the box of our current coordinate per tier to the search index. That's going to be more complex, but not much:

private double maxMiles = 250, minMiles = 1;
private IProjector projector = new SinusoidalProjector();
private CartesianTierPlotter ctp = new CartesianTierPlotter(0, projector, TIER_PREFIX_FIELD);
// startTier is 14 for 25 miles, 15 for 1 miles in lucene 3.0
private int startTier = ctp.bestFit(maxMiles), endTier = ctp.bestFit(minMiles);

private void addCartesianTiers(Coordinate coord, Document document) {
   for (int tier = startTier; tier <= endTier; tier++) {
      ctp = new CartesianTierPlotter(tier, projector, TIER_PREFIX_FIELD);
      double boxId = ctp.getTierBoxId(coord.getLat(), coord.getLon());
      document.add(new Field(ctp.getTierFieldName(), NumericUtils
         .doubleToPrefixCoded(boxId), Field.Store.YES,
         Field.Index.NOT_ANALYZED_NO_NORMS));
   }
}

This code will iterate through multiple tiers and for each tier, it will generate an ID for the box. In simple terms, this would look like:

Now, as our location "Immenstaad" is near the Lake of Constance, the field "_localTier1" would contain the value "C" (the lower left cell of the yellow grid) and the field "_localTier2" would contain the value "N", which is the second cell at the bottom of the red grid. It will of course go on until it reaches the highest tier with the highest number of total boxes. But for each tier, it will only add the box where the location is, of course.

In reality, the boxId is not a single letter, but a specially encoded double value in the form "00013.00032", which means it's the 13th cell in the horizontal axis and the 32nd cell in the vertical axis. This is due to the nature how Lucene is storing values and saves space. (more about Cartesian Grid in Lucene)

Searching

Now, let's perform an actual query against the database:

   ...
   IndexSearcher searcher = new IndexSearcher(dir);
   {
      List locations = find(searcher, Examples.SYBIT, 5.0d);
      assertEquals(locations.toString(), 2, locations.size());
      assertTrue(locations.contains("Sybit-Radolfzell"));
      assertTrue(locations.contains("Fino-Radolfzell"));
   }
}

private List find(IndexSearcher searcher, Coordinate start,
     double miles)  {

      List result = new ArrayList();
      DistanceQueryBuilder dq = new DistanceQueryBuilder(start.getLat(),
         start.getLon(), miles, Spatial.LAT_FIELD, Spatial.LON_FIELD,
         Spatial.TIER_PREFIX_FIELD, true);

      Query query = new TermQuery(new Term("metafile", "doc"));
      TopDocs hits = searcher.search(dq.getQuery(query), 10);
      for (int i = 0; i < hits.totalHits; i++) {
         Document doc = searcher.doc(hits.scoreDocs[i].doc);
         result.add(doc.get("name"));
      }
      return result;
}

The test case starts the search at Examples.SYBIT, which is a pre-defined Coordinate object from the Examples class. The radius for the distance search is set to 5.0d miles. The expected result of the search are two locations: the Sybit location and the nearby restaurant called Fino (We actually didn't go there for lunch, but that's another story).

The DistanceQueryBuild is part of the Lucene-Spatial contribution and performs queries based on the distance of the start coordinate to all the coordinates in the search space. The search space however is not the whole Lucene index. It is pre-filtered based on a dervied best-fit value. The builder calculates the best matching tier for the search of 5-mile radiuses. When it has identified the tier, it will simply use all boxes within the 5 mile radius of the origin:

This narrows down the search to only a few possible locations - compared to all the locations of Germany or world-wide, this decreases the amount of work to a near minimum.

The last step is to actually calculate the distance between the search origin and all the locations in the boxes:

Finally, after the query has been executed by the IndexSearcher, the results are gathered into a list and returned.