- 浏览: 78401 次
文章分类
- 全部博客 (82)
- java (5)
- gearman (1)
- jvm (3)
- 面试 (1)
- Nginx、Resin (1)
- 性能调优 (1)
- 架构 (1)
- OpenGIS 介绍(转) (1)
- 推荐系统 (1)
- elipse、tomcat (2)
- http (3)
- linux (6)
- mongodb (4)
- 感悟 (4)
- 开源 (7)
- 监控 (1)
- varnish (4)
- tomcat (7)
- tom (1)
- 项目网站 (1)
- 数据结构 (1)
- 前端 (2)
- 大神必杀技 (1)
- 算法 (2)
- 浏览器 (1)
- spring (2)
- 队列 (1)
- 测试 (1)
- nginx (1)
- nosql (1)
- thirft (1)
- oracle (1)
- solr (7)
- redis (1)
- 缓存 (1)
- hadoop (3)
- git (1)
- connection pool (1)
最新评论
-
wangluo605:
太牛叉了。我这一两月就假死一次。访问没响应。看tomcat还在 ...
Tomcat 假死原因分析 -
sina_jerry_chen:
朴兄谬赞了!!!
我这十年 -
85977328:
光哥威武~
我这十年
Lucene / Solr 4 Spatial
This document describes how to use the new spatial field types and related functionality in Lucene / Solr 4. The existing spatial support introduced in Solr 3 is still present and is still the default used in Solr's example schema via the named "location" field type.
The bulk of the new spatial implementation lives in the new Lucene 4 spatial module. It replaces the former "Lucene spatial contrib" in v3. The Solr piece is small as it only needs to provide field types which are essentially adapters to the code in the Lucene spatial module. The shape implementations and other core spatial code that isn't related to Lucene is held in a new open-source project called Spatial4j. Presently, polygon support requires an additional dependency -- JTS.
There is a basic demo application that exercises a variety of these features. It's not "live" so you'll have to download and build it first. It's a bit rough around the edges as it's mostly used by the Lucene spatial developers.
Contents
Lucene / Solr 4 Spatial
New features, over Solr 3 spatial
How to Use
Configuration
Indexing
Search
Sorting and Relevancy
Units, Conversion
JTS / WKT / Polygon notes
TODO
Using spatial to search time ranges
New features, over Solr 3 spatial
Note: "Solr 3 spatial" refers to the spatial support introduced in that version of Solr which still exists in v4. Except for a small utility class, Solr 3 spatial does not actually use Lucene 3's defunct spatial contrib module.
These features describe what developer-users of Lucene/Solr 4 will appreciate. Under the hood, it's a framework designed to be extended for different so-called "spatial strategies". I'll assume here the RecursivePrefixTreeStrategy as it should address most use-cases and it has the best tests.
Polygon, LineString and other new shapes. All shapes are supported as indexed shapes and query shapes. Shapes other than point, rectangle and circle are supported via JTS -- an otherwise optional dependency. See JTS caveats below for more information.
Multi-valued indexed fields. This is critical for storing the results of automatic place extraction from text using natural language processing techniques with a gazetteer (a variant of "geocoding"), since a variable number of locations will be found.
Index non-point shapes as well as points. Non-point shapes are essentially pixelated (i.e. gridded) to a configured resolution per shape -- an approximation. By default that resolution is defined by a percentage of the overall shape size, and it applies to query shapes too. Note: If extremely high precision of shape edges needs to be retained for accurate indexing, then this solution probably won't scale too well at indexing time (big indexes, slow indexing). On the other hand, query shapes generally scale well to the maximum configured precision regardless of shape size.
Rectangles with user-specifiable corners. Oddly, Solr 3 spatial only supports the bounding box of a circle.
Multi-value distance sort / score boost. Note: this is a preliminary unoptimized implementation that uses a fair amount of RAM, even when multiValued=false. An alternative should be provided in the future.
Configurable precision which can vary per shape at query time (and sort of at index time). This enhances the performance.
Fast filtering. The code was benchmarked once showing it outperforms Solr 3's "LatLonType" at its own game (single valued indexed points), and several 3rd parties anecdotally reported the same, especially for multi-million document indices. It is based on SOLR-2155 which was benchmarked in January 2010; so a new benchmark is a TODO item. Also, Solr 3 LatLonType sometimes requires all the points to be in memory, whereas the new spatial module here doesn't for filtering.
Well Known Text (WKT) support via JTS. WKT is arguably the most widely supported textual format for shapes. However, standard WKT doesn't specify a format for circles.
Of course, the basics in Solr 3 not mentioned here are implemented in this framework. For example, lat-lon bounding boxes and circles.
How to Use
Configuration
First, you must register a spatial field type in the Solr schema.xml file. The instructions in this whole document imply the RecursivePrefixTreeStrategy based field type used in a geospatial context.
<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"
spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"
distErrPct="0.025"
maxDistErr="0.000009"
units="degrees"
/>
And finally, specify a field that uses this field type: <field name="geo" type="location_rpt" indexed="true" stored="true" multiValued="true" />
A key feature of the new spatial module is multi-value support but you certainly aren't required to declare the field multiValued if it isn't.
The following configuration attributes are common to all new spatial field types based on Lucene 4 spatial:
spatialContextFactory: (Spatial4j) If polygons or other WKT formatted shape support is needed, then use the JTS based class as shown above, otherwise this can be omitted. The JTS jar file must be on Solr's classpath as well. Due to a combination of things, JTS can't simply be referenced by a "<lib>" entry in solrconfig.xml; it needs to be in WEB-INF/lib in Solr's war file, basically.
units="degrees": This parameter is mandatory, and currently the only value supported is "degrees". This affects the interpretation of the maxDistErr attribute, circle radius distances, and other absolute distances. There are approximately 111.2 kilometers in a degree, based on the average earth radius.
geo="true": Wether the spatial fields' coordinates are latitude / longitude WGS84 based (if true) or whether they are pure Euclidean / Cartesian based. It defaults to true. When set to false, you should indicate worldBounds and probably maxDistErr as well.
worldBounds="minX minY maxX maxY": Set the valid numerical ranges for x & y. If geo="true" then this is assumed "-180 -90 180 90". When geo="false" this is the limits of a Java double however those values have been shown to not work (yet), so definitely choose your boundaries for non-geospatial uses.
distCalculator="haversine": Set the distance calculation algorithm. If geo="true" then haversine is the default, otherwise cartesian is. The possible values are: haversine, lawOfCosines (warning: faulty), vincentySphere, cartesian, and cartesian^2.
A PrefixTree based field sees the world as a grid. Each grid cell is further decomposed as another set of grid cells at the next level. The top-most world level is known as "level 1", the next detailed is "level 2" and so on. Here are the attributes specific to PrefixTree based fields:
prefixTree="geohash": Choose the spatial grid implementation. "geohash" uses the Geohash algorithm which has 32 children at each level, and its limited to use when geo="true". The other implementation is "quad" which has 4 children a each level.
maxLevels="10": Set the maximum level (aka grid depth) for indexed data. It's easier to think in terms of a real distance and use maxDistErr instead.
maxDistErr="0.000009": The highest level of detail required for indexed data. If you specify nothing then it is a meter -- which is just a hair less than 0.000009 degrees. The units of this attribute are as indicated in the "units" attribute. On initialization, the prefix tree will determine what maxLevels should be to satisfy the desired distance precision. Unless you pick a maxDistErr at an exact threshold, the actual distance error will be even more precise. maxLevels is logged at startup.
distErrPct="0.025": Specifies the default precision of non-point shapes, as a fraction between 0.0 (fully precise up to maxLevels) and 0.5. Shapes are basically pixelated on an indexed grid. This number is approximated as the fraction of the distance between the center of a shape and the farthest corner of its bounding box. The closer this number is to zero, the more accurate the shape will be, but an indexed shape will use more disk space and it will take longer to index. The default is 2.5%. It applies to both index and query shapes, but it is overridable for query shapes.
A couple more obscure attributes are defaultFieldValuesArrayLen (affects memory use in distance sorting) and prefixGridScanLevel (tunes heuristics for filter performance).
Indexing
Points are indexed just as they are in Solr 3 spatial:
<field name="geo">43.17614,-90.57341</field>
If a comma is omitted, then it is in x-y (lon-lat) order:
<field name="geo">-90.57341 43.17614</field>
A lat-lon rectangle can be indexed with 4 numbers in minX minY maxX maxY order:
<field name="geo">-74.093 41.042 -69.347 44.558</field>
A circle is specified like so:
<field name="geo">Circle(4.56,1.23 d=0.0710)</field>
The first part of it is the center point, in either "lon lat" or "lat,lon" format, then the "d" distance radius is in degrees.
For polygons, use the WKT standard (Well Known Text) like so:
<field name="geo">POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))</field>
In WKT, coordinates are in "x y" (lon lat) order, and the coordinates are each separated by commas. (The double parenthesis is not a typo; see the WKT spec.)
Search
Searching with the new spatial module is done significantly different than Solr 3 spatial. Here is a Solr filter query parameter for a lat-lon bounding box using the simple shape syntax (non-WKT):
fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"
Notice that the query uses the standard default Lucene query parser and uses its fielded-query syntax in which a field is referenced followed by a colon. The spatial operation and shape are provided in the double-quotes. Just use Intersects operation for now, as the others aren't well supported. The contents of the parenthesis are a shape in the very same format used when indexing.
If you want to query by a rectangle shape, you have the option of using Solr's range query syntax:
fq=geo:[-90,-180 TO 90,180]
This is limited to lat,lon style without spaces, Intersects operation only, and you can't specify any extra options as seen below. The left side has the lower left corner, and the right side has the upper right corner.
Keep in mind that the query shape will by default have the distErrPct precision specified by the field type definition, which defaults to 0.025 (2.5%). Interpretation of this figure was described earlier. Here is an example polygon query setting it to 0:
fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))) distErrPct=0"
By setting it to 0, it is as accurate as the grid is (maxDistErr). You can also specify distErr=... to explicitly set the precision for the shape for when you know what the accuracy should be.
Sorting and Relevancy
A common spatial requirement is to sort the search results by distance from a point such as the center of a map window. Again, this works quite differently than Solr 3 spatial. Here, the spatial queries seen earlier are capable of returning a distance based score, which can then be sorted, used in relevancy boosting, and even returned in search results.
Here, we show parameters that do a spatial search filter & sort & returns the distance (as the score) simultaneously:
&fl=*,score&sort=score asc&q={! score=distance}geo:"Intersects(Circle(54.729696,-98.525391 d=10))"
Adding a user keyword search in this case would be added as an 'fq' param, most likely with the leading {!edismax}. Notice the score=distance local-param here. Without this (or if set to "none"), the query would yield a constant 1.0 for all documents. With "distance", it is the distance in degrees from the center of the query shape to the indexed point(s). You'll probably want to sort these values ascending. Another option is "recipDistance" which will use the reciprocal function such that distance 0 yields a score of 1, and a distance at the edge of the query shape yields ~0.1, trailing down closer to 0 beyond that. The "recipDistance" option is intended for use in boosting relevancy, such as using it in dismax's boost parameter.
If you want to sort and to have the distance in the results like in the last example, but don't want the spatial filter, you can do this too. Use this approach in which we sort by a function query referring to a query's score:
&fl=*,distdeg:query($sortsq)&sort=query($sortsq) asc&sortsq={! score=distance}geo:"Intersects(Circle(54.729696,-98.525391 d=10))"
The parameter "sortsq" was named arbitrarily (it's not special); it's referred to in the "fl" parameter and in the "sort" parameter with the same distance-yielding query. If a document has no point in the spatial field, the distance used is 0. Use of this query in two places will result in some redundant calculations but only for the results actually returned, not the potentially millions of matched documents.
If you only need to return the distance but don't need to sort, then the most performant approach is to calculate it on the client based on the lat & lon from the search results. Google for the haversine algorithm and your language of choice and you'll find a code snippet. If you ask Solr to do it then it'll put all the points in memory needlessly, but it'll certainly work. This shortcoming may be addressed in the future.
Notes:
If you index non-point data (e.g. polygons), then the PrefixTree based strategy will supply the center points of those shapes for sorting purposes
If you supply multiple points or other shapes, then the distance to the closest one is used. If you need different behavior then file an issue in JIRA and explain your use-case.
The PrefixTree based field type has a sub-par implementation for caching the indexed points in memory, currently. Even if multiValue="false", it's going to use the same big array of List of Point objects in memory. It's wasteful and the implementation is not friendly to real-time search requirements. Until a better implementation arrives, if you have single-valued point fields then use LatLonType for sorting instead. LatLonType also allows the choice of a float based coordinate field which halves memory compared to doubles, yet getting less than 3 meters of precision.
Sorting in Solr, wether it be a number/date or one of these spatial fields, requires some memory for each document and spatial sorting can involve some non-trivial math performed numerous times. Consequently, don't apply sorting without an actual need / requirement, versus a "hey, why not?" choice. The first time you sort on a field (spatial or not) it will load some data into memory then. This "first time" is the first time since the last commit, to be precise. You probably want to do put the sort query into firstSearcher & newSearcher so that an end user's search won't get hit with that penalty.
Units, Conversion
Degrees to kilometers: degrees * 111.2 Degrees to miles: degrees * 69.09
Just divide instead of multiply to go the other way.
JTS / WKT / Polygon notes
Shapes other than point, circle, or rectangle require JTS, an otherwise optional dependency. If you want to use Well Known Text (WKT) but only need the basic shapes, you still need JTS -- a restriction likely to be addressed in the near future.
Due to a combination of things, JTS can't simply be referenced by a "<lib>" entry in solrconfig.xml; it needs to be in WEB-INF/lib in Solr's war file, basically.
JTS views the world as a flat plane; the latitude and longitude are mapped to this plane directly. It uses Euclidean math operations, not Geodesic ones. This effectively warps shapes slightly, although it can be a bit much if the vertices are particularly far apart longitudinally.
Dateline crossing is supported. Spatial4j adapts shapes that cross the dateline to be compatible with JTS, and so you shouldn't notice a problem (notwithstanding unknown bugs).
Pole wrapping is not supported. Consequently if you want to index or query by an Antarctica polygon for example, you are out of luck for now. The only shape that can encompass a pole is a Circle. Technically a longitude-wrapping (-180 to +180) lat-lon box that touches a pole will too.
Only Polygon, and MultiPolygon WKT types have been tested. GeometryCollection will not work but the others like LineString should in theory. Holes in polygons haven't been tested but there is code in place to support them.
WKT shapes must have each vertex less than 180 degrees in longitude difference than the vertex before it, or else it will be confused as going the wrong way around the globe. The only exception to this is a Polygon representing a rectangle.
All WKT coordinates are normalized into the standard geospatial lat-lon boundaries. So, -184 longitude becomes +176, for example. Both +180 and -180 are kept distinct -- true for all of Spatial4j, not just JTS.
The standard way to specify a rectangle in WKT is a Polygon -- WKT doesn't have a rectangle shape. If you want to specify a Rectangle via WKT (instead of the Spatial4j basic non-WKT syntax), you should take care to specify the coordinates in counter-clockwise order, the WKT standard. If this is done wrong then the rectangle will go the opposite direction longitudinally, even if it means one that spans nearly the entire globe (>180 degrees width). OpenLayers seems to not honor the WKT standard here, and depending on the corner you drag the rectangle from, might use a clockwise order. Some systems like PostGIS don't care what the ordering is, but the problem there is that there is then no way to specify a rectangle that has >= 180 width because there would be ambiguity. Spatial4j follows the WKT spec.
TODO
ability to pass d parameter for km or miles for small distances (helper?)
Using spatial to search time ranges
The new spatial support here can actually be used for searching and indexing time durations or other numeric ranges. See SpatialForTimeDurations.
This document describes how to use the new spatial field types and related functionality in Lucene / Solr 4. The existing spatial support introduced in Solr 3 is still present and is still the default used in Solr's example schema via the named "location" field type.
The bulk of the new spatial implementation lives in the new Lucene 4 spatial module. It replaces the former "Lucene spatial contrib" in v3. The Solr piece is small as it only needs to provide field types which are essentially adapters to the code in the Lucene spatial module. The shape implementations and other core spatial code that isn't related to Lucene is held in a new open-source project called Spatial4j. Presently, polygon support requires an additional dependency -- JTS.
There is a basic demo application that exercises a variety of these features. It's not "live" so you'll have to download and build it first. It's a bit rough around the edges as it's mostly used by the Lucene spatial developers.
Contents
Lucene / Solr 4 Spatial
New features, over Solr 3 spatial
How to Use
Configuration
Indexing
Search
Sorting and Relevancy
Units, Conversion
JTS / WKT / Polygon notes
TODO
Using spatial to search time ranges
New features, over Solr 3 spatial
Note: "Solr 3 spatial" refers to the spatial support introduced in that version of Solr which still exists in v4. Except for a small utility class, Solr 3 spatial does not actually use Lucene 3's defunct spatial contrib module.
These features describe what developer-users of Lucene/Solr 4 will appreciate. Under the hood, it's a framework designed to be extended for different so-called "spatial strategies". I'll assume here the RecursivePrefixTreeStrategy as it should address most use-cases and it has the best tests.
Polygon, LineString and other new shapes. All shapes are supported as indexed shapes and query shapes. Shapes other than point, rectangle and circle are supported via JTS -- an otherwise optional dependency. See JTS caveats below for more information.
Multi-valued indexed fields. This is critical for storing the results of automatic place extraction from text using natural language processing techniques with a gazetteer (a variant of "geocoding"), since a variable number of locations will be found.
Index non-point shapes as well as points. Non-point shapes are essentially pixelated (i.e. gridded) to a configured resolution per shape -- an approximation. By default that resolution is defined by a percentage of the overall shape size, and it applies to query shapes too. Note: If extremely high precision of shape edges needs to be retained for accurate indexing, then this solution probably won't scale too well at indexing time (big indexes, slow indexing). On the other hand, query shapes generally scale well to the maximum configured precision regardless of shape size.
Rectangles with user-specifiable corners. Oddly, Solr 3 spatial only supports the bounding box of a circle.
Multi-value distance sort / score boost. Note: this is a preliminary unoptimized implementation that uses a fair amount of RAM, even when multiValued=false. An alternative should be provided in the future.
Configurable precision which can vary per shape at query time (and sort of at index time). This enhances the performance.
Fast filtering. The code was benchmarked once showing it outperforms Solr 3's "LatLonType" at its own game (single valued indexed points), and several 3rd parties anecdotally reported the same, especially for multi-million document indices. It is based on SOLR-2155 which was benchmarked in January 2010; so a new benchmark is a TODO item. Also, Solr 3 LatLonType sometimes requires all the points to be in memory, whereas the new spatial module here doesn't for filtering.
Well Known Text (WKT) support via JTS. WKT is arguably the most widely supported textual format for shapes. However, standard WKT doesn't specify a format for circles.
Of course, the basics in Solr 3 not mentioned here are implemented in this framework. For example, lat-lon bounding boxes and circles.
How to Use
Configuration
First, you must register a spatial field type in the Solr schema.xml file. The instructions in this whole document imply the RecursivePrefixTreeStrategy based field type used in a geospatial context.
<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"
spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"
distErrPct="0.025"
maxDistErr="0.000009"
units="degrees"
/>
And finally, specify a field that uses this field type: <field name="geo" type="location_rpt" indexed="true" stored="true" multiValued="true" />
A key feature of the new spatial module is multi-value support but you certainly aren't required to declare the field multiValued if it isn't.
The following configuration attributes are common to all new spatial field types based on Lucene 4 spatial:
spatialContextFactory: (Spatial4j) If polygons or other WKT formatted shape support is needed, then use the JTS based class as shown above, otherwise this can be omitted. The JTS jar file must be on Solr's classpath as well. Due to a combination of things, JTS can't simply be referenced by a "<lib>" entry in solrconfig.xml; it needs to be in WEB-INF/lib in Solr's war file, basically.
units="degrees": This parameter is mandatory, and currently the only value supported is "degrees". This affects the interpretation of the maxDistErr attribute, circle radius distances, and other absolute distances. There are approximately 111.2 kilometers in a degree, based on the average earth radius.
geo="true": Wether the spatial fields' coordinates are latitude / longitude WGS84 based (if true) or whether they are pure Euclidean / Cartesian based. It defaults to true. When set to false, you should indicate worldBounds and probably maxDistErr as well.
worldBounds="minX minY maxX maxY": Set the valid numerical ranges for x & y. If geo="true" then this is assumed "-180 -90 180 90". When geo="false" this is the limits of a Java double however those values have been shown to not work (yet), so definitely choose your boundaries for non-geospatial uses.
distCalculator="haversine": Set the distance calculation algorithm. If geo="true" then haversine is the default, otherwise cartesian is. The possible values are: haversine, lawOfCosines (warning: faulty), vincentySphere, cartesian, and cartesian^2.
A PrefixTree based field sees the world as a grid. Each grid cell is further decomposed as another set of grid cells at the next level. The top-most world level is known as "level 1", the next detailed is "level 2" and so on. Here are the attributes specific to PrefixTree based fields:
prefixTree="geohash": Choose the spatial grid implementation. "geohash" uses the Geohash algorithm which has 32 children at each level, and its limited to use when geo="true". The other implementation is "quad" which has 4 children a each level.
maxLevels="10": Set the maximum level (aka grid depth) for indexed data. It's easier to think in terms of a real distance and use maxDistErr instead.
maxDistErr="0.000009": The highest level of detail required for indexed data. If you specify nothing then it is a meter -- which is just a hair less than 0.000009 degrees. The units of this attribute are as indicated in the "units" attribute. On initialization, the prefix tree will determine what maxLevels should be to satisfy the desired distance precision. Unless you pick a maxDistErr at an exact threshold, the actual distance error will be even more precise. maxLevels is logged at startup.
distErrPct="0.025": Specifies the default precision of non-point shapes, as a fraction between 0.0 (fully precise up to maxLevels) and 0.5. Shapes are basically pixelated on an indexed grid. This number is approximated as the fraction of the distance between the center of a shape and the farthest corner of its bounding box. The closer this number is to zero, the more accurate the shape will be, but an indexed shape will use more disk space and it will take longer to index. The default is 2.5%. It applies to both index and query shapes, but it is overridable for query shapes.
A couple more obscure attributes are defaultFieldValuesArrayLen (affects memory use in distance sorting) and prefixGridScanLevel (tunes heuristics for filter performance).
Indexing
Points are indexed just as they are in Solr 3 spatial:
<field name="geo">43.17614,-90.57341</field>
If a comma is omitted, then it is in x-y (lon-lat) order:
<field name="geo">-90.57341 43.17614</field>
A lat-lon rectangle can be indexed with 4 numbers in minX minY maxX maxY order:
<field name="geo">-74.093 41.042 -69.347 44.558</field>
A circle is specified like so:
<field name="geo">Circle(4.56,1.23 d=0.0710)</field>
The first part of it is the center point, in either "lon lat" or "lat,lon" format, then the "d" distance radius is in degrees.
For polygons, use the WKT standard (Well Known Text) like so:
<field name="geo">POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))</field>
In WKT, coordinates are in "x y" (lon lat) order, and the coordinates are each separated by commas. (The double parenthesis is not a typo; see the WKT spec.)
Search
Searching with the new spatial module is done significantly different than Solr 3 spatial. Here is a Solr filter query parameter for a lat-lon bounding box using the simple shape syntax (non-WKT):
fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"
Notice that the query uses the standard default Lucene query parser and uses its fielded-query syntax in which a field is referenced followed by a colon. The spatial operation and shape are provided in the double-quotes. Just use Intersects operation for now, as the others aren't well supported. The contents of the parenthesis are a shape in the very same format used when indexing.
If you want to query by a rectangle shape, you have the option of using Solr's range query syntax:
fq=geo:[-90,-180 TO 90,180]
This is limited to lat,lon style without spaces, Intersects operation only, and you can't specify any extra options as seen below. The left side has the lower left corner, and the right side has the upper right corner.
Keep in mind that the query shape will by default have the distErrPct precision specified by the field type definition, which defaults to 0.025 (2.5%). Interpretation of this figure was described earlier. Here is an example polygon query setting it to 0:
fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))) distErrPct=0"
By setting it to 0, it is as accurate as the grid is (maxDistErr). You can also specify distErr=... to explicitly set the precision for the shape for when you know what the accuracy should be.
Sorting and Relevancy
A common spatial requirement is to sort the search results by distance from a point such as the center of a map window. Again, this works quite differently than Solr 3 spatial. Here, the spatial queries seen earlier are capable of returning a distance based score, which can then be sorted, used in relevancy boosting, and even returned in search results.
Here, we show parameters that do a spatial search filter & sort & returns the distance (as the score) simultaneously:
&fl=*,score&sort=score asc&q={! score=distance}geo:"Intersects(Circle(54.729696,-98.525391 d=10))"
Adding a user keyword search in this case would be added as an 'fq' param, most likely with the leading {!edismax}. Notice the score=distance local-param here. Without this (or if set to "none"), the query would yield a constant 1.0 for all documents. With "distance", it is the distance in degrees from the center of the query shape to the indexed point(s). You'll probably want to sort these values ascending. Another option is "recipDistance" which will use the reciprocal function such that distance 0 yields a score of 1, and a distance at the edge of the query shape yields ~0.1, trailing down closer to 0 beyond that. The "recipDistance" option is intended for use in boosting relevancy, such as using it in dismax's boost parameter.
If you want to sort and to have the distance in the results like in the last example, but don't want the spatial filter, you can do this too. Use this approach in which we sort by a function query referring to a query's score:
&fl=*,distdeg:query($sortsq)&sort=query($sortsq) asc&sortsq={! score=distance}geo:"Intersects(Circle(54.729696,-98.525391 d=10))"
The parameter "sortsq" was named arbitrarily (it's not special); it's referred to in the "fl" parameter and in the "sort" parameter with the same distance-yielding query. If a document has no point in the spatial field, the distance used is 0. Use of this query in two places will result in some redundant calculations but only for the results actually returned, not the potentially millions of matched documents.
If you only need to return the distance but don't need to sort, then the most performant approach is to calculate it on the client based on the lat & lon from the search results. Google for the haversine algorithm and your language of choice and you'll find a code snippet. If you ask Solr to do it then it'll put all the points in memory needlessly, but it'll certainly work. This shortcoming may be addressed in the future.
Notes:
If you index non-point data (e.g. polygons), then the PrefixTree based strategy will supply the center points of those shapes for sorting purposes
If you supply multiple points or other shapes, then the distance to the closest one is used. If you need different behavior then file an issue in JIRA and explain your use-case.
The PrefixTree based field type has a sub-par implementation for caching the indexed points in memory, currently. Even if multiValue="false", it's going to use the same big array of List of Point objects in memory. It's wasteful and the implementation is not friendly to real-time search requirements. Until a better implementation arrives, if you have single-valued point fields then use LatLonType for sorting instead. LatLonType also allows the choice of a float based coordinate field which halves memory compared to doubles, yet getting less than 3 meters of precision.
Sorting in Solr, wether it be a number/date or one of these spatial fields, requires some memory for each document and spatial sorting can involve some non-trivial math performed numerous times. Consequently, don't apply sorting without an actual need / requirement, versus a "hey, why not?" choice. The first time you sort on a field (spatial or not) it will load some data into memory then. This "first time" is the first time since the last commit, to be precise. You probably want to do put the sort query into firstSearcher & newSearcher so that an end user's search won't get hit with that penalty.
Units, Conversion
Degrees to kilometers: degrees * 111.2 Degrees to miles: degrees * 69.09
Just divide instead of multiply to go the other way.
JTS / WKT / Polygon notes
Shapes other than point, circle, or rectangle require JTS, an otherwise optional dependency. If you want to use Well Known Text (WKT) but only need the basic shapes, you still need JTS -- a restriction likely to be addressed in the near future.
Due to a combination of things, JTS can't simply be referenced by a "<lib>" entry in solrconfig.xml; it needs to be in WEB-INF/lib in Solr's war file, basically.
JTS views the world as a flat plane; the latitude and longitude are mapped to this plane directly. It uses Euclidean math operations, not Geodesic ones. This effectively warps shapes slightly, although it can be a bit much if the vertices are particularly far apart longitudinally.
Dateline crossing is supported. Spatial4j adapts shapes that cross the dateline to be compatible with JTS, and so you shouldn't notice a problem (notwithstanding unknown bugs).
Pole wrapping is not supported. Consequently if you want to index or query by an Antarctica polygon for example, you are out of luck for now. The only shape that can encompass a pole is a Circle. Technically a longitude-wrapping (-180 to +180) lat-lon box that touches a pole will too.
Only Polygon, and MultiPolygon WKT types have been tested. GeometryCollection will not work but the others like LineString should in theory. Holes in polygons haven't been tested but there is code in place to support them.
WKT shapes must have each vertex less than 180 degrees in longitude difference than the vertex before it, or else it will be confused as going the wrong way around the globe. The only exception to this is a Polygon representing a rectangle.
All WKT coordinates are normalized into the standard geospatial lat-lon boundaries. So, -184 longitude becomes +176, for example. Both +180 and -180 are kept distinct -- true for all of Spatial4j, not just JTS.
The standard way to specify a rectangle in WKT is a Polygon -- WKT doesn't have a rectangle shape. If you want to specify a Rectangle via WKT (instead of the Spatial4j basic non-WKT syntax), you should take care to specify the coordinates in counter-clockwise order, the WKT standard. If this is done wrong then the rectangle will go the opposite direction longitudinally, even if it means one that spans nearly the entire globe (>180 degrees width). OpenLayers seems to not honor the WKT standard here, and depending on the corner you drag the rectangle from, might use a clockwise order. Some systems like PostGIS don't care what the ordering is, but the problem there is that there is then no way to specify a rectangle that has >= 180 width because there would be ambiguity. Spatial4j follows the WKT spec.
TODO
ability to pass d parameter for km or miles for small distances (helper?)
Using spatial to search time ranges
The new spatial support here can actually be used for searching and indexing time durations or other numeric ranges. See SpatialForTimeDurations.
发表评论
-
solr 检索运算符
2013-04-27 11:27 738[Solr的检索运算符 ]1. “:” 指定字段查指定值,如 ... -
solr csv 导入
2013-03-13 15:06 1271... -
solr 空间索引的建立
2013-03-12 18:47 918sol ... -
使用 Apache Lucene 和 Solr 进行位置感知搜索
2013-03-11 17:23 630http://www.ibm.com/developerwor ... -
solr + tomcat 搭建
2013-03-11 14:57 795... -
Solr\Lucene优劣势分析
2013-03-06 10:49 1014zhuan:http://rdc.taobao.com/te ...
相关推荐
当前的IKAnalyzer官方版在用于Solr4以上高版本时,由于没有TokenizerFactory而造成诸多不便,于是有了为Lucene/Solr 4.7重新打包的IKAnalyzer 2012 FF
mmseg4j-solr-2.0.0.jar 要求 lucene/solr >= 4.3.0。在 lucene/solr [4.3.0, 4.7.1] 测试过兼容可用。 mmseg4j-solr-2.1.0.jar 要求 lucene/solr 4.8.x mmseg4j-solr-2.2.0.jar 要求 lucene/solr [4.9, 4.10.x] ...
lucene&solr原理分析,lucene搜索引擎和solr搜索服务器原理分析。
http://archive.apache.org/dist/lucene/java/ 这个是lucene的历史版本 http://archive.apache.org/dist/lucene/solr/ 这个是solr的历史版本
《深入理解Lucene4、Solr4J与AriK4:构建高效全文搜索引擎》 在信息化时代,数据量呈爆炸性增长,如何快速、准确地检索信息成为了一个至关重要的问题。为此,开源社区提供了强大的全文搜索引擎框架——Lucene4,...
### Lucene与Solr的使用详解 #### 一、Lucene概述 Lucene是一款高性能、全功能的文本搜索引擎库,由Java语言编写而成。它能够为应用系统提供强大的全文检索能力,是当前最为流行的开源搜索库之一。由于其高度可...
- 访问官方下载页面:[http://www.apache.org/dyn/closer.cgi/lucene/solr/](http://www.apache.org/dyn/closer.cgi/lucene/solr/) - 选择版本3.5并将其解压到D盘,例如路径为`D:/solr/apache-solr-3.5.0` 2. **...
LoremIpsum搜索 包含与 lucene 和 solr 一起使用的搜索算法... export CLASSPATH="<lucene>/lucene/replicator/lib/*:<nutch>/build/*:<nutch>/build/lib/*:<lucene>/solr/dist/*:<lucene>/solr/ dist/solrj-lib/*:*:.
It does not assume that you are a Java programmer, although knowledge of Java is helpful when working directly with Lucene or when developing custom extensions to a Lucene/Solr installation.
适用于Apache Lucene / Solr的土耳其语分析组件 在土耳其,开源软件的使用正日益增长。 Apache Lucene / Solr(和其他 )邮件列表上的土耳其用户正在增加。 该项目利用公共可用的土耳其语NLP工具从中创建。 我创建...
Lucene和Solr是两个非常重要的开源搜索引擎工具,它们在大数据处理、信息检索以及网站全文搜索等领域发挥着至关重要的作用。本篇将详细阐述Lucene和Solr的基本概念、工作原理以及如何在实际应用中使用它们。 **1. ...
Solr是Apache Lucene项目的一个子项目,是一个高性能、基于Java的企业级全文搜索引擎服务器。当你在尝试启动Solr时遇到404错误,这通常意味着Solr服务没有正确地启动或者配置文件设置不正确。404错误表示“未找到”...
- **Windows版本**: [http://labs.xiaonei.com/apache-mirror/lucene/solr/1.3.0/apache-solr-1.3.0.zip](http://labs.xiaonei.com/apache-mirror/lucene/solr/1.3.0/apache-solr-1.3.0.zip) - **Linux版本**: ...
本人用ant idea命令花了214分钟,35秒编译的lucene-solr源码,可以用idea打开,把项目放在D:\space\study\java\lucene-solr路径下,再用idea打开就行了
solr -8.11.1.zip 文件