Overview of the Endeca Content Acquisition System
CAS 是一套用于Endeca Application添加,配置和抓取数据源的系统。数据源涉及文件系统,内容管理系统,Web 服务器,和定制的数据源。CAS将爬取的数据源转换文档和文件成Endeca Records, 和Stores,然后用于Forge pipeline.
The Endeca Content Acquisition System is made up of the following components:
1 CAS Service: CAS service 是一个运行在CAS server上的servlet 容器。包含Component Instance Manager,any number of Record Store instances,Dval Id Manager.
2 CAS Server: 是一个管理所有文件系统和CMS 爬取操作的组件
3 CAS Console: 是一个位于Endeca Workbench 基于web 应用,用于爬去各种数据源包括文件系统和CMS系统。CAS 安装期间,CAS Console也会作为扩展额外的安装。
4 CAS Sersver API 允许用户写代码去联系CAS Server,它有一个WSDL 接口也有一个命令行工具
5 Dimension Value Id Manager: 是一个CAS 组件,用于创建,存储和取得Dimension Value 标识符。
6 Endeca Web Crawler 管理所有的web 爬虫相关的操作。
7 Endeca CMS Connectors: 提供一种以各种CMS 类型访问和抓取数据源
8 Component Instance Manager : 用于创建,列举和删除record store instance,它有一个WSDL 接口和CIM 命令行工具
9 Endeca Record Store 提供持久化存储为产生的各种数据。他也有WSDL 接口和Record Store 命令行工具。CAS Server 将爬取的输出从每个数据源写到一个唯一的Record Store 实例
10 CAS Extension API 提供一套接口和classes去构建一些诸如定制数据源和定制的扩展
Note: 每一个数据源可以有多个record store 实例;每一个应用上可以有多个dimension value id manager
开启CAS Service:
Windows: <install path>/CAS/<version>/bin/cas-service.sh cas-service-wrapper.exe
Linux: <install path>/CAS/<version>/bin/cas-service.sh
CAS Server: 特征
1 包含CAS Document Conversion Module,它允许CAS Server去转换二进制文件成txt文件
2 使用包含和排除filters去指定需要从哪些些文件和文件夹去取东西或者不从哪些文件和文件夹取东西
3 支持增量爬取
Record Store:
是一个web 服务,为产生的record store 提供持久化服务。在后面能被Forge访问或者被CAS 查询的时候访问,将会代替写输出到文件。
1 位records提供高效的repository(以前是各种源数据放在不同的目录下,现在在一个地方,消除了需要在不同的目录之间进行拷贝和移动)
2 能够取得索引和增量索引数据
3 支持异步操作,也就是说CAS 可以一边写数据到record store,另一边Forge可以读取
4针对每一个数据源创建一套单独的record store
5 自动清除旧数据
6 通过其命令行工具可以很容易的配置和管理
Dimension value ID Managers
1 There is a command line interface (cas-cmd) to the component to manually perform the following operations:
• Create a Dimension Value Id Manager.
• Generate dimension value Ids.
• Export and import dimension value Ids.
• Get dimension value Ids.
• Delete a Dimension Value Id Manager.
initialize_services script creates a new instance of a Dimension Value Id Manager. CAS generates dimension value Ids as part of writing MDEX output. You manually delete the Dimension Value Id Manager using the cas-cmd utility before removing an Endeca application.
2 备份 和重新存储dimension value ids
备份 exportDimensionValueIdMap¬pings task of cas-cmd.
重新存储:importDimensionValueIdMappings task of cas-cmd.
3 可以跨环境传播dimension value ids
很多时候,你不得不移动dimension value id 映射文件在不同的环境,比如dev uat prod等等
You can coordinate this work in your Deployment Template script by calling exportDimensionValueIdMappings() on the Content Acquisition
ServerComponent, copying the file to the necessary machine, and calling importDimensionValueIdMappings() to load the file into another instance of a Dimension Value Id Manager.(也就是导入导出的流程)
Overview of the default CAS data sources and manipulators
Chapter 2: Create A Crawl
你可以使用CAS Console, CAS Server Command-Line 工具,和CAS Server API创建和配置一个应用任何数量crawls,如果你使用CAS Console,注意crawl 是等价于数据源的。
你应该指定配置选项:
1 crawl 的名字
2 用于抓取的源数据的位置
3 过滤应该包括或者排除的文件或者文件夹
4 CMS 数据源的repository属性
5 修改Endeca Records的Manipulator 作为crawl的一部分
Chapter 3 Load data into an MDEX Engine
1 Creating a Forge pipeline to read from or write to a Record Store
描述怎样构建一个Forge pipeline从一个或多个Record Store去读取Endeca Records。
要读取records到Forge pipeline,你需要添加input record adapter.如果record adapter 从CAS 输出文件读取数据,你需要指定的文件的格式是xml 还是二进制文件。
URL用于指定文件的位置
如果record adapter 是从Record Store 实例读取数据,你需要配置record adapter 成定制的adapter
1. Create a record adapter to read the Endeca records that CAS produced (required).
2. Map the record properties to Endeca properties and dimensions (required, but not documented in this guide.
See Endeca Developer Studio Help.).
Creating a record adapter to read from one or more Record Store instances:
1 New Record Adapter
2 From format list, choose Custom Adapter
3 Specify the JAVA__HOME,Class and ClassPath,eg:
Class:
com.endeca.itl.recordstore.forge.RecordStoreSource
Class Path:
<install path>/CAS/<version>/lib/recordstore-forge-adapter/recordstore-
forge-adapter-<version>.jar.
4 Select the Pass Throughs tab of the Record Adapter editor.
5 On the Pass Throughs tab, create the following name/value pairs:
5.1 Set a HOST pass-through to the fully qualified host name of the machine running the Endeca CAS Service. For example, HOST = hostname.endeca.com.
5.2 Set a PORT pass-through to the port number that the Endeca CAS Service is listening on. For example, PORT = 8500.
5.3 If reading from one Record Store instance, set an INSTANCE_NAME pass-through to the name of the
Record Store instance that you want Forge to read from. For example, INSTANCE_NAME = crawlID.
This pass-through is not required if the adapter is reading from multiple Record Store instances.
5.4 For a baseline pipeline, set a READ_TYPE pass-through to BASELINE. The BASELINE setting instructs Forge to read the latest version of all records in the Record Store. For example, READ_TYPE = BASE¬LINE.
For a partial-update pipeline, set a READ_TYPE pass-through to DELTA. The DELTA setting instructs Forge to read records that have been modified or added between the last committed generation in the Record Store and the last generation read by the same client as identified by CLIENT_ID setting. For example, READ_TYPE = DELTA.
5.5 Set a CLIENT_ID pass-through to a string that distinguishes this client from others that may also be reading from the Record Store instances. For example, CLIENT_ID = FORGE. The CLIENT_ID pass-through specifies the client ID to be set for the generation that is being read in. In effect, this pass-through is performing the set-last-read-generation task that can be performed with the CAS Server Command-line Utility (i.e., state is being set for the client, which is Forge in this case). This pass-through can be used only for READ_TYPE operations.
5.6 Optionally, set a RECORDS_PER_TRANSFER pass-through to the number of records to transfer at a time for each Record Store instance. The default is 500. Click OK to add the new record adapter to the project.
5.7 Optionally, to enable SSL with server only authentication, add pass through options for the truststore location (SSL_TRUSTSTORE), type (SSL_TRUSTSTORE_TYPE), password (SSL_TRUSTSTORE_PASS¬ WORD), and CAS port usage (IS_PORT_SSL).
5.8 Optionally, to enable SSL with mutual authentication, add pass-through options for the keystore location
(SSL_KEYSTORE), type (SSL_KEYSTORE_TYPE), and password (SSL_KEYSTORE_PASSWORD). For example: SSL_KEYSTORE = C:\Endeca\CAS\workspace\conf\keystore.ks, SSL_KEY¬
STORE_TYPE = JKS, SSL_KEYSTORE_PASSWORD = endeca, IS_PORT_SSL = false.
In some cases, you may get an Out of Memory error if Forge is reading or writing records from a Record Store instance. To work around this error, you can increase the amount of memory allocated to the JVM running
Forge. To increase the memory, run Forge with --javaArgument flag and the -Xmx argument, for example --javaArgument -Xmx512m.
Record properties for all dimension values
对于抓取的数据,每一个record都会产生一个dimension value.每一个record或许都会有record properties列在下面的。
dimval.spec:dimension value id, 必须是唯一的
dimval.dimension_name:dimension name,不是dimension value name
dimval.display_order(optinal): 值是数字类型,展示dimension value 的顺序,值越小越在前面,如果某个每有这个属性,那么将会处于有这个属性的后面
dimval.parent_spec:父类dimension value的id,如果是root,就是/
dimval.display_name:dimension value name
dimval.match.use_spec(optinal):是否用dimval.spec 去匹配属性,当然是一个range value properties,那么默认值就是false.否则默认值为true
dimval.search_synonym:dimension value的同义词
Record properties for range dimension values
dimval.range.lower_bound(Optional):指定一个最小值
dimval.range.lower_bound_in¬clusive(optinal):是否包含当前low_bound value
dimval.range.upper_bound(Optional):指定一个最大值
dimval.range.upper_bound_in¬clusive(optinal):是否包含当前upper_bound value
About automatically generating dimension values
CAS 能根据data records的property values 自动产生dimension value,如果需要自动产生,你应该设置dimension 的isAutoGen为true,然后运行一个full crawl 去产生MDEX-compatible output.然后dimension value id manager会产生dimension value id.
2 Creating a CAS crawl to write MDEX-compatible output
我们可以配置任何crawl 写MDEX-compatible output,但是我们最通用方式的是:
创建Record Store Merger crawl 去写,当运行full-carwl模式的时候,一下事情将会发生:
1 从多个所有者合并index 配置
2 处理dimension,properties,precedence rules,dimension value records
3 处理data records
4 写配置和记录到MDEX-Compatiable
Chapter 4 CAS Command Line Utilities
The command syntax for executing the tasks is:
cas-cmd task-name [options]
You get the capabilities for a data source or manipulator by running the listModules task or the getMod¬uleSpec task of cas-cmd.
cas-cmd.bat listModules -h localhost -p 8500
The getAllCrawlMetrics task retrieves a list of crawl IDs and their associated metrics
cas-cmd getAllCrawlMetrics [-h HostName] [-p PortNumber] [-l true|false]
Getting the status of a crawl
cas-cmd getCrawlStatus -id CrawlName [-h HostName] [-p PortNumber] [-l true|false]
Component Instance Manager Command-line Utility
Command-line options
The command syntax for executing the tasks is:
component-manager-cmd task-name [options]
The create-component task creates a Record Store instance:
component-manager-cmd create-component -n RecordStoreName -t RecordStore[-h HostName] [-p PortNumber] [-l true|false]
The delete-component task deletes a Record Store:
component-manager-cmd delete-component -n RecordStoreName
[-h HostName] [-p PortNumber] [-l true|false]
Listing components:
The list-components task lists all component instances that are managed by the Component Instance Manager.
component-manager-cmd list-components [-h HostName] [-p PortNumber] [-l true|false]
Listing types:
The list-types task lists all component types that are managed by the Component Instance Manager. Executing the task returns a list of all managed component types in the CAS Service. In this release, the only supported component type is RecordStore.
The syntax for this task is:
component-manager-cmd list-types [-h HostName] [-p PortNumber]
[-l true|false]
Record Store Command-line Utility
Command-line options With one exception, the command syntax for executing the tasks is:
recordstore-cmd task-name [options]
Writing tasks:
The write task writes a list of records into a specified Record Store instance.
The syntax for this task is:
recordstore-cmd write -a RecordStoreInstanceName [-b] -f InputFile [-h HostName] [-l true|false] [-p PortNumber] [-r Type] [-x Id]
Reading tasks:
The read-baseline task reads the baseline records from a Record Store instance.
The syntax for this task is:
recordstore-cmd read-baseline -a RecordStoreInstanceName
[-c] [-f FileName.xml] [-g GenId] [-h HostName] [-l true|false]
[-p PortNumber] [-n NumRecs] [-x id]
Cleaning a Record Store instance:
recordstore-cmd clean -a RecordStoreInstanceName [-h HostName]
[-l true|false] [-p PortNumber]
Clearing the last read generation:
recordstore-cmd clear-last-read-generation -a RecordStoreInstanceName
-c ClientId [-h HostName] [-l true|false] [-p PortNumber] [-x Id]
Committing transactions:
recordstore-cmd commit-transaction -a RecordStoreInstanceName -x Id
[-h HostName] [-l true|false] [-p PortNumber]
Getting the configuration of a Record Store instance:
recordstore-cmd get-configuration -a RecordStoreInstanceName
-f FileName.xml [-h HostName] [-l true|false] [-n] [-p PortNumber]
Getting the ID of the last-committed generation:
recordstore-cmd get-last-committed-generation -a RecordStoreInstanceName [-h HostName] [-l true|false] [-p PortNumber] [-x Id]
Getting the last-read generation:
recordstore-cmd get-last-read-generation -a RecordStoreInstanceName
-c ClientId [-h HostName] [-l true|false] [-p PortNumber] [-x Id]
Setting the configuration of a Record Store instance:
recordstore-cmd set-configuration -a RecordStoreInstanceName
-f FileName.xml [-h HostName] [-l true|false] [-p PortNumber]
Listing generations:
recordstore-cmd list-generations -a RecordStoreInstanceName
[-h Hostname] [-l true|false] [-p PortNumber]
Record properties generated by crawling
Common record properties
Endeca.Action:[UPSERT|DELETE]
Endeca.SourceType:[FILESYSTEM|WEB|CMS|EXTENSION]
Endeca.Id: RECORD_IDENTIFIER,如果是文件系统,可能是path,如果是web server,可能是URL
Endeca.SourceId: Data Source name,和crawl 配置文件的crawlId应该是一样的
Endeca.File.IsArchive:文件是否是压缩文件
Endeca.File.IsInArchive:当前文件是否是从压缩文件提取的
Endeca.File.Size:字节数
相关推荐
### Endeca 术语知识点 #### 一、Endeca概述 Endeca是一家专注于提供信息访问解决方案的公司,其核心产品Endeca Information Access Platform (IAP) 是一个强大的企业级搜索平台,能够帮助用户从大量非结构化数据...
Endeca Technologies作为一家专注于这些领域的技术提供商,其收购对于Oracle而言,是一次重要的战略扩展。 首先,让我们了解非结构化数据管理的重要性。非结构化数据是指没有预定义的数据模型的数据,常见的形式...
Endeca是Oracle旗下的一个多维搜索引擎和分析平台,广泛应用于电子商务、企业信息搜索以及大数据分析等领域。该平台的核心特性包括其非关系型的搜索引擎、大数据处理能力、以及能够让用户自由探索数据的架构设计。 ...
Oracle Endeca是一款强大的数据探索和导航工具,由Oracle公司提供,主要用于构建企业级的搜索、数据分析和信息发现解决方案。Endeca以其灵活性、可扩展性和高性能而著名,尤其适合处理非结构化和半结构化数据。在本...
它的创建是为了帮助 Endeca 开发人员调试与 CAS 数据摄取有关的问题。 我在这里写了一篇博客文章,解释了有关该工具的更多信息: 构建项目 该项目目前需要 Java 8 和 Endeca 11.1 才能成功构建。 尚未使用早期...
功能/问题: 简介: class Listing < Endeca xss=removed> 'R' map(:expand_refinements => :expand_all_dims).into(:M) float_reader \ :latitude, :longitude, integer_reader :endeca_id boolean_reader :is...
快速安装Oracle Commerce(ATG + Endeca) 关于 这将使用通用默认值安装Oracle Commerce平台(ATG + Endeca)。 这是为了帮助更轻松,更一致地为项目设置开发人员环境。 这将创建一个无用的盒子,供您在团队内部轻松...
Endeca 的组件包括 MDEX Engine、Endeca Content Acquisition System、Endeca Assembler 和 Endeca Experience Manager 等。这些组件可以帮助企业更好地挖掘和分析数据,从而提高业务决策的科学性和可靠性。 文本...
Gradle插件来构建Oracle Commerce(ATG + Endeca)项目 由Naga rajan Seshadri创建电子邮件 完整的例子 使用插件的ATG模块-示例 请参阅根文件夹中的build.gradle,settings.gradle和gradle.properies 请参阅所有...
Endeca DataFoundry、Navigation Engine和Presentation Server的基本架构展示了Endeca系统如何处理和呈现数据,以实现更高效的搜索和导航功能。 综上所述,这个文件讨论的核心知识点是: 1. 现代图书馆服务需要与...
3. `EndecaConceptsGuide.pdf`:Endeca 是一个数据管理平台,这个指南可能涉及到如何在 Endeca 环境下使用 EasyMock。虽然 Endeca 不是 EasyMock 的一部分,但了解如何在特定上下文中使用模拟对象是重要的实践技巧。...
E-Business Suite 12.1.3 [与 Endeca 集成]:提高效率和有效性:•集中式关键业务功能可支持共享服务 •客户、员工与供应商的自助式协作 •丰富的电子表格与影像集成。 满足全球要求:•统一的全球性平台•通用、...
甲骨文与道安晋携手发布了一系列基于甲骨文云计算平台的客户体验产品,包括Right Now、Endeca、Fatwire、Inquire、ATG Livehelp等,以及跨国呼叫中心系统。这些产品旨在帮助中国企业利用跨国公司的成熟业务实践经验...
- EBS for Endeca的提及表明Oracle致力于将搜索和数据发现技术融入EBS,提升数据分析和洞察力。 5. **支持时间表**: Oracle提供了明确的支持时间表,保证对11.5.10和12.1版本的长期支持,让客户有信心进行长期...
Oracle EBS SCM的未来发展方向包括更深入的集成,如Endeca的扩展功能,提供内存中成本管理、到岸成本管理、最低成本公式等功能。此外,还强化了配料替换、电子批次记录、触摸屏用户界面等,以适应分布式和预混流程...
ATG RMI(Remote Method Invocation)服务运行在6860端口,而Endeca服务位于172.16.102.11的6067端口,BCC(Business Control Center)在172.16.102.12的6068端口,CSC(Commerce Site Composer)在172.16.102.13的...