By dananicula-Oracle on Jun 30, 2015
Written by Phil Franklin, Principal Instructor, Oracle University
This article is intended to give Oracle Commerce platform users information about the Forge-less pipeline design for indexing and configuring data for the Endeca Guided Search MDEX engine.
The Content Acquisition System (CAS) is a key component in the indexing architecture, whether you're a full Oracle Commerce platform user or you're using Endeca Guided Search as a stand-alone product.
First, let’s take a look at the history of data ingest into the Endeca MDEX engine.
NOTE: RS=record store, LMC=last-mile-crawl, ECR=Endeca Config Repository
Blue arrow= pull operation Green arrow=push operation
Referring to the numbering on the diagram above:
1. Forge/Dgidx pipeline (Information Transformation Layer) - this is the ‘traditional’ way data and config for the MDEX index was done. Forge data design was done with Developer Studio. This tool allowed visual modification of the configuration files contained within the config\pipeline folder for the indexing project. Forge is a 32-bit process which can only pull data from data sources for transformation.
It is also single-threaded so there can be challenges in processing large data sets. This sometimes meant using multiple instances of forge or writing multithreaded Java manipulators. Forge does have the capability of joining data from multiple data sources and generating Endeca record structures as a result of the join.
2. CAS-Forge-Dgidx - the Content Acquisition System was originally created to acquire data from ‘unstructured’ data sources like file systems and various third party CMS systems where large text fields or documents could be found. Document conversion can be plugged in to convert documents into metadata source fields and text for indexing. First uses of CAS pulled data into the Forge process via a custom adapter which could connect to the CAS service recordstores.
CAS was later extended to cover data sources that originally could only be imported directly into Forge while retaining its unstructured data capabilities. After the Oracle acquisition the Third Party CMS adapters were removed and CAS became the recommended method to ingest data. It managed dimension-value configuration items for indexing.
The most sophisticated usage of this data ingest design in Oracle Commerce is represented by the Product Catalog integration template used to integrate the core platform (ATG) with Endeca up to the release of version 11-1 (see diagram above).
The latter indexing project template uses two instances of Forge: one to generate configuration files and a second to use the generated files (and some manually managed files) to generate the information for Dgidx. Usage of Developer Studio became difficult in the latter design, as most configuration information was now stored elsewhere.
For both (1) and (2), Forge can still be a bottleneck and dimension values (classification values) require careful management, as there is no central place of record for them. They are generated to data\state and to forge_output; moving them between environments (staging and production) requires care.
3. CAS-Dgidx - this now the recommended design from release Oracle Commerce 11.1 and completely removes Forge (and the use of Developer Studio for configuration).
- Data can both be pushed (via command line, scripts or API) into CAS record store, or can be acquired via a pull mechanism (CAS Crawl).
- CAS is a 64 bit multi-threaded server; can carry out these operations on demand and in parallel.
- CAS recordstores support both full and incremental updates, which can take place asynchronously to the indexing process.
- Document conversion and data manipulation in CAS prior to indexing is also supported.
- Via the ‘last-mile-crawl’ component dedicated to each indexing project, data and configuration is integrated. Configuration is obtained from three main places: managed dimension values from a CAS recordstore, schema and precedence rules from the Endeca Configuration Repository (ECR), along with the indexing project’s config\mdex folder. The ECR also has a programmatic API, so it can be populated automatically. The ‘last-mile-crawl’ writes out all the information required for Dgidx to index the project.
- The project also has a centralised dimension ID manager, which provides a central place of record for managing dimension values for the indexing project – a big improvement over the Forge-based approach.
The diagram above shows the essential elements of a Forgeless pipeline. Note that for the 11-1 Catalog integration project, that data and dimvals are pushed into record stores via the APIs. Only the last-mile-crawl is executed as a crawl via the baseline and partial update scripts.
Note that the CAS design can be extended to include other inputs which may be from Third Party inputs and which need not necessarily be extracted or pushed from a core platform repository.
The 11-1 release represents the first time the forge-less design became the default approach for data and config integration for indexing. Currently the CAS design cannot join data natively in the CAS system itself, so any data joining must be achieved externally before loading recordstores. We can perhaps expect further developments in this area in future (please note: the opinions are solely those of the author).
What can be said is that the forge-less approach is the product direction, so for any new projects or major updates to existing projects using Endeca Guided Search, we recommend you take a good look at it.
Note that the Oracle University “Oracle Commerce: Implementing Guided Search 11-1 (3 days)” class covers this area in detail.
About the Author
Phil Franklin has over 25 years experience building, delivering and managing consulting and education services in privately owned and publicly quoted companies, with emphasis on software development and process improvement. He has successfully held CTO and service director management positions over the course of his career. Over the last 10 years, Phil has specialized in technical development education for ecommerce and search solutions, coming to Oracle via the Endeca acquisition. He teaches Oracle Commerce (Endeca/ATG) and Oracle Knowledge, as well as development classes for Cloud and the Java language.