How to consolidate external sources into Knowledge v3 search
Knowledge v2 (Eureka and earlier) had a feature called Navigation Add-ons that was used to show search results from external sources, which is now deprecated.
The new v3 framework to crawl external sources such as Documentation and Community uses scheduled jobs to extract the content from other sources and falls back on Zing Search to index and provide search results. The following diagram shows the flow of data.
About the framework
The framework is mostly housed within a scheduled job that uses APIs of the external content source to extract content. The scheduled job then saves the sources as individual articles in a separate Knowledge Base. Using periodic runs of the scheduled job, the content will be synched over to the KB.
This framework provides a guideline for you to implement external source searching for your instance.
In order to manage multiple content sources effectively, you need to have a configuration object. This stores the details about the external content source, and how the content will be mapped into the Knowledge Base within your instance. Consider managing the details listed in the following table.
|Target Knowledge Base||The KB where the extracted content would be transformed into individual articles. You should create the Target Knowledge base before you configure this.|
|Security||Set up access to the articles that are being created through the scheduled job from a particular source.|
|Source URL endpoint (or other config)||A config or URL that can be used by the scheduled job to call the external content source.|
|Redirect URL||When returning search results, you may need a base URL to use to build the final URL for the source content|
|Security||Define the content security|
|Category Mapping||Define which KB categories will be assigned to articles that are being created by the content source. You should create the categories in the Target Knowledge base, before you define this mapping.|
Once the content is extracted, you should check against existing articles. A checksum is a good way to verify changes.
- If the content is new, then Create a new article.
- If the content exists in the KB, but is updated, then Update the article.
- If the content exists in the KB, but is not modified, Ignore.
- If the content exists in the KB, but is no longer found in content source, then Expire the 'containing article' so that it's no longer available in search.
Incremental crawl vs. Full crawl
Fine-tune the synchronization process based on how frequently the external content source is modified.
Adding and updating existing articles (incremental) may be performed more frequently.
A Full crawl (which includes comparing existing content for deletes) should be done less frequently. Because this process hits the source content system heavily, consider some degree of planning and throttling.
The article created as a result of this process has a few other values to be set, as listed in the following table.
|Knowledge Base||Set the Knowledge base with regards to the config object for that particular external source.|
|Category||Set the category based on the config object that maps an attribute of the source (like URL) to a particular category defined for that Knowledge Base.|
|Roles||The Roles for which the external content will be available to.|
|Language||The language under which the imported article will be available to.|
|Text Body||It's advisable to strip off the HTML from the external content and store only the text of the article within the wrapper article.|
|Meta||You may use an external service to generate meta that will be associated with that article.|
|Valid to date||The date until which this wrapper article will be available (recommended to keep it less, so that stale articles are not indexed).|
|Workflow State||Make sure that the article state is set to Published, to ensure visibility to all. Also don't set any workflows.|
|Click-through URL||The default Article URL needs to be replaced with the URL of the source. You will typically need to concatenate a base URL of the content source with the URL for that article.|