Why can’t search be more like Google!

Well, if you have deployed Documentum before, you have probably heard this from at least one of your clients. The great thing about Google is the speed and simpleness of the search execution (I won’t discuss search relevancy, since this can be very subjective). The problem with Google as it relates to ECM is that all content is created equal. In other words, all content that Google indexes is treated identically – there is no individually security assigned to each document, web page, etc. I am aware that you can configure Google to filter content located in different file stores/locations to limit access to certain groups of users. However, this is not true document level security.

In the ECM world, one must be able to apply different permissions to different documents and these permissions must be adhered to while performing searches. This is used to be problematic for Google, until the release of the Google Search Appliance 5.0 (GSA). With GSA 5.0, there is now an Enterprise Connector for EMC Documentum.

The connector first crawls Documentum repository using a superuser account. The crawling process extracts metadata and content from the repository to be used by GSA to generate index. Users performs search against the GSA, not against the Content Server directly.
When GSA returns results, it then prompts the user to enter his/her credentials using Webtop interface. The connector passes these credentials along with the search results to the Content Server. The Content Server evaluates whether the user has permissions to view each document in the search results and then returns authorized result set back to the connector.

Finally, GSA displays only the results that the user has access to. Problem solved! Maybe…

I have not implemented this solution provided by Google. A few questions that I still have.

What is the performance, given that connector has to communicate with Content Server to evaluate permissions on the result set? This shouldn’t be a concern if the result set is small, but what happens if the result set is large?
Why is Webtop needed? It would be ideal if connector provided an interface to authenticate directly with Content Server.
How is document versioning is handled (i.e. updated content, newer versions, etc)? Are renditions supported?

13 responses to “Why can’t search be more like Google!”

quicktimeuser | December 20, 2007 at 3:29 pm | Reply

Actually, do we actually want search to be like Google? Definately not on the server and technical side (they are many better engines out there) but I don’t think not the even the user interface is worth striving for. Sure it is simple but honestly, their features are very limited and I can’t just accept that it is The Solution for an Enterprise Search function. What about facetted navigation and integration with metadata attributes that we already have in the repository?

The thing that makes me most curious is why the internal search listing inside WebTop/DAM is so bad. There I would actually at least appreciate a first step to make it a bit more Google-like but still the potential of making it much better is so good with a Documentum repository as a foundation. Especially with a “social computing” approach to integrate data from information analytics and user behaviour into the results. And, why isn’t there any search analytics exposed inside DAM?
johnnygee | December 20, 2007 at 5:48 pm | Reply

I appreciate your comments quicktimeuser. I didn’t mean to insinuate that ALL users want to Documentum search behave like Google. The ones that I have heard that want this are usually less tech-savy and do not understand the power of Advanced Search.

As for search analytics, this is entirely dependent on what FAST provides – the Index/Search engine that comes with Documentum. Documentum OEMs this Instream product from FAST.
cristianorellana | December 21, 2007 at 3:08 pm | Reply

Johnny,

I implemented the GSA and the Documentum Connector, to ask some of your questions:

Why is Webtop needed?
This is because the results are in a DRL format, when you click over a result your credential are asked.

How is document versioning is handled?
The versions are handled by the DRL component of webtop, if you click over a document that has a new version the DRL component show the link to see the CURRENT version of the document.

connector has to communicate with Content Server to evaluate permissions?
The connecto just access the content server using dql queries in sets of 100 elements (return top 100) using the modify date and the las r_object_id crawled. The permissions are evaluated once the user click over a result.

We are fighting to change the result screen to put the document name in the link instead of the first line of the content (default)

Well hope that you read this post, best regards.
johnnygee | December 21, 2007 at 6:59 pm | Reply

Hi Cristian,
Thanks for responding. To follow up on your replies:

1) Having full webtop install to just support DRL component seems excessive. Since Google has already created developed code to “crawl” the repository for content, why cant they create their own “DRL” component/service?

2) If I want to look for a document that has three versions, does GSA return three hits (one for each version)? Hopefully, it knows how to hide non-CURRENT versions. I would not want to see previous versions of the content show up in my search results.

3) As for the connector, what happens with the second group of 100 results? You say that the permissions are only evaluated after a user clicks on a result. What happens if the user has NONE permissions on the document? Does he/she still see the “ghost” image of the document? If so, this is problematic. I remember when 5.3 first shipped, it similar problems. Imagine seeing a document called “JDoe Dismissal.Doc” even though you dont have permissions on the actual content. The name of the document may have some significance.

Any further information would be appreciated. Thanks.
cristianorellana | December 21, 2007 at 7:27 pm | Reply

Hello,

1. In the GSA you configure a url to show the results, this url is a DRL, so you dont need a full webtop installation, the GSA show the DRL with the r_object_id of the document but when you click on it you go to your webtop server.

2. In the result set you can see only the current versions but you know the “cache” link on the google result can take you to a previous version

3. About your doubts, here are a dql example of GSA
select i_chronicle_id, r_object_id, r_modify_date from dm_sysobject where r_object_type=’dm_document’ and ((r_modify_date = date(‘2007-11-27 20:01:02′,’yyyy-mm-dd hh:mi:ss’) and r_object_id > ‘090007d2800efd1a’) OR ( r_modify_date > date(‘2007-11-27 20:01:02′,’yyyy-mm-dd hh:mi:ss’))) order by r_modify_date,r_object_id ENABLE (return_top 100)

I don’t have the gsa in this minute so can’t try the permissions issue in dept.

Regards.
cristianorellana | December 21, 2007 at 7:30 pm | Reply

I forgot something, the dql it’s used by the crawler, when you search in the GSA you got all the matching results not the frst 100.
johnnygee | December 21, 2007 at 9:08 pm | Reply

1) So, in theory, you could replace “DRL” url to point some custom servlet, that could server up the content.

2) Good to know about “cache” link

3) Waiting to hear back about #3

Thanks again. I’m actually thinking about proposing GSA to a client if my concerns are alleviated.
cristianorellana | December 26, 2007 at 12:27 pm | Reply

1.- Right, you can write your own servlet
3.- Still without GSA

We will upgrade a client platform from 5.3 to D6, one step it’s to replace the FAST Index Server with a GSA, also we are developing a xsl to put the object_name in the google results page.
johnnygee | December 26, 2007 at 2:31 pm | Reply

Another question: Can you configure GSA to put more weighting on the attribute values (vs content) that are indexed? I heard that you can do this to FAST in D6 version.
Pingback: Well if you dislike FAST now, you may like it even less when Microsoft acquires them « Ask Johnny! - Documentum Guru
Mahesh T | February 5, 2008 at 2:43 pm | Reply

Johnny,
I dont think GSA can be configured to put more weighting on the attribute values(vs content).
The only possibility is..we can configure GSA to consider r_modify_date & display results by giving a high rank to new/updated documents.
In GSA terminology, it is possible through “date biasing”
johnnygee | February 5, 2008 at 4:39 pm | Reply

Thanks for the additional info about GSA Mahesh.
swordgroup | February 5, 2008 at 9:58 pm | Reply

Hi Johnny,

Sword was very proud to develop the Documentum connector for Google GSA.

They liked it so much they bought it from us!!

Hope all is well, David Warren