A feasible way to get full-text search in UR

Yanni · #1 02-16-2005, 04:51 AM

Keyword indexing is undoubtedly the way to go for fast searches. At times, however, one needs a phrase search; unticipating that need and inserting a specific phrase as a keyword is often not practical.

SUGGESTION: As soon as the user types a second word in the search field, UR switches to phrase search. It finds the documents that contain the word with the least occurrences and does a full-text search only on those documents. This, although slower that UR's normal keyword search, it will still be much faster than a full-text search on all documents.

EXAMPLE: I type the term "unlimited possibilities." UR knows that the keyword "unlimited" is found in 40 documents while "possibilities" occurs in some 200 documents. So it starts a full-text search for "unlimited possibilities" on the 40 documents that contain "unlimited." (Or, if time-effective, the total length of the documents that contain each word can be the factor that decides which documents are searched.) Using wildcards or regular expressions would of course make the process a bit more complicated, but still faster than a raw power full-text search.

kevina · #2 02-17-2005, 09:31 AM

Your suggestion is a good one, and while simpler than implementing a complete full-text index, it will be a significant change that will require some research to test and implement.

This will be put on the list of things to do for a future release of Ultra Recall.

ExtraLean · #3 02-17-2005, 10:23 AM

Quote:

Originally posted by kevina This will be put on the list of things to do for a future release of Ultra Recall.

Thanks for agreeing to look into this. As much as I like UR, the searching capability is probably the area that needs the most work, IMO. It does no good to build up a wealth of knowledge if it is hard to find it later. I sorely miss having the capability to do a full-text search!

danson · #4 01-24-2007, 07:04 PM

I bet there is some way to make this even cleverer -

Can you think of some kind of datastructure that allows you to index not only what words occur in what documents but also some kind of offset from the beginning value?

I suppose the current index looks like:

WORD DOCUMENT-ID
================
wordA: 2 5 9 1 3
wordB: 2 12 99 293

You could update the index to show not just what documents the word lies in but also it's position:

wordA: 2(4) 5(29)...
wordB: 2(5) 9(23)...

So wordA occurs in document 2, offset 4 and document 5, offset 29.

Then searching for the phrase "wordA wordB" would simply be a case of returning all documents and comparing offsets that are different by 1 (or perhaps with some tolerance factor).

That final comparison can probably also be optimised with the right algorithm.

Perhaps though you do something much more clever already...

Daniel