#1
|
|||
|
|||
A feasible way to get full-text search in UR
Keyword indexing is undoubtedly the way to go for fast searches. At times, however, one needs a phrase search; unticipating that need and inserting a specific phrase as a keyword is often not practical.
SUGGESTION: As soon as the user types a second word in the search field, UR switches to phrase search. It finds the documents that contain the word with the least occurrences and does a full-text search only on those documents. This, although slower that UR's normal keyword search, it will still be much faster than a full-text search on all documents. EXAMPLE: I type the term "unlimited possibilities." UR knows that the keyword "unlimited" is found in 40 documents while "possibilities" occurs in some 200 documents. So it starts a full-text search for "unlimited possibilities" on the 40 documents that contain "unlimited." (Or, if time-effective, the total length of the documents that contain each word can be the factor that decides which documents are searched.) Using wildcards or regular expressions would of course make the process a bit more complicated, but still faster than a raw power full-text search. |
#2
|
|||
|
|||
Your suggestion is a good one, and while simpler than implementing a complete full-text index, it will be a significant change that will require some research to test and implement.
This will be put on the list of things to do for a future release of Ultra Recall. |
#3
|
|||
|
|||
Quote:
|
#4
|
|||
|
|||
I bet there is some way to make this even cleverer -
Can you think of some kind of datastructure that allows you to index not only what words occur in what documents but also some kind of offset from the beginning value? I suppose the current index looks like: WORD DOCUMENT-ID ================ wordA: 2 5 9 1 3 wordB: 2 12 99 293 You could update the index to show not just what documents the word lies in but also it's position: wordA: 2(4) 5(29)... wordB: 2(5) 9(23)... So wordA occurs in document 2, offset 4 and document 5, offset 29. Then searching for the phrase "wordA wordB" would simply be a case of returning all documents and comparing offsets that are different by 1 (or perhaps with some tolerance factor). That final comparison can probably also be optimised with the right algorithm. Perhaps though you do something much more clever already... Daniel |
|
|