problems with PDF import; and non-US encoding [Archive]

View Full Version : problems with PDF import; and non-US encoding

TMF

04-02-2007, 05:24 PM

No matter what kind of PDF I throw into UR, it will only index and display a part of it, not full document. Tried with all kinds of them from various sources.

-----

Search doesn't work perfectly with Eastern European encoding, I believe Win-1251.

If there is a word MANDOLÃNA (entered into a UR note item), searching for mandolÃ*na will not find it, it will only find when searching form MANDOLÃNA.

Unlike when excluding the accent mark over I, and writing just MANDOLINA, it's found on searches for both MANDOLINA and mandolina.

So I guess UR can't distinguish Ã and Ã* being the same character, just one upper and second lower case.

I'm on default settings, UR Pro 3.03, just installed.

kevina

04-03-2007, 09:47 AM

Please send some sample pdf files that don't keyword properly to support@kinook.com for our evaluation.

Ultra Recall is using standard windows APIs to do the conversion to lowercase, which are documented as using the windows locale for control this conversion.

Are the 'accented' words that don't search properly in the same code page as your Windows code page? Our testing (in the 1252 code page) shows the keywords containing the uppercase version of the accented character - but I would expect this since that accented character isn't in codepage 1252.

Searching for the uppercase of the word should work as a less than ideal workaround...

TMF

04-03-2007, 02:25 PM

VisBuildPro.pdf just downloaded from your web is a fine example:
copy/paste from my Acrobat Reader 6.0, into a new note in UR, resulted in keyword count of 6207. Moreover, that note loads very quickly.

Import through drag&drop into UR was just about 100 keywords or so, and took very long, and opening that document as well takes extremely long, I'd say 10-15 minutes but didn't measure and no time to experiment further. Though the speed issues I have only seen with this one file, not with the ones imported previously.

Tested with fresh install, completely new file, UR 3.0.3, WinXP.

As for the accented words. I have entered them using the keyboard, thus they shall be in same codepage as I have installed on Windows. It's Win-1250 now that I researched it.

One other software, and Notepad.exe, handle same search correctly. Also UR can find the string when using the CTRL+F search inside the note. It's just the database-wide search that doesn't find it.

I don't know if codepage can be setup within UR, if that's possible somewhere, I didn't do that.

kevina

04-03-2007, 02:45 PM

What version of Ultra Recall are you using, Standard or Professional? The standard version does not support keywording of pdf files...

When Ultra Recall imports a document that is keyworded, it retrieves the plain text of the item, then finds all relevant "keywords", lower-cases them, and saves them in an indexed table to optimize later searches. The lower-casing is done to facilitate case-insensitive searching, however, for some reason the text is not being lower-cased properly in your case. We will investigate this to see if a solution can be found.

When you copy the text from Adobe Acrobat Reader, then paste into Ultra Recall, you are essentially using that application to get the rich text of the item, then pasting it as rich text into Ultra Recall (which the Standard version does keyword)...

TMF

04-03-2007, 03:41 PM

I'm using Pro version trial. Shortly trialed standard version few days ago, but uninstalled.

Btw. my text probably gets lower-cased. The upper-case encoding problem that I encountered, is completely independent from PDF problem, it happens on manual entry. I'm sorry for confusion, though I tried to separate the two topics, I thought it's pretty minor issue so posted all in one post.

kevina

04-05-2007, 06:22 PM

The pdf keywording (and performance issue) have been confirmed here with our own visbuildpro.pdf file (that you reference).

Apparently the pdf to text component used by Ultra Recall doesn't work properly with that pdf file. Other pdf files we tested do parse properly.