pdf import not indexed

hartmut · #1 03-26-2006, 01:45 PM

I have imported several pdf-files in one UR-Database.
The pdf- files were downloaded from different sites in the net.
Now I see in the Item Attributes window that some of the files are correctly indexed and others have only 3 keywords.
The files which are not indexed are all from the same site in the net.
Do anyone here have an idea about the reason and/or suggestion to resolve this problem.
As the files are in the same UR-Database and I did not change the option for the Import it must be something witeh the pdf-files
Thank you

Hartmut

Leoram · #2 03-27-2006, 08:03 AM

I have experienced same problem with some extensions (originally with mht) http://www.kinook.com/Forum/showthre...ight=Keyworded Very often too with pdf.

I've realized that the software or method used to generate the pdf document is what causes this issue (if the document is generated by some applications other than Adobe Acrobat for example, or by using the Print command from most programs).

I currently have many faulty keyworded stored items with few keywords, many pdfs. I hope very soon there will be a way to efficiently re-keyword many items in batch, as this issue impacts the search feature negatively. And I don't have plans to purchase Adobe Acrobat (proprietary pdf document maker.)

Leoram

srdiamond · #3 03-27-2006, 03:50 PM

Quote:

Originally posted by hartmut
I have imported several pdf-files in one UR-Database.
The pdf- files were downloaded from different sites in the net.
Now I see in the Item Attributes window that some of the files are correctly indexed and others have only 3 keywords.
The files which are not indexed are all from the same site in the net.
Do anyone here have an idea about the reason and/or suggestion to resolve this problem.
As the files are in the same UR-Database and I did not change the option for the Import it must be something witeh the pdf-files
Thank you

Hartmut

There are apparently two kinds of pdfs--text and image. I think one pdf can have both elements. All the pdfs I have ever come across, however, have been mostly image. (When only a few words are indexed, it's because it's the only text in the document.)

For UR to index image pdfs, it would need ocr software. I finally found a product that turns pdfs into Word documents--very imperfectly, as can be expected since we're really talking about a scanning task. It is by Nuance (formerly Scansoft) and costs about $50. I doubt it would not be efficient for UR to include a component to do OCR on image pdfs. But the way to get them indexed is to turn them into something else, using software like the Scansoft product.

Leoram · #4 03-27-2006, 04:21 PM

Quote:

Originally posted by srdiamond
There are apparently two kinds of pdfs--text and image. I think one pdf can have both elements. All the pdfs I have ever come across, however, have been mostly image. (When only a few words are indexed, it's because it's the only text in the document.)

I don't know but I have pdf documents that, among many in my database, does not get keyworded correctly. For instance, I have a 31-page item mostly text with very few images. In my UR evironment I only get 26 keywords and it has many words in it. I can't send it as an attachment here as this forum only allows maximum 250Kb files. I can send it by email upon request though.

Hope this helps.

Leoram

srdiamond · #5 03-27-2006, 04:30 PM

Quote:

Originally posted by Leoram
I don't know but I have pdf documents that, among many in my database, does not get keyworded correctly. For instance, I have a 31-page item mostly text with very few images. In my UR evironment I only get 26 keywords and it has many words in it. I can't send it as an attachment here as this forum only allows maximum 250Kb files. I can send it by email upon request though.

Hope this helps.

Leoram

I think what you are calling text is really an image of text.

kinook · #6 03-28-2006, 07:46 AM

srdiamond is correct that some PDFs use images even for textual information, and UR can't parse those for keywords. UR also can't extract keywords from encrypted PDFs. If you can ZIP and send a couple problem PDFs to support@kinook.com we'll investigate to see if that is the problem or if there is an issue with the PDF parsing component we're using. Thanks.

Leoram · #7 03-28-2006, 08:29 AM

Quote:

Originally posted by kinook
srdiamond is correct that some PDFs use images even for textual information, and UR can't parse those for keywords. UR also can't extract keywords from encrypted PDFs.

Could be, and I understand that possibility. I'm sending you right now a zipped file with two pdf documents to the address you provided. I hope this will be of help. Thanks.

Leoram

Textfarm · #8 03-28-2006, 12:29 PM

Chances are that some of your PDFs are encrypted, which would prevent them from being indexed properly. Open them in a text editor to find out!

hartmut · #9 03-28-2006, 12:49 PM

well I tried to convert these pdf to other formats (plain text/word or htm) with the test-version of several converters(ABC pdf converter, pdf995,) but without success.
Then I opened the pdf as picture in my OCR(abby finereader) Only with the OCR I could extract the text.
Therefore I think that these pdf are images.
Thank you
regards
Hartmut

kevina · #10 03-28-2006, 03:42 PM

Of the two provided files, one seems to keyword as expected (at least some of the document) as I get 231 keywords. The other pdf document is not keywordable. One way to confirm this is to open the document in Adobe Acrobat, select and copy some text, then paste into a text editor. When this is done with text in the non-parsing pdf file, garbage is pasted (in notepad).

Leoram · #11 03-29-2006, 11:11 AM

Thanks for your time in the investigation of this problem.

Please take an extra moment to further look into the one that in your case Kevina reports 231 keywords. That one in my case is only indexed 26 words as I have configured a list of very common words to be excluded (this, that, here, there, etc.), but please keep in mind that for a 31-page document with many, many words even a quantity of 231 is few, so there is still something odd. The other document (the smallest in size) reports to me only 6 keywords, and it is a 3-page long doc.

I performed the test you suggested. I copied a page of text using Adobe Reader version 7.0.7 from the document "10 things access reports.pdf" and then copied on a word processor, but the result is clean text. I'm slightly frustrated. I'm very confident that this will be clarified though. UR is an excellent program.

Leoram

srdiamond · #12 03-29-2006, 12:24 PM

Quote:

Originally posted by Leoram
I performed the test you suggested. I copied a page of text using Adobe Reader version 7.0.7 from the document "10 things access reports.pdf" and then copied on a word processor, but the result is clean text.

That's not quite the test suggested. A word processor differs from a text editor like notepad; it will usually accept images. Certainly MS Word does. Still, you would see a difference....

Leoram · #13 03-29-2006, 01:04 PM

Quote:

Originally posted by srdiamond
That's not quite the test suggested. A word processor differs from a text editor like notepad

Yes. The test I performed was using Notepad. I confused the term by writing "word processor" above. I apologize.

Leoram

urer · #14 05-24-2006, 05:22 PM

I have experienced same problem.

BumbleBee · #15 06-06-2006, 08:22 AM

The sample PDF gave 19 indexed entries.

I then loaded the PDF into OmniPage, run in through text recognition, and saved it from there.
It is now even smaller, but produces close 900 indexed entries.

So the problem is within the PDF but not in Ultra Recall - which I am just started testing.

Bernie