Kinook Software Forum

Go Back   Kinook Software Forum > Ultra Recall > [UR] General Discussion
FAQ Community Calendar Today's Posts Search

Reply
 
Thread Tools Rate Thread Display Modes
  #1  
Old 03-26-2006, 01:45 PM
hartmut hartmut is online now
Registered User
 
Join Date: 06-12-2005
Posts: 70
pdf import not indexed

I have imported several pdf-files in one UR-Database.
The pdf- files were downloaded from different sites in the net.
Now I see in the Item Attributes window that some of the files are correctly indexed and others have only 3 keywords.
The files which are not indexed are all from the same site in the net.
Do anyone here have an idea about the reason and/or suggestion to resolve this problem.
As the files are in the same UR-Database and I did not change the option for the Import it must be something witeh the pdf-files
Thank you


Hartmut
Reply With Quote
  #2  
Old 03-27-2006, 08:03 AM
Leoram Leoram is offline
Registered User
 
Join Date: 08-03-2005
Posts: 119
I have experienced same problem with some extensions (originally with mht) http://www.kinook.com/Forum/showthre...ight=Keyworded Very often too with pdf.

I've realized that the software or method used to generate the pdf document is what causes this issue (if the document is generated by some applications other than Adobe Acrobat for example, or by using the Print command from most programs).

I currently have many faulty keyworded stored items with few keywords, many pdfs. I hope very soon there will be a way to efficiently re-keyword many items in batch, as this issue impacts the search feature negatively. And I don't have plans to purchase Adobe Acrobat (proprietary pdf document maker.)

Leoram

Last edited by Leoram; 03-27-2006 at 10:27 AM.
Reply With Quote
  #3  
Old 03-27-2006, 03:50 PM
srdiamond srdiamond is online now
Registered User
 
Join Date: 11-23-2004
Location: Los Angeles
Posts: 126
Re: pdf import not indexed

Quote:
Originally posted by hartmut
I have imported several pdf-files in one UR-Database.
The pdf- files were downloaded from different sites in the net.
Now I see in the Item Attributes window that some of the files are correctly indexed and others have only 3 keywords.
The files which are not indexed are all from the same site in the net.
Do anyone here have an idea about the reason and/or suggestion to resolve this problem.
As the files are in the same UR-Database and I did not change the option for the Import it must be something witeh the pdf-files
Thank you


Hartmut
There are apparently two kinds of pdfs--text and image. I think one pdf can have both elements. All the pdfs I have ever come across, however, have been mostly image. (When only a few words are indexed, it's because it's the only text in the document.)


For UR to index image pdfs, it would need ocr software. I finally found a product that turns pdfs into Word documents--very imperfectly, as can be expected since we're really talking about a scanning task. It is by Nuance (formerly Scansoft) and costs about $50. I doubt it would not be efficient for UR to include a component to do OCR on image pdfs. But the way to get them indexed is to turn them into something else, using software like the Scansoft product.
Reply With Quote
  #4  
Old 03-27-2006, 04:21 PM
Leoram Leoram is offline
Registered User
 
Join Date: 08-03-2005
Posts: 119
Re: Re: pdf import not indexed

Quote:
Originally posted by srdiamond
There are apparently two kinds of pdfs--text and image. I think one pdf can have both elements. All the pdfs I have ever come across, however, have been mostly image. (When only a few words are indexed, it's because it's the only text in the document.)
I don't know but I have pdf documents that, among many in my database, does not get keyworded correctly. For instance, I have a 31-page item mostly text with very few images. In my UR evironment I only get 26 keywords and it has many words in it. I can't send it as an attachment here as this forum only allows maximum 250Kb files. I can send it by email upon request though.

Hope this helps.

Leoram

Last edited by Leoram; 03-27-2006 at 04:24 PM.
Reply With Quote
  #5  
Old 03-27-2006, 04:30 PM
srdiamond srdiamond is online now
Registered User
 
Join Date: 11-23-2004
Location: Los Angeles
Posts: 126
Re: Re: Re: pdf import not indexed

Quote:
Originally posted by Leoram
I don't know but I have pdf documents that, among many in my database, does not get keyworded correctly. For instance, I have a 31-page item mostly text with very few images. In my UR evironment I only get 26 keywords and it has many words in it. I can't send it as an attachment here as this forum only allows maximum 250Kb files. I can send it by email upon request though.

Hope this helps.

Leoram
I think what you are calling text is really an image of text.
Reply With Quote
  #6  
Old 03-28-2006, 07:46 AM
kinook kinook is online now
Administrator
 
Join Date: 03-06-2001
Location: Colorado
Posts: 6,034
srdiamond is correct that some PDFs use images even for textual information, and UR can't parse those for keywords. UR also can't extract keywords from encrypted PDFs. If you can ZIP and send a couple problem PDFs to support@kinook.com we'll investigate to see if that is the problem or if there is an issue with the PDF parsing component we're using. Thanks.
Reply With Quote
  #7  
Old 03-28-2006, 08:29 AM
Leoram Leoram is offline
Registered User
 
Join Date: 08-03-2005
Posts: 119
Quote:
Originally posted by kinook
srdiamond is correct that some PDFs use images even for textual information, and UR can't parse those for keywords. UR also can't extract keywords from encrypted PDFs.
Could be, and I understand that possibility. I'm sending you right now a zipped file with two pdf documents to the address you provided. I hope this will be of help. Thanks.

Leoram
Reply With Quote
  #8  
Old 03-28-2006, 12:29 PM
Textfarm Textfarm is online now
Registered User
 
Join Date: 02-19-2006
Posts: 5
Wink

Chances are that some of your PDFs are encrypted, which would prevent them from being indexed properly. Open them in a text editor to find out!
Reply With Quote
  #9  
Old 03-28-2006, 12:49 PM
hartmut hartmut is online now
Registered User
 
Join Date: 06-12-2005
Posts: 70
well I tried to convert these pdf to other formats (plain text/word or htm) with the test-version of several converters(ABC pdf converter, pdf995,) but without success.
Then I opened the pdf as picture in my OCR(abby finereader) Only with the OCR I could extract the text.
Therefore I think that these pdf are images.
Thank you
regards
Hartmut
Reply With Quote
  #10  
Old 03-28-2006, 03:42 PM
kevina kevina is online now
Registered User
 
Join Date: 03-27-2003
Posts: 825
Of the two provided files, one seems to keyword as expected (at least some of the document) as I get 231 keywords. The other pdf document is not keywordable. One way to confirm this is to open the document in Adobe Acrobat, select and copy some text, then paste into a text editor. When this is done with text in the non-parsing pdf file, garbage is pasted (in notepad).
Reply With Quote
  #11  
Old 03-29-2006, 11:11 AM
Leoram Leoram is offline
Registered User
 
Join Date: 08-03-2005
Posts: 119
Thanks for your time in the investigation of this problem.

Please take an extra moment to further look into the one that in your case Kevina reports 231 keywords. That one in my case is only indexed 26 words as I have configured a list of very common words to be excluded (this, that, here, there, etc.), but please keep in mind that for a 31-page document with many, many words even a quantity of 231 is few, so there is still something odd. The other document (the smallest in size) reports to me only 6 keywords, and it is a 3-page long doc.

I performed the test you suggested. I copied a page of text using Adobe Reader version 7.0.7 from the document "10 things access reports.pdf" and then copied on a word processor, but the result is clean text. I'm slightly frustrated. I'm very confident that this will be clarified though. UR is an excellent program.

Leoram
Reply With Quote
  #12  
Old 03-29-2006, 12:24 PM
srdiamond srdiamond is online now
Registered User
 
Join Date: 11-23-2004
Location: Los Angeles
Posts: 126
Quote:
Originally posted by Leoram
I performed the test you suggested. I copied a page of text using Adobe Reader version 7.0.7 from the document "10 things access reports.pdf" and then copied on a word processor, but the result is clean text.
That's not quite the test suggested. A word processor differs from a text editor like notepad; it will usually accept images. Certainly MS Word does. Still, you would see a difference....
Reply With Quote
  #13  
Old 03-29-2006, 01:04 PM
Leoram Leoram is offline
Registered User
 
Join Date: 08-03-2005
Posts: 119
Quote:
Originally posted by srdiamond
That's not quite the test suggested. A word processor differs from a text editor like notepad
Yes. The test I performed was using Notepad. I confused the term by writing "word processor" above. I apologize.

Leoram
Reply With Quote
  #14  
Old 05-24-2006, 05:22 PM
urer urer is online now
Registered User
 
Join Date: 05-24-2006
Posts: 11
I have experienced same problem.
Attached Files
File Type: zip problem pdf.zip (201.1 KB, 3127 views)
Reply With Quote
  #15  
Old 06-06-2006, 08:22 AM
BumbleBee BumbleBee is online now
Registered User
 
Join Date: 06-06-2006
Posts: 1
The sample PDF gave 19 indexed entries.

I then loaded the PDF into OmniPage, run in through text recognition, and saved it from there.
It is now even smaller, but produces close 900 indexed entries.

So the problem is within the PDF but not in Ultra Recall - which I am just started testing.

Bernie
Attached Files
File Type: zip how_to_win_the_adwords_game_omnipage.zip (112.8 KB, 7805 views)
Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



All times are GMT -5. The time now is 11:11 PM.


Copyright © 1999-2023 Kinook Software, Inc.