|
#1
|
|||
|
|||
pdf import not indexed
I have imported several pdf-files in one UR-Database.
The pdf- files were downloaded from different sites in the net. Now I see in the Item Attributes window that some of the files are correctly indexed and others have only 3 keywords. The files which are not indexed are all from the same site in the net. Do anyone here have an idea about the reason and/or suggestion to resolve this problem. As the files are in the same UR-Database and I did not change the option for the Import it must be something witeh the pdf-files Thank you Hartmut |
#2
|
|||
|
|||
I have experienced same problem with some extensions (originally with mht) http://www.kinook.com/Forum/showthre...ight=Keyworded Very often too with pdf.
I've realized that the software or method used to generate the pdf document is what causes this issue (if the document is generated by some applications other than Adobe Acrobat for example, or by using the Print command from most programs). I currently have many faulty keyworded stored items with few keywords, many pdfs. I hope very soon there will be a way to efficiently re-keyword many items in batch, as this issue impacts the search feature negatively. And I don't have plans to purchase Adobe Acrobat (proprietary pdf document maker.) Leoram Last edited by Leoram; 03-27-2006 at 10:27 AM. |
#3
|
|||
|
|||
Re: pdf import not indexed
Quote:
For UR to index image pdfs, it would need ocr software. I finally found a product that turns pdfs into Word documents--very imperfectly, as can be expected since we're really talking about a scanning task. It is by Nuance (formerly Scansoft) and costs about $50. I doubt it would not be efficient for UR to include a component to do OCR on image pdfs. But the way to get them indexed is to turn them into something else, using software like the Scansoft product. |
#4
|
|||
|
|||
Re: Re: pdf import not indexed
Quote:
Hope this helps. Leoram Last edited by Leoram; 03-27-2006 at 04:24 PM. |
#5
|
|||
|
|||
Re: Re: Re: pdf import not indexed
Quote:
|
#6
|
|||
|
|||
srdiamond is correct that some PDFs use images even for textual information, and UR can't parse those for keywords. UR also can't extract keywords from encrypted PDFs. If you can ZIP and send a couple problem PDFs to support@kinook.com we'll investigate to see if that is the problem or if there is an issue with the PDF parsing component we're using. Thanks.
|
#7
|
|||
|
|||
Quote:
Leoram |
#8
|
|||
|
|||
Chances are that some of your PDFs are encrypted, which would prevent them from being indexed properly. Open them in a text editor to find out!
|
#9
|
|||
|
|||
well I tried to convert these pdf to other formats (plain text/word or htm) with the test-version of several converters(ABC pdf converter, pdf995,) but without success.
Then I opened the pdf as picture in my OCR(abby finereader) Only with the OCR I could extract the text. Therefore I think that these pdf are images. Thank you regards Hartmut |
#10
|
|||
|
|||
Of the two provided files, one seems to keyword as expected (at least some of the document) as I get 231 keywords. The other pdf document is not keywordable. One way to confirm this is to open the document in Adobe Acrobat, select and copy some text, then paste into a text editor. When this is done with text in the non-parsing pdf file, garbage is pasted (in notepad).
|
#11
|
|||
|
|||
Thanks for your time in the investigation of this problem.
Please take an extra moment to further look into the one that in your case Kevina reports 231 keywords. That one in my case is only indexed 26 words as I have configured a list of very common words to be excluded (this, that, here, there, etc.), but please keep in mind that for a 31-page document with many, many words even a quantity of 231 is few, so there is still something odd. The other document (the smallest in size) reports to me only 6 keywords, and it is a 3-page long doc. I performed the test you suggested. I copied a page of text using Adobe Reader version 7.0.7 from the document "10 things access reports.pdf" and then copied on a word processor, but the result is clean text. I'm slightly frustrated. I'm very confident that this will be clarified though. UR is an excellent program. Leoram |
#12
|
|||
|
|||
Quote:
|
#13
|
|||
|
|||
Quote:
Leoram |
#14
|
|||
|
|||
I have experienced same problem.
|
#15
|
|||
|
|||
The sample PDF gave 19 indexed entries.
I then loaded the PDF into OmniPage, run in through text recognition, and saved it from there. It is now even smaller, but produces close 900 indexed entries. So the problem is within the PDF but not in Ultra Recall - which I am just started testing. Bernie |
|
|