com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException:
TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@6f8fdac3
Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature;
read 0x615C316674725C7B, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document
Is it right to ask here? Or i should write open issue somewhere more?
Which version are you using to test? I just tried with the latest Importer snapshot and I was able to parse the first 3 of the 5 files you sent me. I sent you an email with the parsed output.
I will investigate the other two further (eq-all-en.xls and metod-2013.doc).
I found out why some are failing. You are using HTTP Collector and the content type from the HTTP response does not match the real content type of the file. For instance, your ...agreement.rtf
file is identified as application/msword
when returned by your web server. In reality the document is application/rtf
.
The Importer module will try to guess the content type when it is not provided. It otherwise uses the provided one (from the HTTP response in your case). So you should really be looking at why your web server does not return the proper content type for some documents. Fixing this will fix many of your errors.
If you do not control the site or it is otherwise impossible for you to do, we can make a feature request to always "guess" the content type instead of trusting the web server. I am not sure whether this could cause additional issues though (when guessing is wrong).
Your metod-2013.doc
file fails due to a bug that will be fixed in next Tika release: https://issues.apache.org/jira/browse/TIKA-2198
Until it is released, I have integrated the single fix myself, found in latest snapshot version of the Importer.
Your eq-all-en.xls
file is more tricky. I researched this specific exception and it seems to point to an issue with how the file was written, even if it can be opened properly by Excel. A known workaround is to save the file in the newer Microsoft format (.xlsx). I tested this and that works.
most interested in fixing org.apache.poi.poifs.filesystem.NotOLE2FileException. The rest are single cases.
feature request to always "guess" the content type instead of trusting the web server
It will be great. But could it be more flexible to compare content type sended by server and real content type? And if it differs use real insted of provided by server.
A new snapshot release of the HTTP Collector was made which now offers to detect the content type and character encoding instead of relying on the HTTP header response. It can be enabled like this (in your crawler section):
<documentFetcher detectContentType="true" detectCharset="true"/>
FYI, when used standalone, the Importer will always try to detect the content type and character encoding when not specified.
I can confirm the same behavior (org.apache.tika.exception.TikaException: TIKA-198) also for .jpg in .zip archives.
Norconex Collector 2.8.1
You tried <documentFetcher detectContentType="true" detectCharset="true"/>
? If so, can you confirm whether you are using pre-parse handlers? If so, do you make sure you are not performing text operations on binary files (using restrictTo
)? If that is not your issue, please open a new ticket (since this one is closed) with details to reproduce.