TikaException: TIKA-198 · Issue #41 · Norconex/importer

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

健壮的李子 · PHP: 编译问题 - Manual· 1 周前 ·

乖乖的蘑菇 · 使用 Apache Parquet 格式的 ...· 1 周前 ·

爱听歌的风衣 · Dubbo接口如何在Jmeter中测试，自研 ...· 1 周前 ·

谈吐大方的烤土司 · 你的主机中的软件中止了一个已建立的连接。 ...· 6 小时前 ·

温文尔雅的圣诞树 · BS/BA, MS, and ...· 7 月前 ·

粗眉毛的树叶 · 机器学习中的表达能力, 训练难度和泛化性能 ...· 1 年前 ·

细心的蟠桃 · 自定义WordPress ...· 1 年前 ·

坏坏的牛排 · 《谐星聊天会》- Apple 播客· 1 年前 ·

追风的麻辣香锅 · 肌皮瓣法重睑术的临床疗效分析· 1 年前 ·

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are a lot of this kind exceptions in log file.

com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: 
TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@6f8fdac3
Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; 
read 0x615C316674725C7B, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document
Is it right to ask here? Or i should write open issue somewhere more?
          Which version are you using to test?  I just tried with the latest Importer snapshot and I was able to parse the first 3 of the 5 files you sent me.  I sent you an email with the parsed output.
I will investigate the other two further (eq-all-en.xls and metod-2013.doc).
          I found out why some are failing.    You are using HTTP Collector and the content type from the HTTP response does not match the real content type of the file.  For instance, your ...agreement.rtf file is identified as application/msword when returned by your web server.  In reality the document is application/rtf.
The Importer module will try to guess the content type when it is not provided.  It otherwise uses the provided one (from the HTTP response in your case).   So you should really be looking at why your web server does not return the proper content type for some documents.  Fixing this will fix many of your errors.
If you do not control the site or it is otherwise impossible for you to do, we can make a feature request to always "guess" the content type instead of trusting the web server.  I am not sure whether this could cause additional issues though (when guessing is wrong).
          Your metod-2013.doc file fails due to a bug that will be fixed in next Tika release: https://issues.apache.org/jira/browse/TIKA-2198

Until it is released, I have integrated the single fix myself, found in latest snapshot version of the Importer.
Your eq-all-en.xls file is more tricky.  I researched this specific exception and it seems to point to an issue with how the file was written, even if it can be opened properly by Excel.   A known workaround is to save the file in the newer Microsoft format (.xlsx).  I tested this and that works.
          most interested in fixing org.apache.poi.poifs.filesystem.NotOLE2FileException. The rest are single cases.
feature request to always "guess" the content type instead of trusting the web server
It will be great. But could it be more flexible to compare content type sended by server and real content type? And if it differs use real insted of provided by server.
          A new snapshot release of the HTTP Collector was made which now offers to detect the content type and character encoding instead of relying on the HTTP header response.  It can be enabled like this (in your crawler section):
<documentFetcher detectContentType="true" detectCharset="true"/>
FYI, when used standalone, the Importer will always try to detect the content type and character encoding when not specified.
          I can confirm the same behavior (org.apache.tika.exception.TikaException: TIKA-198) also for .jpg in .zip archives.
Norconex Collector 2.8.1
          You tried <documentFetcher detectContentType="true" detectCharset="true"/>?  If so, can you confirm whether you are using pre-parse handlers?  If so, do you make sure you are not performing text operations on binary files (using restrictTo)?   If that is not your issue, please open a new ticket (since this one is closed) with details to reproduce.