>>> doc = fitz.open(...)
>>> rect = fitz.Rect(0, 0, 50, 50) # put thumbnail in upper left corner
>>> pix = fitz
.Pixmap("some.jpg") # an image file
>>> for page in doc:
page.insertImage(rect, pixmap = pix)
>>> doc.save(...)
Notes:
If that same image had already been present in the PDF, then only a reference will be inserted. This of course considerably saves disk space and processing time. But to detect this fact, existing PDF images need to be compared with the new one. This is achieved by storing an MD5 code for each image in a table and only compare the new image’s code against its entries. Generating this MD5 table, however, is done only when triggered by the first image insertion - which therefore may have an extended response time.
You can use this method to provide a background image for the page, like a copyright, a watermark or a background color. Or you can combine it with searchFor()
to achieve a textmarker effect.
The image may be inserted uncompressed, e.g. if a Pixmap
is used or if the image has an alpha channel. Therefore, consider using deflate = True
when saving the file.
The image content is stored in its original size - which may be much bigger than the size you want to get displayed. Consider decreasing the stored image size by using the pixmap option and then shrinking it or scaling it down (see Pixmap chapter). The file size savings can be very significant.
getText
(output = 'text')
Retrieves the text of a page. Depending on the output parameter, the results of the TextPage extract methods are returned.
If 'text'
is specified, plain text is returned in the order as specified during PDF creation (which is not necessarily the normal reading order). This may not always look as expected, consider using (and probably modifying) the example program PDF2TextJS.py. It tries to re-arrange text according to the Western reading layout convention “from top-left to bottom-right”.
Parameters:output (str) – A string indicating the requested text format, one of "text"
(default), "html"
, "json"
, "xml"
or "xhtml"
.
Return type:string
Returns:The page’s text as one string.
Use this method to convert the document into a valid HTML version by wrapping it with appropriate header and trailer strings, see the following snippet. Creating XML, XHTML or JSON documents works in exactly the same way. For XML and JSON you may also include an arbitrary filename like so: fitz.ConversionHeader("xml", filename = doc.name)
. Also see Controlling Quality of HTML Output.
>>> doc = fitz.open(...)
>>> ofile = open(doc.name + ".html", "w")
>>> ofile.write(fitz.ConversionHeader("html"))
>>> for page in doc: ofile.write(page.getText("html"))
>>> ofile.write(fitz.ConversionTrailer("html"))
>>> ofile.close()
getTextBlocks
(images = False)
Extract all text blocks as a Python list. Provides basic positioning information without the need to interpret the output of TextPage.extractJSON()
or TextPage.extractXML()
. The block sequence is as specified in the document. All lines of a block are concatenated into one string, separated by a space.
Parameters:images (bool) – also extract image blocks. Default is false. This serves as a means to get complete page layout information. Only metadata, not the image data itself is extracted. Use TextPage.extractJSON()
for accessing this information.
Return type:list
Returns:a list whose items have the following entries.