PyPDF2 : 提取目录/大纲及其页码

1 人关注

我试图用Python（PyPDF2）从PDF中提取TOC/outlines以及它们的页码，我知道 reader.outlines ，但它没有返回正确的页码。

Pdf的例子。 https://www.annualreports.com/HostedData/AnnualReportArchive/l/NASDAQ_LOGM_2018.pdf

而 reader.outlines 的输出是。

[{'/Title': '2018 Highlights', '/Page': IndirectObject(5, 0), '/Type': '/Fit'},
{'/Title': 'Letter to Stockholders', '/Page': IndirectObject(6, 0), '/Type': '/Fit'}, 
{'/Title': 'Part I', '/Page': IndirectObject(10, 0), '/Type': '/Fit'}, 
[{'/Title': 'Item 1. Business', '/Page': IndirectObject(10, 0), '/Type': '/Fit'}, 
{'/Title': 'Item 1A. Risk Factors', '/Page': IndirectObject(19, 0), '/Type': '/Fit'}
例如，PART我没有被要求从第10页开始，我错过了什么吗？
有没有人有替代方案？
我已经尝试用PyMupdf、Tabula和getDestinationPageNumber method  with no luck.
预先感谢你。


         1
         
         个评论


           
            @KJ 我只是用PdfFileReader（来自PyPDF2）阅读了pdf，并且只是打印了轮廓，这就是为什么我觉得很奇怪。


         python

pdf


         pypdf2


         tableofcontents


        3
        
        个回答


          已采纳


         0
         
         人赞同


          
           
            马丁-托马的回答
           
           正是我所需要的（PyMuPDF）。
           
            Diblo Dk的回答
           
           也是一个有趣的解决方法（PyPDF2）。
          
          
           我引用的正是马丁-托马的代码。
          
          from typing import Dict
import fitz  # pip install pymupdf
def get_bookmarks(filepath: str) -> Dict[int, str]:
    # WARNING! One page can have multiple bookmarks!
    bookmarks = {}
    with fitz.open(filepath) as doc:
        toc = doc.getToC()  # [[lvl, title, page, …], …]
        for level, title, page in toc:
            bookmarks[page] = title
    return bookmarks
print(get_bookmarks("my.pdf"))


          
           
            you should reference this
            
             PDF大纲和它们的页码
            
           
           targetPDFFile = 'your_pdf_filename.pdf'
pdfFileObj=open(targetPDFFile, 'rb')
# use outline replace of bookmark, outline is more accuracy than bookmark
result = {}
def outline_dict(bookmark_list):
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            outline_dict(item)
        else:
                pageNum = pdfReader.getDestinationPageNumber(item) + 1
                # print("key=" + str(pageNum) + ",title=" + item.title)
                # 相同页码的item会被替换掉
                result[pageNum] = item.title
            except:
                print("except:" + item)
outline_dict(pdfReader.getOutlines())