pdfbox读取pdf文本内容

最新推荐文章于 2025-07-01 16:48:21 发布

筑基期的小菜鸟

最新推荐文章于 2025-07-01 16:48:21 发布

阅读量3.6k

点赞数 7

CC 4.0 BY-SA版权

分类专栏：技术文章标签： java

本文链接：https://blog.csdn.net/weixin_45949103/article/details/110449681

技术专栏收录该内容

2 篇文章

订阅专栏

博客提及了maven坐标，还介绍了两种读取本地文件的方式，分别是复杂实用版和简单鸡肋版，并进行了总结。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

PDF文本内容读取

这次做的功能中需要把pdf中的文本内容读取出来，开始也是不会做。后来在网上找了很多帖子进行了一些小小的研究后才决定用pdfbox来
实现这个功能，在此记录自己探索中的一些问题，希望有大神指教！

maven坐标如下：

    <dependency>
      <groupId>org.apache.pdfbox</groupId>
      <artifactId>pdfbox</artifactId>
      <version>1.8.8</version>
    </dependency>
    <dependency>
      <groupId>org.apache.pdfbox</groupId>
      <artifactId>fontbox</artifactId>
      <version>1.8.8</version>
    </dependency>

pdfbox与fontbox的版本不一致、或者没有引入fontbox的坐标会出现以下错误：
	Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/fontbox/afm/AFMPaeser

读取本地文件（复杂版–实用）：

public static void main(String[] args) throws Exception {
        try {
            File file = new File("本地文件路径");
            //一般在后端代码中获取到是inputStream，所以第一步是 可以省略的
            InputStream inputStream = new FileInputStream(file);
            //老版本（例如：1.8.8）的不用进行强转，新版本（例如：2.0.8）的需要
            //加载pdf文档
            PDFParser parser = new PDFParser((RandomAccessRead) inputStream);
            /**
             * 缺少这句会报：
             * Exception in thread "main" java.io.IOException:
             * You must call paser() before calling getDocument
             */
            parser.parse();
            PDDocument pdDocument = parser.getPDDocument();
            //获取总页码
            int pages = pdDocument.getNumberOfPages();
            //读取文本内容
            PDFTextStripper stripper = new PDFTextStripper();
            //设置输出顺序（是否排序）
            stripper.setSortByPosition(true);
            stripper.setStartPage(1);
            stripper.setEndPage(pages);
            //文本内容
            String text = stripper.getText(pdDocument);
            System.out.println(text);
            //关闭资源
            pdDocument.close();
        } catch (InvalidPasswordException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

读取本地文件（简单版–鸡肋）：

    public static void main(String[] args) {
        PDDocument pdDocument =null;
        try {
        	//获取pdf文档
            pdDocument = PDDocument.load(new File("文件路径"))；
            //获取总页码
            int pages = pdDocument.getNumberOfPages();
            //获取文本内容
            PDFTextStripper stripper = new PDFTextStripper();
            //设置输出顺序（是否排序）
            stripper.setSortByPosition(true);
            stripper.setStartPage(1);
            stripper.setEndPage(pages);
            System.out.println(stripper.getText(pdDocument));
            //关闭资源是个好习惯
            pdDocument.close();
        }catch (IOException e){
            e.printStackTrace();
        }
    }

总结：

	pdf读取文档内容暂时就介绍到这里了，研究的只是用在开发这个功能上，肯定存在不足，jar包找不到的话可以留言我上传！