need
Use pdfbox to recognize pdf text, because pdf is unstructured, resulting in disordered content during recognition. If you need to recognize text, you can recognize it by line, which is convenient for comparing content.
Introduce maven dependency: the latest dependency as of 23 years
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId