Reemplazar texto dentro de un archivo PDF utilizando iText

Guardar Furio:

Im usar iText(5.5.13)la biblioteca para leer un .PDF y reemplazar un patrón dentro del archivo. El problema es que el patrón no se encuentra porque de alguna manera algunos caracteres extraños aparecen cuando la biblioteca lee el pdf.

Por ejemplo, en la frase:

"This is a test in order to see if the"

se convierte en éste cuando estoy tratando de leerlo:

[(This is a )9(te)-3(st)9( in o)-4(rd)15(er )-2(t)9(o)-5( s)8(ee)7( if t)-3(h)3(e )]

Así que si trataba de buscar y reemplazar "test", sin "test"palabra se encontraría en el PDF y no será reemplazado

aquí está el código que estoy usando:

public void processPDF(String src, String dest) {

    try {

      PdfReader reader = new PdfReader(src);
      PdfArray refs = null;
      PRIndirectReference reference = null;

      int nPages = reader.getNumberOfPages();

      for (int i = 1; i <= nPages; i++) {
        PdfDictionary dict = reader.getPageN(i);
        PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
        if (object.isArray()) {
          refs = dict.getAsArray(PdfName.CONTENTS);
          ArrayList<PdfObject> references = refs.getArrayList();

          for (PdfObject r : references) {

            reference = (PRIndirectReference) r;
            PRStream stream = (PRStream) PdfReader.getPdfObject(reference);
            byte[] data = PdfReader.getStreamBytes(stream);
            String dd = new String(data, "UTF-8");

            dd = dd.replaceAll("@pattern_1234", "trueValue");
            dd = dd.replaceAll("test", "tested");

            stream.setData(dd.getBytes());
          }

        }
        if (object instanceof PRStream) {
          PRStream stream = (PRStream) object;

          byte[] data = PdfReader.getStreamBytes(stream);
          String dd = new String(data, "UTF-8");
          System.out.println("content---->" + dd);
          dd = dd.replaceAll("@pattern_1234", "trueValue");
          dd = dd.replaceAll("This", "FIRST");

          stream.setData(dd.getBytes(StandardCharsets.UTF_8));
        }
      }
      PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
      stamper.close();
      reader.close();
    }

    catch (Exception e) {
    }
  }

MKL:

Como ya se ha mencionado en los comentarios y respuestas, PDF no es un formato destinado a la edición de texto . Es un formato final, y la información sobre el flujo de texto, su diseño, e incluso su asignación a Unicode es opcional.

Por lo tanto, aun suponiendo que la información opcional en glifos de mapeo para Unicode están presentes, la aproximación a esta tarea con iText puede parecer un poco insatisfactorio: El primero sería determinar la posición del texto en cuestión mediante una estrategia de extracción de texto personalizado, y luego continuar retirando el contenido actual del todo en esa posición con el PdfCleanUpProcessor, y finalmente extraer el texto de reemplazo en el hueco.

En esta respuesta me gustaría presentar una clase de ayuda que permite combinar los dos primeros pasos, encontrar y eliminar el texto existente, con la ventaja de que, efectivamente, sólo el texto se elimina, no también cualquier gráficos de fondo, etc. , como en el caso de PdfCleanUpProcessorredacción. El ayudante devuelve además las posiciones del texto eliminado permitiendo estampación de reemplazo de la misma.

La clase de ayuda se basa en la PdfContentStreamEditorpresentada en esta respuesta anterior . Por favor utilice la versión de esta clase en github , sin embargo, como la clase original se ha mejorado un poco desde la concepción.

La SimpleTextRemoverclase de ayuda ilustra lo que es necesario eliminar correctamente el texto de un PDF. En realidad, es limitado en algunos aspectos:

Sólo se reemplaza texto en los flujos de contenido real de la página.

Para sustituir también el texto en XObjects incrustados, uno tiene que repetir por los recursos XObject de la página respectiva en cuestión de forma recursiva y también aplicar el editor de ellos.
Es "simple" de la misma forma en que el SimpleTextExtractionStrategyes: Asume el texto muestra las instrucciones que aparezcan en el contenido en orden de lectura.

También para el trabajo con flujos de contenido para los que el orden es diferente y las instrucciones que debe ser ordenada, y esto implica que todas las instrucciones recibidas y la información pertinente rendir deben ser guardados hasta el final de la página, no sólo unos pocos instrucción a la vez. A continuación, la información de render puede ser ordenada, para eliminar secciones se pueden identificar en el ordenado rendir información, las instrucciones asociadas pueden ser manipulados, y, finalmente, las instrucciones se pueden almacenar.
No trata de identificar las brechas entre los glifos que representan visualmente un espacio en blanco, mientras que en realidad no hay glifo en absoluto.

Para determinar las deficiencias del código debe ampliarse para comprobar si dos glifos consecutivos exactamente seguirán unos a otros o si existe una brecha o un salto de línea.
Al calcular la distancia a la licencia de donde se extrae un glifo, que aún no toma el espacio entre caracteres y palabras en cuenta.

Para mejorar esto, el cálculo de ancho de glifo debe ser mejorada.

Teniendo en cuenta el ejemplo extracto de su flujo de contenido, sin embargo, estas restricciones probablemente no obstaculizar usted también.

public class SimpleTextRemover extends PdfContentStreamEditor {
    public SimpleTextRemover() {
        super (new SimpleTextRemoverListener());
        ((SimpleTextRemoverListener)getRenderListener()).simpleTextRemover = this;
    }

    /**
     * <p>Removes the string to remove from the given page of the
     * document in the PDF reader the given PDF stamper works on.</p>
     * <p>The result is a list of glyph lists each of which represents
     * a match can can be queried for position information.</p>
     */
    public List<List<Glyph>> remove(PdfStamper pdfStamper, int pageNum, String toRemove) throws IOException {
        if (toRemove.length()  == 0)
            return Collections.emptyList();

        this.toRemove = toRemove;
        cachedOperations.clear();
        elementNumber = -1;
        pendingMatch.clear();
        matches.clear();
        allMatches.clear();
        editPage(pdfStamper, pageNum);
        return allMatches;
    }

    /**
     * Adds the given operation to the cached operations and checks
     * whether some cached operations can meanwhile be processed and
     * written to the result content stream.
     */
    @Override
    protected void write(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        cachedOperations.add(new ArrayList<>(operands));

        while (process(processor)) {
            cachedOperations.remove(0);
        }
    }

    /**
     * Removes any started match and sends all remaining cached
     * operations for processing.
     */
    @Override
    public void finalizeContent() {
        pendingMatch.clear();
        try {
            while (!cachedOperations.isEmpty()) {
                if (!process(this)) {
                    // TODO: Should not happen, so warn
                    System.err.printf("Failure flushing operation %s; dropping.\n", cachedOperations.get(0));
                }
                cachedOperations.remove(0);
            }
        } catch (IOException e) {
            throw new ExceptionConverter(e);
        }
    }

    /**
     * Tries to process the first cached operation. Returns whether
     * it could be processed.
     */
    boolean process(PdfContentStreamProcessor processor) throws IOException {
        if (cachedOperations.isEmpty())
            return false;

        List<PdfObject> operands = cachedOperations.get(0);
        PdfLiteral operator = (PdfLiteral) operands.get(operands.size() - 1);
        String operatorString = operator.toString();

        if (TEXT_SHOWING_OPERATORS.contains(operatorString))
            return processTextShowingOp(processor, operator, operands);

        super.write(processor, operator, operands);
        return true;
    }

    /**
     * Tries to processes a text showing operation. Unless a match
     * is pending and starts before the end of the argument of this
     * instruction, it can be processed. If the instructions contains
     * a part of a match, it is transformed to a TJ operation and
     * the glyphs in question are replaced by text position adjustments.
     * If the original operation had a side effect (jump to next line
     * or spacing adjustment), this side effect is explicitly added.
     */
    boolean processTextShowingOp(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        PdfObject object = operands.get(operands.size() - 2);
        boolean isArray = object instanceof PdfArray;
        PdfArray array = isArray ? (PdfArray) object : new PdfArray(object);
        int elementCount = countStrings(object);

        // Currently pending glyph intersects parameter of this operation -> cannot yet process
        if (!pendingMatch.isEmpty() && pendingMatch.get(0).elementNumber < processedElements + elementCount)
            return false;

        // The parameter of this operation is subject to a match -> copy as is
        if (matches.size() == 0 || processedElements + elementCount <= matches.get(0).get(0).elementNumber || elementCount == 0) {
            super.write(processor, operator, operands);
            processedElements += elementCount;
            return true;
        }

        // The parameter of this operation contains glyphs of a match -> manipulate 
        PdfArray newArray = new PdfArray();
        for (int arrayIndex = 0; arrayIndex < array.size(); arrayIndex++) {
            PdfObject entry = array.getPdfObject(arrayIndex);
            if (!(entry instanceof PdfString)) {
                newArray.add(entry);
            } else {
                PdfString entryString = (PdfString) entry;
                byte[] entryBytes = entryString.getBytes();
                for (int index = 0; index < entryBytes.length; ) {
                    List<Glyph> match = matches.size() == 0 ? null : matches.get(0);
                    Glyph glyph = match == null ? null : match.get(0);
                    if (glyph == null || processedElements < glyph.elementNumber) {
                        newArray.add(new PdfString(Arrays.copyOfRange(entryBytes, index, entryBytes.length)));
                        break;
                    }
                    if (index < glyph.index) {
                        newArray.add(new PdfString(Arrays.copyOfRange(entryBytes, index, glyph.index)));
                        index = glyph.index;
                        continue;
                    }
                    newArray.add(new PdfNumber(-glyph.width));
                    index++;
                    match.remove(0);
                    if (match.isEmpty())
                        matches.remove(0);
                }
                processedElements++;
            }
        }
        writeSideEffect(processor, operator, operands);
        writeTJ(processor, newArray);

        return true;
    }

    /**
     * Counts the strings in the given argument, itself a string or
     * an array containing strings and non-strings.
     */
    int countStrings(PdfObject textArgument) {
        if (textArgument instanceof PdfArray) {
            int result = 0;
            for (PdfObject object : (PdfArray)textArgument) {
                if (object instanceof PdfString)
                    result++;
            }
            return result;
        } else 
            return textArgument instanceof PdfString ? 1 : 0;
    }

    /**
     * Writes side effects of a text showing operation which is going to be
     * replaced by a TJ operation. Side effects are line jumps and changes
     * of character or word spacing.
     */
    void writeSideEffect(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        switch (operator.toString()) {
        case "\"":
            super.write(processor, OPERATOR_Tw, Arrays.asList(operands.get(0), OPERATOR_Tw));
            super.write(processor, OPERATOR_Tc, Arrays.asList(operands.get(1), OPERATOR_Tc));
        case "'":
            super.write(processor, OPERATOR_Tasterisk, Collections.singletonList(OPERATOR_Tasterisk));
        }
    }

    /**
     * Writes a TJ operation with the given array unless array is empty.
     */
    void writeTJ(PdfContentStreamProcessor processor, PdfArray array) throws IOException {
        if (!array.isEmpty()) {
            List<PdfObject> operands = Arrays.asList(array, OPERATOR_TJ);
            super.write(processor, OPERATOR_TJ, operands);
        }
    }

    /**
     * Analyzes the given text render info whether it starts a new match or
     * finishes / continues / breaks a pending match. This method is called
     * by the {@link SimpleTextRemoverListener} registered as render listener
     * of the underlying content stream processor.
     */
    void renderText(TextRenderInfo renderInfo) {
        elementNumber++;
        int index = 0;
        for (TextRenderInfo info : renderInfo.getCharacterRenderInfos()) {
            int matchPosition = pendingMatch.size();
            pendingMatch.add(new Glyph(info, elementNumber, index));
            if (!toRemove.substring(matchPosition, matchPosition + info.getText().length()).equals(info.getText())) {
                reduceToPartialMatch();
            }
            if (pendingMatch.size() == toRemove.length()) {
                matches.add(new ArrayList<>(pendingMatch));
                allMatches.add(new ArrayList<>(pendingMatch));
                pendingMatch.clear();
            }
            index++;
        }
    }

    /**
     * Reduces the current pending match to an actual (partial) match
     * after the addition of the next glyph has invalidated it as a
     * whole match.
     */
    void reduceToPartialMatch() {
        outer:
        while (!pendingMatch.isEmpty()) {
            pendingMatch.remove(0);
            int index = 0;
            for (Glyph glyph : pendingMatch) {
                if (!toRemove.substring(index, index + glyph.text.length()).equals(glyph.text)) {
                    continue outer;
                }
                index++;
            }
            break;
        }
    }

    String toRemove = null;
    final List<List<PdfObject>> cachedOperations = new LinkedList<>();

    int elementNumber = -1;
    int processedElements = 0;
    final List<Glyph> pendingMatch = new ArrayList<>();
    final List<List<Glyph>> matches = new ArrayList<>();
    final List<List<Glyph>> allMatches = new ArrayList<>();

    /**
     * Render listener class used by {@link SimpleTextRemover} as listener
     * of its content stream processor ancestor. Essentially it forwards
     * {@link TextRenderInfo} events and ignores all else.
     */
    static class SimpleTextRemoverListener implements RenderListener {
        @Override
        public void beginTextBlock() { }

        @Override
        public void renderText(TextRenderInfo renderInfo) {
            simpleTextRemover.renderText(renderInfo);
        }

        @Override
        public void endTextBlock() { }

        @Override
        public void renderImage(ImageRenderInfo renderInfo) { }

        SimpleTextRemover simpleTextRemover = null;
    }

    /**
     * Value class representing a glyph with information on
     * the displayed text and its position, the overall number
     * of the string argument of a text showing instruction
     * it is in and the index at which it can be found therein,
     * and the width to use as text position adjustment when
     * replacing it. Beware, the width does not yet consider
     * character and word spacing!
     */
    public static class Glyph {
        public Glyph(TextRenderInfo info, int elementNumber, int index) {
            text = info.getText();
            ascent = info.getAscentLine();
            base = info.getBaseline();
            descent = info.getDescentLine();
            this.elementNumber = elementNumber;
            this.index = index;
            this.width = info.getFont().getWidth(text);
        }

        public final String text;
        public final LineSegment ascent;
        public final LineSegment base;
        public final LineSegment descent;
        final int elementNumber;
        final int index;
        final float width;
    }

    final PdfLiteral OPERATOR_Tasterisk = new PdfLiteral("T*");
    final PdfLiteral OPERATOR_Tc = new PdfLiteral("Tc");
    final PdfLiteral OPERATOR_Tw = new PdfLiteral("Tw");
    final PdfLiteral OPERATOR_Tj = new PdfLiteral("Tj");
    final PdfLiteral OPERATOR_TJ = new PdfLiteral("TJ");
    final static List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    final static Glyph[] EMPTY_GLYPH_ARRAY = new Glyph[0];
}

( SimpleTextRemover clase de ayuda)

Se puede utilizar de esta manera:

PdfReader pdfReader = new PdfReader(SOURCE);
PdfStamper pdfStamper = new PdfStamper(pdfReader, RESULT_STREAM);
SimpleTextRemover remover = new SimpleTextRemover();

System.out.printf("\ntest.pdf - Test\n");
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
    System.out.printf("Page %d:\n", i);
    List<List<Glyph>> matches = remover.remove(pdfStamper, i, "Test");
    for (List<Glyph> match : matches) {
        Glyph first = match.get(0);
        Vector baseStart = first.base.getStartPoint();
        Glyph last = match.get(match.size()-1);
        Vector baseEnd = last.base.getEndPoint();
        System.out.printf("  Match from (%3.1f %3.1f) to (%3.1f %3.1f)\n", baseStart.get(I1), baseStart.get(I2), baseEnd.get(I1), baseEnd.get(I2));
    }
}

pdfStamper.close();

( RemovePageTextContent prueba testRemoveTestFromTest)

con la siguiente salida de la consola para mi archivo de prueba:

test.pdf - Test
Page 1:
  Match from (134,8 666,9) to (177,8 666,9)
  Match from (134,8 642,0) to (153,4 642,0)
  Match from (172,8 642,0) to (191,4 642,0)

y las apariciones de "Prueba" faltante en esas posiciones en la salida PDF.

En lugar de dar salida a las coordenadas de los partidos, puede usarlos para extraer texto de sustitución en la posición en cuestión.

Reemplazar texto dentro de un archivo PDF utilizando iText

Supongo que te gusta