¿Cómo puedo escribir el contenido de manera eficiente (700.000 líneas) de un ciclo for en un archivo de manera eficiente de Java?

adjective_noun:

Que escribí siguiente código para obtener los resultados en forma de respuestas XML y escribir algunos de sus contenidos en el fichero de una de Java. Esto se hace mediante la recepción de un XML-respuesta para cerca de 700.000 consultas a una base de datos pública.

Sin embargo, antes de que el código se puede escribir en el archivo, se detuvo ya sea por alguna excepción al azar (desde el servidor) en una posición aleatoria en el código. He intentado escribir en el archivo del bucle por sí mismo, pero no era capaz de hacerlo. Así que traté de guardar los trozos de respuestas recibidas en Java y escribir HashMap HashMap al archivo en una sola llamada. Pero antes de que el código recibe todas las respuestas en el ciclo for y los almacena en un HashMap, se detiene con alguna excepción (tal vez en la iteración 15000a !!). ¿Hay alguna otra manera eficaz de escribir en el fichero en Java cuando se requiere este tipo de iteraciones para recuperar los datos?

El archivo local que utilizo para este código es aquí .

Mi código es,

import java.io.BufferedReader;              

import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.StringWriter;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.json.JSONArray;
import org.json.JSONException;
import org.json.JSONObject;
import org.json.XML;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;


public class random {

    static FileWriter fileWriter;
    static PrintWriter writer;

    public static void main(String[] args) {

        // Hashmap to store the MeSH values for each PMID 
        Map<String, String> universalMeSHMap = new HashMap<String, String>();

        try {

            // FileWriter for MeSH terms
            fileWriter = new FileWriter("/home/user/eclipse-workspace/pmidtomeshConverter/src/main/resources/outputFiles/pmidMESH.txt", true);
            writer = new PrintWriter(fileWriter);

            // Read the PMIDS from this file 
            String filePath = "file_attached_to_Post.txt";
            String line = null;
            BufferedReader bufferedReader = new BufferedReader(new FileReader(filePath));


            String[] pmidsAll = null;

            int x = 0;
            try {
                //print first 2 lines or all if file has less than 2 lines
                while(((line = bufferedReader.readLine()) != null) && x < 1) {
                    pmidsAll = line.split(",");
                    x++;
                }   
            }
            finally {   
                bufferedReader.close();         
            }

            // List of strings containing the PMIDs
            List<String> pmidList = Arrays.asList(pmidsAll);

            // Iterate through the list of PMIDs to fetch the XML files from PubMed using eUtilities API service from PubMed
            for (int i = 0; i < pmidList.size(); i++) {


                String baseURL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&rettype=abstract&id=";

                // Process to get the PMIDs
                String indPMID_p0 = pmidList.get(i).toString().replace("[", "");
                String indPMID_p1 = indPMID_p0.replace("]", "");
                String indPMID_p2 = indPMID_p1.replace("\\", "");
                String indPMID_p3 = indPMID_p2.replace("\"", "");

                // Fetch XML response from the eUtilities into a document object 
                Document doc = parseXML(new URL(baseURL + indPMID_p3));

                // Convert the retrieved XMl into a Java String 
                String xmlString = xml2String(doc); // Converts xml from doc into a string

                // Convert the Java String into a JSON Object
                JSONObject jsonWithMeSH = XML.toJSONObject(xmlString);  // Converts the xml-string into JSON

                // -------------------------------------------------------------------
                // Getting the MeSH terms from a JSON Object
                // -------------------------------------------------------------------
                JSONObject ind_MeSH = jsonWithMeSH.getJSONObject("PubmedArticleSet").getJSONObject("PubmedArticle").getJSONObject("MedlineCitation");

                // List to store multiple MeSH types
                List<String> list_MeSH = new ArrayList<String>();
                if (ind_MeSH.has("MeshHeadingList")) {

                    for (int j = 0; j < ind_MeSH.getJSONObject("MeshHeadingList").getJSONArray("MeshHeading").length(); j++) {
                        list_MeSH.add(ind_MeSH.getJSONObject("MeshHeadingList").getJSONArray("MeshHeading").getJSONObject(j).getJSONObject("DescriptorName").get("content").toString());
                    }
                } else {

                    list_MeSH.add("null");

                }

                universalMeSHMap.put(indPMID_p3, String.join("\t", list_MeSH));

                writer.write(indPMID_p3 + ":" + String.join("\t", list_MeSH) + "\n");



            System.out.println("Completed iteration for " + i + " PMID");

        }

        // Write to the file here
        for (Map.Entry<String,String> entry : universalMeSHMap.entrySet()) {

            writer.append(entry.getKey() + ":" +  entry.getValue() + "\n");

        }

        System.out.print("Completed writing the file");

    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (ParserConfigurationException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (SAXException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (TransformerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } finally {
        writer.flush();
        writer_pubtype.flush();
        writer.close();
        writer_pubtype.close();
    }

}

private static String xml2String(Document doc) throws TransformerException {

    TransformerFactory transfac = TransformerFactory.newInstance();
    Transformer trans = transfac.newTransformer();
    trans.setOutputProperty(OutputKeys.METHOD, "xml");
    trans.setOutputProperty(OutputKeys.INDENT, "yes");
    trans.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", Integer.toString(2));

    StringWriter sw = new StringWriter();
    StreamResult result = new StreamResult(sw);
    DOMSource source = new DOMSource(doc.getDocumentElement());

    trans.transform(source, result);
    String xmlString = sw.toString();
    return xmlString;

}

private static Document parseXML(URL url) throws ParserConfigurationException, SAXException, IOException {
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse((url).openStream());
    doc.getDocumentElement().normalize();
    return doc;
}

private static String readAll(Reader rd) throws IOException {
    StringBuilder sb = new StringBuilder();
    int cp;
    while ((cp = rd.read()) != -1) {
        sb.append((char) cp);
    }
    return sb.toString();
}

public static JSONObject readJsonFromUrl(String url) throws IOException, JSONException {
    InputStream is = new URL(url).openStream();
    try {
        BufferedReader rd = new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8")));
        String jsonText = readAll(rd);
        JSONObject json = new JSONObject(jsonText);
        return json;
    } finally {
        is.close();
    }
}

}

Esto es lo que se imprime en la consola antes de que la excepción.

iteración completado por 0 PMID
iteración Completado para 1 PMID
iteración Completado para 2 PMID
iteración Completado para 3 PMID
iteración Completado para 4 PMID
iteración completado durante 5 PMID
Y escribe hasta que aparezca excepción dadas a continuación ...

Por lo tanto en cualquier punto aleatorio en el bucle, tengo la excepción abajo.

java.io.FileNotFoundException: https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtden sun.net.www.protocol.http.HttpURLConnection.getInputStream0 (HttpURLConnection.java:1890) en sun.net.www.protocol.http.HttpURLConnection.getInputStream (HttpURLConnection.java:1492) en sun.net.www.protocol .https.HttpsURLConnectionImpl.getInputStream (HttpsURLConnectionImpl.java:263) en com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity (XMLEntityManager.java:647) en com.sun.org.apache.xerces.internal .impl.XMLEntityManager.startEntity (XMLEntityManager.java:1304) en com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity (XMLEntityManager.java:1270) en com.sun.org.apache.xerces.internal .impl.XMLDTDScannerImpl.setInputSource (XMLDTDScannerImpl.java:264) en com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl $ DTDDriver.dispatch (XMLDocumentScannerImpl.java:1161) en com.sun.org.apache.xerces .internal.impl.XMLDocumentScannerImpl $ DTDDriver.next (XMLDocumentScannerImpl.java:1045) en com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl $ PrologDriver.next (XMLDocumentScannerImpl.java:959) en com.sun.org.apache.xerces. internal.impl.XMLDocumentScannerImpl.next (XMLDocumentScannerImpl.java:602) en com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument (XMLDocumentFragmentScannerImpl.java:505) en com.sun.org.apache.xerces. internal.parsers.XML11Configuration.parse (XML11Configuration.java:842) en com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse (XML11Configuration.java:771) en com.sun.org.apache.xerces. internal.parsers.XMLParser.parse (XMLParser.java:141) en com.sun.org.apache.xerces.internal.parsers.DOMParser.parse (DOMParser.java:243) en com.sun.org.apache.xerces. internal.jaxp.DocumentBuilderImpl.parse (DocumentBuilderImpl.java:339) en javax.xml.parsers.DocumentBuilder.parse (DocumentBuilder.java:121) en pmidtomeshConverter.Convert2MeSH.parseXML (Convert2MeSH.java:240) en pmidtomeshConverter.Convert2MeSH.main (Convert2MeSH.java:121 )

rustyx:

No hay necesidad de usar un mapa; acaba de escribir directamente en el archivo. Para un mejor rendimiento el uso de una BufferedWriter.

También comprobaría que no hay límite de velocidad o cualquier cosa de esa naturaleza en el lado del servidor (se puede adivinar que a partir del error que está recibiendo). Guardar la respuesta en un archivo separado al analizar o la descarga falla, de esa manera usted será capaz de diagnosticar el problema mejor.

También me gustaría invertir algo de tiempo en la implementación de un mecanismo de reinicio, de manera que puede reiniciar el proceso desde la última ubicación fallido lugar de empezar desde el principio cada vez. Puede ser tan simple como proporcionar un contador de saltos como entrada para omitir las primeras solicitudes N.

Debe volver a utilizar el DocumentBuilderFactoryfin de que no se carga la misma DTD cada vez. Adicionalmente es posible que desee desactivar por completo la validación DTD (a menos que desee únicos documentos válidos, en cuyo caso es bueno que captura la excepción y volcar el mal XML en un archivo separado para su revisión).

private static DocumentBuilderFactory dbf;

public static void main(String[] args) {
    dbf = DocumentBuilderFactory.newInstance();
    dbf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
    dbf.setFeature("http://xml.org/sax/features/validation", false);
    ...
}

private static Document parseXML(URL url) throws ParserConfigurationException, SAXException, IOException {
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse((url).openStream());
    doc.getDocumentElement().normalize();
    return doc;
}

¿Cómo puedo escribir el contenido de manera eficiente (700.000 líneas) de un ciclo for en un archivo de manera eficiente de Java?

Supongo que te gusta