How can I efficiently write the (700,000 lines) content from a For-loop into a file efficiently from Java?

adjective_noun :

I wrote following code to fetch the results in form of XML responses and write some of its content to the a file from Java. This is done by receiving an XML-response for about 700,000 queries to a public database.

However, before the code can write to the file, it is either stopped by some random exception (from the server) at a random position in code. I tried writing to the file from the For-loop itself, but was not able to. So I tried to store the chunks from received responses into Java HashMap and write the HashMap to the file in a single call. But before the code receives all the responses in the for-loop and stores them into a HashMap, it stops with some exception (maybe at the 15000th iteration!!). Is there any other efficient way to write to the file in Java when one requires such iterations to fetch the data?

The local file that I use for this code is here.

My code is,

import java.io.BufferedReader;              

import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.StringWriter;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.json.JSONArray;
import org.json.JSONException;
import org.json.JSONObject;
import org.json.XML;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;


public class random {

    static FileWriter fileWriter;
    static PrintWriter writer;

    public static void main(String[] args) {

        // Hashmap to store the MeSH values for each PMID 
        Map<String, String> universalMeSHMap = new HashMap<String, String>();

        try {

            // FileWriter for MeSH terms
            fileWriter = new FileWriter("/home/user/eclipse-workspace/pmidtomeshConverter/src/main/resources/outputFiles/pmidMESH.txt", true);
            writer = new PrintWriter(fileWriter);

            // Read the PMIDS from this file 
            String filePath = "file_attached_to_Post.txt";
            String line = null;
            BufferedReader bufferedReader = new BufferedReader(new FileReader(filePath));


            String[] pmidsAll = null;

            int x = 0;
            try {
                //print first 2 lines or all if file has less than 2 lines
                while(((line = bufferedReader.readLine()) != null) && x < 1) {
                    pmidsAll = line.split(",");
                    x++;
                }   
            }
            finally {   
                bufferedReader.close();         
            }

            // List of strings containing the PMIDs
            List<String> pmidList = Arrays.asList(pmidsAll);

            // Iterate through the list of PMIDs to fetch the XML files from PubMed using eUtilities API service from PubMed
            for (int i = 0; i < pmidList.size(); i++) {


                String baseURL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&rettype=abstract&id=";

                // Process to get the PMIDs
                String indPMID_p0 = pmidList.get(i).toString().replace("[", "");
                String indPMID_p1 = indPMID_p0.replace("]", "");
                String indPMID_p2 = indPMID_p1.replace("\\", "");
                String indPMID_p3 = indPMID_p2.replace("\"", "");

                // Fetch XML response from the eUtilities into a document object 
                Document doc = parseXML(new URL(baseURL + indPMID_p3));

                // Convert the retrieved XMl into a Java String 
                String xmlString = xml2String(doc); // Converts xml from doc into a string

                // Convert the Java String into a JSON Object
                JSONObject jsonWithMeSH = XML.toJSONObject(xmlString);  // Converts the xml-string into JSON

                // -------------------------------------------------------------------
                // Getting the MeSH terms from a JSON Object
                // -------------------------------------------------------------------
                JSONObject ind_MeSH = jsonWithMeSH.getJSONObject("PubmedArticleSet").getJSONObject("PubmedArticle").getJSONObject("MedlineCitation");

                // List to store multiple MeSH types
                List<String> list_MeSH = new ArrayList<String>();
                if (ind_MeSH.has("MeshHeadingList")) {

                    for (int j = 0; j < ind_MeSH.getJSONObject("MeshHeadingList").getJSONArray("MeshHeading").length(); j++) {
                        list_MeSH.add(ind_MeSH.getJSONObject("MeshHeadingList").getJSONArray("MeshHeading").getJSONObject(j).getJSONObject("DescriptorName").get("content").toString());
                    }
                } else {

                    list_MeSH.add("null");

                }

                universalMeSHMap.put(indPMID_p3, String.join("\t", list_MeSH));

                writer.write(indPMID_p3 + ":" + String.join("\t", list_MeSH) + "\n");



            System.out.println("Completed iteration for " + i + " PMID");

        }

        // Write to the file here
        for (Map.Entry<String,String> entry : universalMeSHMap.entrySet()) {

            writer.append(entry.getKey() + ":" +  entry.getValue() + "\n");

        }

        System.out.print("Completed writing the file");

    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (ParserConfigurationException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (SAXException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (TransformerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } finally {
        writer.flush();
        writer_pubtype.flush();
        writer.close();
        writer_pubtype.close();
    }

}

private static String xml2String(Document doc) throws TransformerException {

    TransformerFactory transfac = TransformerFactory.newInstance();
    Transformer trans = transfac.newTransformer();
    trans.setOutputProperty(OutputKeys.METHOD, "xml");
    trans.setOutputProperty(OutputKeys.INDENT, "yes");
    trans.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", Integer.toString(2));

    StringWriter sw = new StringWriter();
    StreamResult result = new StreamResult(sw);
    DOMSource source = new DOMSource(doc.getDocumentElement());

    trans.transform(source, result);
    String xmlString = sw.toString();
    return xmlString;

}

private static Document parseXML(URL url) throws ParserConfigurationException, SAXException, IOException {
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse((url).openStream());
    doc.getDocumentElement().normalize();
    return doc;
}

private static String readAll(Reader rd) throws IOException {
    StringBuilder sb = new StringBuilder();
    int cp;
    while ((cp = rd.read()) != -1) {
        sb.append((char) cp);
    }
    return sb.toString();
}

public static JSONObject readJsonFromUrl(String url) throws IOException, JSONException {
    InputStream is = new URL(url).openStream();
    try {
        BufferedReader rd = new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8")));
        String jsonText = readAll(rd);
        JSONObject json = new JSONObject(jsonText);
        return json;
    } finally {
        is.close();
    }
}

}

This is what it prints on the Console before the exception.

  • Completed iteration for 0 PMID
  • Completed iteration for 1 PMID
  • Completed iteration for 2 PMID
  • Completed iteration for 3 PMID
  • Completed iteration for 4 PMID
  • Completed iteration for 5 PMID
  • And it writes until the below given exception appears...

So at any random point in the loop, I get the exception below.

java.io.FileNotFoundException: https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1890) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:263) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:647) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1304) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1270) at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:264) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1161) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1045) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:959) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) at pmidtomeshConverter.Convert2MeSH.parseXML(Convert2MeSH.java:240) at pmidtomeshConverter.Convert2MeSH.main(Convert2MeSH.java:121)

rustyx :

There is no need to use a Map; just write directly to the file. For better performance use a BufferedWriter.

I would also check that there is no rate limit or anything of that nature on the server side (you can guess that from the error you're getting). Save the response in a separate file when parsing or downloading fails, that way you will be able to diagnose the issue better.

I would also invest some time into implementing a restart mechanism, such that you can restart the process from the last failed location instead of starting from the beginning every time. It can be as simple as providing a skip counter as input to skip the first N requests.

You should re-use the DocumentBuilderFactory so that it doesn't load the same DTD every time. Additionally you may want to disable DTD validation altogether (unless you want only valid documents, in which case it's good to catch that exception and dump the bad XML to a separate file for review).

private static DocumentBuilderFactory dbf;

public static void main(String[] args) {
    dbf = DocumentBuilderFactory.newInstance();
    dbf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
    dbf.setFeature("http://xml.org/sax/features/validation", false);
    ...
}

private static Document parseXML(URL url) throws ParserConfigurationException, SAXException, IOException {
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse((url).openStream());
    doc.getDocumentElement().normalize();
    return doc;
}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=113019&siteId=1