Java: Most efficient way to loop through CSV and sum values of one column for each unique value in another Column

FinDev :

I have a CSV file with 500,000 rows of data and 22 columns. This data represents all commercial flights in the USA for one year. I am being tasked with finding the tail number of the plane that flew the most miles in the data set. Column 5 contains the airplain's tail number for each flight. Column 22 contains the total distance traveled.

Please see my extractQ3 method below. First, created a HashMap for the whole CSV using the createHashMap() method. Then, I ran a for loop to identify every unique tail number in the dataset and stored them in an array called tailNumbers. Then for each unique tail number, I looped through the entire Hashmap to calculate the total miles of distance for that tail number.

The code runs fine on smaller datasets, but once the sized increased to 500,000 rows the code becomes horribly inefficient and takes an eternity to run. Can anyone provide me with a faster way to do this?

public class FlightData {

    HashMap<String,String[]>  dataMap;

        public static void main(String[] args) {

            FlightData map1 = new FlightData();
            map1.dataMap = map1.createHashMap();

            String answer = map1.extractQ3(map1);  
}

        public String extractQ3(FlightData map1) {
            ArrayList<String> tailNumbers = new ArrayList<String>();
            ArrayList<Integer> tailMiles = new ArrayList<Integer>();
            //Filling the Array with all tail numbers
            for (String[] value : map1.dataMap.values()) {
                if(Arrays.asList(tailNumbers).contains(value[4])) {  
                } else {
                    tailNumbers.add(value[4]);
                }
            }

            for (int i = 0; i < tailNumbers.size(); i++) {
                String tempName = tailNumbers.get(i); 
                int miles = 0;

                for (String[] value : map1.dataMap.values()) {
                    if(value[4].contentEquals(tempName) && value[19].contentEquals("0")) {
                        miles = miles + Integer.parseInt(value[21]);
                    }  
                }
                tailMiles.add(miles);     
            }

            Integer maxVal = Collections.max(tailMiles);
            Integer maxIdx = tailMiles.indexOf(maxVal);
            String maxPlane = tailNumbers.get(maxIdx);

            return maxPlane;
        }




        public HashMap<String,String[]> createHashMap() {
            File flightFile = new File("flights_small.csv");
            HashMap<String,String[]> flightsMap = new HashMap<String,String[]>();

            try {
            Scanner s = new Scanner(flightFile);
            while (s.hasNextLine()) {

                    String info = s.nextLine();
                    String [] piecesOfInfo = info.split(",");
                    String flightKey = piecesOfInfo[4] + "_" + piecesOfInfo[2] + "_" + piecesOfInfo[11]; //Setting the Key
                    String[] values = Arrays.copyOfRange(piecesOfInfo, 0, piecesOfInfo.length);

                    flightsMap.put(flightKey, values);

            }


            s.close();
            }


           catch (FileNotFoundException e)
           {
             System.out.println("Cannot open: " + flightFile);
           }

            return flightsMap;
        }
}
andrewjames :

The answer depends on what you mean by "most efficient", "horribly inefficient" and "takes an eternity". These are subjective terms. The answer may also depend on specific technical factors (speed vs. memory consumption; the number of unique flight keys compared to the number of overall records; etc.).

I would recommend applying some basic streamlining to your code, to start with. See if that gets you a better (acceptable) result. If you need more, then you can consider more advanced improvements.

Whatever you do, take some timings to understand the broad impacts of any changes you make.

Focus on going from "horrible" to "acceptable" - and then worry about more advanced tuning after that (if you still need it).

Consider using a BufferedReader instead of a Scanner. See here. Although the scanner may be just fine for your needs (i.e. if it's not a bottleneck).

Consider using logic within your scanner loop to capture tail numbers and accumulated mileage in one pass of the data. The following is deliberately basic, for clarity and simplicity:

// The string is a tail number.
// The integer holds the accumulated miles flown for that tail number:
Map<String, Integer> planeMileages = new HashMap();

if (planeMileages.containsKey(tailNumber)) {
    // add miles to existing total:
    int accumulatedMileage = planeMileages.get(tailNumber) + flightMileage;
    planeMileages.put(tailNumber, accumulatedMileage);
} else {
    // capture new tail number:
    planeMileages.put(tailNumber, flightMileage);
}

After that, once you have completed the scanner loop, you can iterate over your planeMileages to find the largest mileage:

String maxMilesTailNumber;
int maxMiles = 0;
for (Map.Entry<String, Integer> entry : planeMileages.entrySet()) {
    int planeMiles = entry.getValue();
    if (planeMiles > maxMiles) {
        maxMilesTailNumber = entry.getKey();
        maxMiles = planeMiles;
    }
}

WARNING - This approach is just for illustration. It will only capture one tail number. There could be multiple planes with the same maximum mileage. You would have to adjust your logic to capture multiple "winners".

The above approach removes the need for several of your existing data structures, and related processing.

If you still face problems, put in some timers to see which specific areas of your code are slowest - and then you will have more specific tuning opportunities you can focus on.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=276804&siteId=1