How to cluster my data with a custom distance matrix using smile library's CLARANS method

Vahe Karapetyan :

I want cluster my data with a custom distance matrix rather than the built-in algorithms (i.e Euclidean). And there seems to be no clear way of doing it.

I've tried to add some of my code to the demos in the Smile project. Also tried to do it with testing in my project, here's a chunk of the code:

        StringBuilder sb = new StringBuilder();
        String line;
        while ((line = vrpJsonFromFile.readLine()) != null) {
            sb.append(line).append("\n");
        }
        JSONArray jsonArray = new JSONObject(sb.toString()).getJSONArray("services");
        Double[][] data = new Double[jsonArray.length()][2];
        for (int i = 0; i < jsonArray.length(); i++) {
            JSONObject address = jsonArray.getJSONObject(i).getJSONObject("address");
            data[i][0] = Double.parseDouble(address.getString("lon"));
            data[i][1] = Double.parseDouble(address.getString("lat"));
        }

        // here
        Distance<Double[]> distance1 = (x, y) -> Math.sqrt(Math.pow(y[1]-x[1],2) + Math.pow(y[0]-x[0], 2));
        CLARANS<Double[]> clarans = new CLARANS<>(data, distance1, 3);
        System.out.println(clarans);

This code creates a CLARANS clustering with the Euclidean algorithm (see the line below the //here comment). I should change it with my own distance matrix and I hope there is a way of doing that in Smile.

Has QUIT--Anony-Mousse :

You can likely use

Distance<Integer> d = (i,j) -> matrix[i][j];

to cluster the object numbers, not their vectors.

But it may be worth looking at ELKI instead that has predefined classes for distance matrixes, and uses optimized representations for sets of objects rather than having to use expensive boxed Integer as in the lambda above. Because i and j are boxed integers, this requires additional memory indirection (and cache misses) for each distance computation, that can reduce performance a lot. It also has the better FastCLARANS algorithm, as well as FastPAM that are supposedly O(k) times faster.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=105016&siteId=1