Coursera Algorithm Ⅱ week5 编程作业 Burrows-wheeler

没上学，没offer，准备把前段时间做的algorithms的编程作业和课回顾梳理一下，因为感觉网上的最后一个编程作业的博客比较少，所以决定从最后一个往前写

代码地址放最前面：https://github.com/RedemptionC/CourseraAlgorithms/tree/master/burrows

做编程作业，很重要的两个reference就是specification和faq（也就是原来的checklist），前者相当于一个需求说明和系统设计，后者给出了一些常见问题的解答，还有对于实现步骤的一些建议

FAQ https://coursera.cs.princeton.edu/algs4/assignments/burrows/faq.php

Specification https://coursera.cs.princeton.edu/algs4/assignments/burrows/specification.php

从spec可以了解到，我们要实现的是b-w压缩算法，这个算法有三个部分，burrows-wheeler transform，move to front，huffman encoding

其中需要我们实现的是第一二步

具体的实现步骤，我是参考的faq里给出的：

首先实现CircularSuffixArray（将会在burrows-wheeler transform的实现中用到），这个循环后缀数组大概是这样的：

比如原始数据为abcd,那么响应的循环后缀数组为

0 a b c d

1 b c d a

2 c d a b

3 d a b c

如上所示，得到这个数组我们似乎可以通过维护一个队列实现，每次出队，然后入队，就把原来队头元素放到了队尾

但是实际上，我们可以做的更加高效：我们可以通过保存每个循环后缀数组的开头元素在原始数据中的index，间接的保存整个数组的信息

这里需要强调一下做编程作业的要点，就是根据api编程，编程完成后你的代码会被autograder打分，所以你的代码必须符合给出的API，因此在编程之前必须先对API的要求熟记于心

public class CircularSuffixArray {

    // circular suffix array of s
    public CircularSuffixArray(String s)

    // length of s
    public int length()

    // returns index of ith sorted suffix
    public int index(int i)

    // unit testing (required)
    public static void main(String[] args)

}

这上面提到的index函数，要返回的是index of ith sorted suffix

i       Original Suffixes           Sorted Suffixes         index[i]
--    -----------------------     -----------------------    --------
0    A B R A C A D A B R A !     ! A B R A C A D A B R A    11
1    B R A C A D A B R A ! A     A ! A B R A C A D A B R    10
2    R A C A D A B R A ! A B     A B R A ! A B R A C A D    7
3    A C A D A B R A ! A B R     A B R A C A D A B R A !    0
4    C A D A B R A ! A B R A     A C A D A B R A ! A B R    3
5    A D A B R A ! A B R A C     A D A B R A ! A B R A C    5
6    D A B R A ! A B R A C A     B R A ! A B R A C A D A    8
7    A B R A ! A B R A C A D     B R A C A D A B R A ! A    1
8    B R A ! A B R A C A D A     C A D A B R A ! A B R A    4
9    R A ! A B R A C A D A B     D A B R A ! A B R A C A    6
10    A ! A B R A C A D A B R     R A ! A B R A C A D A B    9
11    ! A B R A C A D A B R A     R A C A D A B R A ! A B    2

如上，得到循环后缀数组后，我们需要对其按ascii码排序，至于index[i]=x，定义是原来的第x个后缀数组，排序之后出现在了第i的位置

具体的实现，先上代码：

import java.util.Arrays;
import java.util.Comparator;

public class CircularSuffixArray {
    private String s;
    private Integer[] circularSuffixIndex;

    // circular suffix array of s
    public CircularSuffixArray(String s) {
        if (s == null)
            throw new IllegalArgumentException("arg can not be null");
        this.s = s;
        circularSuffixIndex = new Integer[s.length()];
        // circularSuffixIndex[i]代表以第i个字符开头
        // 刚好就是original suffix的第i行
        for (int i = 0; i < s.length(); i++) {
            circularSuffixIndex[i] = i;
        }
        Comparator<Integer> cmp = (a1, a2) -> {
            // i,j是index数组里的元素，下面要把他们对应的suffix array
            // 从高到低比较
            for (int i = 0; i < s.length(); i++) {
                int i1 = (a1 + i) % s.length();
                int i2 = (a2 + i) % s.length();
                if (s.charAt(i1) == s.charAt(i2))
                    continue;
                else
                    return Character.compare(s.charAt(i1), s.charAt(i2));
            }
            return 0;
        };
        // 把循环后缀数组排序
        Arrays.sort(circularSuffixIndex, cmp);
    }

    // length of s
    public int length() {
        return s.length();
    }

    // returns index of ith sorted suffix
    public int index(int i) {
        if (i < 0 || i >= s.length())
            throw new IllegalArgumentException("index out of range");
        return circularSuffixIndex[i];
    }

    // unit testing (required)
    public static void main(String[] args) {
        String s = "ABRACADABRA!";
        CircularSuffixArray circularSuffixArray = new CircularSuffixArray(s);
        for (int i = 0; i < s.length(); i++) {
            System.out.println(circularSuffixArray.index(i));
        }
        // String s = "banana";
        // Queue<Character> queue = new Queue<>();
        // for (char c : s.toCharArray()) {
        //     queue.enqueue(c);
        // }
        // int i = s.length();
        // // System.out.println(queue.toString());
        // while (i-- > 0) {
        //     char ch = queue.dequeue();
        //     queue.enqueue(ch);
        //     System.out.println(queue.toString());
        // }
    }
}

如前所述，这里没有真的新建很多循环后缀数组，而是通过保存每个循环后缀数组的开头元素在原始数据中的index，间接保存了所有的后缀数组

至于排序，注意我们最终要的是index数组，也就是原始的循环后缀数组和排序后的循环后缀数组之间的对应关系，注意上面说的保存原始index，除了间接保存了后缀数组，还实际上保存了原来后缀数组的index，即，第i个后缀数组的开头元素，是原始数据的第i的元素（0-indexing)，因此我们通过对他的排序，就能得到index[i]，举例说明：

index 0 1 2 3 4 5 6 7 8 9 10 11

原来的后缀数组 0 1 2 3 4 5 6 7 8 9 10 11

排序后的后缀数组 11 10 7 0 3 5 8 1 4 6 9 2

所以index[11]=2,etc

2.使用刚才实现的CircularSuffixArray，来实现burrows-wheeler transform

这里包括transform和inverse transform，也就是正变换和逆变换，首先来实现正变换

正变换很简单，我们需要生成这两样信息：第一行是first，第二行是上述循环后缀数组排序后每一行最后一列的数据

首先解释一下first，指的是原始数据，在排序后的循环后缀数组中的行号

如本例，即A B R A C A D A B R A !在排序后的行号为3，因此，正变换之后的结果为：

ARD!RCAAAABB

实现代码：

    // apply Burrows-Wheeler transform,
    // reading from standard input and writing to standard output
    public static void transform() {
        // first /n last column of sorted suffix
        String s = BinaryStdIn.readString();
        CircularSuffixArray circularSuffixArray = new CircularSuffixArray(s);
        int first = -1;
        for (int i = 0; i < s.length(); i++) {
            if (circularSuffixArray.index(i) == 0) {
                first = i;
                break;
            }
        }
        BinaryStdOut.write(first);
        for (int i = 0; i < s.length(); i++) {
            BinaryStdOut
                    .write(s.charAt((circularSuffixArray.index(i) + s.length() - 1) % s.length()));
        }
        BinaryStdOut.close();
    }

然后是实现逆变换，也是这里面比较难的一个部分

首先定义next[i]=x,代表：如果第j个原始的循环后缀数组排序后index为i，那么next[i]为第j+1个原始suffix在排序之后的index
比如first=3,next[first]=7,即原来的第一个（注意first指的是原始数据）原始suffix在排序之后的index为7,etc

逆变换的输入是first以及排序后的后缀数组的最后一列，通过最后一列，我们可以马上得到第一列，因为第一列是最后一列排序后的结果

假设我们已经有了next数组，那么我们就能重构原始数据，因为第i个原始suffix的第一个元素就是原始数据的第i个元素

这里我们知道first为3，并且我们知道排序后数组的第一列和最后一列，那么我们知道原始数据的开头是A，结尾是！，因为我们现在知道next[first]=7,因此第二个字符是B，如此往复，我们就能得到原始的字符串

现在的问题是如何得到next数组，这个问题如果是手动操作，会简单一些，如spec给出了这种做法：

consider the suffix that starts with 'C'. By inspecting the first column, it appears 8th in the sorted order. The next original suffix after this one will have the character 'C' as its last character. By inspecting the last column, the next original appears 5th in the sorted order. Thus, next[8] = 5.

还有更形象一点的，见视频：https://www.youtube.com/watch?v=4WRANhDiSHM&list=LLy4ez7_C8lxwhTAWJV7sxEw&index=2&t=0s

要编程实现，并且要在线性时间内完成，理解起来会有些困难（我现在还没完全理解><）

做法是在排序时就能算出next数组！可是排序怎么会是线性时间呢？这里用到的是key-indexed-counting(也叫计数排序？）：当元素的范围不大时适用

上代码

    // 假设所以字符对应的值都不小于0
    private static char[] keyIndexedCounting(char[] a, int[] next) {
        int max = -1;
        int min = 10000;
        for (char i : a) {
            if (i > max) {
                max = i;
            }
            if (i < min) {
                min = i;
            }
        }
        int[] count = new int[max - min + 1 + 1];
        for (int i = 0; i < a.length; i++) {
            count[a[i] - min + 1]++;
        }
        for (int i = 0; i < count.length - 1; i++) {
            count[i + 1] += count[i];
        }
        char[] aux = new char[a.length];
        for (int i = 0; i < a.length; i++) {
            aux[count[a[i] - min]] = a[i];
            // 虽然不懂，但是这就是next数组！
            // System.out.printf("%d -> %d\n", count[a[i] - min], i);
            next[count[a[i] - min]] = i;
            count[a[i] - min]++;
        }
        return aux;
    }

有了next数组，逆变换的代码也出来了

// apply Burrows-Wheeler inverse transform,
    // reading from standard input and writing to standard output
    public static void inverseTransform() {
        int first = BinaryStdIn.readInt();
        String s = BinaryStdIn.readString();
        // System.out.println(first + "\n" + s);
        int[] next = new int[s.length()];
        char[] firstColumn = keyIndexedCounting(s.toCharArray(), next);
        // for (char c : firstColumn) {
        //     System.out.println(c);
        // }
        // 怎么在O(n)内算出next数组？暂时没想到，先O(n*n)
        // StringBuilder sb = new StringBuilder(s);
        // char[] temp = s.toCharArray();
        // int[] next = new int[s.length()];
        // for (int i = 0; i < s.length(); i++) {
        //     for (int j = 0; j < temp.length; j++) {
        //         if (firstColumn[i] == temp[j]) {
        //             next[i] = j;
        //             temp[j] = '\0';
        //             break;
        //         }
        //     }
        // }
        // for (int c : next) {
        //     System.out.println(c);
        // }
        // StringBuilder sb = new StringBuilder();
        int t = first;
        for (int i = 0; i < next.length; i++) {
            // sb.append(firstColumn[t]);
            BinaryStdOut.write(firstColumn[t], 8);
            t = next[t];
        }
        // BinaryStdOut.write(sb.toString());
        BinaryStdOut.close();
    }

3.然后我们来实现move to front ,这个编码的做法是初始化一个256个字符(extended ascii)的list，然后每读到一个字符，就把该字符对应的ascii码移到最前

上代码

import edu.princeton.cs.algs4.BinaryStdIn;
import edu.princeton.cs.algs4.BinaryStdOut;

import java.util.LinkedList;

public class MoveToFront {

    // apply move-to-front encoding, reading from standard input and writing to standard output
    public static void encode() {
        LinkedList<Character> ascii = new LinkedList<>();
        for (int i = 0; i < 256; i++) {
            ascii.add((char) i);
        }
        while (!BinaryStdIn.isEmpty()) {
            char c = BinaryStdIn.readChar(8);
            // System.out.printf("%c %d\n", c, ascii.get((int) c));
            int i = ascii.indexOf((c));
            // System.out.println(c + " " + i);
            BinaryStdOut.write(i, 8);
            Character t = ascii.remove(i);
            ascii.addFirst(t);
        }
        // System.out.println();
        BinaryStdOut.close();
    }

    // apply move-to-front decoding, reading from standard input and writing to standard output
    public static void decode() {
        LinkedList<Character> ascii = new LinkedList<>();
        for (int i = 0; i < 256; i++) {
            ascii.add((char) i);
        }
        while (!BinaryStdIn.isEmpty()) {
            char c = BinaryStdIn.readChar(8);
            BinaryStdOut.write(ascii.get((int) c), 8);
            Character t = ascii.remove((int) c);
            ascii.addFirst(t);
        }
        BinaryStdOut.close();
    }

    // if args[0] is "-", apply move-to-front encoding
    // if args[0] is "+", apply move-to-front decoding
    public static void main(String[] args) {
        String method = args[0];
        if (method.equals("-")) {
            MoveToFront.encode();
        }
        else {
            MoveToFront.decode();
        }
    }
}

Coursera Algorithm Ⅱ week5 编程作业 Burrows-wheeler

猜你喜欢