2014-06-08

Ko-Aluru suffix array construction algorithm

Suffix array是解决string search问题的一个工具，它是一个字串所有后缀的排序。

线性时间的构建算法主要有Kärkkäinen-Sanders(¹)、Ko-Aluru(²)、KSP三种，其中KSP算法使用了类似suffix tree的构建方式，Kärkkäinen-Sanders算法时间复杂度的递推式为，通常认为性能不如的Ko-Aluru。

按照³的注记，借鉴⁴对LMS substring递归排序使用的方法，对原始的Ko-Aluru算法进行改动以简化内存使用和提升性能。我的实现在字串和suffix array外仅使用两个数组：计数排序使用的桶和一个表示L/S类型的数组，反映了Ko-Aluru算法能达到和⁵同等的精简内存占用。

Ka-Aluru suffix array construction algorithm

规定为哨兵字符，且小于字符集中任意一个字符。用表示后缀。对于后缀，根据和的大小关系得到后缀的类型，然后根据把所有后缀分为两类：

也称为L类，也称为S类。

令字符集为，再根据后缀首字母分类：

并结合以上两种分类定义：

对于，中的后缀排在中后缀的前面，因为如果两个后缀和的第一个字符、不同，那么和的大小关系可以直接确定。另外中的后缀排在中后缀的前面，因为中后缀均小于(重复的)，而中后缀均大于。

根据可以把suffix array粗略划分为若干个桶，每个桶内的后缀都小于右边的桶内的任意后缀。最多有个桶：。

根据⁶的induced sorting，只要确定了的大小顺序，那么的顺序也可以确定。同样地，只要确定了的大小顺序，那么的顺序也可以确定。根据平均数原理和至少有一个，不妨设，只要求出中所有后缀的大小顺序，那么使用induced sorting即可得到的大小顺序。再根据中的后缀排在中后缀的前面，即可得到所有后缀的大小顺序。

下面考虑如何递归求出的大小顺序。⁷使用了一个复杂的方法，根据-distance计算。借鉴⁸对LMS substring递归排序使用的方法，我在实现中使用了一个简单的方法。

定义L-substring为夹在相邻两个L(-)类型字符间的子串。定义S-substring为夹在相邻两个S(+)类型字符间的子串。定义一个L-substring 小于另一个L-substring 当且仅当字典序小于，其中两种取值规定L小于S。

对于两个长度不同的L-substring 和，它们不会相等，且根据L-substring的大小关系可以得到和的大小关系。如果两L-substring相等，找到在原串后的L-substring ，两者共用一个字符。再找到在原串后的L-substring ，和的大小关系即为和的大小关系。如果这两个子串也相同了，那么可以继续找紧接着的下一对L-substring。根据以上比较方式可以发现如果取出中所有L-substring，并把这些L-substring看作字符，就可以得到子问题递归调用算法解决。现在考虑如何确定所有L-substring的大小关系，使得长度不同的可以区分大小，而长度相同的允许相等(所以才需要递归确定大小)。

先用桶排序求出所有L类型字符的大小，使用一次induced sorting根据L计算S，这样便得到了所有形如子串的大小。再用一次induced sorting根据S计算L，便得到了所有形如类型子串的大小，而这些串就是我们需要的L-substring。对L-substring计算新字符集大小，若等于L-substring的数目则说明所有L-substring的顺序已经确定，否则取出L字符递归计算suffix array。得到这些L-substring的顺序即的顺序后使用induced sorting即可得到整个suffix array。

实现技巧

为原串，为类型数组，为桶排序使用的桶。

整个实现分为几个阶段，仍然设：

计算数组，需要使用
使用两次induced sorting计算L-substring的大小，需要使用
根据L-substring构建新字串递归计算子suffix array
根据子suffix array计算部分的suffix array
使用induced sorting根据部分的suffix array得到整体suffix array

其中第2、3步需要特别注意，使用了很多技巧来节省空间占用。比如新串的长度不大于，因此可以和临时数组共享sa[]数组的空间，递归调用后对sa[]进行原地操作进行位置变换。

注意到，因此新字串和子suffix array的空间占用和，可以让它们共用原始的suffix array的空间。这样根据子suffix array计算部分的suffix array的过程变成了把suffix array扩张的过程，而扩张时需要根据新字串和的字符对应关系的映射进行，因为映射的性质可以共用一个数组实现，扩张中suffix array储存了三部分的内容：映射、子suffix array、部分的suffix array，其中部分的suffix array渐渐变大，子suffix array渐渐变小。求得部分的suffix array后，原地桶排序得到原始下标的。递归也需要使用数组，因为计算是线性的，可以让递归时继续使用该数组，递归函数返回时重新计算被覆盖的部分。这样可以省去数组在递归中不断被创建的开销。

多数文献都在末尾添加哨兵元素以方便讲解，但个人感觉实现中哨兵元素通常会导致更复杂的判断，弊大于利。不使用哨兵元素只需要在三个地方注意：

为L，因为把哨兵字符看作字典序最小。
根据S-substring计算L-substring的induced sorting过程提前把放入桶中。否则因为它是最后一个字符，无法被放入桶中，当该字符的桶包含两个或更多元素时可能会导致它的位置被覆盖而出错。
计算新字符集时，需要注意比较过程中到达边界的情况。

Ko-Aluru算法的伪代码如下：

fn ka S sa n
  计算T
  确定少数类型minor: -或+

  if minor为L(-)
    计数排序S中L(-)类型的字符，S[n-1]需要提前放
    用induced sorting根据L(-)计算S(+)，即确定了所有形如++..+-的串的大小
    用induced sorting根据S(+)计算L(-)，即确定了所有形如--..-++..+-的串的大小
  else
    类似

  把依序的各个L-substring作为新的串
  若新字符集大小小于新串长度则递归调用，计算新串的sa2[]

  把sa2[]的值(新串的位置)映射为原串的位置
  if minor为L(-)
    把sa2[]的元素移动到字符代表的桶的前部
    用induced sorting根据S(+)计算L(-)，得到桶的后部
  else
    把sa2[]的元素移动到字符代表的桶的后部
    用induced sorting根据L(-)计算S(+)，得到桶的前部

实现

C++：

#include <algorithm>
#include <cstdio>
#include <cstring>
using namespace std;

#define REP(i, n) for (int i = 0; i < (n); i++)
#define FOR(i, a, b) for (int i = (a); i < (b); i++)
#define ROF(i, a, b) for (int i = (b); --i >= (a); )

namespace KoAluru
{
  bool *t;
  int *b;

  template<typename T>
  void bucket(T a[], int n, int k, bool end)
  {
    fill_n(b, k, 0);
    REP(i, n) b[a[i]]++;
    if (end)
      FOR(i, 1, k) b[i] += b[i-1];
    else {
      int s = 0;
      REP(i, k)
        s += b[i], b[i] = s-b[i];
    }
  }

  template<typename T>
  void plus_to_minus(T a[], int sa[], int n, int k)
  {
    bucket(a, n, k, false);
    sa[b[a[n-1]]++] = n-1;
    REP(i, n-1) {
      int j = sa[i]-1;
      if (j >= 0 && ! t[j])
        sa[b[a[j]]++] = j;
    }
  }

  template<typename T>
  void minus_to_plus(T a[], int sa[], int n, int k)
  {
    bucket(a, n, k, true);
    ROF(i, 0, n) {
      int j = sa[i]-1;
      if (j >= 0 && t[j])
        sa[--b[a[j]]] = j;
    }
  }

  template<typename T>
  void ka(T a[], int sa[], int n, int k)
  {
    t[n-1] = false;
    ROF(i, 0, n-1)
      t[i] = a[i] < a[i+1] || a[i] == a[i+1] && t[i+1];
    bool minor = 2 * count(t, t+n, false) > n;

    bucket(a, n, k, minor);
    fill_n(sa, n, -1);
    if (minor) {
      REP(i, n)
        if (t[i])
          sa[--b[a[i]]] = i;
      plus_to_minus(a, sa, n, k);
      minus_to_plus(a, sa, n, k);
    } else {
      sa[b[a[n-1]]++] = n-1;
      REP(i, n-1)
        if (! t[i])
          sa[b[a[i]]++] = i;
      minus_to_plus(a, sa, n, k);
      plus_to_minus(a, sa, n, k);
    }

    int last = -1, name = 0, nn = count(t, t+n, minor);
    int *sa2, *pi;
    if (minor)
      sa2 = sa, pi = sa+n-nn;
    else
      sa2 = sa+n-nn, pi = sa;
    fill_n(b, n, -1);
    REP(i, n)
      if (sa[i] >= 0 && minor == t[sa[i]]) {
        bool diff = last == -1;
        int p = sa[i];
        if (! diff)
          REP(j, n) {
            if (last+j >= n || p+j >= n || a[last+j] != a[p+j] || t[last+j] != t[p+j]) {
              diff = true;
              break;
            } else if (j > 0 && (minor == t[last+j] || minor == t[p+j]))
              break;
          }
        if (diff) {
          name++;
          last = p;
        }
        b[p] = name-1;
      }
    nn = 0;
    REP(i, n)
      if (b[i] >= 0)
        pi[nn++] = b[i];

    if (name < nn)
      ka(pi, sa2, nn, name);
    else
      REP(i, nn)
        sa2[pi[i]] = i;

    ROF(i, 0, nn)
      t[i] = a[i] < a[i+1] || a[i] == a[i+1] && t[i+1];

    nn = 0;
    bucket(a, n, k, minor);
    REP(i, n)
      if (minor == t[i])
        pi[nn++] = i;
    if (minor) {
      REP(i, nn)
        sa[i] = pi[sa2[i]];
      ROF(i, 0, nn) {
        int j = sa[i];
        sa[i] = -1;
        sa[--b[a[j]]] = j;
      }
    } else {
      ROF(i, 0, nn)
        sa[n-nn+i] = pi[sa2[i]];
      REP(i, nn) {
        int j = sa[n-nn+i];
        sa[n-nn+i] = -1;
        sa[b[a[j]]++] = j;
      }
    }
    if (minor)
      plus_to_minus(a, sa, n, k);
    else
      minus_to_plus(a, sa, n, k);
  }

  template<typename T>
  void main(T a[], int sa[], int b[], int n, int k)
  {
    KoAluru::b = b;
    if (n > 0) {
      t = new bool[n];
      ka(a, sa, n, k);
      delete[] t;
    }
  }

  template<typename T>
  void calc_rank_lcp(T a[], int sa[], int n, int rank[], int lcp[])
  {
    REP(i, n)
      rank[sa[i]] = i;
    int k = 0;
    lcp[0] = 0;
    REP(i, n)
      if (rank[i]) {
        for (int j = sa[rank[i]-1]; i+k < n && j+k < n && a[i+k] == a[j+k]; k++);
        lcp[rank[i]] = k;
        k && k--;
      }
  }
};

const int N = 1992;
unsigned char a[N+1];
int sa[N], rnk[N], lcp[N], q[N];

int main()
{
  gets((char *)a);
  int n = int(strlen((char *)a));
  KoAluru::main(a, sa, q, n, 256);
  KoAluru::calc_rank_lcp(a, sa, n, rnk, lcp);

  puts("sa");
  REP(i, n)
    printf("%d: %d\n", i, sa[i]);
  puts("rank");
  REP(i, n)
    printf("%d: %d\n", i, rnk[i]);
  puts("lcp");
  REP(i, n)
    printf("%d: %d\n", i, lcp[i]);
}

OCaml：

module KoAluru = struct
  (** [k]: Elements of [a] are in \[0,k) *)
  let ka a k =
    let n = Array.length a in
    let t = Array.make n false in
    let b = Array.make (max n k) 0 in
    let sa = Array.make n 0 in

    let bucket ai n k minor =
      Array.fill b 0 k 0;
      for i = 0 to n-1 do
        b.(a.(ai+i)) <- b.(a.(ai+i)) + 1
      done;
      if minor then
        for i = 1 to k-1 do
          b.(i) <- b.(i) + b.(i-1)
        done
      else
        let s = ref 0 in
        for i = 0 to k-1 do
          let t = !s in
          s := !s + b.(i);
          b.(i) <- t
        done
    in

    let minus_to_plus ai sai n k =
      bucket ai n k true;
      for i = n-1 downto 0 do
        let j = sa.(sai+i) - 1 in
        if j >= 0 && t.(j) then (
          b.(a.(ai+j)) <- b.(a.(ai+j)) - 1;
          sa.(b.(a.(ai+j))) <- j
        )
      done
    in

    let plus_to_minus ai sai n k =
      bucket ai n k false;
      sa.(b.(a.(ai+n-1))) <- n-1;
      b.(a.(ai+n-1)) <- b.(a.(ai+n-1)) + 1;
      for i = 0 to n-1 do
        let j = sa.(sai+i) - 1 in
        if j >= 0 && not t.(j) then (
          sa.(b.(a.(ai+j))) <- j;
          b.(a.(ai+j)) <- b.(a.(ai+j)) + 1
        )
      done
    in

    let rec ka_iter ai sai n k =
      (* calculate t[] *)
      t.(n-1) <- false;
      let minus_cnt = ref 1 in
      for i = n-2 downto 0 do
        t.(i) <- a.(ai+i) < a.(ai+i+1) || a.(ai+i) = a.(ai+i+1) && t.(i+1);
        if not t.(i) then
          incr minus_cnt
      done;
      let minor = !minus_cnt > n - !minus_cnt in
      let minor_cnt = if minor then n - !minus_cnt else !minus_cnt in

      (* use two induced sorting to order L/S-substrings, identical L/S-substrings are ordered arbitrarily *)
      bucket ai n k minor;
      Array.fill sa sai n (-1);
      if minor then (
        for i = 0 to n-1 do
          if t.(i) then (
            b.(a.(ai+i)) <- b.(a.(ai+i)) - 1;
            sa.(b.(a.(ai+i))) <- i
          )
        done;
        plus_to_minus ai sai n k;
        minus_to_plus ai sai n k
      ) else (
        for i = 0 to n-1 do
          if not t.(i) then (
            sa.(b.(a.(ai+i))) <- i;
            b.(a.(ai+i)) <- b.(a.(ai+i)) + 1
          )
        done;
        minus_to_plus ai sai n k;
        plus_to_minus ai sai n k
      );

      (* label L/S-substrings, identical L/S-substrings share the same label *)
      Array.fill b 0 n (-1);
      let last = ref (-1)
      and name = ref 0 in
      for i = 0 to n-1 do
        if sa.(i) >= 0 && minor = t.(sa.(sai+i)) then (
          let p = sa.(sai+i) in
          let rec check_lsl j =
            if !last+j >= n || p+j >= n || a.(!last+j) <> a.(p+j) || t.(!last+j) <> t.(p+j) then
              true
            else if j > 0 && (minor = t.(!last+j) || minor = t.(p+j)) then
              false
            else
              check_lsl (j+1)
          in
          if !last = -1 || check_lsl 0 then (
            incr name;
            last := p
          );
          b.(p) <- !name - 1
        )
      done;

      (* calculate SA of L/S-substrings *)
      let pi = if minor then sai+n-minor_cnt else sai in
      let sa2i = if minor then sai else sai+n-minor_cnt in
      let rec collect_lsl nn i =
        if i = n then
          nn
        else (
          if b.(i) >= 0 then (
            sa.(pi+nn) <- b.(i);
            collect_lsl (nn+1) (i+1)
          ) else
            collect_lsl nn (i+1)
        )
      in
      let nn = collect_lsl 0 0 in
      if !name < nn then
        ka_iter pi sa2i nn !name
      else
        for i = 0 to nn-1 do
          sa.(sa2i+sa.(pi+i)) <- i
        done;

      (* restore destroyed t[] *)
      for i = nn-1 downto 0 do
        t.(i) <- a.(ai+i) < a.(ai+i+1) || a.(ai+i) = a.(ai+i+1) && t.(i+1)
      done;

      (* transform calculated sa2[] *)
      let rec collect_minor_indices nn i =
        if i = n then
          nn
        else if minor = t.(i) then (
          sa.(pi+nn) <- i;
          collect_minor_indices (nn+1) (i+1)
        ) else
          collect_minor_indices nn (i+1)
      in
      let nn = collect_minor_indices 0 0 in
      for i = 0 to nn-1 do
        sa.(sa2i+i) <- sa.(pi+sa.(sa2i+i))
      done;

      (* move transformed sa2[] to front/rear of buckets *)
      bucket ai n k minor;
      if minor then (
        for i = nn-1 downto 0 do
          let j = sa.(i) in
          sa.(i) <- -1;
          b.(a.(ai+j)) <- b.(a.(ai+j)) - 1;
          sa.(b.(a.(ai+j))) <- j
        done;
        plus_to_minus ai sai n k
      ) else (
        for i = 0 to nn-1 do
          let j = sa.(sa2i+i) in
          sa.(sa2i+i) <- -1;
          sa.(b.(a.(ai+j))) <- j;
          b.(a.(ai+j)) <- b.(a.(ai+j)) + 1
        done;
        minus_to_plus ai sai n k;
      )
    in

    ka_iter 0 0 n k;
    sa

  let calc_rank_lcp a sa =
    let n = Array.length a in
    let rank = Array.make n 0 in
    let lcp = Array.make n 0 in
    for i = 0 to n-1 do
      rank.(sa.(i)) <- i
    done;
    let rec go k i =
      if i < n then
        if rank.(i) > 0 then (
          let j = sa.(rank.(i)-1) in
          let rec go2 k =
            if i+k < n && j+k < n && a.(i+k) = a.(j+k) then
              go2 (k+1)
            else
              k
          in
          let k = go2 k in
          lcp.(rank.(i)) <- k;
          go (k-1 |> max 0) (i+1)
        ) else
          go (k-1 |> max 0) (i+1)
    in
    go 0 0;
    rank, lcp
end

let () =
  let a = [| 1;1;2;1;2;3;2;3;1;3;0 |] in
  let n = Array.length a in
  let sa = KoAluru.ka a 4 in
  let rank,lcp = KoAluru.calc_rank_lcp a sa in
  for i = 0 to n-1 do
    Printf.printf "%d " rank.(i)
  done

参考文献

Juha Kärkkäinen and Peter Sanders. “Simple Linear Work Suffix Array Construction”. English. In: Automata, Languages and Programming. Ed. by JosC.M. Baeten et al. Vol. 2719. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2003, pp. 943–955. isbn: 978-3-540-40493-4. doi: 10.1007/3-540-45061-0_73. url: http://dx.doi.org/10.1007/3-540-45061-0_73.↩︎
Pang Ko and Srinivas Aluru. “Space efficient linear time construction of suffix arrays”. In: Journal of Discrete Algorithms. Springer, 2003, pp. 200–210.↩︎
YANG Zhe. Suffix Array Construction: KA's algorithm and Induced Sorting, which is faster? 2012. url: http://yangzhe1990.wordpress.com/2012/11/04/ka-and-induced-sortingwhich-is-faster/.↩︎
Ge Nong, Sen Zhang, and Wai Hong Chan. “Two Efficient Algorithms for Linear Time Suffix Array Construction”. In: IEEE Transactions on Computers 60.10 (2011), pp. 1471–1484. issn: 0018-9340. doi: http://doi.ieeecomputersociety.org/10.1109/TC.2010.188.↩︎
Ge Nong, Sen Zhang, and Wai Hong Chan. “Two Efficient Algorithms for Linear Time Suffix Array Construction”. In: IEEE Transactions on Computers 60.10 (2011), pp. 1471–1484. issn: 0018-9340. doi: http://doi.ieeecomputersociety.org/10.1109/TC.2010.188.↩︎
Pang Ko and Srinivas Aluru. “Space efficient linear time construction of suffix arrays”. In: Journal of Discrete Algorithms. Springer, 2003, pp. 200–210.↩︎
Pang Ko and Srinivas Aluru. “Space efficient linear time construction of suffix arrays”. In: Journal of Discrete Algorithms. Springer, 2003, pp. 200–210.↩︎
Ge Nong, Sen Zhang, and Wai Hong Chan. “Two Efficient Algorithms for Linear Time Suffix Array Construction”. In: IEEE Transactions on Computers 60.10 (2011), pp. 1471–1484. issn: 0018-9340. doi: http://doi.ieeecomputersociety.org/10.1109/TC.2010.188.↩︎