2023-12-30

reviews.llvm.org became a read-only archive

For approximately 10 years, reviews.llvm.org functioned as the code view site for the LLVM project, utilizing a Phabricator instance. This website hosted numerous invaluable code review discussions. However, following LLVM's transition to GitHub pull requests, there arises a necessity for a read-only archive of the existing Phabricator instance. (https://archive.org/ archives a subset of the reviews.llvm.org/Dxxxxx pages.)

The intent is to eliminate a SQL engine. Phabicator operates on a complex database scheme. To minimize time investment, the most feasible approach seems to involve downloading the static HTML pages and employing a lightweight scraping process.

Raphaël Gomès developed phab-archive to serve a read-only archive for Mercurial's Phabricator instance. I have modified the code to suit reviews.llvm.org.

The DNS records of reviews.llvm.org have been pointed to the archive website.

Read-only pages

The review discussions primarily happen on /Dxxx pages, which should be archived. There are much fewer discussions on /rL$svn_rev (when LLVM used svn) and /rG$git_commit pages. We skip archiving them as a compromise.

Some /Dxxx pages contain a large number of modified files (usually tests). Phabricator presents a "Load File" button. If we expand every button, the end HTML can be very large. We need to limit the number of buttons to click.

The file hierarchy is quite straightforward. archive/unprocessed/diffs contains raw HTML pages while templates/diffs contains scraped HTML pages alongside patch files.

% tree archive/unprocessed/diffs | head -n 12
archive/unprocessed/diffs
├── 1
│   ├── D1-4.html
│   ├── D1-5.html
│   └── D1.html
├── 10
│   ├── D10-33.html
│   └── D10.html
├── 100
│   ├── D100000-335683.html
│   ├── D100000-335688.html
│   ├── D100000-335689.html
% tree templates/diffs/ | head -n 20
templates/diffs/
├── 1
│   ├── D1-4.diff
│   ├── D1-4.html
│   ├── D1-5.diff
│   ├── D1-5.html
│   ├── D1.diff
│   └── D1.html
├── 10
│   ├── D10-33.diff
│   ├── D10-33.html
│   ├── D10.diff
│   └── D10.html
├── 100
│   ├── D100000-335683.diff
│   ├── D100000-335683.html
│   ├── D100000-335688.diff
│   ├── D100000-335688.html
│   ├── D100000-335689.diff
│   ├── D100000-335689.html
% cat templates/diffs/1/D1-4.diff
Index: include/llvm/ADT/StringMap.h
===================================================================
--- include/llvm/ADT/StringMap.h
+++ include/llvm/ADT/StringMap.h
@@ -34,7 +34,7 @@
 public:
   template <typename InitTy>
   static void Initialize(StringMapEntry<ValueTy> &T, InitTy InitVal) {
-    T.second = InitVal;
+    T.test= InitVal;
   }
 };

% du -sh archive/unprocessed/
270G    archive/unprocessed/
% du -sh templates/diffs
282G    templates/diffs

At present, some https://reviews.llvm.org/Dxxxxx pages might be inaccessible. https://reviews.llvm.org/Dxxxxx?download=true is an alternative if you just need the patch file but not discussions.

Embedded images are currently unavailable. https://reviews.llvm.org/D71786 is an example. https://reviews.llvm.org/D135657 is another example with embedded images in a comment.

1 2	% rg -l 'phabricator-remarkup-embed-image' templates/diffs/ \| wc -l 3332

Nginx

I aim to utilize Nginx solely to serve URIs.

/D2 => /diffs/2/D2.html
/D2?id=&download=true => /diffs/2/D2.diff
/D2?id=10 => /diffs/2/D2-10.html
/D2?id=10&download=true => /diffs/2/D2-10.diff

/D123?id=5 => /diffs/123/D123-5.html
/D1234?id=5 => /diffs/123/D1234-5.html

/rL$svn_rev => https://github.com/llvm/llvm-project/commit/$git_commit
/rG$git_commit => https://github.com/llvm/llvm-project/commit/$git_commit

We just need URL mapping and some Nginx location directives.

map_hash_max_size 400000;
map_hash_bucket_size 128;
map $request_uri $svn_rev {
  ~^/rL([0-9]+) $1;
}
map $svn_rev $git_commit {
  include /var/www/phab-archive/svn_url_rewrite.conf;
}

server {
  listen 80 default_server;
  listen [::]:80 default_server;

  if ($git_commit) {
    return 301 https://github.com/llvm/llvm-project/commit/$git_commit;
  }

  root /var/www/phab-archive/www;
  server_name reviews.llvm.org;

  types {
    text/html html;
    text/plain diff;
  }

  location ~ "^/D(?<diff>.{1,3})$" {
    set $ext ".html";
    if ($arg_download) { set $ext ".diff"; }
    if ($arg_id ~ ^(\d+)$) { rewrite ^ /diffs/$diff/D$diff-$arg_id$ext? last; }
    try_files /diffs/$diff/D$diff$ext =404;
  }
  location ~ ^/D(?<dir>...)(?<tail>.+) {
    set $ext ".html";
    if ($arg_download) { set $ext ".diff"; }
    if ($arg_id ~ ^(\d+)$) { rewrite ^ /diffs/$dir/D$dir$tail-$arg_id$ext? last; }
    try_files /diffs/$dir/D$dir$tail$ext =404;
  }
}

The second round of crawling

Among D1 to D159553, there were 1669 pages that were not downloaded. These differentials might be deleted by the author, had a permission error (e.g. the author did it make it publicly readable), or the crawler encountered an error (e.g. an emulated button click failed).

In January 2024, I got access to the machine hosting the Phabricator instance and crawled 759 differentials. Among them, 184 differentials have a state other than "Closed".

Statistics

We can make a copy of process-html.py and modify it to get some statistics.

def process_html(html, diff):
    soup = BeautifulSoup(html, "html.parser")
    status = soup.select_one(".phui-tag-core").text
    title = soup.select_one(".phui-header-header").text
    author = soup.select_one(".phui-head-thing-view > strong").text
    sub = []
    for div in soup.select(".phui-handle.phui-link-person"):
        if 'commits' in div.text:
            sub.append(div.text)
    print(diff, status, title, author, ','.join(sub), sep='\t')

I have collected differentials that are not “Closed” at https://gist.githubusercontent.com/MaskRay/798de69eb9e7ec7c3e98507265dc5514/raw/. The majority of differentials are "Closed" (indicating a landed patch, unless mis-tagged), therefore not interesting. The rows contain subscribers that look like *-commits, e.g. llvm-commits (a mailing list). This should help find pending patches for subprojects, such as clang, flang, and libcxx.