For approximately 10 years, reviews.llvm.org functioned as the code
view site for the LLVM project, utilizing a Phabricator instance. This
website hosted numerous invaluable code review discussions. However,
following LLVM's transition
to GitHub pull requests, there arises a necessity for a read-only
archive of the existing Phabricator instance. (https://archive.org/
archives a subset of the reviews.llvm.org/Dxxxxx
pages.)
The intent is to eliminate a SQL engine. Phabicator operates on a complex database scheme. To minimize time investment, the most feasible approach seems to involve downloading the static HTML pages and employing a lightweight scraping process.
Raphaël Gomès developed phab-archive to serve a read-only archive for Mercurial's Phabricator instance. I have modified the code to suit reviews.llvm.org.
The DNS records of reviews.llvm.org have been pointed to the archive website.
Read-only pages
The review discussions primarily happen on /Dxxx
pages,
which should be archived. There are much fewer discussions on
/rL$svn_rev
(when LLVM used svn) and
/rG$git_commit
pages. We skip archiving them as a
compromise.
Some /Dxxx
pages contain a large number of modified
files (usually tests). Phabricator presents a "Load File" button. If we
expand every button, the end HTML can be very large. We need to limit
the number of buttons to click.
The file hierarchy is quite straightforward.
archive/unprocessed/diffs
contains raw HTML pages while
templates/diffs
contains scraped HTML pages alongside patch
files.
1 | % tree archive/unprocessed/diffs | head -n 12 |
1 | % du -sh archive/unprocessed/ |
At present, some https://reviews.llvm.org/Dxxxxx
pages
might be inaccessible.
https://reviews.llvm.org/Dxxxxx?download=true
is an
alternative if you just need the patch file but not discussions.
Embedded images are currently unavailable. https://reviews.llvm.org/D71786 is an example. https://reviews.llvm.org/D135657 is another example with
embedded images in a comment. 1
2% rg -l 'phabricator-remarkup-embed-image' templates/diffs/ | wc -l
3332
Nginx
I aim to utilize Nginx solely to serve URIs.
1 | /D2 => /diffs/2/D2.html |
We just need URL mapping and some Nginx location
directives.
1 | map_hash_max_size 400000; |
The second round of crawling
Among D1 to D159553, there were 1669 pages that were not downloaded. These differentials might be deleted by the author, had a permission error (e.g. the author did it make it publicly readable), or the crawler encountered an error (e.g. an emulated button click failed).
In January 2024, I got access to the machine hosting the Phabricator instance and crawled 759 differentials. Among them, 184 differentials have a state other than "Closed".
Statistics
We can make a copy of process-html.py
and modify it to
get some statistics. 1
2
3
4
5
6
7
8
9
10def process_html(html, diff):
soup = BeautifulSoup(html, "html.parser")
status = soup.select_one(".phui-tag-core").text
title = soup.select_one(".phui-header-header").text
author = soup.select_one(".phui-head-thing-view > strong").text
sub = []
for div in soup.select(".phui-handle.phui-link-person"):
if 'commits' in div.text:
sub.append(div.text)
print(diff, status, title, author, ','.join(sub), sep='\t')
I have collected differentials that are not “Closed” at https://gist.githubusercontent.com/MaskRay/798de69eb9e7ec7c3e98507265dc5514/raw/.
The majority of differentials are "Closed" (indicating a landed patch,
unless mis-tagged), therefore not interesting. The rows contain
subscribers that look like *-commits
, e.g. llvm-commits
(a mailing list). This should help find pending patches for subprojects,
such as clang, flang, and libcxx.