2012-11-21

建立清华大学Node Packaged Modules镜像

缘由

Hacker News最近有篇文章How to create a private npm.js repository，看完后打算给http://mirror.tuna.tsinghua.edu.cn/搭个源。

搭建

sudo aptitude install couchdb
sudo vim /etc/couchdb/local.ini修改admin密码
sudo install -d -o couchdb -g couchdb /var/run/couchdb

然后配置CouchDB：

$ sudo vim /etc/couchdb/local.ini
[httpd]
secure_rewrites = false

[couchdb]
database_dir = /mirror/npm/couchdb # 设置couchdb数据库存放路径

这里把数据库目录设为/mirror/npm/couchdb了。 2013年8月17日，同步完成后这个目录下面的registry.couch有72G。

sudo /etc/init.d/couchdb start

然后开始同步，CouchDB的管理是通过HTTP请求来进行的：

1
2
3

curl -X POST http://127.0.0.1:5984/_replicate -d \
  '{"source":"http://isaacs.iriscouch.com/registry/", "target":"registry", "create_target":true, "continuous":true}' \
  -H "Content-Type: application/json"

注意其中的name/value对："continuous":true，就是说这个replication过程(同步)是持续的，而非一次性的，不需要像rsync镜像那样用cron job定期执行同步命令。

获取同步状态

使用以下命令：

1	curl -s localhost:5984/_active_task \| jq .

显示同步状态。其中jq是个强大的JSON数据的命令行处理器，类似于sed。在这个命令中jq的基本过滤器.，起到了pretty printer美化JSON输出结果的作用。

题外话，在CSS的selector的影响下诞生了HTML快速生成的snippet工具Zen Coding(现在更名为Emmet)，jq给人类似的感觉。

上面的命令有类似下面的JSON输出结果：

[
  {
    "updated_on": 1376700136,
    "missing_revisions_found": 1678,
    "docs_written": 1678,
    "docs_read": 1678,
    "doc_write_failures": 0,
    "doc_id": null,
    "continuous": true,
    "checkpointed_source_seq": 620684,
    "pid": "<0.1872.0>",
    "progress": 99,
    "replication_id": "d2d78ebb34eb57384335196827cdc81e+continuous",
    "revisions_checked": 12615,
    "source": "http://isaacs.iriscouch.com/registry/",
    "source_seq": 624071,
    "started_on": 1376513978,
    "target": "registry",
    "type": "replication"
  }
]

CouchDB文档不全，我没有找到各个字段的含义，下面是个人臆断：

progress name表示的是进度，最大为100，是用floor(checkpointed_source_seq * 100 / source_seq)计算出来的。如果是progress达到了100就表明完全达到了官方数据库的某一历史版本状态。如果没到100，镜像就处于不一致状态，可能metadata信息和实际包不一致，但这种不一致性影响比较小，通常不会产生问题。

禁止普通用户PUT/POST/DELETE

客户端npm向服务端抓取数据只会用到GET请求，我们的镜像是官方数据库的一个slave database，不是权威服务器，也不具有官方服务器的用户账户信息，所以无法提供用户登录，而且镜像也不应该允许用户上传，所以PUT/POST/DELETE方法都用不到的。而CouchDB默认是允许任何用户做修改的，这是个很大的安全风险，需要屏蔽掉这几个方法。解决方案是给registry这个database的_design/security添加如下验证函数validate_doc_update：

function(newDoc, oldDoc, userCtx, secObj) {
  if (! userCtx || ~ userCtx.roles.indexOf('_admin'))
    log('Admin change on read-only db: ' + newDoc._id);
  else
    throw {'forbidden':'This database is read-only'};
}

可以执行下面的命令添加上面这段验证函数：

1
2

curl -X PUT admin:password@localhost:5984/registry/_design/security -d \
  '{"validate_doc_update": "function(newDoc, oldDoc, userCtx, secObj) { if (! userCtx || ~ userCtx.roles.indexOf('\''_admin'\'')) log('\''Admin change on read-only db: '\'' + newDoc._id); else throw {'\''forbidden'\'':'\''This database is read-only'\''}; }"}'

使用Nginx做反向代理

CouchDB支持vhost，但为了让它和其他镜像更好地协作，用Nginx做反向代理。

server {
  listen [::]:80;
  server_name npm.*;
  root /mirror/npm/www;
  index index.html;

  location /registry {
    proxy_redirect off;
    proxy_set_header Host $host;
    proxy_pass http://127.0.0.1:5984/registry/_design/app/_rewrite;
    # proxy_pass http://127.0.0.1:5984/registry;
  }

  location /registry/_design {
    proxy_redirect off;
    proxy_set_header Host $host;
    proxy_pass http://127.0.0.1:5984/registry/_design;
  }
}

其中被#注释掉的proxy_pass http://127.0.0.1:5984/registry;不可用，不明原因。

fqj1994同学报告并修复了一个问题：metadata返回的url的host是根据请求的Host:首部来决定的，所以需要proxy_set_header Host $host;来让CouchDB生成正确的url。否则客户端npm会收到包含127.0.0.1的metadata，试图从自己机器获取数据，自然就得不到。

另外，leecade同学提出了一个想法，希望本地源返回404

% curl http://npm.tuna.tsinghua.edu.cn/registry/xxxxxxxx
{"error":"not_found","reason":"document not found"}

时让Nginx作为前向代理替客户端向官方源请求。当有一个包刚刚传到官方源，本地源尚无相应信息时挺有用。可以在配置中修改locatioon /registry和location /registry/_design的配置：

location /registry {
  proxy_redirect off;
  proxy_set_header Host $host;
  proxy_pass http://127.0.0.1:5984/registry/_design/app/_rewrite;
  # proxy_pass http://127.0.0.1:5984/registry;
  proxy_intercept_errors on;
  error_page 404 @official;
}

location /registry/_design {
  proxy_redirect off;
  proxy_set_header Host $host;
  proxy_pass http://127.0.0.1:5984/registry/_design;
  proxy_intercept_errors on;
  error_page 404 @official;
}

并添加：

location @official {
  proxy_redirect off;
  proxy_pass http://registry.npmjs.org;
}

其中proxy_intercept_errors的作用就是让Nginx解析后端返回的状态码>=400的错误，这样error_page就能生效并引导至@official块。

网页介绍

然后要设计一个网页介绍。

用jade来做模板引擎无疑是最方便的，不过标签里要添加尾部空格比较难以实现。Ruby社区的slim在jade的基础上做了些改进，引入了一些便捷的东西，比jade更加好用。

CSS可以考虑用stylus结合nib插件。

从上游同步时使用代理

在向上游同步时，偶尔会碰到流量被过滤，同步无法顺利进行的情况。我碰到过几次这样的情况了，等几天都同步都不会有进展的，用ls -l查看数据库的修改日期会发现一直没有变化。

这个时候需要临时去除无代理的上游配置，换上有代理的上游配置：

1
2

curl -sX POST admin:password@127.0.0.1:5984/_replicate -d '{"source":"http://isaacs.iriscouch.com/registry/", "target":"registry", "continuous":true, "cancel":true}' -H "Content-Type: application/json"
curl -sX POST admin:password@127.0.0.1:5984/_replicate -d '{"source":"http://isaacs.iriscouch.com/registry/", "target":"registry", "continuous":true, "create_target":true, "proxy":"http://127.0.0.1:xxxx"}' -H "Content-Type: application/json"

第一句是去除原有的无代理上游配置，第二局是添加有代理的上游配置。

过了几秒钟后即可再次换上无代理的上游配置：

1
2

curl -sX POST admin:password@127.0.0.1:5984/_replicate -d '{"source":"http://isaacs.iriscouch.com/registry/", "target":"registry", "continuous":true, "cancel":true, "proxy":"http://127.0.0.1:xxxx"}' -H "Content-Type: application/json"
curl -sX POST admin:password@127.0.0.1:5984/_replicate -d '{"source":"http://isaacs.iriscouch.com/registry/", "target":"registry", "continuous":true, "create_target":true}' -H "Content-Type: application/json"

第一句是去除有代理上游配置，第二句是添加无代理的上游配置。

同步脚本

下面的同步脚本会跟踪registry.couchdb文件的修改时间，以此判断CouchDB的replication过程是不是卡住了，如果是则临时挂上代理：

2013年10月16日更新，根据http://wiki.apache.org/couchdb/Replication#Cancel_replication，CouchDB 1.2开始取消同步的方式有所变化。

#!/usr/bin/env ruby
require 'net/http'
require 'net/smtp'
require 'json'

LOG_ROOT_DIR = File.expand_path '~mirror/log'
LOG_DIR = File.join LOG_ROOT_DIR, 'npm'
STATUS_FILE = File.join LOG_DIR, 'status.txt'
DB_PATH = '/srv/local/npm/couchdb/registry.couch'
USER = 'admin'
PASSWORD, HTTP_PROXY = JSON.parse(File.read '/srv/local/npm/couchdb/passwd').values_at 'password', 'http_proxy'
WAIT = 30
DEBUG = false

replicate = ->opts=({}) {
  req = Net::HTTP::Post.new '/_replicate'
  req.basic_auth USER, PASSWORD
  if opts[:cancel]
    form_data = {'replication_id' => opts[:replication_id], 'cancel' => true}
  else
    form_data = {'source'=>'http://isaacs.iriscouch.com/registry/',
      'target'=>'registry', 'continuous'=>true, 'create_target'=>true}
    form_data.update 'proxy'=>HTTP_PROXY if opts[:proxy]
  end
  req.set_content_type 'application/json'
  req.body = form_data.to_json
  res = Net::HTTP.new('localhost', 5984).request req
  puts res.body if DEBUG
}

cancel_all_replicates = ->sources {
  sources.each {|source|
    puts "replication: #{source['replication_id']}"
    replicate[cancel: true, replication_id: source['replication_id']]
  }
}

get = ->path {
  req = Net::HTTP::Get.new path
  req.basic_auth USER, PASSWORD
  Net::HTTP.new('localhost', 5984).request req
}

get_json = ->path {
  JSON.parse get[path].body
}

stuck = ->{
  Time.new - File.stat(DB_PATH).mtime > 60 * 10
}

write_log = ->log{
  puts "write log: #{log}" if DEBUG
  File.write STATUS_FILE, log
}

send_email = ->{
  name = 'Ray'
  from = 'issues.tuna@gmail.com'
  to = 'i@maskray.me'
  server = 'localhost'
  msg = <<E
From: #{name} <#{from}>
To: #{to}
Subject: npm ERROR on #{Time.now}

failed to sync
E

  Net::SMTP.start(server) do |smtp|
    smtp.send_message msg, from, to
  end
}

status = File.readlines(File.join log_dir, 'status.txt')[0].split ',' rescue 'failed,1377760143,-,0,74.5217GB,0,0,0,0,0,0,0,0'
sources = get_json['/_active_tasks']
last_fail = Time.now

loop do
  begin
    size = get_json['/registry']['disk_size']
    failed = ->{
      status[0] = 'failed'
      status[4] = size
      write_log["#{status.join(',')}\n"]
      if Time.now - last_fail > 60 * 60 * 24
        send_email[]
        last_fail = Time.now
      end
    }

    if sources.empty?
      replicate[]
    elsif stuck[]
      cancel_all_replicates[sources]
      replicate[proxy: true]
      sleep WAIT
      if stuck[]
        cancel_all_replicates[sources]
        failed[]
      else
        cancel_all_replicates[sources]
        replicate[]
      end
    else
      updated_on, _progress = sources[0].values_at 'updated_on', 'progress'
      write_log["success,#{updated_on},-,0,#{size},0,0,0,0,0,0,0,0"]
    end
  rescue => e
    $stderr.puts e
  ensure
    sleep 60
  end
end