2013-08-19

用Google Analytics API为静态站点实现“阅读最多文章”功能

Hexo

我一直走静态博客路线。最早用make+m4搭建博客，后来改用Haskell的Hakyll，因为网站相关的工具炼(HTML、CSS、JS模板引擎等)不够用去年11月又换成了Node.js社区的DocPad。最近升级DocPad坏掉了，于是8月上旬又折腾了下迁移到了Hexo。

zippera写了好几篇关于Hexo的文章，非常棒：

博客站点的功能性特性我能想到下面几个：

评论系统
Web analytics比较容易实现
关联文章
多语言，比如https://www.byvoid.com/zhs/
列出阅读最多的文章

评论系统使用Disqus，网站分析用Google Analytics。

下面介绍怎么用Google Analytics实现“阅读最多文章”功能。

Google Analytics Tracking Code

如果你已经在自己的网站上加入了Google Analytics tracking code，请跳过这一段。

首先要给页面加上tracking code，可以参考Tracking Basics (Asynchronous Syntax)。其实用不着看文档，Hexo的默认主题light已经自带这个功能了，只需要修改themes/light/_config.yml中的这一行

1	google_analytics: UA-35225578-1

把其中的UA-35225578-1换成你的网站的property ID。

用Google Analytics API获取所有文章的pageview

先注册一个Gmail帐号(这个方法需要登录，为了安全性起见不要用主要帐号)，在Google Analytics的Admin-User Management页面授予这个帐号User权限。

另外我们需要获取GA跟踪的站点的profile ID，这个比较难找。在Google Analytics查看自己的网站会看到这样的URL：

1	https://www.google.com/analytics/web/?hl=en&pli=1#croverview/cr-overview/aXXXXXXXXwYYYYYYYYpZZZZZZZZ/

其中ZZZZZZZZ就是profile ID。

下面LiveScript脚本根据Gmail帐号用户名、密码认证，然后根据profile ID返回所有文章的ga:pageviews，输出为JSON格式。

脚本的工作原理是先用ClientLogin for Installed Applications登录刚才新注册的Gmail帐号，得到Google返回的字符串AUTH=xxxxxxxxxxxxxx，然后根据这个AUTH向http://www.googleapis.com/analytics/v3/data/ga发送GET请求。

% ls
node_modules/  out/  public/  scaffolds/  scripts/  source/  themes/  util/  db.json  package.json  twistd.log  _config.yml
% npm install -g LiveScript
% lsc util/generate-popular-json.ls

# util/generate-popular-json.ls
require! 'https'
require! 'querystring'

pp = (o) ->
  console.log JSON.stringify o, null, 2

class GA extends require('events').EventEmitter
  (@user, @password) ->

  receive: (res, cb) ->
    chunks = []
    len = 0
    res.on 'data', (chunk) ->
      chunks.push chunk
      len += chunk.length
    res.on 'end', ->
      buf = new Buffer len
      offset = 0
      for c in chunks
        c.copy buf, offset, 0
        offset += c.length
      cb buf.toString!

  login: (cb) ->
    options =
      host: 'www.google.com'
      port: 443
      method: 'POST'
      path: '/accounts/ClientLogin'
      headers: 'Content-Type': 'application/x-www-form-urlencoded'

    post-data =
      Email: @user
      Passwd: @password
      accountType: 'HOSTED_OR_GOOGLE'
      source: 'curl-accountFeed-v2'
      service: 'analytics'

    req = https.request options, (res) ~>
      @receive res, (data) ~>
        if m = data.match /(Auth=[^\s*]*)\s/
          @token = m[1]
          cb null, @token
        else
          cb data
    req
      ..write querystring.stringify post-data
      ..end!

  get: (request, cb) ->
    if @debug
      console.log 'token:', @token
    options =
      method: 'GET'
      host: 'www.googleapis.com'
      port: 443
      path: "/analytics/v3/data/ga?#{querystring.stringify request}"
      headers:
        Authorization: "GoogleLogin #{@token}"
        'GData-Version': 2
    req = https.request options, (res) ~>
      @receive res, (raw-data) ->
        data = JSON.parse raw-data
        if data.error?
          cb data.error.message
        else
          cb null, data
    req.end!

ga = new GA 'email address', 'password'
profile-id = 64586883

(err, _token) <- ga.login!
console.error(err)+process.exit(1) if err?

(err, results) <- ga.get {
  dimensions: 'ga:pagePath,ga:pageTitle'
  ids: "ga:#{profile-id}"
  'start-date': '2006-01-01'
  'end-date': '2026-01-01'
  metrics: 'ga:pageviews'
  sort: '-ga:pageviews'
}
console.error(err)+process.exit(2) if err?

pp results.rows

因为是静态站点，我们需要每隔一段时间更新pageviews，把信息同步到网站服务器上。我使用fcron(功能比vixie-cron、cronie等多很多)：

1
2
3

% fcrontab -l
PATH=/usr/bin:/bin:/home/ray/bin:/home/ray/.local/bin
0 */6 * * *      cd ~/maskray.me; lsc util/generate-popular-json.ls > out/api/popular.json

即每个六小时抓取文章的ga:pageviews信息，生成http://maskray.me/api/popular.json使用的静态文件。

新建文件themes/light/layout/_widget/popular.ejs：

<div class="widget popular">
  <h3 class="title">Popular</h3>
  <ul class="entry" id="js-popular">
  </ul>
</div>

修改themes/light/_config.yml中的widgets:，添加- popular。

我创建了目录out表示静态站点的生成目录，其中：

% readlink out/blog
../public
% ls out/js/popular.js
out/js/popular.js

把这段LiveScript脚本编译成out/js/popular.js：

truncate = (s) ->
  if s.length > 24
    "#{s.slice(0, 21)}..."
  else
    s

$.get '/api/popular.json', (data) ->
  data = JSON.parse data if typeof data is 'string'
  for row in data
    $('#js-popular').append $('<li>').append $('<a>').attr('href', row.path).text("#{truncate row.title} (#{row.pageView})")

在themes/light/layout/_partial/after_footer.ejs末尾添加：

1	<script src="/js/popular.js"></script>

现在只有一个assets文件：/js/popular.ls，所以每次修改后就手动执行lsc -c out/js/生成。以后东西多了就该切换到Grunt了。

2017年6月3日更新：OAuth 2代替ClientLogin

2015年某时起，使用Google账户密码认证的ClientLogin已不再支持，现在得用Google API Client。因为某些文章的路径名发生过变化，我还用了一个脏办法去掉这些失效的链接。2017年之前的API改变了，再次更新。下面基本上是我现在用的生成脚本，其中的ID信息都隐藏了：

#!/usr/bin/env python3
# pip install --user pyopenssl google-api-python-client
import httplib2

from googleapiclient.discovery import build
from googleapiclient.http import HttpError
from oauth2client.service_account import ServiceAccountCredentials
import re, sys, json, os

VIEW_ID = 'ga:XXXXXXXX'

def get_metrics():
    credentials = ServiceAccountCredentials.from_json_keyfile_name(
            os.path.join(os.path.dirname(os.path.abspath(__file__)), 'API Project-XXXXXXXXXXXX.json'),
            scopes=['https://www.googleapis.com/auth/analytics.readonly'])
    http = credentials.authorize(httplib2.Http())
    analytics = build('analytics', 'v4', http=http, discoveryServiceUrl='https://analyticsreporting.googleapis.com/$discovery/rest')
    result = analytics.reports().batchGet(body={'reportRequests': [{
        'viewId': VIEW_ID,
        'dateRanges': [{'startDate': '2012-01-01', 'endDate': 'today'}],
        'metrics': [{'expression': 'ga:pageviews'}],
        'dimensions': [{'name':'ga:pagePath'}, {'name':'ga:pageTitle'}],
    }]}).execute()['reports'][0]['data']['rows']

    #with open('/tmp/g.json') as f:
    #    result = json.load(f)['reports'][0]['data']['rows']
    return result

def normalize_path(path):
    return re.sub('[.?].*|\\/$', '', path)

def normalize_title(title):
    return re.sub('MaskRay [|]|[|] MaskRay', '', title).strip()

r = [(normalize_path(x['dimensions'][0]), normalize_title(x['dimensions'][1]), int(x['metrics'][0]['values'][0])) for x in get_metrics()]
r = sorted([x for x in r if re.match('/blog/20..-..-..-', x[0])], key=lambda x: x[0])
i = 0
rr = []
while i < len(r):
    j = i
    pv = 0
    opt_pv = i
    while j < len(r) and r[i][0] == r[j][0]:
        pv += r[j][2]
        if r[j][2] > r[opt_pv][2] and r[j][1] != '(not set)':
            opt_pv = j
        j += 1
    if not re.search('-document-viewer|build-system-tup|2013-07-28-beauty-of-programming|2012-11-12-build-website-with-docpad|2012-11-12-migrate-to-docpad|asc14-to-isc15-of-my', r[i][0]):
    #if not re.search('-document-viewer|build-system-tup|/-document-viewer|build-system-tup|2012-11-19-ai9|2012-11-12-build-website-with-docpad|2013-07-28-beauty-of-programming|2012-11-12-migrate-to-docpad|2015-05-01-jq-internals-bytecode|2015-03-26-leetcode-best-time-to-buy-and-sell-stock-iv|2015-03-25-elf-hacks|2015-03-22-bctf-2015-camlmaze|2015-03-13-debug-hacks-2|2012-11-01-perfect-maze-generation|2014-11-23-jsxajs-workgroup|2015-06-15-morris-post-order-traversal|2014-10-13-wechat-export|2012-09-09-parallel-n-body|2013-03-13-xv-olimpiada-informatyczna-etap-1-klo|2015-06-29-bmc-firmware-reverse-enginnering|2014-12-30-summary', r[i][0]):
        rr.append({'path':r[i][0], 'title':r[opt_pv][1], 'pageView':pv})
    i = j
rr.sort(key=lambda x: - x['pageView'])

# remove entries with wrong date
rrr = []
exist = set()
for x in rr:
    m = re.match('/blog/20..-..-..-([-\\w]*)', x['path'])
    if not m:
        rrr.append(x)
    elif m.groups()[0] not in exist:
        exist.add(m.groups()[0])
        rrr.append(x)
json.dump(rrr, sys.stdout, ensure_ascii=False)

使用ensure_ascii使用中文而不是\u1234这样的字串是为了用OpenCC转换成统一的简体和繁体比较方便，参见Nginx根据Accept-Language的简繁体支持。

Hexo

Google Analytics Tracking Code

用Google Analytics API获取所有文章的pageview

为Hexo添加一个widget显示“阅读最多文章”

2017年6月3日更新：OAuth 2代替ClientLogin