Suggesting Related Posts with Ruby and Jekyll

Having a static website blog comes with many limitations, and I, for the large part, learned to live with them. The speed, simplicity, low cost, and security of a static website outweighs a lot of its shortcomings.

One thing I tried building is a feature that automatically displays related posts. It would be easy if this site ran on a server that calculates stuff on the fly, but it doesn’t. I need to find a clever way to efficiently recalculate what is related to each blog post when the site builds.

The Goal

Here’s what I’m trying to achieve:

When this static site builds, I am able to display related posts based on relevancy.
The process should be fairly automatic—no manual intervention or tagging
It must work with my myriad of posts in Chinese
It must work with my Bundler-based tech stack with Netlify; Java environment is out of the question.

The Method

tf–idf Model

tf–idf stands for Term Frequency–Inverse Document Frequency. There are many websites that write about it, so I’m not gonna. A good starting point is just this Wikipedia page. Essentially, tf–idf looks for the frequency of a certain word showing up in a certain post, offset by the frequency of that same word showing up in all posts.

With this model, we should be in the right direction, calculating similarity of any two given blog posts.

Stop words

Stop words are connector words that don’t contribute much to a document’s topic. The most common ones are and, or, because, etc. Removing stop words before feeding the text to the tf–idf model will help it understand relevancy a lot better.

And because I write a lot of stuff in Chinese, I downloaded a stop word list that contains both English and Chinese words and phrases.

Jekyll Plug-in

Liquid is nice and all, but it’s apparently not powerful or flexible enough to handle a proper tf–idf problem. Jekyll is built on top of Ruby on Rails, and we can easily write a plug-in to do the heavy lifting.

With this plug-in, we want to be able to—

Iterate through the posts and build a tf–idf model quickly and efficiently
Calculate similarity score for any given two posts
For each blog post, find the top 3 relevant posts. Note that:
1. We hope to include text in the title, too, and assign heavier weight to the text in the title. To achieve that, I just repeat the text in the title 5 times, and prepend that to the content of the post.
2. We don’t want relevant posts to contain the previous or the next post, whose links will appear right beside the links of related posts. For that, want to actually extract the top 5 posts by similarity scores, in case up to 2 of them can’t be included.
3. We need a similarity cutoff, the minimum score to reach for a post to be considered “related”. I arbitrarily settled on 0.25 after some tuning.

Chinese Pinyin Conversion

Another crucial piece is the conversion from Chinese characters to Pinyin (or latinized) “words”.

After some searching, I found that this chinese_pinyin gem works perfectly. Most importantly, it runs without any external requirements and fits the building requirements with Netlify. It may not be perfect, but it gets the job done.

I chose to convert to Pinyin with tone marks, so that “你好” becomes nĭ hăo instead of plain ni hao. Without tone marks, the same latin word may represent many of the same words and reduces the accuracy of the tf–idf model—I’ve tried that.

The Code

These are the gems I used:

matrix and tf-idf-similarity for the backbone of tf–idf implementation
chinese_pinyin for converting Chinese characters into pinyin
narray for more efficient vector-based calculation

Below is the entire Jekyll plug-in that I wrote in Ruby, placed in the _plugins directory:

require 'matrix'
require 'tf-idf-similarity'
require 'chinese_pinyin'
require 'narray'

module PostSimilarity
  class Generator < Jekyll::Generator
	def generate(site)
	  
	  # Get stop words
	  sw = site.data["stopwords"]
	  
	  # Generate corpus
	  corpus = []
	  all_posts = site.posts.docs
	  for post in site.posts.docs
		document = generate_doc(post, sw)
		corpus.append(document)
	  end
	  
	  # Build model
	  model = TfIdfSimilarity::TfIdfModel.new(corpus, :library => :narray)
	  matrix = model.similarity_matrix
	  
	  # Loop through posts
	  all_posts.each_with_index do | post, index |
		doc = corpus[index]
		top5_scores = []
		top5_posts = []
		all_posts.each_with_index do | comparing_post, comparing_index |
		  if index != comparingIndex
			comparing_doc = corpus[comparing_index]
			sim = matrix[
			  model.document_index(doc), 
			  model.document_index(comparing_doc)
			]
			
			rank = top5_scores.count { |s| sim < s }
			if sim > 0.25 && rank < 5 && (comparing_post.data["exclude_from_related"] || false) == false
			  top5_scores.insert(rank, sim)
			  top5_posts.insert(rank, comparing_post)
			  if top5_scores.size > 5
				top5_scores.pop()
				top5_posts.pop()
			  end
			end
		  end
		  
		  post.data["related_posts"] = top3_posts
		end
	  end
	  
	end
	
	def generate_doc(post, stopwords)
	  title_array = Array.new(5, post.data["title"])
	  title_string = title_array.join(" ")
	  text = title_string + " " + post.content
	  
	  stopwords_regex = Regexp.union(stopwords)
	  text = text.gsub(stopwords_regex, "")
	  
	  latin = Pinyin.t(text, tonemarks: true)
	  doc = TfIdfSimilarity::Document.new(latin)
	  doc
	end
	
  end
end

And this link gives you the list of stop words that I added to the _data directory. You can obviously Google and use any sensible list of stop words.

Finally, you can start using page.related_posts property to insert HTML elements on your layout:

{% assign post = page %}
{% assign next_post = ... %}
{% assign prev_post = ... %}

{% if post.related_posts %}
	{% assign count = 0 %}
	{% for p in post.related_posts %}
		{% if p != next_post and p != prev_post %}
			<h1 class='related'><a href="">
				
			</a></h1>
			{% assign count = count | plus: 1 %}
			{% if count >= 3 %}{% break %}{% endif %}
		{% endif %}
	{% endfor %}
{% endif %}

The Result

You are looking at this website as the result—related posts are pulled automatically and added to the end of each blog post.

Take a look at this post about snowboarding on Cypress Mountain, for example: the related posts are about snowboarding at the same resort on another day, snowboarding somewhere else, or about some hikes also in Cypress Mountain.

Another example is my round-up for 2022: the related posts are all look-back posts or postmortems. I’m happy with the results.