In honor of search engine algorithm pioneer Professor Karen Spärck Jones, let’s take a closer look at her crowning achievement: Inverse Document Frequency (idf). It still has significant impact today.
Idf and its variants underpin practically all modern sophisticated search engine algorithms, including those utilized by Google, Bing, and Duck Duck Go. She released her algorithm
tf*idf in a 1972 paper entitled: “A statistical interpretation of term specificity and its application in retrieval” after working on the problem of term-based search throughout the 1960s.
Karen Spärck Jones is the “Einstein of Search”
Professor Spärck Jones invented nothing short of the most important relevance algorithm component that still comprises a key part of search engine algorithms today. Even when modifications adjust ranking strategies for particular collections, her innovation led to users being able to conduct searches with more natural language.
Prior to her work, search was fairly difficult. Term frequency (tf) alone sorts documents by a word density measure: Term count over word count. With tf you can’t easily use natural language for meaningful results because you get noise from commonplace words as “the” and so on.
The concept behind
tf*idf is breath-taking both for its simplicity and elegance, not unlike Einstein’s Theory of Relativity. “The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs,” Jones wrote.
The Algorithm has a Logarithm
Idf calculates the logarithm of the inverse of term-containing document frequency, which can be thought of as a fraction: all documents over the number of documents where the search term appears. The effect is that words which rarely appear in the collection increase in importance for relevance rankings.
Stop words are simultaneously demoted when they appear in many documents, sometimes demoted all the way down to zero when they appear in all of a collection’s documents. If the word “the” appears in all 100 of a collection of 100 documents, then the word “the” scores exactly zero.
log(100/100) equals zero.
Variants of the math can avoid such words scoring zero by adding 1 to the document total. The result is still going to be an infinitesimal score.
log(101/100) equals 0.004.
A word that is rare, say only appearing in 1 document out of one hundred, is going to score far higher.
log(100/1) equals 2.
To Stop or Not Stop Words
In order to avoid scoring zero, for example, we make adjustments for the desired outcome. That’s the type of work you do testing and editing your search algorithms with your document collection, and when your collection grows to a scale of the modern Web, then you’re going to constantly adjust to improve relevancy, and make up for spam anomalies that crop up.
Pagerank Has a Logarithm
Guess what other algorithm scores documents along a logarithmic scale? Google’s PageRank.
That’s right, Google’s PageRank is a direct descendant of Professor Spärck Jones’s
tf*idf algorithm. That’s not to say Google hasn’t altered the math to accommodate its huge collection of the Web’s documents. It most certainly has, to such an extent that it’s become complicated and heavily engineered.
To paraphrase Gary Illyes: RankBrain is a machine learning ranking component that uses historical search data to predict what a user would most likely click for a previously unseen query. It frequently solves problems that Google used to run into with traditional algorithms. It saved us countless times whenever traditional algos were like, e.g. “oh look a ‘not’ in the query string! let’s ignore the hell out of it!” It’s relying on old data about what happened on the results page itself, not on the landing page.
It’s solid advice for webmasters to ignore what he calls “made up crap” about so-called dwell time, domain authority, click through rate (on search results), and so on because anything an enterprising SEO tries to prove with a limited study is bound to be missing several vital factors that aren’t translatable to others. The world is big enough that a cadre of like-minded folks will gather and reinforce made up crap. It’s natural. Instead, he says: “Search is much more simple than people think.”
Professor Spärck Jones’s
tf*idf in Modern Search
Google is getting more sophisticated all the time. That doesn’t mean you should do SEO guess work. Concentrate on making search engine-friendly websites with valuable and unique content. Let
tf*idf be your guide. Search marketers should worry less about making sure specific popular keywords are on their pages and think more about writing unique content. Google is getting smarter at figuring out the words you would naturally use.
Classification of knowledge domains in collections, document sets in collections, classification of websites, link analysis, website users, search users — these all originate with Karen Spärck Jones’s invention of
tf*idf which, interestingly, has been modified in experiments to be applied in those areas.
She was keen to stay up to date, as you can read in correspondence about
tf*idf in 2004: “AltaVista applied
tf*idf from the start, and it seems that most engines, somewhere, use something of the sort as one component of their matching strategies. It thus took about twenty five years for a simple, obvious, useful idea to reach the real world, even the fast-moving information technology one.”