A DachisGroup Company

Sharing our thinking in the open is a great way to learn from our network and peers, and we love to discuss social business on our blog or during one of the many conferences we attend around the world.

Technical Digest: The basics of Keyword Vector Spaces

by

Whilst browsing this site, you may have noticed that when viewing certain pages we present you with a list of “related items” both from our own content “from headshift moments” and from external websites “from sites we read”. These links are automatically generated by the system, vastly reducing the amount of time required of us to manage the site
The suggested links are generated using a custom-built keyword vector space engine. Vector space techniques are very good at calculating the proximity of two different content items. The basic procedure relies on maintaining a mathematic model of the website content, and running simple algorithms in real time to calculate item proximity
The mathematic model is based upon the words used in each content item. The procedure is as follows

  • When an item is added to the system, it is first stripped of all punctuation, HTML and common “noise” words such as “and”, “so”, “but”, “because” etc.
  • The list of remaining words is then run through a Porter Stemming algorithm. This groups the singular and plural forms of words, groups verb tenses etc.
  • Duplicate keywords are filtered, but their frequency is recorded.
  • This leaves us a list of distinct keywords contained in the item, with the number of times that keyword is used
  • Now comes the maths. Each keyword is set up as an orthogonal axis in our keyword vector space. The item can then be plotted as a vector on these axes. The diagram below shows an example item that contains the keyword “web” twice, and “blog” once. Of course, most items will contain more than two distinct keywords, but the idea is easier to visualise with two axes.
  • A basic vector space with one item

  • Now that the vector space is set up, other content items can be plotted as vectors on the same space. If another item contains the word “blog” twice and “web” once, its vector will look like this:
  • A basic vector space with two items

  • The angle between these two vectors is the measure of how close they are to each other. Those that remember the maths they did at school will now be shooting their hands up in the air with the words “dot product” erupting from their mouths. For those that can’t remember that far back, the dot product is a simple formula that produces the cosine of the angle between two vectors. From the above diagram, we can see that the shallower the angle, the closer the two items are. The cosine of zero degrees is 1 (i.e. the items are identical) and the cosine of 90 degrees is zero (the items share no keywords)
  • Armed with these algorithms, whenever you view an item the website compares the keywords in your item with all of the other items in the database (both internal and externally syndicated). The top five matches, assuming they are close enough, are displayed to you, the user, for your viewing pleasure.

2 Responses to Technical Digest: The basics of Keyword Vector Spaces

  1. By Headshift on April 28, 2006 at 5:06 pm

    How this site was built

    We have worked hard to create a site that is simple on the outside but clever on the inside to provide useful knowledge sharing features relevant to our areas of interest.

  2. By Karen Skelton on November 17, 2009 at 1:05 pm

    Hi,
    Thanks for the information on Algorithms and Vector space. Do you know which Recruitment companies use Matching Algorithms and Vector space to match job seekers against jobs, where it can match skills, experience, rather than just keyword search or cv parsing.
    Do companies such as Monster.com and hotjobs.com use this technology?
    thanks,
    Karen