Pavel Simakov - Looking At People Through Their Words

by Pavel Simakov on 2007-05-07

The Problem

I planned to attend BarCamp Toronto meeting and was reviewing a list of attendees. The lists was long, over 150 people, and I did not know most of them. Each person registered to attend had a link back to his home page or a blog. I tried reading couple of pages from each person blog or web site hoping to get a sense of person's interests and areas of expertise. By the 5th or 6th site I began thinking "there got to be a better way" - it all took too long I was getting nowhere.

What if there was a software tool that could answer these questions:

What is a blog or a web site about?
What are the personal interests of a blogger?
What are the hot topics covered by a web site?

The Solution

The most obvious way to figure out what the web site or blog is all about is to simply look at its title. But wait a minute, I get e-mails daily with the subject "You have just won a huge pile of money! Click here to claim!" and I never trusted it. Why should I trust the web site or blog title after all? I might, if I knew the authors to begin with, but I don't know the authors. And does the title reflect the content? Here is the title of Slashdot

: "News for nerds, stuff that matters". You decide...

It would be better to have a software tool to crawl the site and summarize it. I have seen tools for text analysis, but not quite for the job at hand...

Google comes to mind now. Google has to perform the similar analysis to make its AdSense advertising work. For advertising to be relevant, Google has to summarize and classify all crawled web pages by their relevance to specific ad keywords. Google works well at the moment and it would be a great tool for data mining, but, unfortunately, there is no way today to use crawled pages for non-Google purposes. Alexa Web Search Platform might be of use here, but I had trouble estimating the CPU cost and, consequently, the price tag on my research.

Naive Bayes classifiers work very well for spam filtering. They can also work well enough for the forum moderation and the document classification as we can gauge by the successful story of spam-free "Joel On Software" discussion boards. I could use Bayes classifier to classify blogs and web pages of the meeting attendees. But first, I would have to train the classifier by showing it which pages belong to which categories. Since I do not know the people attending or any of their work, training is difficult. Self-clustered Bayes classifiers are probably out there, but that's a whole other story.

So, instead I wrote a small software tool that I call WordMatrix, which turned out to do what I needed. For given groups of web pages or blogs the WordMatrix creates an HTML report containing two lists: the list of words common in all groups and the list of words common in only one group. Each word has color coding so that the highest frequency words have full intensity blue color fading into the white color as word frequency decreases. Each word or a phrase in the report is followed by a subscript number showing its score in some difficult-to-explain units, with 100 being the highest and 0 being the lowest frequency of word occurrence in text.

In a process, similar to single word analysis, the word phrase analysis is performed for phrases of different lengths; the most useful phrase length was found to be 2. I call the reports Frequent Words Report, and Frequent Phrases Report respectively. Overall, the processing requires large amounts of memory, but is fast. It takes only seconds to analyze the web pages and the blogs for this article.

The Subjects

Now, when we have the programming issues out of the way, let me show you how I used WordMatrix to summarize some web sites and to discover what their focus areas and the hot topics are. I have selected for analysis some well known online properties. Their authors are among the best writers and respected authorities (in their respective fields) in the world. Their writing touches on the software and social engineering subjects. I also happen to read most of them, which made it easier for me check if the WordMatrix's reports supported my personal observations. Here are the sites I chose:

Martin Fowler's Bliki
Clay Shirky's articles
Dr. Phil's web site
Noam Chomsky's writing
Joel Spolsky's blog
"Joel On Software" discussion boards

Pretend for a moment that you do not know who these people are or what they write about. Lets see if WordMatrix will helps us to find this out.

Comparing Interests of Several Authors

In this exercise we try to find out the unique interests that each author has compared to all other authors. As a side effect we will find the common vocabulary that all authors prefer to use. For this test I have selected 5-10 articles from each author as a representative set for WordMatrix to process. The 10 most common words per authors are:

Martin Fowler: dsl₁₀₀ |inject₈₈ |workbench₈₅ |software₈₂ |agile₇₉ |mock₇₇ |use₇₄ |language₇₂ |representation₆₅
Joel Spolsky: window₁₀₀ |hungarian₇₁ |software₇₀ |code₅₇ |microsoft₅₆ |bug₅₂ |use₄₉ |string₄₅ |charge₃₈ |joel₃₇
Noam Chomsky: terror₁₀₀ |state₃₉ |terrorist₃₉ |israeli₃₄ |bomb₃₀ |us₂₉ |military₂₇ |israel₂₆ |mai₂₅ |war₂₅
Clay Shirky: web₁₀₀ |group₉₄ |user₉₂ |software₉₀ |weblog₈₀ |social₇₈ |pattern₆₇ |people₅₄ |because₅₂ |semantic₄₂
Dr. Phil: family₁₀₀ |rituals₅₅ |factor₅₅ |rhythm₅₅ |crisis₄₅ |child₃₆ |parent₃₆ |children₂₇ |meaningful₁₈ |step-by-step₁₈

The 5 most common words phrases per author are:

Martin Fowler: language workbench₁₀₀ |software developer₅₇ |service locator₅₅ |language oriented₄₅ |agile method₄₄
Joel Spolsky: hungarian notation₁₀₀ |windows api₁₀₀ |coding convention₈₉ |operating system₈₈ |joel spolsky₇₈
Noam Chomsky: international terror₁₀₀ |united states₈₄ |human rights₅₆ |domestic constituencies₃₅ |terrorist act₃₂
Clay Shirky: semantic web₁₀₀ |social software₈₀ |power law₆₆ |web school₆₀ |law distribution₄₀
Dr. Phil: family life₁₀₀ |promote rhythm₁₀₀ |traditions family₁₀₀ |crisis will₁₀₀ |children ears₅₀

We have just discovered that Martin Fowler writes about programming languages and agile software development, Joel Spolsky worries about coding conventions and Windows API, Noam Chomsky has focus on international terrorism and human rights, and Dr. Phil is all about family and child-parent relations.

Here are the complete WordMatrix analysis reports for your own review: Frequent Words ReportFrequent Phrases Report

Discovering Change in Person's Interests over Time

In this exercise we will try to observe how the interests of one author change over time. In our fast paced and quickly changing world several authors managed to keep their past articles in good order. I collected articles and grouped them on a yearly basis for WordMatrix to process.

I have quickly learned that Noam Chomsky covers:

in 2006: kamm₁₀₀ |m-w₆₅ | medical assistance₁₀₀ |oil natural₁₀₀ |energy sector₁₀₀
in 2005: language₁₀₀ |chavez₆₄ |orleans₅₇|social security₁₀₀ |intelligent design₉₅ |internal language₈₅
in 2004: haiti₁₀₀ |palestinian₉₂ |moral values₁₀₀ |chemical warfare₉₆ |war terror₈₉
in 2003: iraq₁₀₀ |saddam₇₉|preventive war₁₀₀ |war terror₇₁ |grand strategy₄₈
in 2002: taliban₁₀₀ |afghan₇₉ |war terror₁₀₀ |bin laden₉₃ |international terror₈₃
in 2001: voter₁₀₀ |disenfranchised₅₇ |permanent interests₁₀₀ |neoliberal reforms₇₂ |capital mobility₇₂
in 2000: kosovo₁₀₀ |serb₈₆ |albanian₆₃ |nato₅₈ |colombia plan₁₀₀ |nato bombs₇₉ |bombing campaign₆₆
in 1999: kosovo₁₀₀ |fbi₉₃ |east timor₁₀₀

At the same time period, but in the software engineering part of the world - Joel Spolsky covers:

in 2006: ajax calendar₁₀₀|pointers recursion₁₀₀ |functional program₁₀₀ |cs degree₆₇|wiki₃₇ |ajax₃₇
in 2005: project aardvark₁₀₀ |usability test₁₀₀ |hungarian notation₉₈ |coding convention₉₃|hiring top₈₂
in 2004: social interface₁₀₀ |rosh gadol₁₀₀ |rosh katan₁₀₀ |windows api₆₁ |demand curve₄₃ |social software₃₇
in 2003: aol₁₀₀|lease₁₀₀ |landlord₈₉ |code point₁₀₀ |tenant broker₆₄ |office space₅₆|character set₆₀
in 2002: dave₁₀₀ |groove₁₀₀ |commodity₉₄ |product vision₁₀₀ |leaky abstraction₇₉ |asp net₇₉ |vnc₆₈|open source₆₇
in 2001: citydesk₁₀₀ |tile floor₁₀₀ |task switch₁₀₀ |citydesk beta₈₆|pascal string₇₅ |usability test₇₃ |dog food₇₁
in 2000: netscape₁₀₀ |spam spam₁₀₀ |work₈₆ |bonus₇₇ |stock option₁₀₀ |program manager₉₆
in 1999: sabbatical₁₀₀ |next big₁₀₀ |last job₈₆

Here are the corresponding WordMatrix analysis reports for your own review:

Noam Chomsky writing during 1967-2006: Frequent Words Report + Frequent Phrases Report
Joel Spolsky "Joel On Software" writing during 1999-2006: Full Length Articles Frequent Words Report + Frequent Phrases Report, Old Front Pages Frequent Words Report + Frequent Phrases Report

Discovering Hot Topics in a Website or Blog

The last exercise in the series is about finding hot topics in a web site or blog. The larger blogs and web sites are organizes in sections, categories and so on. It quite challenging to name these sections or categories with meaningful titles. The titles are short and might not properly reflect the content collected under them, especially when blogs and web site evolve over time. What kind of content do you think the categories "General", "Recent", "Interviews", "Design", "Leisure" or "Tools" have? What topics do they cover? What is the web site about?

Using WordMatrix I quickly discover that:

"Leisure" category for Martin Fowler's Bliki focuses on: board game₁₀₀ |music₇₄ |saba₄₇ |film₄₃ |dive₄₀ |us jazz₃₈
"Interviews" category for Clay Shirky's writings means: micropayment₁₀₀ |media₈₉ |news₈₂ |media outlet₁₀₀ |good design₆₉ |recording industry₆₇ |music industry₅₈ |cable channels₅₄

At the same time there seems to be no dominant theme on any of the "Joel On Software" Discussion Boards. Maybe with the exception of Tech. Interviews board where members talk a lot about char* manipulating C code. In my experience this is quite typical of a forum, where several individual contributors post short pieces of content with varying styles and purpose.

Here are the corresponding WordMatrix analysis reports for your own review:

Martin Fowler's Bliki Frequent Words Report + Frequent Phrases Report
Clay Shirky's Writings About the Internet Single Word Report + Word Pair Report
"Joel On Software" Discussion Boards Single Word Report + Word Pair Report

Final Word

The results of WordMatrix are quite encouraging. You might suspect this work to be a self-fulfilling prophecy, because I have taken articles of people I already knew. But, it is not. The approach and the tool are applicable to any blog and any web site. These days I always run WordMatrix to learn about any new person I plan to meet or communicate with. It helps me to communicate better with other people. It helps to use the right words, so to speak.

Just remember that we have entered a new world where every word you say, print, blog, SMS, draw or click can be recorded. It can be further analyzed, classified, processed, translated, stemmed, and cross-correlated with anything including your mother's maiden name, color of your eyes, time delay you took to comprehend the page you have just read, and the IP' address you use to connect to you favorite web pages, potentially including this one...

The Implementation Details

After couple of attempts to use Classifier4J code base and processing of OPML and RSS feeds from Bloglines and Java Blogs I gave up on pure Bayesian classifier approach. What is needed is a smart summarizer, a smart filter, a correlation finder - not a classifier.

In WordMatrix, similar to Naive Bayesian approach, blogs and web pages are modeled as sets of words with the independent probability. But instead of computing a document score, analysis is conducted on the basis of the scores for individual words and phrases. The words, not the documents, are classified as likely or not likely to be used by a specific author or set of authors. If a given word has a high probability to occur in articles of all authors it is classified as "comon to all authors". But if a given word has high probability of occurring only in articles of one author it is classified as "specific to particular author". This approach works without modification for any pairwise comparisons or measuring the similarity between any pairs of authors or any grouping of authors.

I wrote WordMatrix analyzer tool in Java. The analyzer is a command line tool; it takes a single XML configuration file

as an input and produces a report. The input file contains the list of web pages to crawl, the list of web page groups and the assignments of a page to a group. The various processing options include use of stemming, list of common words to ignore, etc.

For each page: the page is fetched over HTTP and is converted to the plain text using Tidy HTML parsing library and custom XML DOM processor. The XML DOM processor allows to selectively include or exclude parts of the HTML document into resulting text. Thus we can filter out <script>, <head>, and HTML header/footer that are identical on all pages of a specific web site.

The resulting text is tokenized and stemmed using Snowball stemmer with default set of stop words. The word frequencies are computed by counting the word occurrences in the source text. For each page group: the word frequencies in a group are computed by aggregating the word frequencies for the individual pages in the group with the word weights proportional to the page size.

When all pages in all groups are processed, the similarity of groups and the word scores are computed using simple linear algebra. The various correlation reports are produced thereafter.

Related Projects

While I was working on this article I found that Nilesh Bansal has developed BlogScope. BlogScope is a very cool product that analyses "blogosphere" and finds blogs correlated on the basis of the terms they use. It visualizes the popularity and correlation of query terms as a function of time. Additionally, it displays a list of keywords closely associated with the query terms over the selected time window, hence providing an exploratory navigation system.

I am not familiar with details of BlogScope implementation, but I still think it can be adopted to conduct various forms of automated text analysis similar to one's I am illustrating in this article. It's not clear, however, if the BlogScope indexing mechanisms will be able to work on a subset of all blogs and feeds for the focused correlation.