by Pavel Simakov on 2007-05-07
The ProblemI planned to attend BarCamp Toronto meeting and was reviewing a list of attendees. The lists was long, over 150 people, and I did not know most of them. Each person registered to attend had a link back to his home page or a blog. I tried reading couple of pages from each person blog or web site hoping to get a sense of person's interests and areas of expertise. By the 5th or 6th site I began thinking "there got to be a better way" - it all took too long I was getting nowhere.
What if there was a software tool that could answer these questions:
The SolutionThe most obvious way to figure out what the web site or blog is all about is to simply look at its title. But wait a minute, I get e-mails daily with the subject "You have just won a huge pile of money! Click here to claim!" and I never trusted it. Why should I trust the web site or blog title after all? I might, if I knew the authors to begin with, but I don't know the authors. And does the title reflect the content? Here is the title of Slashdot: "News for nerds, stuff that matters". You decide...
It would be better to have a software tool to crawl the site and summarize it. I have seen tools for text analysis, but not quite for the job at hand...
Google comes to mind now. Google has to perform the similar analysis to make its AdSense advertising work. For advertising to be relevant, Google has to summarize and classify all crawled web pages by their relevance to specific ad keywords. Google works well at the moment and it would be a great tool for data mining, but, unfortunately, there is no way today to use crawled pages for non-Google purposes. Alexa Web Search Platform might be of use here, but I had trouble estimating the CPU cost and, consequently, the price tag on my research.
Naive Bayes classifiers work very well for spam filtering. They can also work well enough for the forum moderation and the document classification as we can gauge by the successful story of spam-free "Joel On Software" discussion boards. I could use Bayes classifier to classify blogs and web pages of the meeting attendees. But first, I would have to train the classifier by showing it which pages belong to which categories. Since I do not know the people attending or any of their work, training is difficult. Self-clustered Bayes classifiers are probably out there, but that's a whole other story.
So, instead I wrote a small software tool that I call WordMatrix, which turned out to do what I needed. For given groups of web pages or blogs the WordMatrix creates an HTML report containing two lists: the list of words common in all groups and the list of words common in only one group. Each word has color coding so that the highest frequency words have full intensity blue color fading into the white color as word frequency decreases. Each word or a phrase in the report is followed by a subscript number showing its score in some difficult-to-explain units, with 100 being the highest and 0 being the lowest frequency of word occurrence in text.
In a process, similar to single word analysis, the word phrase analysis is performed for phrases of different lengths; the most useful phrase length was found to be 2. I call the reports Frequent Words Report, and Frequent Phrases Report respectively. Overall, the processing requires large amounts of memory, but is fast. It takes only seconds to analyze the web pages and the blogs for this article.
Now, when we have the programming issues out of the way, let me show you how I used WordMatrix to summarize some web sites and to discover what their focus areas and the hot topics are. I have selected for analysis some well known online properties. Their authors are among the best writers and respected authorities (in their respective fields) in the world. Their writing touches on the software and social engineering subjects. I also happen to read most of them, which made it easier for me check if the WordMatrix's reports supported my personal observations. Here are the sites I chose:
Pretend for a moment that you do not know who these people are or what they write about. Lets see if WordMatrix will helps us to find this out.
Comparing Interests of Several Authors
In this exercise we try to find out the unique interests that each author has compared to all other authors. As a side effect we will find the common vocabulary that all authors prefer to use. For this test I have selected 5-10 articles from each author as a representative set for WordMatrix to process. The 10 most common words per authors are:
The 5 most common words phrases per author are:
We have just discovered that Martin Fowler writes about programming languages and agile software development, Joel Spolsky worries about coding conventions and Windows API, Noam Chomsky has focus on international terrorism and human rights, and Dr. Phil is all about family and child-parent relations.
Here are the complete WordMatrix analysis reports for your own review: Frequent Words ReportFrequent Phrases Report
Discovering Change in Person's Interests over Time
In this exercise we will try to observe how the interests of one author change over time. In our fast paced and quickly changing world several authors managed to keep their past articles in good order. I collected articles and grouped them on a yearly basis for WordMatrix to process.
I have quickly learned that Noam Chomsky covers:
Some things do not change for Noam Chomsky, however. For all these years his persistent interests, according to WordMatrix, are: state100 |us94 |world93 |war91 |united states100|human right91 |york times89 |state department83 |years ago83 |security council80 |united nations80 |international law78 |national security78 |tens thousands78.
At the same time period, but in the software engineering part of the world - Joel Spolsky covers:
Some things do not change for Joel Spolsky, as well. For all these years his persistent interests, according to WordMatrix, are: software100 |people93 |thing91 |company88 |fog creek100 |joel software96 |bug tracking94 |software developer92 |creek copilot89|joel spolsky88 |citydesk fogbugz88 |york city83|writing software82
Here are the corresponding WordMatrix analysis reports for your own review:
Discovering Hot Topics in a Website or Blog
The last exercise in the series is about finding hot topics in a web site or blog. The larger blogs and web sites are organizes in sections, categories and so on. It quite challenging to name these sections or categories with meaningful titles. The titles are short and might not properly reflect the content collected under them, especially when blogs and web site evolve over time. What kind of content do you think the categories "General", "Recent", "Interviews", "Design", "Leisure" or "Tools" have? What topics do they cover? What is the web site about?
Using WordMatrix I quickly discover that:
At the same time there seems to be no dominant theme on any of the "Joel On Software" Discussion Boards. Maybe with the exception of Tech. Interviews board where members talk a lot about char* manipulating C code. In my experience this is quite typical of a forum, where several individual contributors post short pieces of content with varying styles and purpose.
Here are the corresponding WordMatrix analysis reports for your own review:
Final WordThe results of WordMatrix are quite encouraging. You might suspect this work to be a self-fulfilling prophecy, because I have taken articles of people I already knew. But, it is not. The approach and the tool are applicable to any blog and any web site. These days I always run WordMatrix to learn about any new person I plan to meet or communicate with. It helps me to communicate better with other people. It helps to use the right words, so to speak.
Just remember that we have entered a new world where every word you say, print, blog, SMS, draw or click can be recorded. It can be further analyzed, classified, processed, translated, stemmed, and cross-correlated with anything including your mother's maiden name, color of your eyes, time delay you took to comprehend the page you have just read, and the IP' address you use to connect to you favorite web pages, potentially including this one...
The Implementation Details
After couple of attempts to use Classifier4J code base and processing of OPML and RSS feeds from Bloglines and Java Blogs I gave up on pure Bayesian classifier approach. What is needed is a smart summarizer, a smart filter, a correlation finder - not a classifier.
In WordMatrix, similar to Naive Bayesian approach, blogs and web pages are modeled as sets of words with the independent probability. But instead of computing a document score, analysis is conducted on the basis of the scores for individual words and phrases. The words, not the documents, are classified as likely or not likely to be used by a specific author or set of authors. If a given word has a high probability to occur in articles of all authors it is classified as "comon to all authors". But if a given word has high probability of occurring only in articles of one author it is classified as "specific to particular author". This approach works without modification for any pairwise comparisons or measuring the similarity between any pairs of authors or any grouping of authors.I wrote WordMatrix analyzer tool in Java. The analyzer is a command line tool; it takes a single XML configuration file as an input and produces a report. The input file contains the list of web pages to crawl, the list of web page groups and the assignments of a page to a group. The various processing options include use of stemming, list of common words to ignore, etc.
For each page: the page is fetched over HTTP and is converted to the plain text using Tidy HTML parsing library and custom XML DOM processor. The XML DOM processor allows to selectively include or exclude parts of the HTML document into resulting text. Thus we can filter out <script>, <head>, and HTML header/footer that are identical on all pages of a specific web site.
The resulting text is tokenized and stemmed using Snowball stemmer with default set of stop words. The word frequencies are computed by counting the word occurrences in the source text. For each page group: the word frequencies in a group are computed by aggregating the word frequencies for the individual pages in the group with the word weights proportional to the page size.
When all pages in all groups are processed, the similarity of groups and the word scores are computed using simple linear algebra. The various correlation reports are produced thereafter.
While I was working on this article I found that Nilesh Bansal has developed BlogScope. BlogScope is a very cool product that analyses "blogosphere" and finds blogs correlated on the basis of the terms they use. It visualizes the popularity and correlation of query terms as a function of time. Additionally, it displays a list of keywords closely associated with the query terms over the selected time window, hence providing an exploratory navigation system.
I am not familiar with details of BlogScope implementation, but I still think it can be adopted to conduct various forms of automated text analysis similar to one's I am illustrating in this article. It's not clear, however, if the BlogScope indexing mechanisms will be able to work on a subset of all blogs and feeds for the focused correlation.
Software Secret Weapons (TM) Copyright (C) 2004-2017 by Pavel Simakov
any ideas, thoughts, conclusions, recommendations or the source code presented on this site