Shakespeare Sonnet Sourced Markov Text Generation

And play the mother’s part, kiss me, be kind;
So will I pray that thou mayst prove me.
Weary with toil, I haste me thence?
Till I return, of posting is no remedy,
It is the time with thoughts of love as oft as thou shalt find
Those children nursed, deliver’d from thy heart,
And take thou my love shall in these black lines be seen,
And they shall live, and he stole that word
From thy behaviour; beauty doth he give,
And found such fair assistance in my will no fair acceptance shine?
The sea, all water, yet receives rain still,
And then thou hast her it is built anew,
Grows fairer than at first, more strong, greater.

Number of key tuples: 14916
Mean Choices: 1.174175
Min Choices: 1
Max Choices: 27
Sum of Squares: 13581.492357
Standard Deviation: 116.539660

I have had a vague “todo” in my head for a while to do something with Markov text generation. One of the things that I look forward to when I scan through my spam folders is finding interesting text that was obviously generated. I was lazily browsing through some of my news feeds and following links when I came across this post titled Markov and You by Jeff Atwood. I especially liked the Garkov reference he used to illustrate the usage. I decided I would try to do something similar and found a basic markov implementation in python at this Usware blog post.

After playing with the ”quick brown fox” implementation and pulling in some CNN articles I decided to do something a bit more interesting and downloaded Shakespeare’s Sonnets from Project Gutenberg.

I modified my generator implementation to have output based on the number of lines emitted, and added a reseed function to select a new start tuple in the event that no keys matched my existing search tuple. Finally, I added some simple output statistics to get a feel for how the constructed database looked when I tweaked the chain length parameter. Having shorter chain lengths tended to make the text too random, while longer ones pulled in too much of an existing sonnet sequence. Watching the number of key tuples and the summary statistics of the choices helped me tune the code for this corpus.

The code is available here.

MS Communicator(SIP) + Ubuntu 9.04

I found an interesting post this morning talking about using Pidgin on Fedora Core to communicate with a MS Communicator server. I was able to follow the instructions from Louis van der Merwe to get this working on 64-bit Ubuntu 9.04.

First I downloaded the Pidgin client plugin sourceball from the SIPE Project, and started running configure --prefix=/usr to identify missing libraries. I installed the missing dependencies using sudo apt-get install intltool libpurple comerr-dev, other systems may require more or less. After configuring, making, and installing the plugin I fired up Pidgin.

The only modification I had to make to the configuration instructions was to use SSL instead of TCP for communications as we use a certificate at my workplace. Although contacts and groups were pulled in successfully, my presence notification doesn’t seem to work (I show up as offline). However, this is good enough for me right now and I’m really happy to be able to access the corporate communication channel in my primary development environment.

*Updated to 1.6.3*
Presence notification seems to be working now, thank you SIPE team!

Summary statistics by group with R

I was working on profiling some code today and wanted to obtain some summary statistics by groups with two factors. The original source was a log4j file that included entries from an aspect based logger I had enabled. I had already written a small perl script to extract the pertinent information and generate a CSV file with (clazz,method,elapsed) entries, so I was looking for some standard statistics like mean, median, etc. based on clazz+method combinations.
Continue reading

Code, Statistics, Data Visualization