I was working on profiling some code today and wanted to obtain some summary statistics by groups with two factors. The original source was a log4j file that included entries from an aspect based logger I had enabled. I had already written a small perl script to extract the pertinent information and generate a CSV file with (clazz,method,elapsed) entries, so I was looking for some standard statistics like mean, median, etc. based on clazz+method combinations.
My initial approach looked like:
metrics <- read.csv('some_metrics.csv',header=T)
aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), median) -> medians
aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), mean) -> means
aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), min) -> mins
aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), max) -> maxes
aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), length) -> lengths
aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), sum) -> sums
s <- mins
s$MIN <- s$x
s$x <- NULL
s$MAX = maxes$x
s$MEAN = means$x
s$MEDIAN = medians$x
s$NUM = lengths$x
s$SUM = sums$x
rm(mins,means,maxes,medians,sums)
This was obviously less than ideal, although I could wrap this in a function it is a bit ugly and cumbersome. I searched the R-help mailing list and found some references to the doBy package, which “grew out of a need to calculate groupwise summary statistics in a simple way”. The summaryBy function in this package turned out to be exactly what I needed and simplified by code to:
summarize <- function(csvfile) {
require(doBy)
metrics.csv <- read.csv(csvfile,header=T)
metrics <- summaryBy(elapsed ~ clazz + method, data=metrics.csv, FUN=c(mean,median,min,max,sum,length))
write.csv(metrics, file='export.csv', quote=F, row.names=F)
metrics
}
metrics <- summarize('some_metrics.csv')
Hi, Congratulations to the site owner for this marvelous work you’ve done. It has lots of useful and interesting data.