I was working on profiling some code today and wanted to obtain some summary statistics by groups with two factors. The original source was a log4j file that included entries from an aspect based logger I had enabled. I had already written a small perl script to extract the pertinent information and generate a CSV file with (clazz,method,elapsed) entries, so I was looking for some standard statistics like mean, median, etc. based on clazz+method combinations.

My initial approach looked like:

metrics <- read.csv('some_metrics.csv',header=T) aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), median) -> medians aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), mean) -> means aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), min) -> mins aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), max) -> maxes aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), length) -> lengths aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), sum) -> sums s <- mins s$MIN <- s$x s$x <- NULL s$MAX = maxes$x s$MEAN = means$x s$MEDIAN = medians$x s$NUM = lengths$x s$SUM = sums$x rm(mins,means,maxes,medians,sums)

This was obviously less than ideal, although I could wrap this in a function it is a bit ugly and cumbersome. I searched the R-help mailing list and found some references to the doBy package, which “grew out of a need to calculate groupwise summary statistics in a simple way”. The summaryBy function in this package turned out to be exactly what I needed and simplified by code to:

summarize <- function(csvfile) { require(doBy) metrics.csv <- read.csv(csvfile,header=T) metrics <- summaryBy(elapsed ~ clazz + method, data=metrics.csv, FUN=c(mean,median,min,max,sum,length)) write.csv(metrics, file='export.csv', quote=F, row.names=F) metrics } metrics <- summarize('some_metrics.csv')

Hi, Congratulations to the site owner for this marvelous work you’ve done. It has lots of useful and interesting data.

Hey, great work. I had a question, can you use this to calculate trimmed means (remove extreme 10%) as well ?