Summary statistics by group with R

I was working on profiling some code today and wanted to obtain some summary statistics by groups with two factors. The original source was a log4j file that included entries from an aspect based logger I had enabled. I had already written a small perl script to extract the pertinent information and generate a CSV file with (clazz,method,elapsed) entries, so I was looking for some standard statistics like mean, median, etc. based on clazz+method combinations.

My initial approach looked like:

metrics <- read.csv('some_metrics.csv',header=T)
aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), median) -> medians
aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), mean) -> means
aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), min) -> mins
aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), max) -> maxes
aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), length) -> lengths
aggregate(dce$elapsed, by=list(CLAZZ=dce$clazz,METHOD=dce$method), sum) -> sums
s <- mins
s$MIN <- s$x
s$x <- NULL
s$MAX = maxes$x
s$MEAN = means$x
s$MEDIAN = medians$x
s$NUM = lengths$x
s$SUM = sums$x
rm(mins,means,maxes,medians,sums)

This was obviously less than ideal, although I could wrap this in a function it is a bit ugly and cumbersome. I searched the R-help mailing list and found some references to the doBy package, which “grew out of a need to calculate groupwise summary statistics in a simple way”. The summaryBy function in this package turned out to be exactly what I needed and simplified by code to:

summarize <- function(csvfile) {
	require(doBy)
	metrics.csv <- read.csv(csvfile,header=T)
	metrics <- summaryBy(elapsed ~ clazz + method, data=metrics.csv, FUN=c(mean,median,min,max,sum,length))
	write.csv(metrics, file='export.csv', quote=F, row.names=F)
	metrics
}
metrics <- summarize('some_metrics.csv')

2 thoughts on “Summary statistics by group with R”

  1. Hey, great work. I had a question, can you use this to calculate trimmed means (remove extreme 10%) as well ?

Leave a Reply