# Summary statistics by group with R

I was working on profiling some code today and wanted to obtain some summary statistics by groups with two factors. The original source was a log4j file that included entries from an aspect based logger I had enabled. I had already written a small perl script to extract the pertinent information and generate a CSV file with (clazz,method,elapsed) entries, so I was looking for some standard statistics like mean, median, etc. based on clazz+method combinations.

My initial approach looked like:

```metrics <- read.csv('some_metrics.csv',header=T)
aggregate(dce\$elapsed, by=list(CLAZZ=dce\$clazz,METHOD=dce\$method), median) -> medians
aggregate(dce\$elapsed, by=list(CLAZZ=dce\$clazz,METHOD=dce\$method), mean) -> means
aggregate(dce\$elapsed, by=list(CLAZZ=dce\$clazz,METHOD=dce\$method), min) -> mins
aggregate(dce\$elapsed, by=list(CLAZZ=dce\$clazz,METHOD=dce\$method), max) -> maxes
aggregate(dce\$elapsed, by=list(CLAZZ=dce\$clazz,METHOD=dce\$method), length) -> lengths
aggregate(dce\$elapsed, by=list(CLAZZ=dce\$clazz,METHOD=dce\$method), sum) -> sums
s <- mins
s\$MIN <- s\$x
s\$x <- NULL
s\$MAX = maxes\$x
s\$MEAN = means\$x
s\$MEDIAN = medians\$x
s\$NUM = lengths\$x
s\$SUM = sums\$x
rm(mins,means,maxes,medians,sums)
```

This was obviously less than ideal, although I could wrap this in a function it is a bit ugly and cumbersome. I searched the R-help mailing list and found some references to the doBy package, which “grew out of a need to calculate groupwise summary statistics in a simple way”. The summaryBy function in this package turned out to be exactly what I needed and simplified by code to:

```summarize <- function(csvfile) {
require(doBy)
metrics <- summaryBy(elapsed ~ clazz + method, data=metrics.csv, FUN=c(mean,median,min,max,sum,length))
write.csv(metrics, file='export.csv', quote=F, row.names=F)
metrics
}
metrics <- summarize('some_metrics.csv')
```

## 2 thoughts on “Summary statistics by group with R”

1. Apeddeftondow says:

Hi, Congratulations to the site owner for this marvelous work you’ve done. It has lots of useful and interesting data.

2. David says:

Hey, great work. I had a question, can you use this to calculate trimmed means (remove extreme 10%) as well ?