Good stuff for programming geeks
[ start | index | login or register ]
start > 2005-08-10 > 1

Start/2005-08-10/1

Created by tmoertel. Last edited by tmoertel 1093 days ago. Viewed 797 times. #5
[diff] [history] [edit] [rdf]
labels
attachments

Simple data formats are not going away

I saw that >>ziggy wrote a small program to compute summary statistics and emit the output in name=value format. This is the same output format that my stats program uses, and it's great. First, it's easy to eyeball:

$ stats rnorm100.dat
count   = 100
min     = -2.567090756
10% cut = -0.9992493139
25% cut = -0.5645971535
median  = 0.0074455585
mean    = 0.0347402067700001
75% cut = 0.542629871
90% cut = 1.3079738435
max     = 2.051265585
var     = 0.788500663119147
stdev   = 0.88797559826785
popvar  = 0.780615656487955
popstdv = 0.883524564733746

Second, it's easy to manipulate. If I just want the median and mean, for example, I grep for them:

$ stats rnorm100.dat | grep me
median  = 0.0074455585
mean    = 0.0347402067700001

For mass analysis, however, this format is too verbose: I do not want to look at one hundred of these summaries to try to figure out the big picture. What I want is to see all of the stats at once. I want a summary table: each data set in its own row and each summary statistic in its own column.

While it is easy to write an ad-hoc program to compile the individual summaries into mass summary, I wrote a small program tabulate that is more flexible and reusable. It reads a stream of name=value pairs, deduces the record structure of the steam, and emits a corresponding summary table. I can concatenate a bunch of summaries, and tabulate will figure out how to split them up.

For example, if I give tabulate a single summary, it gives me back a single-row table:

$ stats rnorm100.data | grep me | tabulate
median        mean
0.0074455585  0.03474020677

If I give it two summaries (there are two data sets in my working directory), it gives me back two rows:

$ for set in *.dat; do stats $set | grep me; done | tabulate
median        mean
0.5122150670  0.88521159344
0.0074455585  0.03474020677

Now, however, I can't tell which set of statistics is which. But this problem is easy to solve: I just prepend each data set's name to its stream of summary statistics. The simple data format makes this easy, and I can even use echo to do the job in my shell's for loop:

for dataset in *.dat; do
    echo dataset = $dataset  # insert name into stream
    stats $dataset | grep me
done |
tabulate

Now I get the results I need:

dataset       median        mean
exp100.dat    0.5122150670  0.88521159344
rnorm100.dat  0.0074455585  0.03474020677

I would not want try that with XML, which is why I think that simple ASCII formats will be with us forever.

Please login to post a comment.
community.moertel.com | Copyright © 2003–07 Moertel Consulting