Simple data formats are not going away 
I saw that
ziggy wrote a small program to compute summary statistics
and emit the output in
name=
value format.
This is the same output format that my
stats
program uses, and it's great. First, it's easy to eyeball:
$ stats rnorm100.dat
count = 100
min = -2.567090756
10% cut = -0.9992493139
25% cut = -0.5645971535
median = 0.0074455585
mean = 0.0347402067700001
75% cut = 0.542629871
90% cut = 1.3079738435
max = 2.051265585
var = 0.788500663119147
stdev = 0.88797559826785
popvar = 0.780615656487955
popstdv = 0.883524564733746
Second, it's easy to manipulate. If I just want the median and mean,
for example, I grep for them:
$ stats rnorm100.dat | grep me
median = 0.0074455585
mean = 0.0347402067700001
For mass analysis, however, this format is too verbose: I
do not
want to look at one hundred of these summaries to try to figure out
the big picture. What I want is to see
all of the stats at
once. I want a summary table: each data set in its
own row and each summary statistic in its own column.
While it is easy to write an ad-hoc program to compile the individual
summaries into mass summary, I wrote a small program
tabulate that
is more flexible and reusable. It reads a stream of
name=
value pairs, deduces the record structure of the steam,
and emits a corresponding summary table. I can concatenate a bunch
of summaries, and
tabulate will figure out how to split
them up.
For example, if I give
tabulate a single summary, it gives
me back a single-row table:
$ stats rnorm100.data | grep me | tabulate
median mean
0.0074455585 0.03474020677
If I give it two summaries (there are two data sets in my working directory), it gives me back two rows:
$ for set in *.dat; do stats $set | grep me; done | tabulate
median mean
0.5122150670 0.88521159344
0.0074455585 0.03474020677
Now, however, I can't tell which set of statistics is which. But
this problem is easy to solve: I just prepend each data set's name to
its stream of summary statistics. The simple data format
makes this easy, and I can even use
echo to do the job
in my shell's
for loop:
for dataset in *.dat; do
echo dataset = $dataset # insert name into stream
stats $dataset | grep me
done |
tabulateNow I get the results I need:
dataset median mean
exp100.dat 0.5122150670 0.88521159344
rnorm100.dat 0.0074455585 0.03474020677
I would not want try that with XML, which is why I think that simple
ASCII formats will be with us forever.