Text::Bib -- parse Unix .bib files
This module provides routines for parsing in the contents
of bibliographic databases (usually found lurking on Unix-like
operating systems): these are simple text files which contain
one or more bibliography records. Each record describes a single
paper, book, or article. Users of nroff/troff often employ
such databases whenm typesetting papers.
Even if you don't use *roff, this simple, easily-parsed parameter-value
format is still useful for recording/exchanging bibliographic
information. With the Bib:: module, you can easily post-process
.bib files: search them, convert them into LaTeX, whatever.
IMPORTANT NOTE FOR OLD Bib:: USERS :
After conversations with the High-Muckity-Mucks of the CPAN, this
module has been renamed from Bib:: to the more appropriate Text::Bib.
(From the GNU manpage, grefer(1) :)
The bibliographic database is a text file consisting of
records separated by one or more blank lines. Within each
record fields start with a % at the beginning of a line.
Each field has a one character name that immediately follows
the %. It is best to use only upper and lower case
letters for the names of fields. The name of the field
should be followed by exactly one space, and then by the
contents of the field. Empty fields are ignored. The
conventional meaning of each field is as follows:
-
A
-
The name of an author. If the name contains a
title such as Jr. at the end, it should be separated
from the last name by a comma. There can be multiple
occurrences of the A field. The order is significant.
It is a good idea always to supply an A field or a Q field.
-
B
-
For an article that is part of a book, the title of the book
-
C
-
The place (city) of publication.
-
D
-
The date of publication. The year should be specified in full.
If the month is specified, the name rather than the number of
the month should be used, but only the first three letters are required.
It is a good idea always to supply a D field; if the date is unknown,
a value such as "in press" or "unknown" can be used.
-
E
-
For an article that is part of a book, the name of an editor of the book.
Where the work has editors and no authors, the names of the editors should
be given as A fields and , (ed) or , (eds) should be
appended to the last author.
-
G
-
US Government ordering number.
-
I
-
The publisher (issuer).
-
J
-
For an article in a journal, the name of the journal.
-
K
-
Keywords to be used for searching.
-
L
-
Label.
NOTE: Uniquely identifies the entry. For example, "Able94".
-
N
-
Journal issue number.
-
O
-
Other information. This is usually printed at the end of the reference.
-
P
-
Page number. A range of pages can be specified as m-n.
-
Q
-
The name of the author, if the author is not a person.
This will only be used if there are no A fields. There can only be one
Q field.
NOTE: Thanks to Mike Zimmerman for clarifying this for me:
it means a "corporate" author: when the "author" is listed
as an organization such as the UN, or RAND Corporation, or whatever.
I've changed the access/storage/etc. methods to "corpAuthor" to access it,
but "android" will still work for now.
-
R
-
Technical report number.
-
S
-
Series name.
-
T
-
Title. For an article in a book or journal, this should be the title
of the article.
-
V
-
Volume number of the journal or book.
-
X
-
Annotation.
NOTE: Basically, a brief abstract or description.
For all fields except A and E, if there is more than one occurrence.of a particular field in a record, only the last such field will be used.
If accent strings are used, they should follow the character
to be accented. This means that the AM macro must be
used with the -ms macros. Accent strings should not be
quoted: use one \ rather than two.
Here's a possible .bib file with three entries:
%T Cyberiad
%A Stanislaw Lem
%K robot fable
%I Harcourt/Brace/Jovanovich
%T Invisible Cities
%A Italo Calvino
%K city fable philosophy
%X In this surreal series of fables, Marco Polo tells an
aged Kublai Khan of the many cities he has visited in
his lifetime.
%T Angels and Visitations
%A Neil Gaiman
See refer(1) or grefer(1) for a description of .bib files.
To parse a .bib file, just do this:
require Text::Bib;
my $bib;
while ($bib = Text::Bib->read($anyOldFileHandle)) {
# ...do stuff with $bib...
}
defined($bib) || die("error parsing input");
You will nearly always use the read() constructor to create
new instances, and nearly always as shown above. Notice that
read() returns the following:
- The new object, on success.
- The value '0', on expected end-of-file.
- The undefined value, on error.
Since read() returns "true" if and only if a new object could be read,
and it returns two distinct "false" values otherwise, it's very easy to
iterate through a .bib stream and to know why the iteration stopped.
By default, the parser accepts any one-character field name that is
a printable character (no whitespace). Formally, this is:
[\041-\176]
Use of characters outside this range is a syntax error. You may define
a narrower range using the GoodFields parser-option: however, this will
slow down your parser, so you may want to consider whether or not you
really need it.
For every one of the standard fields in a .bib record, the Bib::
module has designated a high-level attribute name:
- A - author
- B - book
- C - city
- D - date
- E - editor
- G - govtNo
- I - publisher
- J - journal
- K - keywordList
- L - label
- N - number
- O - otherInfo
- P - page
- Q - android
- R - reportNo
- S - series
- T - title
- V - volume
- X - abstract
Then, for each high-level attribute name attr , Text::Bib:: defines three
methods:
-
attr()
-
All access methods of this form (e.g., date() , title() ),
return a single scalar value for that particular attribute,
or undef if there is no such value. For example:
$date = $bib->date();
If the Bib object has more than one value defined for attr ,
the last value that was read in is used.
-
attrs()
-
All access methods of this form (e.g., dates() , titles() ),
return the array of all values of that attribute, as follows:
- If invoked in an array context, an array of values is returned, or the empty array if there are no values for that particular attribute.
- If invoked in an scalar context, a B an array of values is returned, or undef if there are no values for that particular attribute.
For example:
# Get and print the first author in the list:
(@authors = $bib->authors()) || die("no authors");
print "first author = $authors[0]\n";
# Virtually the same thing, but more efficient if many authors:
($authorsRef = $bib->authors()) || die("no authors");
print "first author = $authorsRef->[0]\n";
-
setAttrs()
-
All methods of this form (e.g., setAuthors() ,
setEditors() ) set the array of all values of that attribute.
Supply the list of values as the arguments; for example:
$bib->setAuthors('C. Clausticus', 'H. Hydronimous', 'F. Fwitch');
If you are writing a subclass, you can use the makeMethods() class.method to add new fields, or override the interpretation of
existing ones:
package MyBibSubclass;
@ISA = qw(Text::Bib);
# In our files, %Y holds the year, which is *really* the date:
MyBibSubclass->makeMethods('Y', 'date');
# Also in our files, %u fields hold the URLs of any on-line copies:
MyBibSubclass->makeMethods('u', 'url');
...
while ($bib = MyBibSubclass->read($FH)) {
$date = $bib->date(); # return date, from %Y
@urls = $bib->urls(); # return array of URLs, from %u
$anyUrl = $bib->url(); # return the last URL encountered
...
}
The normal way to output Bib objects in .bib format is to
use the method:
$bib->output($filehandle);
The filehandle may be omitted; in such a case, currently-selected filehandle
is used. The fields are output with %L first (if it exists), and then the
remaining fields in alphabetical order. The following "safety measures"
are taken:
- Lines longer than 77 characters are wrapped at the first whitespace character before that length.
- Any occurences of '%' immediately after a newline are preceded by a single space.
These safety measures are slightly time-consuming, and are silly if you
are merely outputting a Bib object which you have read in verbatim
(i.e., using the default parset-options) from a valid .bib file.
Thus, we define a faster method, without the seatbelts:
$bib->dump($filehandle);
Warning: this method does no fixup on the values at all: they are
output as-is. That means if you used parser-options which destroyed any
of the formatting whitespace (e.g.,
Newline=TOSPACE
with
LeadWhite=KILLALL
), there is a risk that the output object will be
an invalid Bib record.
Note: users of 1.8 and previous releases will notice that the
print() method is now undefined by default: it is deprecated in favor of
the perfectly-equivalent output() method. If you absolutely cannot
change your method calls just yet, simply change your "require" line:
require Text::Bib;
Text::Bib->DEFINE_PRINT_METHOD;
That will define the deprecated Text::Bib:: print() as being equivalent to
Text::Bib:: output() .
Each .bib object has instance variables corresponding to the actual
field names: for example, the .bib record:
%T The Non-Linear Existence of Menger-Sierpinski Dragons
%A S. Trurl
%A L. Klapaucius
%A C. Cybr
%E Abbarat Hyperion
%C
%K dragon nonlinear Menger Sierpinski irrational hat-rack
%X Of the many varieties of non-existent dragons, perhaps the
most fascinating one to not exist is the Menger-Sierpinski Dragon,
a.k.a. the Fractal Dragon. This paper discusses how these "fragons"
are, in fact, irrationally-dimensional (e.g., pi-dimensional) curves,
and concludes with the proof that a nonexistent dragon which nonexists
in such an impossible manner must logically exist in conventional
space -- surprisingly, as a hat-rack.
%D 1996
Would, when parsed, result in a Bib object with the following instance
variables:
$self->{T} = ["The Non-Linear ... Dragons"];
$self->{A} = ["S. Trurl",
"L. Klapaucius",
"C. Cybr"];
$self->{C} = [""];
$self->{E} = ["Abbarat Hyperion"];
$self->{K} = ["dragon nonlinear Menger Sierpinski irrational hat-rack"];
$self->{D} = ["1996"];
Notice that, for maximum flexibility and consistency (but at the cost of
some space and access-efficiency), the semantics of .bib records do
not come into play at this time: since everything resides in an array,
you can have as many %K , %D , etc. records as you like, and given them
entirely different semantics. For example, the Library Of Boring Stuff
That Everyone Reads (LOBSTER) uses the unused %Y as a "year" field.
The parser accomodates this case by politely not choking on LOBSTER
bibliographies.
The .bib semantics come into play in the storage/access methods... which,
of course, you can override in subclasses. So, while the default date-access
looks something like this:
sub date {
my $self = shift;
defined($self->{D}) ? $self->{D}[-1] : undef;
}
The LOBSTER would create a subclass LobsterBib::, and override the date()
method to be:
sub date {
my $self = shift;
defined($self->{Y}) ? $self->{Y}[-1] : undef;
}
Furthermore, since this is identical in format to a "standard" scalar-access
method, the LOBSTER could just place in LobsterBib.pm the line:
LobsterBib->makeMethods('Y', 'date');
And voila, the appropriate methods will be defined.
Before you parse a Bib object, you can set certain parser options
to adjust for the peculiarities in a particular .bib -flavored file.
Since we're trying to steer clear of package-level state information,
we pass the parser options right into the read() call, as the
optional second argument:
my $opts = Text::Bib->makeOpts(LeadWhite => KEEP,
GoodFields => '[AEFZ]');
while ($bib = Text::Bib->read($fh, $opts)) {
# ...do stuff...
}
The options are as follows:
-
GoodFields
-
By default, the parser accepts any (one-character) field name that is
a printable character (no whitespace). Formally, this is:
[\041-\176]
However, when compiling parser options, you can supply your own regular
expression for validating (one-character) field names.
( note: you must supply the square brackets; they are there to remind
you that you should give a well-formed single-character expression).
One standard expression is provided for you:
$Text::Bib::GroffFields = '[A-EGI-LN-TVX]'; # legal groff fields
Illegal fields which are encounterd during parsing result in a syntax error.
NOTE: You really shouldn't use this unless you absolutely need to.
The added regular expression test slows down the parser.
-
LeadWhite
-
In many .bib files, continuation lines (the 2nd, 3rd, etc. lines of a
field) are written with leading whitespace, like this:
%T Incontrovertible Proof that Pi Equals Three
(for Large Values of Three)
%A S. Trurl
%X The author shows how anyone can use various common household
objects to obtain successively less-accurate estimations of
pi, until finally arriving at a desired integer approximation,
which nearly always is three.
This leading whitespace serves two purposes:
- It makes it impossible to mistake a continuation line for a field, since % can no longer be the first character.
- It makes the .bib entries easier to read.
The
LeadWhite
option controls what is done with this whitespace:
- KEEP - default; the whitespace is untouched
- KILLONE - exactly one character of leading whitespace is removed
- KILLALL - all leading whitespace is removed
See the section below on "using the parser options" for hints and warnings.
-
Newline
-
The
Newline
option controls what is done with the newlines that
separate adjacent lines in the same field:
- KEEP - default; the newlines are kept in the field value
- TOSPACE - convert each newline to a single space
- KILL - the newlines are removed
See the section below on "using the parser options" for hints and warnings.
Default values will be used for any options which are left unspecified..
The default values for
Newline
and
LeadWhite
will preserve the
input text exactly.
The
Newline=TOSPACE
option, when used in conjunction with the
LeadWhite=KILLALL
option, effectively "word-wraps" the text of
each field into a single line.
Be careful! If you use the
Newline=KILL
option with
either the
LeadWhite=KILLONE
or the
LeadWhite=KILLALL
option,
you could end up eliminating all whitespace that separates the word
at the end of one line from the word at the beginning of the next line.
Since you generally will parse an entire file with the same parser options,
it's silly to have to determine the options used (and fill-in the defaults
for unspecified options) on every call to read() . So instead, if
you want to provide parser options, you specify them in a call to
makeOpts() : this method will "compile" your options for fastest-possible
usage, and then return a parser-options "object" to you which you can
plug into read() .
If a Text::Bib:: method returns an error value (usually undef), you can get
the last error by using any of these forms:
# If you happen to be using Bib objects:
Text::Bib->lastError();
# If you happen to be using MyBibSubclass objects:
MyBibSubclass->lastError();
# If you happen to have an instance on hand:
$bibobject->lastError();
It doesn't matter which form you use: they're all equivalent.
All return a string representation of the last error, which will
look like this:
"syntax: unexpected end of file"
The error message will always be of the form "category: description" ,
where the currently-legal categories include...
ok not really an error: e.g., expected end-of-file
syntax syntax error in parsing
NOTE: This error string is for diagnostics only: you shouldn't depend
on it for flow-control.
Tolerable... barely. Even with a lot of hacking to speed things up,
it parses a typical 500 KB .bib file (of 1600 records) in 13
seconds of user time on my 66 MHz/32 MB RAM/I486 box running Linux 1.1.18.
So, figure about 125 records/sec, or about 40 KB/sec.
By contrast, a C program which does the same work is about 8 times
as fast. But of course, the C code is 8 times as large, and 8 times
as ugly. :-)
Since the parsing doesn't really "need" regular expressions, I'm
willing to bet that a variation of the parser which uses
dynamically-loaded C functions would be a little faster. Perhaps such
an alternate parser-method would be a parser-option, available for
people who've compiled their Perl5 to support dynamic-loading. But,
for now, we go with a more-portable approach.
Bottom line: I'd recommend using this module to process .bib files,
but if you're looking for query tool... well... maybe we need someone
to implement a readInfo() substitute in C, which this module could
load.
I actually do not use .bib files for *roffing... I used them as a
quick-and-dirty database for WebLib, and that's where this code comes
from. If you're a serious user of .bib files, and this module doesn't
do what you need it to, please contact me: I'll add the functionality
in.
Compiles a lot of storage/access methods that the user may not need
(e.g., authors() , setAuthors() , etc.). In the future, the creation of
these methods should be done on-demand, by a custom AUTOLOAD routine.
To speed up the access/storage methods calls, the full methods are created
and loaded (as opposed to having one-line "stubs" which call some
generic "back-end" function). The access/storage methods are pretty small,
but still... this means that all the more Perl code must be eval'ed and
loaded, and it may or may not have been a good design choice.
If any of the auto-compiled storage/access methods are invoked improperly,
the error messages are very cryptic, since the "filename" mentioned
is "eval".
Some combinations of parser-options are silly.
$Id: Bib.pm,v 1.18 1995/12/21 19:26:41 eryq Exp $
Copyright (C) 1995 by Eryq.
The author may be reached at
eryq@rhine.gsfc.nasa.gov
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
For a copy of the GNU General Public License, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.