Dusting the bookshelf
Posted on 2022-11-30
I’m currently in the process of writing my Ph.D. thesis. This means I need to gather all my published papers and fight against LaTeX and Lhs2TeX until everything fits together and looks somewhat decent. One of the many challenges of this process was to deal with the multiple bibliography (.bib
) files I collected over the years. Hopefully I’m not the only one in this situation:
- Paper 0: inherit a gigantic
.bib
from advisor and add new references as needed - Paper n: pick the
.bib
from some paper k (with 0 ≤ k < n) and add new references as needed
So here I was, sitting with ~50kLOC worth of .bib
files I needed to organize. On the other hand, I wanted to collect all the references in a single section of the thesis to avoid repetetition and inconsistencies. So, in principle I could have simply used the good ’ol bibtool
utility to merge all my .bib
files together:
$ bibtool -s foo.bib bar.bib baz.bib
The problem with this approach is that I would have ended up again with a gigantic .bib
file when, in reality, I only needed a handful of entries from it. To solve this, I put together a small bash script that pulls references from .bib
files on demand. The trick is to ask for forgiveness rather than for permission:
- Compile the project with an empty
.bib
file - Parse the log file to find which citations are missing references
- Extract each of those references from some existing
.bib
file and append them to the project’s.bib
file - Compile the project again, now without missing references :D
Finding missing citations
If we inspect the log files created by pdflatex
/bibtex
, missing citations are reported as:
LaTeX Warning: Citation 'afl' on page 21 undefined on input line 16664.
So we need to find these warnings, extract the citation keys (i.e., afl
in the line above) from them and remove any duplicates:
LOGFILE=thesis.log
MISSING=$( \
grep "LaTeX Warning: Citation" $LOGFILE |\
awk -F' ' '{print $4}' | tr -d \' |\
sort | uniq \
)
With the culprits at hand, we can move onto the next step.
Retrieving citations on demand
To find if a concrete reference exists in a given citation file, we can use bibtool
’s select
resource:
find_ref () {
bibtool -q -r biblatex '--expand.macros=ON' '--print.all.strings=OFF' '--select{$key "'$1'"}' $2
}
This way, find_ref $key $bibfile
will try to find the reference $key
in the file $bibfile
, returning either its corresponding entry if it can find it, or an empty string otherwise. The -r biblatex
option is needed to allow indexing @online
entries, whereas the expand.macros
and print.all.strings
options are needed to ensure that string macros are expanded, so we don’t need to print them everytime we look for a citation key.
Then, we can iterate over the existing .bib
files to see if we find each missing citation:
BIBFILES=$(ls bib/*.bib) # Existing .bib files
OUTPUT=references.bib # The project's .bib file
for missing in $MISSING; do
echo -n "Looking for $missing: "
# Reference is already there
FOUND=$(find_ref $missing $OUTPUT)
if [ ! -z "$FOUND" ]; then
echo "Already in in $OUTPUT!"
continue
fi
# Reference is in some .bib file
for bibfile in $BIBFILES; do
FOUND=$(find_ref $missing $bibfile)
if [ ! -z "$FOUND" ]; then
echo "Found in $bibfile!"
echo "$FOUND" >> $OUTPUT
break
fi
done
# Reference is M.I.A.
if [ -z "$FOUND" ]; then
echo "Could not find any entry!"
fi
done
The reason we first look in the project’s .bib
file is to avoid producing duplicate entries when:
- The user already added them manually, or
- We have an overlapping entry key that matches a previously searched missing reference. 1
Trying it out
We can now run this little snippet to gather only the references we need:
Looking for afl: Found in bib/foo.bib!
Looking for asan: Found in bib/bar.bib!
Looking for peach: Found in bib/baz.bib!
...
And my curated .bib
file now has “only” a little over 2000 lines of code ¯\_(ツ)_/¯
You can find the bash script here in case you want to give it a try.
This happens because
bibtool
’sselect
resource looks for all the entries whose keys contain the given key string. So, for instance, both keysfoo
andfoobar
will match withfoo
, possibly returning more than one entry perfind_ref
. If you know how to avoid this, please let me know!↩︎