Saturday, September 15, 2012

HOWTO: Batch Authority Control

I received a request in email to share how I am doing my batch authority control linking in Evergreen, so I thought I'd write a blog post to explain it.

In order to batch authority control on Evergreen, you will need three pieces of software:

authority_control_fields_batcher.pl
You can get authority_control_fields_batcher.pl in my evergreen_utilities repository.
disbatcher.pl
disbatcher.pl is available from here.
authority_control_fields.pl
This program comes with Evergreen and recent installations should put it in your /openils/bin/ directory.

Run authority_control_fields_batcher.pl and direct the output to a file:

authority_control_fields_batcher.pl > batches

This will produce a file that you can use with disbatcher.pl. This file will have entries that will run authority_control_fields.pl over all of the undeleted bibs in your Evergreen database in batches of 10,000. If you want different options, you should read the comments in authority_control_fields_batcher.pl.

Next, you should schedule disbatcher.pl to run via at or cron with some appropriate options:

disbatcher.pl -s 7200 -n 8 -f /full/path/to/batches -v

Depending on your system and where you are running this, see below, you will likely need different options.

If you just want to start it now, and don't care to specify any extra options you could just run the following. Remember not to logout until it finishes or use a screen session:

auhtority_control_fields_batcher.pl | disbatcher.pl

Again, you will likely want to specify some options, particularly to disbatcher.pl.

I run this on my workstation in the MVLC Central Site offices. I can do this because I use Ubuntu GNU/Linux and have installed the OpenSRF and OpenILS libraries and configured them to communicate with our production installation. If you don't have a GNU/Linux workstation, then you could run this on your utility server. If you don't have a utility server, then you could run this directly on your prodcution server. However, in that case, you may be no better off than just running authority_control_fields.pl over your entire database without batching.--I found this to be the case when running it on my development virtual machine image.

Ideally, you want to run this when your system isn't that busy. Nights and weekends seem to work well for us. Determining the best time to run the batches requires a bit of experimentation. I started by running just 4 batches simultaneously and only running 4 batches by editing the input file to include only the first four lines of output from authority_control_fields_batcher.pl. When that went well, I upped the number to 8 the next night. After that run, I decided to run all of the remaining files in batches of 8 until they finished. This last started on a Saturday night. I have not actually run that last batch, yet, so I won't how it worked until tomorrow, but I suspect it should finish by 9:00 pm on Sunday night.--I'll post a follow up blog on Monday to share how it went.

No comments:

Post a Comment