Monday, September 17, 2012

Authority Control: After Action Report

The run of authority_control_fields.pl that I start at 7:00 pm on Saturday ran through Sunday and finished at 4:35 am this (Monday) morning. At 33 hours and 35 minutes, it took a little longer than I had hoped, but it finished well within the bounds of what I needed.

For those of you following along at home, there were some clean up issues this morning.

The output contained 1,699 lines about what appeared to be bib records that were missing subfield codes in various tags, mostly 400, 410 and 670. These lines were typically surrounded by messages about wide characters in warn.

I checked all of the reported bib records and the one thing that they all had in common was that they did not contain the datafield that was supposedly missing subfield entries. I mentioned this on IRC and Galen Charlton suggested that it could be bad authorities.

So, I modified my copy of authority_control_fields.pl to add print("$rec_id : $auth_id\n"); on or about line 461. This way it would print all of the bibliographic records and matching authority record ids. I then wrote a script to take the list of bibs and run this authority_control_fields.pl and capture the output to a file. This script ran each of the bad records individually using the --record parameter of authority_control_fields.pl. This run mysteriously produced no error output and all of the bibs now appear to be linked to authorities.

I then sorted the output of authority ids and uniquified the list. After checking the authorities by dumping their MARCXML to a file and going over it, none of them looked bad.

Galen called this a "heisenbug" since the behavior seems to change as you observe it. However, I think the strange output maybe due to some difference in the environment when I run jobs via at. I normally use the UTF-8 character set, and this may not be sent in the environment when at runs a job.

The upshot of the above is, if you get errors when running your batched authority_control_fields.pl jobs, then run it again on the errored records. This may just fix those.

Saturday, September 15, 2012

HOWTO: Batch Authority Control

I received a request in email to share how I am doing my batch authority control linking in Evergreen, so I thought I'd write a blog post to explain it.

In order to batch authority control on Evergreen, you will need three pieces of software:

authority_control_fields_batcher.pl
You can get authority_control_fields_batcher.pl in my evergreen_utilities repository.
disbatcher.pl
disbatcher.pl is available from here.
authority_control_fields.pl
This program comes with Evergreen and recent installations should put it in your /openils/bin/ directory.

Run authority_control_fields_batcher.pl and direct the output to a file:

authority_control_fields_batcher.pl > batches

This will produce a file that you can use with disbatcher.pl. This file will have entries that will run authority_control_fields.pl over all of the undeleted bibs in your Evergreen database in batches of 10,000. If you want different options, you should read the comments in authority_control_fields_batcher.pl.

Next, you should schedule disbatcher.pl to run via at or cron with some appropriate options:

disbatcher.pl -s 7200 -n 8 -f /full/path/to/batches -v

Depending on your system and where you are running this, see below, you will likely need different options.

If you just want to start it now, and don't care to specify any extra options you could just run the following. Remember not to logout until it finishes or use a screen session:

auhtority_control_fields_batcher.pl | disbatcher.pl

Again, you will likely want to specify some options, particularly to disbatcher.pl.

I run this on my workstation in the MVLC Central Site offices. I can do this because I use Ubuntu GNU/Linux and have installed the OpenSRF and OpenILS libraries and configured them to communicate with our production installation. If you don't have a GNU/Linux workstation, then you could run this on your utility server. If you don't have a utility server, then you could run this directly on your prodcution server. However, in that case, you may be no better off than just running authority_control_fields.pl over your entire database without batching.--I found this to be the case when running it on my development virtual machine image.

Ideally, you want to run this when your system isn't that busy. Nights and weekends seem to work well for us. Determining the best time to run the batches requires a bit of experimentation. I started by running just 4 batches simultaneously and only running 4 batches by editing the input file to include only the first four lines of output from authority_control_fields_batcher.pl. When that went well, I upped the number to 8 the next night. After that run, I decided to run all of the remaining files in batches of 8 until they finished. This last started on a Saturday night. I have not actually run that last batch, yet, so I won't how it worked until tomorrow, but I suspect it should finish by 9:00 pm on Sunday night.--I'll post a follow up blog on Monday to share how it went.

Color Me Impressed: Another Authority Control Report

The batch of 8 files all processed within two hours last night. (See the output below.) That's a full 20 minutes ahead of the 4 that ran on Thursday. I chalk the improved performance up to there being less happening on the servers on a Friday night.

Given that level of success, I plan to run the remaining batches starting tonight at 7:00 pm. I'll have it run 8 at a time and give it the full list of remaining commands. It should finish sometime Sunday night or early Monday morning.

My next blog post will explain how I'm doing this, so watch this space.

Output:

dispatched: /openils/bin/authority_control_fields.pl --start_id=11 --end_id=21439
1 of 8 running
dispatched: /openils/bin/authority_control_fields.pl --start_id=21441 --end_id=43928
2 of 8 running
dispatched: /openils/bin/authority_control_fields.pl --start_id=43931 --end_id=65785
3 of 8 running
dispatched: /openils/bin/authority_control_fields.pl --start_id=65791 --end_id=86020
4 of 8 running
dispatched: /openils/bin/authority_control_fields.pl --start_id=86026 --end_id=102506
5 of 8 running
dispatched: /openils/bin/authority_control_fields.pl --start_id=102507 --end_id=119262
6 of 8 running
dispatched: /openils/bin/authority_control_fields.pl --start_id=119263 --end_id=136363
7 of 8 running
dispatched: /openils/bin/authority_control_fields.pl --start_id=136364 --end_id=152938
8 of 8 running
1 of 8 processed
7 of 8 running
2 of 8 processed
6 of 8 running
3 of 8 processed
5 of 8 running
4 of 8 processed
4 of 8 running
5 of 8 processed
3 of 8 running
6 of 8 processed
2 of 8 running
7 of 8 processed
1 of 8 running
8 of 8 processed
0.01user 0.00system 1:59:44elapsed 0%CPU (0avgtext+0avgdata 14480maxresident)k
752inputs+8outputs (0major+1308minor)pagefaults 0swaps

Friday, September 14, 2012

More issa changes.

After some thinking and some other bib-related work. I've decided to make issa create new copies as pre-cat bibs like it should have to begin with. Since there is an installed base (however small) of issa users, this new feature will be optional, but turned on by default, so any it will be there for any new installations. If an existing installation wants to use the new feature, then they'll need to update their issa code and add the option in their configuration file. I'll explain how it works and how to activate the feature once I've actually coded the solution.

Authority Control Linking: Results

The first batch of authority control fields linking went well last night. It finished in two hours and twenty-one minutes. Here's the report that I received in email:

dispatched: /openils/bin/authority_control_fields.pl --start_id=11 --end_id=21439
1 of 4 running
dispatched: /openils/bin/authority_control_fields.pl --start_id=21441 --end_id=43928
2 of 4 running
dispatched: /openils/bin/authority_control_fields.pl --start_id=43931 --end_id=65785
3 of 4 running
dispatched: /openils/bin/authority_control_fields.pl --start_id=65791 --end_id=86020
4 of 4 running
4 of 4 running
4 of 4 running
1 of 4 processed
3 of 4 running
2 of 4 processed
2 of 4 running
3 of 4 processed
1 of 4 running
4 of 4 processed
0.00user 0.01system 2:21:18elapsed 0%CPU (0avgtext+0avgdata 14480maxresident)k
0inputs+8outputs (0major+1189minor)pagefaults 0swaps

Tonight, we'll try running 8 in a batch of 8 to see if that takes longer or just as long. Depending on the results of tonight's test, we may just run the rest through starting Saturday night, or we'll continue running batches each night.

Thursday, September 13, 2012

Authority Control Linking

This post is more for MVLC member libraries' staff than for the community at large, which is a bit of a switch for us. This blog is meant to be for the benefit of our members as much as it is for the benefit of the community.

Starting tonight, September 13, 2012, MVLC central site staff will run the script for linking authorities with bibs in Evergreen, authority_control_fields.pl. We plan to run it on batches of 10,000 bibs at a time with up to four batches running simultaneously. We will do just four batches on the first night to see how long that takes. Depending on the results, we may bump the number up to 8 or 16 batches per night, or adjust the number of simultaneously running batches downward.

Depending upon how many batches we can successfully complete in a night, this will take us anywhere from six to twenty-two days to complete.

While I don't expect this to have any impact on production performance whatsoever, we are still running this at night as a precaution.

If there's any interest in the comments, I'll post updates as this progresses or not.

Saturday, September 1, 2012

All issa, all the time

Yes, another post concerning issa. Now that other sites are using it some unexpected situations have come up. The latest changes to the code will fail more gracefully if your configuration file still says to use a stat cat entry on copies created by issa, but neither that stat cat entry nor the corresponding stat cat exist.