Monday, September 17, 2012

Authority Control: After Action Report

The run of authority_control_fields.pl that I start at 7:00 pm on Saturday ran through Sunday and finished at 4:35 am this (Monday) morning. At 33 hours and 35 minutes, it took a little longer than I had hoped, but it finished well within the bounds of what I needed.

For those of you following along at home, there were some clean up issues this morning.

The output contained 1,699 lines about what appeared to be bib records that were missing subfield codes in various tags, mostly 400, 410 and 670. These lines were typically surrounded by messages about wide characters in warn.

I checked all of the reported bib records and the one thing that they all had in common was that they did not contain the datafield that was supposedly missing subfield entries. I mentioned this on IRC and Galen Charlton suggested that it could be bad authorities.

So, I modified my copy of authority_control_fields.pl to add print("$rec_id : $auth_id\n"); on or about line 461. This way it would print all of the bibliographic records and matching authority record ids. I then wrote a script to take the list of bibs and run this authority_control_fields.pl and capture the output to a file. This script ran each of the bad records individually using the --record parameter of authority_control_fields.pl. This run mysteriously produced no error output and all of the bibs now appear to be linked to authorities.

I then sorted the output of authority ids and uniquified the list. After checking the authorities by dumping their MARCXML to a file and going over it, none of them looked bad.

Galen called this a "heisenbug" since the behavior seems to change as you observe it. However, I think the strange output maybe due to some difference in the environment when I run jobs via at. I normally use the UTF-8 character set, and this may not be sent in the environment when at runs a job.

The upshot of the above is, if you get errors when running your batched authority_control_fields.pl jobs, then run it again on the errored records. This may just fix those.

No comments:

Post a Comment