MVLC: Parallel Metabib Reingest in Evergreen 2.4 and Later

The Evergreen 2.4 to 2.5 upgrade process is going to require a reingest of your bibliographic records so that new features, such as the browse search, will work properly. Traditional methods of reingesting records using a SQL script are slow, since the search indexing for each bibliographic record is updated in turn. They also require that you tinker with global flags in the database.

To remedy some of these issues, 2.4 modified the database function metabib.reingest_metabib_field_entries to accept three boolean flags in addition to the record id of the bibliographic record that needs reingesting. These flags indicate which of the metabib indexes you'd like to skip for the given bib record id: facet, browse, or search in that order. (Just to make it perfectly clear: setting a flag to TRUE causes that index reingest to be skipped and not run. This logic is the opposite of what you might typically expect, so it bares repeating.) The options are all FALSE by default, so if you want to reingest everything then you can still use the function in the old way. However, using these flags to only reingest what needs to be reingested can save you some time and also permits us to write a program to do a complete bibliographic reingest in parallel. This latter is a feat that was rather difficult to achieve prior to 2.4.

The main advantage of using the 2.4 version of metabib.reingest_metabib_field_entries over a SQL script that updates your bibliographic records is that when you use the flags to turn off or skip different ingest methods, you gain fine grained control over the ingest process. Simply updating a bibliographic record causes all of the reingest methods to run on this record. In the course of normal operation, this is exactly what you want. If a MARC record is edited for instance, you want the changes to show up in all of the indexes. However, when you are doing a planned reingest of all of your records, such as during an upgrade or after adding a custom metabib field, you may want more control over which index gets updated. Only updating the facet, browse, or search index when necessary will save you a bit of time when indexing all your records at once. In the case where you've added a new configuration for a facet, search, or browse metabib field, you will want to ask your database administrator to run a simple SQL script to reingest all of your bibs using the metabib.reingest_metabib_field_entries function with the appropriate flags. While updating only one metabib index will save you some time, indexing all of your records in this way will still take several hours. In the event that you want to update all of your indexes for all of your bibliographic records in one go, you will definitely want to do this in parallel. Using the appropriate flags with the metabib.reingest_metabib_field_entries function makes this possible.

Before you can run the reingest in parallel, you need to know a little about how the different ingest routines work. You need to know that the facet, browse, and search ingests can all happen at the same time. That is, they can run in parallel with each other. That said, the browse ingest cannot run in parallel with other browse ingests. You run the risk of having database conflicts with the different processes doing browse updates at the same time. What this means is that you have to partition the work so that the facet and search ingests run in parallel, and the browse ingest runs sequentially over each record. You can still run the browse ingest while the parallel facet and search ingests run. If that sounds a bit complicated, never fear. I have written pingest.pl, a smallish Perl program that will do all of that for you.

While you may have great success with just downloading the program, copying it to one of your Evergreen servers, and running it without knowing how it does what it does, you should probably understand a few things about how it works and the assumptions that it makes before you attempt to run it.

First, it assumes that you have set the PGHOST, PGPORT, PGDATABASE, PGUSER, and PGPASSWORD environment variables as described here. If you don't have those set, you will either need to set them or to modify the three lines that have DBI->connect('DBI:Pg') on them so that the program can find your database.

Second, the program will run 8 parallel processes by default, and will use batches of 10,000 records each when ingesting for the facet and search indexes. These values work in my environment. You may want to use different numbers depending on the capabilities of your database server and the number of records in your database. These values are set by the constants MAXCHILD and BATCHSIZE defined near the top of the file. MAXCHILD controls how many processes are used for the parallel ingest. BATCHSIZE controls how many records are processed by each of the parallel processes. The browse ingest that sequentially over all records as a single batch also counts against the limit set by MAXCHILD. Because the browse ingest operates more or less sequentially, it serves as the main limit on how long the total reingest takes. Using any reasonable number of processes and batch size, the combined facet and search ingests will likely finish several hours before the browse ingest does. As a general rule of thumb, you should probably set MAXCHILD to one half the number of cores or threads (if HTT is enabled) on your database server, and BATCHSIZE should be approximately one one-hundredth of the number of your bibliographic records. There is room to fudge here, and if you're doing this during an upgrade, you could just go ahead and use all of the cores on your database server. You should experiment and find numbers that work for you. You might discover that larger batches work just fine in your situation.

Finally, the program itself spends most of its time waiting on the database, so it uses very few resources on the computer where it runs. If you run it from a server or workstation other than your database server, you generally should not have to worry about how many CPU cores that machine has. The database server's resources and utilization are your main concerns.

We use this script quite frequently when updating our development and training servers, as well as when necessary during upgrades, here at MVLC. We hope you also find it useful. We know that there are ways it could be improved, such as moving the maximum child and batch size parameters from constants to command line parameters. If you make any modifications that would be useful to others, then we would be happy to incorporate them.

MVLC

Thursday, December 5, 2013

Parallel Metabib Reingest in Evergreen 2.4 and Later

No comments:

Post a Comment