OL: how to generate the dump files, step-by-step

Here the instructions about how to generate the ol_dump.txt.gz files.

ol-home is the right place to do this.

1 :: Dumping the data table from ol-db1
this task requires around 1 hour to complete.

giovanni@ol-home:/1/var/tmp$ psql -h ol-db1 -U openlibrary openlibrary -c "copy data to stdout" | gzip -c > data.txt.gz

2 :: Activate the virtual environment /opt/openlibrary/venv

giovanni@ol-home:/1/var/tmp$ source /opt/openlibrary/venv/bin/activate

3 :: Generate the metadata table dump from archive db
this task requires around 1 hour to complete.

(venv)giovanni@ol-home:/1/var/tmp$ ARCHIVE_DB_PASSWORD=`/opt/.petabox/dbserver`
(venv)giovanni@ol-home:/1/var/tmp$ python /opt/openlibrary/openlibrary/scripts/2012/dump-ia-items.py --host db-current --user archive --password $ARCHIVE_DB_PASSWORD --database archive | gzip -c > ia_metadata_dump_2015-03-11.txt.gz

4 :: Generate the dump of all revisions of all documents.
this task requires around 8 hours to complete.

(venv)giovanni@ol-home:/1/var/tmp$ /opt/openlibrary/openlibrary/scripts/oldump.py cdump data.txt.gz 2015-03-11 | gzip -c > ol_cdump.txt.gz
(venv)giovanni@ol-home:/1/var/tmp$ rm data.txt.gz

5 :: Generate the dump of latest revisions of all documents.
this task requires around 6 hours to complete.

(venv)giovanni@ol-home:/1/var/tmp$ gzip -cd ol_cdump.txt.gz | python /opt/openlibrary/openlibrary/scripts/oldump.py sort --tmpdir /1/var/tmp | python /opt/openlibrary/openlibrary/scripts/oldump.py dump | gzip -c > ol_dump_2015-03-11.txt.gz
(venv)giovanni@ol-home:/1/var/tmp$ rm -rf /1/var/tmp/oldumpsort

6 :: Splitting the Dump into authors, editions, works, redirects

(venv)giovanni@ol-home:/1/var/tmp$ gzip -cd ol_dump_2015-03-11.txt.gz | python /opt/openlibrary/openlibrary/scripts/oldump.py split --format ol_dump_%s_2015-03-11.txt.gz

7 :: Generate the denormalized works Dump <<---- TO FIX: the script returns exceptions
where each row contains a JSON document with the following fields:

  • work – The work documents
  • editions – List of editions that belong to this work
  • authors – All the authors of this work
  • ia – IA metadata for all the ia items referenced in the editions as a list
  • duplicates – dictionary of duplicates (key -> it’s duplicates) of work and edition docs mentioned above
(venv)giovanni@ol-home:/1/var/tmp$ python /opt/openlibrary/openlibrary/scripts/2011/09/generate_deworks.py ol_dump_2015-03-11.txt.gz ia_metadata_dump_2015-03-11.txt.gz | gzip -c > ol_dump_deworks_2015-01-11.txt.gz
(venv)giovanni@ol-home:/1/var/tmp$ ls
ia_metadata_dump_2015-03-11.txt.gz  ol_dump_2015-03-11.txt.gz
ol_dump_redirects_2015-03-11.txt.gz ol_dump_authors_2015-03-11.txt.gz
ol_dump_deworks_2015-01-11.txt.gz   ol_dump_editions_2015-03-11.txt.gz
ol_dump_works_2015-03-11.txt.gz