Here the instructions about how to generate the ol_dump.txt.gz
files.
ol-home is the right place to do this.
1 :: Dumping the data table from ol-db1
this task requires around 1 hour to complete.
giovanni@ol-home:/1/var/tmp$ psql -h ol-db1 -U openlibrary openlibrary -c "copy data to stdout" | gzip -c > data.txt.gz |
giovanni@ol-home:/1/var/tmp$ psql -h ol-db1 -U openlibrary openlibrary -c "copy data to stdout" | gzip -c > data.txt.gz
2 :: Activate the virtual environment /opt/openlibrary/venv
giovanni@ol-home:/1/var/tmp$ source /opt/openlibrary/venv/bin/activate |
giovanni@ol-home:/1/var/tmp$ source /opt/openlibrary/venv/bin/activate
3 :: Generate the metadata table dump from archive db
this task requires around 1 hour to complete.
(venv)giovanni@ol-home:/1/var/tmp$ ARCHIVE_DB_PASSWORD=`/opt/.petabox/dbserver`
(venv)giovanni@ol-home:/1/var/tmp$ python /opt/openlibrary/openlibrary/scripts/2012/dump-ia-items.py --host db-current --user archive --password $ARCHIVE_DB_PASSWORD --database archive | gzip -c > ia_metadata_dump_2015-03-11.txt.gz |
(venv)giovanni@ol-home:/1/var/tmp$ ARCHIVE_DB_PASSWORD=`/opt/.petabox/dbserver`
(venv)giovanni@ol-home:/1/var/tmp$ python /opt/openlibrary/openlibrary/scripts/2012/dump-ia-items.py --host db-current --user archive --password $ARCHIVE_DB_PASSWORD --database archive | gzip -c > ia_metadata_dump_2015-03-11.txt.gz
4 :: Generate the dump of all revisions of all documents.
this task requires around 8 hours to complete.
(venv)giovanni@ol-home:/1/var/tmp$ /opt/openlibrary/openlibrary/scripts/oldump.py cdump data.txt.gz 2015-03-11 | gzip -c > ol_cdump.txt.gz
(venv)giovanni@ol-home:/1/var/tmp$ rm data.txt.gz |
(venv)giovanni@ol-home:/1/var/tmp$ /opt/openlibrary/openlibrary/scripts/oldump.py cdump data.txt.gz 2015-03-11 | gzip -c > ol_cdump.txt.gz
(venv)giovanni@ol-home:/1/var/tmp$ rm data.txt.gz
5 :: Generate the dump of latest revisions of all documents.
this task requires around 6 hours to complete.
(venv)giovanni@ol-home:/1/var/tmp$ gzip -cd ol_cdump.txt.gz | python /opt/openlibrary/openlibrary/scripts/oldump.py sort --tmpdir /1/var/tmp | python /opt/openlibrary/openlibrary/scripts/oldump.py dump | gzip -c > ol_dump_2015-03-11.txt.gz
(venv)giovanni@ol-home:/1/var/tmp$ rm -rf /1/var/tmp/oldumpsort |
(venv)giovanni@ol-home:/1/var/tmp$ gzip -cd ol_cdump.txt.gz | python /opt/openlibrary/openlibrary/scripts/oldump.py sort --tmpdir /1/var/tmp | python /opt/openlibrary/openlibrary/scripts/oldump.py dump | gzip -c > ol_dump_2015-03-11.txt.gz
(venv)giovanni@ol-home:/1/var/tmp$ rm -rf /1/var/tmp/oldumpsort
6 :: Splitting the Dump into authors, editions, works, redirects
(venv)giovanni@ol-home:/1/var/tmp$ gzip -cd ol_dump_2015-03-11.txt.gz | python /opt/openlibrary/openlibrary/scripts/oldump.py split --format ol_dump_%s_2015-03-11.txt.gz |
(venv)giovanni@ol-home:/1/var/tmp$ gzip -cd ol_dump_2015-03-11.txt.gz | python /opt/openlibrary/openlibrary/scripts/oldump.py split --format ol_dump_%s_2015-03-11.txt.gz
7 :: Generate the denormalized works Dump <<---- TO FIX: the script returns exceptions
where each row contains a JSON document with the following fields:
- work – The work documents
- editions – List of editions that belong to this work
- authors – All the authors of this work
- ia – IA metadata for all the ia items referenced in the editions as a list
- duplicates – dictionary of duplicates (key -> it’s duplicates) of work and edition docs mentioned above
(venv)giovanni@ol-home:/1/var/tmp$ python /opt/openlibrary/openlibrary/scripts/2011/09/generate_deworks.py ol_dump_2015-03-11.txt.gz ia_metadata_dump_2015-03-11.txt.gz | gzip -c > ol_dump_deworks_2015-01-11.txt.gz |
(venv)giovanni@ol-home:/1/var/tmp$ python /opt/openlibrary/openlibrary/scripts/2011/09/generate_deworks.py ol_dump_2015-03-11.txt.gz ia_metadata_dump_2015-03-11.txt.gz | gzip -c > ol_dump_deworks_2015-01-11.txt.gz
(venv)giovanni@ol-home:/1/var/tmp$ ls
ia_metadata_dump_2015-03-11.txt.gz ol_dump_2015-03-11.txt.gz
ol_dump_redirects_2015-03-11.txt.gz ol_dump_authors_2015-03-11.txt.gz
ol_dump_deworks_2015-01-11.txt.gz ol_dump_editions_2015-03-11.txt.gz
ol_dump_works_2015-03-11.txt.gz |
(venv)giovanni@ol-home:/1/var/tmp$ ls
ia_metadata_dump_2015-03-11.txt.gz ol_dump_2015-03-11.txt.gz
ol_dump_redirects_2015-03-11.txt.gz ol_dump_authors_2015-03-11.txt.gz
ol_dump_deworks_2015-01-11.txt.gz ol_dump_editions_2015-03-11.txt.gz
ol_dump_works_2015-03-11.txt.gz
/olsystem/bin/cron/oldump.sh automates all this process.