Updates from March, 2015 Toggle Comment Threads | Keyboard Shortcuts

  • gio 10:00 pm on March 31, 2015 Permalink  

    OL: Interacting with memcached 

    On any host:

    $ cd /olsystem/etc
    $ . /opt/openlibrary/venv/bin/activate
    $ python

    in python:

    >>> import yaml
    >>> import memcache
    >>> y = yaml.safe_load(open('openlibrary.yml'))
    >>> mc = memcache.Client(y['memcache_servers'])


    >>> y['memcache_servers']
    ['ol-mem0:11211', 'ol-mem1:11211', 'ol-mem2:11211']

    -: to GET the memcache entry:

    >>> mc.get('ia.get_metadata-"houseofscorpion00farmrich"')

    -: to DELETE a memcached entry:

    >>> mc.delete('ia.get_metadata-"houseofscorpion00farmrich"')

    [ref: http://dev.blog.archive.org/2014/02/14/manually-deleting-stale-cache-entries-from-ol-memcache]

  • gio 10:21 pm on March 30, 2015 Permalink  

    GitLab for the Internet Archive 

    Tips for admins:

    :: Install and configure GitLab.

    :: How to convert a SVN repository to GIT.

    :: How to add users in bulk on GitLab.

    Cookbooks for users:

    :: Pro Git. a good book with everything you need to know to use Git.

    :: Git – SVN. a crash course to switch from svn to git.

    :: Become a Git guru. a good tutorial.

  • gio 9:24 pm on March 27, 2015 Permalink  

    GitLab: how add users in bulk 

    For this task we will use the GitLab API.

    1. Obtain the PRIVATE-TOKEN

    Do you need a token to prove to be authenticated

    giovanni@vm-gittest:~/git$ curl "http://vm-gittest.us.archive.org/api/v3/session" --data 'login=root&password=ThePassword'  | python -mjson.tool
        "avatar_url": "http://www.gravatar.com/avatar/e64c7d89f26bd197acbd4d13d7dd61?s=40&d=identicon",
        "bio": null,
        "can_create_group": true,
        "can_create_project": true,
        "color_scheme_id": 1,
        "created_at": "2015-03-23T17:41:18.649Z",
        "email": "admin@example.com",
        "id": 1,
        "identities": [],
        "is_admin": true,
        "linkedin": "",
        "name": "Administrator",
        "private_token": "2xyxyxysyxsyxsUe8",
        "projects_limit": 10000,
        "skype": "",
        "state": "active",
        "theme_id": 2,
        "twitter": "",
        "username": "root",
        "website_url": ""

    2. Create the users

    You can add a new user on GitLab using the command:

    curl --header "PRIVATE-TOKEN: YourPrivateToken" -d "email=user@archive.org&password=defaultPassword&username=username$name=name" "http://vm-gittest.us.archive.org/api/v3/users"

    To add users in bulk we need a file containing all the users to be created like:

    mark = mark <mark@archive.org>
    zella = zella <zella@archive.org>
    uie = uie <uie@archive.org>
    kers = kers <kers@archive.org>

    and run the command:

    for i in `cat authors-transform.txt | awk {'print "email="$1"@archive.org&password=defaultPassword&username="$1"&name="$1""'}`; 
       do  curl --header "PRIVATE-TOKEN: YourPrivateToken" -d $i "http://vm-gittest.us.archive.org/api/v3/users"; 

    WARNING: at this point all the users will receive a confirmation email to activate their account.

  • gio 11:32 pm on March 26, 2015 Permalink  

    Git: how to convert a SVN repository to GIT 

    For the conversion we are following the John Albin’s document:

    1. Retrieve a list of all Subversion committers

    From the root of your local Subversion checkout, run this command:

    svn log -q | awk -F '|' '/^r/ {sub("^ ", "", $2); sub(" $", "", $2); print $2" = "$2" <"$2">"}' | sort -u > authors-transform_alpha.txt

    Git require the emails as username, so you have to edit each line:


    fred = fred <fred>


    fred = fred <fred@archive.org>

    for our purpose we can do that quickly running this command:

     sed "s/>/@archive.org>/g" authors-transform_alpha.txt > authors-transform.txt

    2. Clone the Subversion repository using git-svn

    This will do the standard git-svn transformation (using the authors-transform.txt file you created in step 1) and place the git repository in the “~/temp” folder inside your home directory.

    git svn clone svn://home.us.archive.org/petabox -A authors-transform.txt --stdlayout --prefix=origin/ ~/temp

    3. Convert svn:ignore properties to .gitignore

    cd ~/temp
    git svn show-ignore > .gitignore
    git add .gitignore
    git commit -m 'Convert svn:ignore properties to .gitignore.'

    4. Push repository to a new project on GitLab

    git remote add gitlab git@git.domain.com.au:dev-team/favourite-project.git
    git push --set-upstream gitlab master

    Congratulations you have done!

    —The following instructions are useful only if you are not using GitLab and you want a bare repository—

    4b. Push repository to a bare git repository
    Then push the temp repository to the new bare repository.

    cd ~/temp
    git remote add bare ~/petabox.git
    git config remote.bare.push 'refs/remotes/*:refs/heads/*'
    git push bare

    5. Rename “trunk” branch to “master”
    Your main development branch will be named “trunk” which matches the name it was in Subversion. You’ll want to rename it to Git’s standard “master” branch using:

    cd ~/new-bare.git
    git branch -m trunk master

    6. Clean up branches and tags
    git-svn makes all of Subversions tags into very-short branches in Git of the form “tags/name”. You’ll want to convert all those branches into actual Git tags using:

    cd ~/new-bare.git
    git for-each-ref --format='%(refname)' refs/heads/tags |
    cut -d / -f 4 |
    while read ref
      git tag "$ref" "refs/heads/tags/$ref";
      git branch -D "tags/$ref";

  • gio 6:29 pm on March 26, 2015 Permalink
    Tags: , howto, petabox   

    Petabox: git-svn quick how-to 

    How to commit:

    • rebase: git svn rebase
    • edit files vim foo.ff
    • git add foo.ff
    • git commit -m 'note about the commit'
    • git svn rebase
    • if the rebase fails because of memory problems: ulimit -v unlimited and rebase again
    • dry commit git svn dcommit --dry-run
    • check the patch git diff-tree ea56092a94b7b0266cdfbb08f69245d7761bba09~1 ea56092a94b7b0266cdfbb08f69245d7761bba09 -p
    • git svn dcommit
  • gio 5:25 pm on March 26, 2015 Permalink
    Tags: stats   

    OL: graphs statistics 

    OL statistics are present in two different graphics:

    Screen Shot 2015-03-26 at 10.17.14 AM

    Screen Shot 2015-03-26 at 10.18.25 AM

    They are generated through two scripts:

    :: /opt/openlibrary/openlibrary/scripts/ipstats.py runs on ol-www1
    that create the graph:Screen Shot 2015-03-31 at 3.50.06 PM

    :: /opt/openlibrary/openlibrary/scripts/store_counts.py runs on ol-home
    To generate the count stats of the past n days, execute the command:

    giovanni@ol-home:~$ sudo -s
    root@ol-home:/home/giovanni# su openlibrary
    openlibrary@ol-home:/home/giovanni$ source /opt/openlibrary/venv/bin/activate
    (venv)openlibrary@ol-home:/home/giovanni$ cd /opt/openlibrary/openlibrary/scripts/

    and run the command:

    $ python store_counts.py /opt/openlibrary/olsystem/etc/infobase.yml /opt/openlibrary/olsystem/etc/openlibrary.yml /opt/openlibrary/olsystem/etc/coverstore.yml  n

    this creates the graphs: Screen Shot 2015-03-31 at 3.49.59 PM

    The scripts run as scheduled in the /etc/cron.d/openlibrary

    0 * * * * openlibrary /olsystem/bin/verify-node.sh ol-home && /olsystem/bin/olenv $SCRIPTS/store_counts.py /opt/openlibrary/olsystem/etc/infobase.yml /opt/openlibrary/olsystem/etc/openlibrary.yml /opt/openlibrary/olsystem/etc/coverstore.yml 1
    0 * * * * www-data /olsystem/bin/verify-node.sh ol-www1 && /olsystem/bin/olenv $SCRIPTS/ipstats.py  /opt/openlibrary/olsystem/etc/openlibrary.yml
    59 23 * * * openlibrary /olsystem/bin/verify-node.sh ol-home && /olsystem/bin/olenv $SCRIPTS/store_counts.py /opt/openlibrary/olsystem/etc/infobase.yml /opt/openlibrary/olsystem/etc/openlibrary.yml /opt/openlibrary/olsystem/etc/coverstore.yml 1
    59 23 * * * www-data /olsystem/bin/verify-node.sh ol-www1 && /olsystem/bin/olenv $SCRIPTS/ipstats.py  /opt/openlibrary/olsystem/etc/openlibrary.yml

    When, for some reason, the graphs are broken you have to run the scripts manually.
    Be careful running them within the right days window.

    See the code for the details:

  • gio 4:15 pm on March 26, 2015 Permalink  

    Cluster RePublisher: Workflow 

    Screen Shot 2015-03-26 at 9.15.05 AM
    thank you to Raj for the graph

  • gio 11:33 pm on March 23, 2015 Permalink  

    GitLab: Install and configure 

    Installing and configure a gitlab server on vm-gittest.us.archive.org

    For the basic requirements please check the document

    Install and configure the necessary dependencies:

    giovanni@vm-gittest:~$ sudo apt-get install openssh-server
    giovanni@vm-gittest:~$ sudo apt-get install postfix

    Download and install debian package for Ubuntu 14.04

    giovanni@vm-gittest:~$ wget https://downloads-packages.s3.amazonaws.com/ubuntu-14.04/gitlab_7.9.0-omnibus.2-1_amd64.deb
    giovanni@vm-gittest:~$ sudo dpkg -i gitlab_7.9.0-omnibus.2-1_amd64.deb

    We want to run CPU_NUM+1 uinicorn workers editing the file /etc/gitlab/gitlab.rb adding the lines:

    unicorn['worker_timeout'] = 60
    unicorn['worker_processes'] = 3

    Re-configure and start GitLab

    giovanni@vm-gittest:~$ sudo gitlab-ctl reconfigure

    Start GitLab

    giovanni@vm-gittest:~$ sudo gitlab-ctl start

    The default root credentials are:

    Username: root 
    Password: 5iveL!fe

    The servers require some minutes to be usable, before you will receive a 502 status code.

    To make sure the UI is reachable only within the IA network we have to edit the file
    adding the lines

    server {
      deny   all;

    The git test vm is reachable at: http://vm-gittest.us.archive.org/
    (the admin/root password is the usual one)

  • gio 11:00 pm on March 11, 2015 Permalink

    OL: how to generate the dump files, step-by-step 

    Here the instructions about how to generate the ol_dump.txt.gz files.

    ol-home is the right place to do this.

    1 :: Dumping the data table from ol-db1
    this task requires around 1 hour to complete.

    giovanni@ol-home:/1/var/tmp$ psql -h ol-db1 -U openlibrary openlibrary -c "copy data to stdout" | gzip -c > data.txt.gz

    2 :: Activate the virtual environment /opt/openlibrary/venv

    giovanni@ol-home:/1/var/tmp$ source /opt/openlibrary/venv/bin/activate

    3 :: Generate the metadata table dump from archive db
    this task requires around 1 hour to complete.

    (venv)giovanni@ol-home:/1/var/tmp$ ARCHIVE_DB_PASSWORD=`/opt/.petabox/dbserver`
    (venv)giovanni@ol-home:/1/var/tmp$ python /opt/openlibrary/openlibrary/scripts/2012/dump-ia-items.py --host db-current --user archive --password $ARCHIVE_DB_PASSWORD --database archive | gzip -c > ia_metadata_dump_2015-03-11.txt.gz

    4 :: Generate the dump of all revisions of all documents.
    this task requires around 8 hours to complete.

    (venv)giovanni@ol-home:/1/var/tmp$ /opt/openlibrary/openlibrary/scripts/oldump.py cdump data.txt.gz 2015-03-11 | gzip -c > ol_cdump.txt.gz
    (venv)giovanni@ol-home:/1/var/tmp$ rm data.txt.gz

    5 :: Generate the dump of latest revisions of all documents.
    this task requires around 6 hours to complete.

    (venv)giovanni@ol-home:/1/var/tmp$ gzip -cd ol_cdump.txt.gz | python /opt/openlibrary/openlibrary/scripts/oldump.py sort --tmpdir /1/var/tmp | python /opt/openlibrary/openlibrary/scripts/oldump.py dump | gzip -c > ol_dump_2015-03-11.txt.gz
    (venv)giovanni@ol-home:/1/var/tmp$ rm -rf /1/var/tmp/oldumpsort

    6 :: Splitting the Dump into authors, editions, works, redirects

    (venv)giovanni@ol-home:/1/var/tmp$ gzip -cd ol_dump_2015-03-11.txt.gz | python /opt/openlibrary/openlibrary/scripts/oldump.py split --format ol_dump_%s_2015-03-11.txt.gz

    7 :: Generate the denormalized works Dump <<---- TO FIX: the script returns exceptions
    where each row contains a JSON document with the following fields:

    • work – The work documents
    • editions – List of editions that belong to this work
    • authors – All the authors of this work
    • ia – IA metadata for all the ia items referenced in the editions as a list
    • duplicates – dictionary of duplicates (key -> it’s duplicates) of work and edition docs mentioned above
    (venv)giovanni@ol-home:/1/var/tmp$ python /opt/openlibrary/openlibrary/scripts/2011/09/generate_deworks.py ol_dump_2015-03-11.txt.gz ia_metadata_dump_2015-03-11.txt.gz | gzip -c > ol_dump_deworks_2015-01-11.txt.gz
    (venv)giovanni@ol-home:/1/var/tmp$ ls
    ia_metadata_dump_2015-03-11.txt.gz  ol_dump_2015-03-11.txt.gz
    ol_dump_redirects_2015-03-11.txt.gz ol_dump_authors_2015-03-11.txt.gz
    ol_dump_deworks_2015-01-11.txt.gz   ol_dump_editions_2015-03-11.txt.gz
Compose new post
Next post/Next comment
Previous post/Previous comment
Show/Hide comments
Go to top
Go to login
Show/Hide help
shift + esc