AC UserManual

NOTE: Brackets {} in the following notes have to be removed when typing in the terminal. They are used to define variables.

Storage

Home directory

Once you have your PIC account you are able to access the UI's machines:

ssh {USER}@ui.pic.es

and you are you are logged in to your "home":

~{USER}

This directory is your main place for storage for software, scripts, logs, and long term data files. It is backed-up and has 10GiB of capacity.

Massive storage

Each project has (in general) a massive storage space accessible at the following path:

/pnfs/pic.es/data/astro/{PROJECT} (ask for the actual path to your contact person)

which has only read permissions for the project's users.

Inside the directory there are two different paths corresponding to two different back-ends:

Tape

/pnfs/pic.es/data/astro/{PROJECT}/tape

As its name suggests, the data in the tape path is stored in magnetic tapes, and is critical, such as raw data or very difficult data to obtain or to get. The size of each file is usually large, from 1-2GB to 100-200GB, due to technical reasons (they are usually iso or tar.bz2 files). Data in tapes is not very often accessed. Before accessing any file on tape, you MUST notify your contact person so they can perform a pre-stage on the files you require. You have to provide also the interval during which you need to access those files. The pre-stage operation will read all the data you requested and put them on a disk buffer. Only after that, your files will be readable (using the same path). After the specified interval has passed, the pre-staged files will be removed from the disk buffer and be no longer readable.

Disk

/pnfs/pic.es/data/astro/{PROJECT}/disk

Disk data is usually the data being currently used by the project, and it is being very often accessed. The size of the files is not important here.

Scratch

Each user has a scratch space at the following path:

/nfs/astro/{USER}

This space is thought as a volatile sandbox. If you produce results that may be important for the project, ask your contact person and they will move the data into the /pnfs storage.

Please note that all data older than 6 months may be erased at any time without prior notice.

Any location not included in the former paths is not allowed and its contents erased on sight.

Working with Python environments (in UI)

1.1 Usually environments are all saved in the same directory (e.g. ~/env). In case it is not created:

mkdir ~/env/

1.2 Create a new environment (python_version = 2.7.14):

cd ~/env/
/software/astro/sl6/python/{PYTHON_VERSION}/bin/virtualenv {ENV_NAME}

1.3 Activate environment:

source ~/env/{ENV_NAME}/bin/activate

1.4 Update pip command (only for the first time):

pip install --upgrade pip

1.5 Install any package you need (in case you have any problem with some package, please contact us)

e.g numpy package:

pip install numpy

1.6 To see the different packages included in the environment:

pip freeze

Accessing a remote jupyter notebook

These are the instructions to work with a jupyter notebook running in a workernode at PIC from your web browser.

After creating and activating a virtual environment, you will need to create an SSH tunnel from your computer to the workernode through the UI in order to access the notebook.

These are the steps you have to follow:

From one terminal login in a UI:

ssh {USER}@ui.pic.es

Login in the ASTRO workernode:

ssh {USER}@wn.astro.pic.es

Activate the virtual environment (in case it has not been created yet, see previous section):

. ~/env/{ENV_NAME}/bin/activate

In case jupyter is not already installed:

pip install jupyter
jupyter-notebook --generate-config
jupyter-notebook password     # (for security reasons when opening the notebook in your browser afterwards)

Execute the jupyter notebook command

jupyter-notebook --ip='*' --no-browser

Note: In the prompt, in one of the lines that appear, there will be a message like this one:

[I 15:44:17.162 NotebookApp] The Jupyter Notebook is running at:
[I 15:44:17.162 NotebookApp] http://[all ip addresses on your system]:{WN_PORT}/

Please, take note of the value of {WN_PORT}.

Open another terminal and create a tunnel from your laptop to the workernode through the UI:

Choose any {LOCAL_PORT} higher than 1024, i.e. 9000.

ssh -L {LOCAL_PORT}:wn.astro.pic.es:{WN_PORT} ui.pic.es

From a web browser in your local computer, access the following url:

http://localhost:{LOCAL_PORT}

Download code and git rules

These are the git rules for developers at PIC.

The methodology written below is a try to help the code development of the team and they are thought for non-experts git users.

It has been compiled from the official git documentation, which we strongly recommend to look at (at least the first three chapters):

Git documentation

And from this git branch model:

A successful Git branching model

We assume you already have a PIC account and you have already created a python virtual environment

Codes are hosted at https://gitlab.pic.es.

1. Download the code (first you need to have permissions to do it)

1.1. Access ui:

ssh {user}@ui.pic.es

1.2 Usually software is stored in the same directory:

mkdir ~/src/

1.2 Create a directory in which you are going to develop your codes, e.g:

mkdir -p ~/{src}/{software_project_name}

1.3 Copy the code from the gitlab repository in the created directory:

cd ~/src/{software_project_name}
git clone https://gitlab01.pic.es/{software_project_name}/{pipeline}.git

1.4 Activate your environment

. ~/env/{env_name}/bin/activate

1.5 Locate the `setup.py` file of the project (usually in the main software directory), and deploit the code:

cd  ~/src/{software_project_name}
pip install -e .

2. Create your own branch

Every project has two main protected branches: master and develop. Protected means you, as a standard developer, will not have permissions to write on them. Therefore in order to develop your features in the code you need to create your own branch that will always come from the develop branch.

2.1 Enter in the project directory:

cd {pipeline}

2.2 Create the branch:

git checkout -b feature_branch_name origin/develop

3. Modify the code

3.1 Day Tip:

Everyday you sit in your computer and want to modify the code, in order not to be outdated in the changes made in the develop branch, you should do:

3.1.1 Download any modification in the code

git fetch

3.1.2 Incorporate changes in the develop branch into your feature_branch_name branch

git rebase origin/develop

(Hopefully there will be no conflicts if all developers are working in independent branches. If this is not the case and you have doubts after reading the references given in point 5 below, please call us before meshing it up!)

3.2 See the changes you have done

git status

3.3 Add changes

git add changed_files

3.4 Commit changes

git commit -m "message describing the modifications"

4. Finish the new feature

Once you finish to develop, debug and test the new feature you send us an email.

We immediately will send you back another one saying that your feature has been integrated into the develop branch.

Note that your branch will be deleted.

5. Incorporate changes and start a new feature again

In order to start with a new feature you need to incorporate the changes we just did (integrate the feature into develop) and create another new branch:

git fetch

git checkout -b anoter_feature_branch_name origin/develop

Jupyter notebook on Spark

From one terminal login in a UI:

ssh {user}@ui.pic.es

Login in the DATA.ASTRO machine:

ssh {user}@data.astro.pic.es

Create a new virtual environment (necessary for the first time only) BUT it is mandatory that it has been created from the DATA.ASTRO machine:

(in case ~/env is not created: mkdir ~/env/)
cd  ~/env/
virtualenv {env_name}
pip install --upgrade pip

Install Jupyter in the environment.

. activate ~/env/myspark
pip install --upgrade setuptools

Go to the directory where you have your notebooks or where you want to create a new one:

cd ~/{notebooks_directory} (in case you don't have one, just create it: mkdir ~/{notebooks_directory})

It is necessary that all code you want to use is visible and accessible from every node in the Hadoop cluster. The best way is to only use shared filesystems, such as /nfs/astro.

Copy the following command in your favourite text editor and define the {data_astro_port} (e.g. 8895), and the {env_name}. This command will start a DYNAMIC application which will ask additional executors as more resources are needed, and will release them as soon as possible. This will avoid resource starvation for Hive queries:

SPARK_MAJOR_VERSION=2 ENV=~/env/{env_name} HADOOP_CONF_DIR=/etc/hadoop/conf PYSPARK_PYTHON=${ENV}/bin/python PYSPARK_DRIVER_PYTHON=${ENV}/bin/jupyter \
PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --ip='data.astro.pic.es' --port {data_astro_port}" pyspark --master yarn --driver-memory 2G \
--executor-cores 3 --executor-memory 9G --conf spark.yarn.executor.memoryOverhead=0 --conf spark.python.profile=true --conf \
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.rdd.compress=true --conf spark.kryoserializer.buffer.max=128m --conf \
spark.driver.maxResultSize=2G --driver-library-path \
/usr/lib/jvm/java-1.8.0-openjdk/jre/lib/amd64/server/:/usr/hdp/2.6.1.0-129/usr/lib/:/usr/hdp/current/hadoop-client/lib/native --conf \
spark.executor.extraLibraryPath=/usr/lib/jvm/java-1.8.0-openjdk/jre/lib/amd64/server/:/usr/hdp/2.6.1.0-129/usr/lib/:\
/usr/hdp/current/hadoop-client/lib/native

Note: In the prompt, in one of the lines that appear, there will be a message to tell you what to do with the url:

Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://data.astro.pic.es:{data_astro_port}/?token=7bbee6c505a9044c900c33b6555da81949efcf2d46a42e5c

Please, take note of the value of {data_astro_port}.

Open another terminal and create a tunnel from your laptop to the DATA.ASTRO through the UI:

Choose any {local_port} higher than 1024, i.e. 9000.

ssh -L {local_port}:data.astro.pic.es:{data_astro_port} ui.pic.es

From a web browser in your local computer, access the following url:

http://localhost:{local_port}

It will take some time to initialize the Spark Context.

Once it has been initialized, it will be accessible through the 'sc' global variable.

Example

import numpy as np
import pandas as pd

from scipic.mocks.hod.base import Galaxy
from scipic.mocks.hod.kravtsov import Kravtsov

from pyspark.sql import Row

hive = HiveContext(sc)
hive.sql('USE cosmohub').collect()
cat = hive.sql('SELECT unique_halo_id AS halo_id, lmhalo as mass FROM micecatv2_0 LIMIT 20').cache()

k = Kravtsov(12, 13, 1)

def gals(p):
    data = list(p) 
    df = pd.DataFrame(data, columns=data[0].__fields__)
    hod = {k.name:v for k,v in k.galaxies(df.halo_id, df.mass).iteritems()}
    df = pd.DataFrame(hod)
    
    return [Row(**fields) for fields in [t._asdict() for t in df.itertuples(index=False)]]

print cat.mapPartitions(gals, True).collect()