Difference between revisions of "AC UserManual"
(Updated python env instructions) |
|||
(60 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
+ | NOTE: '''Brackets {} in the following notes have to be removed when typing in the terminal'''. They are used to define variables. | ||
+ | |||
== Storage == | == Storage == | ||
+ | |||
+ | === Home directory === | ||
+ | |||
+ | Once you have your PIC account you are able to access the UI's machines: | ||
+ | |||
+ | ssh {USER}@ui.pic.es | ||
+ | |||
+ | and you are you are logged in to your "home": | ||
+ | |||
+ | ~{USER} | ||
+ | |||
+ | This directory is your main place for storage for software, scripts, logs, and long term data files. | ||
+ | It is backed-up and has 10GiB of capacity. | ||
+ | |||
+ | |||
=== Massive storage === | === Massive storage === | ||
− | Each project has a massive storage space accessible at the following path: | + | Each project has (in general) a massive storage space accessible at the following path: |
− | /pnfs/pic.es/data/astro/{ | + | /pnfs/pic.es/data/astro/{PROJECT} (ask for the actual path to your contact person) |
which has only read permissions for the project's users. | which has only read permissions for the project's users. | ||
Line 8: | Line 25: | ||
==== Tape ==== | ==== Tape ==== | ||
− | /pnfs/pic.es/data/astro/{ | + | /pnfs/pic.es/data/astro/{PROJECT}/tape |
As its name suggests, the data in the tape path is stored in magnetic tapes, and is critical, such as raw data or very difficult data to obtain or to get. | As its name suggests, the data in the tape path is stored in magnetic tapes, and is critical, such as raw data or very difficult data to obtain or to get. | ||
The size of each file is usually large, from 1-2GB to 100-200GB, due to technical reasons (they are usually ''iso'' or ''tar.bz2'' files). | The size of each file is usually large, from 1-2GB to 100-200GB, due to technical reasons (they are usually ''iso'' or ''tar.bz2'' files). | ||
− | Data in tapes is not very often accessed. | + | Data in tapes is not very often accessed. |
+ | '''Before accessing any file on tape, you MUST notify your contact person so they can perform a pre-stage on the files you require'''. You have to provide also the interval during which you need to access those files. The pre-stage operation will read all the data you requested and put them on a disk buffer. Only after that, your files will be readable (using the same path). After the specified interval has passed, the pre-staged files will be removed from the disk buffer and be no longer readable. | ||
==== Disk ==== | ==== Disk ==== | ||
− | /pnfs/pic.es/data/astro/{ | + | /pnfs/pic.es/data/astro/{PROJECT}/disk |
Disk data is usually the data being currently used by the project, and it is being very often accessed. The size of the files is not important here. | Disk data is usually the data being currently used by the project, and it is being very often accessed. The size of the files is not important here. | ||
Line 20: | Line 38: | ||
Each user has a scratch space at the following path: | Each user has a scratch space at the following path: | ||
− | /nfs/astro/{ | + | /nfs/astro/{USER} |
− | This space is thought as a sandbox. | + | This space is thought as a volatile sandbox. If you produce results that may be important for the project, ask your contact person and they will move the data into the /pnfs storage. |
− | + | ||
+ | Please note that all '''data older than 6 months may be erased at any time''' without prior notice. | ||
+ | |||
+ | '''Any location not included in the former paths is not allowed and its contents erased on sight.''' | ||
+ | |||
+ | == Using the HTCondor computing farm == | ||
+ | |||
+ | The HTCondor computing cluster is designed to run many independent jobs in parallel. That means that in contrast to MPI jobs there is no interdependence or communication between them. | ||
+ | |||
+ | Let's say we want to run a python script several times with different input parameters. Here, we show the example of calculating the shear maps of a single seed of a dark matter simulation. First of all, connect to the userinterface ui04.pic.es, which is the one with a CentOS7 operating system: | ||
+ | ssh neissner@ui04.pic.es | ||
+ | |||
+ | Now, you would need the executable python script. Put that somewhere in your home directory: | ||
+ | [neissner@ui04 ~]$ ls -la ~neissner/scripts/python/test/test.py | ||
+ | -rwxr-xr-x 1 neissner pic 5942 Apr 24 15:03 /nfs/pic.es/user/n/neissner/scripts/python/test/test.py | ||
+ | |||
+ | We can have a look which modules are required in order to create the virtual environment: | ||
+ | [neissner@ui04 ~]$ cat scripts/python/test/test.py | grep import | ||
+ | import os | ||
+ | import sys | ||
+ | import numpy as np | ||
+ | import healpy as hp | ||
+ | import pandas as pd | ||
+ | import bz2 | ||
+ | from pathlib import Path | ||
+ | import pycdlib | ||
+ | import io | ||
+ | import pyarrow | ||
+ | |||
+ | With this information we create our virtual environment: | ||
+ | [neissner@ui04 ~]$ /software/astro/centos7/python/3.7.2/bin/python3 -m venv ~neissner/env/test | ||
+ | [neissner@ui04 ~]$ . ~neissner/env/test/bin/activate | ||
+ | (test) [neissner@ui04 ~]$ pip install --upgrade pip | ||
+ | [...] | ||
+ | (test) [neissner@ui04 ~]$ pip install --no-cache-dir numpy healpy pandas pathlib io pycdlib pyarrow | ||
+ | [...] | ||
+ | (test) [neissner@ui04 ~]$ deactivate | ||
+ | |||
+ | The python script needs three input parameters: the seed number (here 600), the lowest healpix step (here 134) and the healpix step to be calculated (here a number between 134 and 400) and three input files: | ||
+ | [neissner@ui04 ~]$ ls /nfs/astro/neissner/data/test/ | ||
+ | kappa.npy seed00600.iso steps.ssv | ||
+ | |||
+ | It is more comfortable to run the python program from inside a bash script. Write a bash script that looks the following way: | ||
+ | [neissner@ui04 ~]$ cat ~neissner/scripts/bash/test/test.sh | ||
+ | #!/bin/bash | ||
+ | SEED=$1 | ||
+ | LSTEP=$2 | ||
+ | STEP=$3 | ||
+ | . ~neissner/env/test/bin/activate | ||
+ | python ~neissner/scripts/python/test/test.py $SEED $LSTEP $STEP | ||
+ | deactivate | ||
+ | |||
+ | The bash script requires the same input parameters as the python script. | ||
+ | |||
+ | Now we can execute the bash script on one of the HTCondor execution hosts by sending it to the scheduler. In order to do that we need a submit file that looks like this: | ||
+ | |||
+ | [neissner@ui04 ~]$ cat test.sub | ||
+ | executable = /nfs/pic.es/user/n/neissner/scripts/bash/test/test.sh | ||
+ | args = 600 134 134 | ||
+ | output = output-600-134.out | ||
+ | error = error-600-134.err | ||
+ | log = log-600-134.log | ||
+ | request_memory=6GB | ||
+ | request_cpus=2 | ||
+ | +experiment="des" | ||
+ | queue | ||
+ | |||
+ | This creates a single job to be run in the computing cluster. Now we want to run al steps from 134 to 400 at once, therefore we change the submit file to this one: | ||
+ | |||
+ | [neissner@ui04 ~]$ cat test.sub | ||
+ | executable = /nfs/pic.es/user/n/neissner/scripts/bash/test/test.sh | ||
+ | args = 600 134 $(Item) | ||
+ | output = output-600-$(Item).out | ||
+ | error = error-600-$(Item).err | ||
+ | log = log-600-$(Item).log | ||
+ | request_memory=6GB | ||
+ | request_cpus=2 | ||
+ | +experiment="des" | ||
+ | queue from seq 134 1 400 | | ||
+ | |||
+ | This will create 267 independent jobs that can run simultaneously in the computing cluster. | ||
+ | |||
+ | Finally we can send the job to the HTCondor scheduler: | ||
+ | [neissner@ui04 ~]$ ssh submit01.pic.es 'condor_submit test.sub | ||
+ | Submitting job(s)...................................................................................... | ||
+ | ................................................................................................................. | ||
+ | .................................................................... | ||
+ | 267 job(s) submitted to cluster 1486. | ||
+ | |||
+ | The cluster number is the job ID which can be used to monitor the progress: | ||
+ | |||
+ | [neissner@ui04 ~]$ ssh submit01.pic.es 'condor_q -all' | ||
+ | -- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 04/25/19 12:48:58 | ||
+ | OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS | ||
+ | neissner ID: 1486 4/25 12:47 _ 4 263 267 1486.0-266 | ||
+ | Total for query: 267 jobs; 0 completed, 0 removed, 263 idle, 4 running, 0 held, 0 suspended | ||
+ | Total for all users: 267 jobs; 0 completed, 0 removed, 263 idle, 4 running, 0 held, 0 suspended | ||
+ | |||
+ | E.g. in this example 4 out of 267 jobs are already running while 263 are still in the queue. This will progress in time. | ||
+ | [neissner@ui04 ~]$ ssh submit01.pic.es 'condor_q -all' | ||
+ | -- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 04/25/19 13:01:13 | ||
+ | OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS | ||
+ | neissner ID: 1486 4/25 12:47 4 8 255 267 1486.4-266 | ||
+ | Total for query: 263 jobs; 0 completed, 0 removed, 255 idle, 8 running, 0 held, 0 suspended | ||
+ | Total for all users: 263 jobs; 0 completed, 0 removed, 255 idle, 8 running, 0 held, 0 suspended | ||
+ | |||
+ | |||
+ | == Creating a python environment to work on wn-el7.astro.pic.es == | ||
+ | |||
+ | 1. Login to UI | ||
+ | |||
+ | ssh {USER}@ui.pic.es | ||
+ | |||
+ | 2. Login to data.astro.pic.es (because virtualenv3.6 is installed there): | ||
+ | |||
+ | ssh {USER}@data.astro.pic.es | ||
+ | |||
+ | 3. Usually environments are all saved in the same directory (e.g. ~/env). | ||
+ | 3.1. Create an environment (if not yet created): | ||
+ | |||
+ | mkdir ~/env/ | ||
+ | cd ~/env/ | ||
+ | virtualenv-3.6 {ENV_NAME} | ||
+ | |||
+ | 5. Activate environment | ||
+ | |||
+ | source ~/env/{ENV_NAME}/bin/activate | ||
+ | |||
+ | 6. Upgrade pip command (the environment has to be activated): | ||
+ | |||
+ | pip install --upgrade pip | ||
+ | |||
+ | 7. Install all necessary libraries | ||
+ | |||
+ | pip install numpy scipy matplotlib astropy pandas jupyter psycopg2 psycopg2-binary | ||
+ | |||
+ | 8. Open another terminal and login to wn-el7.astro.pic.es: | ||
+ | |||
+ | ssh {USER}@ui.pic.es | ||
+ | ssh {USER}@wn-el7.astro.pic.es | ||
+ | |||
+ | 9. Activate environment | ||
+ | |||
+ | . {ENV_NAME}/bin/activate | ||
+ | |||
+ | 10. '''Go to the directory where your notebook is''' and launch jupyter-notebook: | ||
+ | |||
+ | jupyter-notebook --ip=$(hostname) --no-browser | ||
+ | |||
+ | Note1: In the prompt, in one of the lines that appear, there will be a message like this one: | ||
+ | |||
+ | [I 15:44:17.162 NotebookApp] The Jupyter Notebook is running at: | ||
+ | [I 15:44:17.162 NotebookApp] http://[all ip addresses on your system]:'''{WN_PORT}'''/ | ||
+ | |||
+ | Please, take note of the value of {WN_PORT}. | ||
+ | |||
+ | 11. Open another terminal and create a tunnel from your laptop to the workernode through the UI: | ||
+ | Choose any {LOCAL_PORT} higher than 1024, i.e. 9000. | ||
+ | |||
+ | ssh -L {LOCAL_PORT}:wn-el7.astro.pic.es:{WN_PORT} {USER}@ui.pic.es | ||
+ | |||
+ | * From a web browser in your local computer, access the following url: | ||
+ | |||
+ | http://localhost:{LOCAL_PORT} | ||
+ | |||
+ | == Working with Python environments (in UI) == | ||
+ | |||
+ | 1.1 Usually environments are all saved in the same directory (e.g. ~/env). In case it is not created: | ||
+ | |||
+ | mkdir ~/env/ | ||
+ | |||
+ | 1.2 Create a new environment (python_version = 2.7.14): | ||
+ | cd ~/env/ | ||
+ | /software/astro/sl6/python/{PYTHON_VERSION}/bin/virtualenv {ENV_NAME} | ||
+ | |||
+ | 1.3 Activate environment: | ||
+ | |||
+ | source ~/env/{ENV_NAME}/bin/activate | ||
+ | |||
+ | 1.4 Update pip command (only for the first time): | ||
+ | |||
+ | pip install --upgrade pip | ||
+ | |||
+ | 1.5 Install any package you need (in case you have any problem with some package, please contact us) | ||
+ | |||
+ | e.g numpy package: | ||
+ | pip install numpy | ||
+ | |||
+ | 1.6 To see the different packages included in the environment: | ||
+ | pip freeze | ||
+ | |||
+ | == Accessing a remote jupyter notebook == | ||
+ | |||
+ | These are the instructions to work with a jupyter notebook running in a workernode at PIC from your web browser. | ||
+ | |||
+ | After creating and activating a virtual environment, you will need to create an SSH tunnel from your computer to the workernode through the UI in order to access the notebook. | ||
+ | |||
+ | These are the steps you have to follow: | ||
+ | |||
+ | * From one terminal login in a UI: | ||
+ | |||
+ | ssh {USER}@ui.pic.es | ||
+ | |||
+ | * Login in the ASTRO workernode: | ||
+ | |||
+ | ssh {USER}@wn.astro.pic.es | ||
+ | |||
+ | * Activate the virtual environment (in case it has not been created yet, see previous section): | ||
+ | |||
+ | source ~/env/{ENV_NAME}/bin/activate | ||
+ | |||
+ | * In case jupyter is not already installed: | ||
+ | |||
+ | pip install jupyter | ||
+ | jupyter-notebook --generate-config | ||
+ | jupyter-notebook password # (for security reasons when opening the notebook in your browser afterwards) | ||
+ | |||
+ | * Execute the jupyter notebook command | ||
− | + | jupyter-notebook --ip='*' --no-browser (try --ip=$(hostname) instead of --ip='*' when there ir an error) | |
− | + | Note1: In the prompt, in one of the lines that appear, there will be a message like this one: | |
− | |||
− | |||
− | |||
− | + | [I 15:44:17.162 NotebookApp] The Jupyter Notebook is running at: | |
− | / | + | [I 15:44:17.162 NotebookApp] http://[all ip addresses on your system]:'''{WN_PORT}'''/ |
− | + | Please, take note of the value of {WN_PORT}. | |
+ | |||
+ | |||
+ | * Open another terminal and create a tunnel from your laptop to the workernode through the UI: | ||
+ | Choose any {LOCAL_PORT} higher than 1024, i.e. 9000. | ||
+ | ssh -L {LOCAL_PORT}:wn.astro.pic.es:{WN_PORT} {USER}@ui.pic.es | ||
+ | * From a web browser in your local computer, access the following url: | ||
+ | http://localhost:{LOCAL_PORT} | ||
== Download code and git rules == | == Download code and git rules == | ||
Line 54: | Line 293: | ||
− | We assume you already have a PIC account | + | We assume you already have a PIC account and you have already created a python virtual environment |
Codes are hosted at https://gitlab.pic.es. | Codes are hosted at https://gitlab.pic.es. | ||
Line 60: | Line 299: | ||
1. Download the code (first you need to have permissions to do it) | 1. Download the code (first you need to have permissions to do it) | ||
− | + | 1.1. Access ui: | |
− | + | ssh {USER}@ui.pic.es | |
− | + | 1.2 Usually software is stored in the same directory: | |
− | + | mkdir ~/src/ | |
− | + | 1.2 Create a directory in which you are going to develop your codes, e.g: | |
− | + | mkdir -p ~/src/{software_project_name} | |
− | + | 1.3 Copy the code from the gitlab repository in the created directory: | |
− | + | cd ~/src/{software_project_name} | |
+ | git clone https://gitlab01.pic.es/{software_project_name}/{pipeline}.git | ||
− | + | 1.4 Activate your environment | |
+ | |||
+ | source ~/env/{ENV_NAME}/bin/activate | ||
+ | 1.5 Locate the `setup.py` file of the project (usually in the main software directory), and deploit the code: | ||
+ | |||
+ | cd ~/src/{software_project_name} | ||
+ | pip install -e . | ||
2. Create your own branch | 2. Create your own branch | ||
Line 90: | Line 336: | ||
git checkout -b feature_branch_name origin/develop | git checkout -b feature_branch_name origin/develop | ||
− | |||
3. Modify the code | 3. Modify the code | ||
Line 107: | Line 352: | ||
(Hopefully there will be no conflicts if all developers are working in ''independent'' branches. If this is not the case and you have doubts after reading the references given in point 5 below, please call us before meshing it up!) | (Hopefully there will be no conflicts if all developers are working in ''independent'' branches. If this is not the case and you have doubts after reading the references given in point 5 below, please call us before meshing it up!) | ||
− | |||
3.2 See the changes you have done | 3.2 See the changes you have done | ||
Line 137: | Line 381: | ||
git checkout -b anoter_feature_branch_name origin/develop | git checkout -b anoter_feature_branch_name origin/develop | ||
+ | == Jupyter notebook on Spark == | ||
+ | |||
+ | * From one terminal login in a UI: | ||
+ | |||
+ | ssh {USER}@ui.pic.es | ||
+ | |||
+ | * Login in the DATA.ASTRO machine: | ||
+ | |||
+ | ssh {USER}@data.astro.pic.es | ||
+ | |||
+ | * Create a new virtual environment (necessary for the first time only) '''BUT it is mandatory that it has been created from the data.astro machine''': | ||
+ | (in case ~/env is not created: mkdir ~/env/) | ||
+ | cd ~/env/ | ||
+ | virtualenv {ENV_NAME} | ||
+ | source ~/env/{ENV_NAME}/bin/activate | ||
+ | pip install --upgrade pip | ||
+ | |||
+ | * In case jupyter is not already installed: | ||
+ | |||
+ | pip install jupyter | ||
+ | jupyter-notebook --generate-config | ||
+ | jupyter-notebook password # (for security reasons when opening the notebook in your browser afterwards) | ||
+ | |||
+ | * Go to the directory where you have your notebooks or where you want to create a new one: | ||
+ | |||
+ | cd ~/notebooks (in case you don't have one, just create it: mkdir ~/notebooks) | ||
+ | |||
+ | It is necessary that all code you want to use is visible and accessible from every node in the Hadoop cluster. | ||
+ | The best way is to only use a shared filesystems, such as your home in ~{USER}, or /nfs/astro/{USER}. | ||
+ | |||
+ | * Launch Jupyter using our helper script ('''NOTE: when launching the jupyter on spark, you CANNOT have the environment active'''): | ||
+ | |||
+ | /software/astro/scripts/jupyter_pyspark.sh {ENV_NAME} | ||
+ | |||
+ | Note: In the prompt, in one of the lines that appear, there will be a message to tell you what to do with the url: | ||
+ | |||
+ | Copy/paste this URL into your browser when you connect for the first time, | ||
+ | to login with a token: | ||
+ | http://data.astro.pic.es:{DATA_ASTRO_PORT} | ||
+ | |||
+ | Please, take note of the value of {DATA_ASTRO_PORT}. | ||
+ | |||
+ | * Open another terminal and create a tunnel from your laptop to the DATA.ASTRO through the UI: | ||
+ | Choose any {LOCAL_PORT} higher than 1024, i.e. 9000. | ||
+ | |||
+ | ssh -L {LOCAL_PORT}:data.astro.pic.es:{DATA_ASTRO_PORT} {USER}@ui.pic.es | ||
+ | |||
+ | * From a web browser in your local computer, access the following url: | ||
+ | |||
+ | http://localhost:{LOCAL_PORT} | ||
+ | |||
+ | |||
+ | |||
+ | It will take some time to initialize the Spark Context. | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | Once it has been initialized, it will be accessible through the 'sc' global variable. | ||
+ | |||
+ | === Example === | ||
+ | |||
+ | import numpy as np | ||
+ | import pandas as pd | ||
+ | |||
+ | from scipic.mocks.hod.base import Galaxy | ||
+ | from scipic.mocks.hod.kravtsov import Kravtsov | ||
+ | |||
+ | from pyspark.sql import Row | ||
+ | |||
+ | hive = HiveContext(sc) | ||
+ | hive.sql('USE cosmohub').collect() | ||
+ | cat = hive.sql('SELECT unique_halo_id AS halo_id, lmhalo as mass FROM micecatv2_0 LIMIT 20').cache() | ||
+ | |||
+ | k = Kravtsov(12, 13, 1) | ||
+ | |||
+ | def gals(p): | ||
+ | data = list(p) | ||
+ | df = pd.DataFrame(data, columns=data[0].__fields__) | ||
+ | hod = {k.name:v for k,v in k.galaxies(df.halo_id, df.mass).iteritems()} | ||
+ | df = pd.DataFrame(hod) | ||
+ | |||
+ | return [Row(**fields) for fields in [t._asdict() for t in df.itertuples(index=False)]] | ||
+ | |||
+ | print cat.mapPartitions(gals, True).collect() | ||
+ | |||
+ | == Necessary information to ingest a catalog into CosmoHub / PAUdm == | ||
+ | |||
+ | Two files: | ||
+ | |||
+ | - The catalog itself in the current accepted formats: CSV, CSV.BZ2, FITS and HDF5. | ||
+ | Files should occupy > 100Mb and < 1Gb. | ||
+ | |||
+ | - The metadata of the catalog: a yaml file with the following information: | ||
+ | |||
+ | - name: Name of the catalogue | ||
+ | - version: Version of the catalogue | ||
+ | - software_version: Code version to create the catalogue | ||
+ | - date_create: When the catalogue was created | ||
+ | - description: Brief description (one liner) | ||
+ | - summary: Summary of the catalogue | ||
+ | - fields: | ||
+ | - column1: Column 1 description | ||
+ | - column2: Column 2 description | ||
+ | |||
+ | In addition, the following fields are necessary for ingestion into PAUdm. | ||
+ | |||
+ | - paudm_input_production_id : MEMBA production_id | ||
+ | |||
+ | == Ingest PAUdm catalogs into CosmoHub == | ||
+ | |||
+ | === Forced aperture production (production = 941) === | ||
+ | |||
+ | |||
+ | * Get catalog from paudb (from UI): | ||
+ | |||
+ | psql -U readonly -W -h db.pau.pic.es dm -c "COPY (SELECT * FROM public.forced_aperture_coadd WHERE production_id = 941) TO stdout DELIMITER ',' CSV HEADER" | lbzip2 | pv > forced_aperture_coadd_production_id_941.csv.bz2 | ||
+ | |||
+ | * Copy to pnfs | ||
+ | |||
+ | cp forced_aperture_coadd_production_id_941.csv.bz2 /pnfs/pic.es/data/astro/cosmohub/disk/raw/paudm/forced_aperture_coadd/forced_aperture_coadd_production_id_941.csv.bz2 | ||
+ | |||
+ | * Beeline: | ||
+ | |||
+ | beeline --color=true --showHeader=true --verbose=true --silent=false -u "jdbc:hive2://ambarisrv02.pic.es:2181,ambarisrv03.pic.es:2181,ambarisrv01.pic.es:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2" | ||
+ | |||
+ | * Create partition (from beeline) | ||
+ | |||
+ | ALTER TABLE jcarrete.paudm_forced_aperture_coadd_csv ADD IF NOT EXISTS PARTITION(production_id=941); | ||
+ | |||
+ | * Copy csv.bz2 to hdfs: | ||
+ | |||
+ | hdfs dfs -put /nfs/astro/jcarrete/sandbox/forced_aperture_coadd_production_id_941.csv.bz2 /user/jcarrete/data/paudm/forced_aperture_coadd/production_id\=941/forced_aperture_coadd_production_id_941.csv.bz2 | ||
+ | |||
+ | * Insert overwrite cosmohub table: | ||
+ | |||
+ | INSERT OVERWRITE TABLE cosmohub.paudm_forced_aperture_coadd PARTITION(production_id=941) | ||
+ | SELECT `ref_id`, `band`, `flux`, `flux_error`, `chi2`, `n_coadd`, `run` | ||
+ | FROM jcarrete.paudm_forced_aperture_coadd_csv WHERE production_id=941; | ||
+ | |||
+ | === Production table === | ||
+ | |||
+ | psql -U readonly -W -h db.pau.pic.es dm -c "COPY (SELECT * FROM public.production) TO stdout DELIMITER ',' CSV HEADER" | pv > production_table_04_11_2019.csv | ||
+ | hdfs dfs -rm /user/jcarrete/data/paudm/production/* | ||
+ | hdfs dfs -copyFromLocal /nfs/astro/jcarrete/sandbox/production_table_04_11_2019.csv /user/jcarrete/data/paudm/production/production_table_04_11_2019.csv | ||
+ | INSERT OVERWRITE TABLE cosmohub.paudm_production SELECT id, input_production, pipeline, release, software_version, job_id, comments, created FROM jcarrete.paudm_production_csv WHERE id IS NOT NULL; | ||
+ | |||
+ | === PAUdm photoz results from M.Eriksen file === | ||
+ | The file is here: | ||
− | + | /nfs/astro/eriksen/kcorr/bcnz_v29.parquet | |
− | + | * I use python pyarrow library to read the parquet file and saved it into csv. These are the lines in python: | |
− | + | import pyarrow.parquet as pq | |
− | + | data = pq.read_table('/nfs/astro/eriksen/kcorr/bcnz_v29.parquet') | |
+ | df_new = data.to_pandas() | ||
+ | df_new['production_id'] = 952 | ||
+ | df_new.reset_index(inplace=True) | ||
+ | column_list = ['production_id', 'ref_id', 'zb', 'odds', 'pz_width', 'zb_mean', 'chi2', 'n_band', 'ebv', 'qz', 'best_run'] | ||
+ | df_new[column_list].to_csv('/nfs/astro/jcarrete/sandbox/parquet_bcnz_v29_some_fields.csv', header=True, index=False) | ||
− | = | + | * Add by hand in the paudb, in the production table, the production of the new photoz data production id = 952 |
− | + | * Create partition (from beeline) | |
− | + | ALTER TABLE jcarrete.paudm_photoz_bcnz_updated_v1_1_csv ADD IF NOT EXISTS PARTITION(production_id=952); | |
− | + | * Copy csv to hdfs: | |
− | + | hdfs dfs -put /nfs/astro/jcarrete/sandbox/parquet_bcnz_v29_some_fields.csv /user/jcarrete/data/paudm/photoz_bcnz_updated_v1_1_csv/production_id\=952/parquet_bcnz_v29_some_fields.csv | |
− | |||
− | + | * Insert overwrite table: | |
− | + | INSERT OVERWRITE TABLE cosmohub.paudm_photoz_bcnz_updated_v1_1 PARTITION(production_id=952) | |
− | + | SELECT ref_id, zb, odds, pz_width, zb_mean, chi2, n_band, ebv, qz, best_run | |
+ | FROM jcarrete.paudm_photoz_bcnz_updated_v1_1_csv WHERE production_id = 952 AND ref_id IS NOT NULL; |
Latest revision as of 17:28, 4 November 2019
NOTE: Brackets {} in the following notes have to be removed when typing in the terminal. They are used to define variables.
Storage
Home directory
Once you have your PIC account you are able to access the UI's machines:
ssh {USER}@ui.pic.es
and you are you are logged in to your "home":
~{USER}
This directory is your main place for storage for software, scripts, logs, and long term data files. It is backed-up and has 10GiB of capacity.
Massive storage
Each project has (in general) a massive storage space accessible at the following path:
/pnfs/pic.es/data/astro/{PROJECT} (ask for the actual path to your contact person)
which has only read permissions for the project's users.
Inside the directory there are two different paths corresponding to two different back-ends:
Tape
/pnfs/pic.es/data/astro/{PROJECT}/tape
As its name suggests, the data in the tape path is stored in magnetic tapes, and is critical, such as raw data or very difficult data to obtain or to get. The size of each file is usually large, from 1-2GB to 100-200GB, due to technical reasons (they are usually iso or tar.bz2 files). Data in tapes is not very often accessed. Before accessing any file on tape, you MUST notify your contact person so they can perform a pre-stage on the files you require. You have to provide also the interval during which you need to access those files. The pre-stage operation will read all the data you requested and put them on a disk buffer. Only after that, your files will be readable (using the same path). After the specified interval has passed, the pre-staged files will be removed from the disk buffer and be no longer readable.
Disk
/pnfs/pic.es/data/astro/{PROJECT}/disk
Disk data is usually the data being currently used by the project, and it is being very often accessed. The size of the files is not important here.
Scratch
Each user has a scratch space at the following path:
/nfs/astro/{USER}
This space is thought as a volatile sandbox. If you produce results that may be important for the project, ask your contact person and they will move the data into the /pnfs storage.
Please note that all data older than 6 months may be erased at any time without prior notice.
Any location not included in the former paths is not allowed and its contents erased on sight.
Using the HTCondor computing farm
The HTCondor computing cluster is designed to run many independent jobs in parallel. That means that in contrast to MPI jobs there is no interdependence or communication between them.
Let's say we want to run a python script several times with different input parameters. Here, we show the example of calculating the shear maps of a single seed of a dark matter simulation. First of all, connect to the userinterface ui04.pic.es, which is the one with a CentOS7 operating system:
ssh neissner@ui04.pic.es
Now, you would need the executable python script. Put that somewhere in your home directory:
[neissner@ui04 ~]$ ls -la ~neissner/scripts/python/test/test.py -rwxr-xr-x 1 neissner pic 5942 Apr 24 15:03 /nfs/pic.es/user/n/neissner/scripts/python/test/test.py
We can have a look which modules are required in order to create the virtual environment:
[neissner@ui04 ~]$ cat scripts/python/test/test.py | grep import import os import sys import numpy as np import healpy as hp import pandas as pd import bz2 from pathlib import Path import pycdlib import io import pyarrow
With this information we create our virtual environment:
[neissner@ui04 ~]$ /software/astro/centos7/python/3.7.2/bin/python3 -m venv ~neissner/env/test [neissner@ui04 ~]$ . ~neissner/env/test/bin/activate (test) [neissner@ui04 ~]$ pip install --upgrade pip [...] (test) [neissner@ui04 ~]$ pip install --no-cache-dir numpy healpy pandas pathlib io pycdlib pyarrow [...] (test) [neissner@ui04 ~]$ deactivate
The python script needs three input parameters: the seed number (here 600), the lowest healpix step (here 134) and the healpix step to be calculated (here a number between 134 and 400) and three input files:
[neissner@ui04 ~]$ ls /nfs/astro/neissner/data/test/ kappa.npy seed00600.iso steps.ssv
It is more comfortable to run the python program from inside a bash script. Write a bash script that looks the following way:
[neissner@ui04 ~]$ cat ~neissner/scripts/bash/test/test.sh #!/bin/bash SEED=$1 LSTEP=$2 STEP=$3 . ~neissner/env/test/bin/activate python ~neissner/scripts/python/test/test.py $SEED $LSTEP $STEP deactivate
The bash script requires the same input parameters as the python script.
Now we can execute the bash script on one of the HTCondor execution hosts by sending it to the scheduler. In order to do that we need a submit file that looks like this:
[neissner@ui04 ~]$ cat test.sub executable = /nfs/pic.es/user/n/neissner/scripts/bash/test/test.sh args = 600 134 134 output = output-600-134.out error = error-600-134.err log = log-600-134.log request_memory=6GB request_cpus=2 +experiment="des" queue
This creates a single job to be run in the computing cluster. Now we want to run al steps from 134 to 400 at once, therefore we change the submit file to this one:
[neissner@ui04 ~]$ cat test.sub executable = /nfs/pic.es/user/n/neissner/scripts/bash/test/test.sh args = 600 134 $(Item) output = output-600-$(Item).out error = error-600-$(Item).err log = log-600-$(Item).log request_memory=6GB request_cpus=2 +experiment="des" queue from seq 134 1 400 |
This will create 267 independent jobs that can run simultaneously in the computing cluster.
Finally we can send the job to the HTCondor scheduler:
[neissner@ui04 ~]$ ssh submit01.pic.es 'condor_submit test.sub Submitting job(s)...................................................................................... ................................................................................................................. .................................................................... 267 job(s) submitted to cluster 1486.
The cluster number is the job ID which can be used to monitor the progress:
[neissner@ui04 ~]$ ssh submit01.pic.es 'condor_q -all' -- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 04/25/19 12:48:58 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS neissner ID: 1486 4/25 12:47 _ 4 263 267 1486.0-266 Total for query: 267 jobs; 0 completed, 0 removed, 263 idle, 4 running, 0 held, 0 suspended Total for all users: 267 jobs; 0 completed, 0 removed, 263 idle, 4 running, 0 held, 0 suspended
E.g. in this example 4 out of 267 jobs are already running while 263 are still in the queue. This will progress in time.
[neissner@ui04 ~]$ ssh submit01.pic.es 'condor_q -all' -- Schedd: submit01.pic.es : <193.109.174.82:9618?... @ 04/25/19 13:01:13 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS neissner ID: 1486 4/25 12:47 4 8 255 267 1486.4-266 Total for query: 263 jobs; 0 completed, 0 removed, 255 idle, 8 running, 0 held, 0 suspended Total for all users: 263 jobs; 0 completed, 0 removed, 255 idle, 8 running, 0 held, 0 suspended
Creating a python environment to work on wn-el7.astro.pic.es
1. Login to UI
ssh {USER}@ui.pic.es
2. Login to data.astro.pic.es (because virtualenv3.6 is installed there):
ssh {USER}@data.astro.pic.es
3. Usually environments are all saved in the same directory (e.g. ~/env). 3.1. Create an environment (if not yet created):
mkdir ~/env/ cd ~/env/ virtualenv-3.6 {ENV_NAME}
5. Activate environment
source ~/env/{ENV_NAME}/bin/activate
6. Upgrade pip command (the environment has to be activated):
pip install --upgrade pip
7. Install all necessary libraries
pip install numpy scipy matplotlib astropy pandas jupyter psycopg2 psycopg2-binary
8. Open another terminal and login to wn-el7.astro.pic.es:
ssh {USER}@ui.pic.es ssh {USER}@wn-el7.astro.pic.es
9. Activate environment
. {ENV_NAME}/bin/activate
10. Go to the directory where your notebook is and launch jupyter-notebook:
jupyter-notebook --ip=$(hostname) --no-browser
Note1: In the prompt, in one of the lines that appear, there will be a message like this one:
[I 15:44:17.162 NotebookApp] The Jupyter Notebook is running at: [I 15:44:17.162 NotebookApp] http://[all ip addresses on your system]:{WN_PORT}/
Please, take note of the value of {WN_PORT}.
11. Open another terminal and create a tunnel from your laptop to the workernode through the UI: Choose any {LOCAL_PORT} higher than 1024, i.e. 9000.
ssh -L {LOCAL_PORT}:wn-el7.astro.pic.es:{WN_PORT} {USER}@ui.pic.es
- From a web browser in your local computer, access the following url:
http://localhost:{LOCAL_PORT}
Working with Python environments (in UI)
1.1 Usually environments are all saved in the same directory (e.g. ~/env). In case it is not created:
mkdir ~/env/
1.2 Create a new environment (python_version = 2.7.14):
cd ~/env/ /software/astro/sl6/python/{PYTHON_VERSION}/bin/virtualenv {ENV_NAME}
1.3 Activate environment:
source ~/env/{ENV_NAME}/bin/activate
1.4 Update pip command (only for the first time):
pip install --upgrade pip
1.5 Install any package you need (in case you have any problem with some package, please contact us)
e.g numpy package:
pip install numpy
1.6 To see the different packages included in the environment:
pip freeze
Accessing a remote jupyter notebook
These are the instructions to work with a jupyter notebook running in a workernode at PIC from your web browser.
After creating and activating a virtual environment, you will need to create an SSH tunnel from your computer to the workernode through the UI in order to access the notebook.
These are the steps you have to follow:
- From one terminal login in a UI:
ssh {USER}@ui.pic.es
- Login in the ASTRO workernode:
ssh {USER}@wn.astro.pic.es
- Activate the virtual environment (in case it has not been created yet, see previous section):
source ~/env/{ENV_NAME}/bin/activate
- In case jupyter is not already installed:
pip install jupyter jupyter-notebook --generate-config jupyter-notebook password # (for security reasons when opening the notebook in your browser afterwards)
- Execute the jupyter notebook command
jupyter-notebook --ip='*' --no-browser (try --ip=$(hostname) instead of --ip='*' when there ir an error)
Note1: In the prompt, in one of the lines that appear, there will be a message like this one:
[I 15:44:17.162 NotebookApp] The Jupyter Notebook is running at: [I 15:44:17.162 NotebookApp] http://[all ip addresses on your system]:{WN_PORT}/
Please, take note of the value of {WN_PORT}.
- Open another terminal and create a tunnel from your laptop to the workernode through the UI:
Choose any {LOCAL_PORT} higher than 1024, i.e. 9000.
ssh -L {LOCAL_PORT}:wn.astro.pic.es:{WN_PORT} {USER}@ui.pic.es
- From a web browser in your local computer, access the following url:
http://localhost:{LOCAL_PORT}
Download code and git rules
These are the git rules for developers at PIC.
The methodology written below is a try to help the code development of the team and they are thought for non-experts git users.
It has been compiled from the official git documentation, which we strongly recommend to look at (at least the first three chapters):
And from this git branch model:
A successful Git branching model
We assume you already have a PIC account and you have already created a python virtual environment
Codes are hosted at https://gitlab.pic.es.
1. Download the code (first you need to have permissions to do it)
1.1. Access ui:
ssh {USER}@ui.pic.es
1.2 Usually software is stored in the same directory:
mkdir ~/src/
1.2 Create a directory in which you are going to develop your codes, e.g:
mkdir -p ~/src/{software_project_name}
1.3 Copy the code from the gitlab repository in the created directory:
cd ~/src/{software_project_name} git clone https://gitlab01.pic.es/{software_project_name}/{pipeline}.git
1.4 Activate your environment
source ~/env/{ENV_NAME}/bin/activate
1.5 Locate the `setup.py` file of the project (usually in the main software directory), and deploit the code:
cd ~/src/{software_project_name} pip install -e .
2. Create your own branch
Every project has two main protected branches: master and develop. Protected means you, as a standard developer, will not have permissions to write on them. Therefore in order to develop your features in the code you need to create your own branch that will always come from the develop branch.
2.1 Enter in the project directory:
cd {pipeline}
2.2 Create the branch:
git checkout -b feature_branch_name origin/develop
3. Modify the code
3.1 Day Tip:
Everyday you sit in your computer and want to modify the code, in order not to be outdated in the changes made in the develop branch, you should do:
3.1.1 Download any modification in the code
git fetch
3.1.2 Incorporate changes in the develop branch into your feature_branch_name branch
git rebase origin/develop
(Hopefully there will be no conflicts if all developers are working in independent branches. If this is not the case and you have doubts after reading the references given in point 5 below, please call us before meshing it up!)
3.2 See the changes you have done
git status
3.3 Add changes
git add changed_files
3.4 Commit changes
git commit -m "message describing the modifications"
4. Finish the new feature
Once you finish to develop, debug and test the new feature you send us an email.
We immediately will send you back another one saying that your feature has been integrated into the develop branch.
Note that your branch will be deleted.
5. Incorporate changes and start a new feature again
In order to start with a new feature you need to incorporate the changes we just did (integrate the feature into develop) and create another new branch:
git fetch
git checkout -b anoter_feature_branch_name origin/develop
Jupyter notebook on Spark
- From one terminal login in a UI:
ssh {USER}@ui.pic.es
- Login in the DATA.ASTRO machine:
ssh {USER}@data.astro.pic.es
- Create a new virtual environment (necessary for the first time only) BUT it is mandatory that it has been created from the data.astro machine:
(in case ~/env is not created: mkdir ~/env/) cd ~/env/ virtualenv {ENV_NAME} source ~/env/{ENV_NAME}/bin/activate pip install --upgrade pip
- In case jupyter is not already installed:
pip install jupyter jupyter-notebook --generate-config jupyter-notebook password # (for security reasons when opening the notebook in your browser afterwards)
- Go to the directory where you have your notebooks or where you want to create a new one:
cd ~/notebooks (in case you don't have one, just create it: mkdir ~/notebooks)
It is necessary that all code you want to use is visible and accessible from every node in the Hadoop cluster. The best way is to only use a shared filesystems, such as your home in ~{USER}, or /nfs/astro/{USER}.
- Launch Jupyter using our helper script (NOTE: when launching the jupyter on spark, you CANNOT have the environment active):
/software/astro/scripts/jupyter_pyspark.sh {ENV_NAME}
Note: In the prompt, in one of the lines that appear, there will be a message to tell you what to do with the url:
Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://data.astro.pic.es:{DATA_ASTRO_PORT}
Please, take note of the value of {DATA_ASTRO_PORT}.
- Open another terminal and create a tunnel from your laptop to the DATA.ASTRO through the UI:
Choose any {LOCAL_PORT} higher than 1024, i.e. 9000.
ssh -L {LOCAL_PORT}:data.astro.pic.es:{DATA_ASTRO_PORT} {USER}@ui.pic.es
- From a web browser in your local computer, access the following url:
http://localhost:{LOCAL_PORT}
It will take some time to initialize the Spark Context.
Once it has been initialized, it will be accessible through the 'sc' global variable.
Example
import numpy as np import pandas as pd from scipic.mocks.hod.base import Galaxy from scipic.mocks.hod.kravtsov import Kravtsov from pyspark.sql import Row hive = HiveContext(sc) hive.sql('USE cosmohub').collect() cat = hive.sql('SELECT unique_halo_id AS halo_id, lmhalo as mass FROM micecatv2_0 LIMIT 20').cache() k = Kravtsov(12, 13, 1) def gals(p): data = list(p) df = pd.DataFrame(data, columns=data[0].__fields__) hod = {k.name:v for k,v in k.galaxies(df.halo_id, df.mass).iteritems()} df = pd.DataFrame(hod) return [Row(**fields) for fields in [t._asdict() for t in df.itertuples(index=False)]] print cat.mapPartitions(gals, True).collect()
Necessary information to ingest a catalog into CosmoHub / PAUdm
Two files:
- The catalog itself in the current accepted formats: CSV, CSV.BZ2, FITS and HDF5. Files should occupy > 100Mb and < 1Gb.
- The metadata of the catalog: a yaml file with the following information:
- name: Name of the catalogue - version: Version of the catalogue - software_version: Code version to create the catalogue - date_create: When the catalogue was created - description: Brief description (one liner) - summary: Summary of the catalogue - fields: - column1: Column 1 description - column2: Column 2 description
In addition, the following fields are necessary for ingestion into PAUdm.
- paudm_input_production_id : MEMBA production_id
Ingest PAUdm catalogs into CosmoHub
Forced aperture production (production = 941)
- Get catalog from paudb (from UI):
psql -U readonly -W -h db.pau.pic.es dm -c "COPY (SELECT * FROM public.forced_aperture_coadd WHERE production_id = 941) TO stdout DELIMITER ',' CSV HEADER" | lbzip2 | pv > forced_aperture_coadd_production_id_941.csv.bz2
- Copy to pnfs
cp forced_aperture_coadd_production_id_941.csv.bz2 /pnfs/pic.es/data/astro/cosmohub/disk/raw/paudm/forced_aperture_coadd/forced_aperture_coadd_production_id_941.csv.bz2
- Beeline:
beeline --color=true --showHeader=true --verbose=true --silent=false -u "jdbc:hive2://ambarisrv02.pic.es:2181,ambarisrv03.pic.es:2181,ambarisrv01.pic.es:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2"
- Create partition (from beeline)
ALTER TABLE jcarrete.paudm_forced_aperture_coadd_csv ADD IF NOT EXISTS PARTITION(production_id=941);
- Copy csv.bz2 to hdfs:
hdfs dfs -put /nfs/astro/jcarrete/sandbox/forced_aperture_coadd_production_id_941.csv.bz2 /user/jcarrete/data/paudm/forced_aperture_coadd/production_id\=941/forced_aperture_coadd_production_id_941.csv.bz2
- Insert overwrite cosmohub table:
INSERT OVERWRITE TABLE cosmohub.paudm_forced_aperture_coadd PARTITION(production_id=941) SELECT `ref_id`, `band`, `flux`, `flux_error`, `chi2`, `n_coadd`, `run` FROM jcarrete.paudm_forced_aperture_coadd_csv WHERE production_id=941;
Production table
psql -U readonly -W -h db.pau.pic.es dm -c "COPY (SELECT * FROM public.production) TO stdout DELIMITER ',' CSV HEADER" | pv > production_table_04_11_2019.csv hdfs dfs -rm /user/jcarrete/data/paudm/production/* hdfs dfs -copyFromLocal /nfs/astro/jcarrete/sandbox/production_table_04_11_2019.csv /user/jcarrete/data/paudm/production/production_table_04_11_2019.csv INSERT OVERWRITE TABLE cosmohub.paudm_production SELECT id, input_production, pipeline, release, software_version, job_id, comments, created FROM jcarrete.paudm_production_csv WHERE id IS NOT NULL;
PAUdm photoz results from M.Eriksen file
The file is here:
/nfs/astro/eriksen/kcorr/bcnz_v29.parquet
- I use python pyarrow library to read the parquet file and saved it into csv. These are the lines in python:
import pyarrow.parquet as pq data = pq.read_table('/nfs/astro/eriksen/kcorr/bcnz_v29.parquet') df_new = data.to_pandas() df_new['production_id'] = 952 df_new.reset_index(inplace=True) column_list = ['production_id', 'ref_id', 'zb', 'odds', 'pz_width', 'zb_mean', 'chi2', 'n_band', 'ebv', 'qz', 'best_run'] df_new[column_list].to_csv('/nfs/astro/jcarrete/sandbox/parquet_bcnz_v29_some_fields.csv', header=True, index=False)
- Add by hand in the paudb, in the production table, the production of the new photoz data production id = 952
- Create partition (from beeline)
ALTER TABLE jcarrete.paudm_photoz_bcnz_updated_v1_1_csv ADD IF NOT EXISTS PARTITION(production_id=952);
- Copy csv to hdfs:
hdfs dfs -put /nfs/astro/jcarrete/sandbox/parquet_bcnz_v29_some_fields.csv /user/jcarrete/data/paudm/photoz_bcnz_updated_v1_1_csv/production_id\=952/parquet_bcnz_v29_some_fields.csv
- Insert overwrite table:
INSERT OVERWRITE TABLE cosmohub.paudm_photoz_bcnz_updated_v1_1 PARTITION(production_id=952) SELECT ref_id, zb, odds, pz_width, zb_mean, chi2, n_band, ebv, qz, best_run FROM jcarrete.paudm_photoz_bcnz_updated_v1_1_csv WHERE production_id = 952 AND ref_id IS NOT NULL;