Difference between revisions of "Photoz bcnz"

From Public PIC Wiki
Jump to navigation Jump to search
(Created page with "Version parquet v29 I use a notebook to read and clean the data since there are wrong or weird values in the file: e.g. NaN values, odds > 1.e60, etc.. The notebook is calle...")
 
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
Version parquet v29
+
==== Version parquet v29 ====
  
 
I use a notebook to read and clean the data since there are wrong or weird values in the file: e.g. NaN values, odds > 1.e60, etc..
 
I use a notebook to read and clean the data since there are wrong or weird values in the file: e.g. NaN values, odds > 1.e60, etc..
Line 7: Line 7:
 
     PAUdm photoz parquet files.ipynb
 
     PAUdm photoz parquet files.ipynb
  
I saved the file in csv format in order to ingest it into paudb.
+
It is necessary to make a clip for the fields odds, pz_width, zb_mean and chi2, e.g.
 +
 
 +
    ## Clipping large and small values
 +
    df_new['odds'].clip(lower=1.e-37, upper=1.e37, inplace=True)
 +
 
 +
And even with that, afterwards it is necessary to remove NaN values in some entries.
 +
I also put 1.e37 as value:
 +
 
 +
    df_new.fillna(value=1.e37, inplace=True)
 +
 
 +
I saved the file in csv format in order to ingest it into paudb:
 +
 
 +
    outfile = '/cephfs/pic.es/astro/scratch/jcarrete/sandbox/parquet_bcnz_v29_some_fields.csv'
 +
 
 +
 
 +
And then I insert it into paudb:
 +
 
 +
tail -n+2 parquet_bcnz_v29_some_fields.csv | pv -l | psql -U postgres -W -h db.pau.pic.es dm -c "COPY photoz_bcnz (production_id, ref_id, zb, odds, pz_width, zb_mean, chi2, n_band, ebv, qz, best_run) FROM STDIN DELIMITER ',' CSV"
 +
 
 +
And also into CosmoHub:
 +
 
 +
First it is necessary to create the partition in the external table:
 +
 
 +
    ALTER TABLE jcarrete.paudm_photoz_bcnz_updated_v1_1_csv ADD IF NOT EXISTS PARTITION(production_id=952);
 +
 
 +
Probably it is necessary to give write permissions for hive (or for everybody) to the just created directory of the partition.
 +
The hdfs directory is here:
 +
 
 +
    /user/jcarrete/data/paudm/photoz_bcnz_updated_v1_1_csv/production_id=952
 +
 
 +
And then one can ingest the data into the cosmohub table:
 +
 
 +
    INSERT OVERWRITE TABLE cosmohub.paudm_photoz_bcnz_updated_v1_1 PARTITION (production_id=952) SELECT ref_id, zb, odds, pz_width, zb_mean, chi2, n_band, ebv, qz, best_run FROM jcarrete.paudm_photoz_bcnz_updated_v1_1_csv WHERE ref_id IS NOT NULL AND production_id = 952;

Latest revision as of 14:30, 17 February 2020

Version parquet v29

I use a notebook to read and clean the data since there are wrong or weird values in the file: e.g. NaN values, odds > 1.e60, etc..

The notebook is called:

   PAUdm photoz parquet files.ipynb

It is necessary to make a clip for the fields odds, pz_width, zb_mean and chi2, e.g.

   ## Clipping large and small values
   df_new['odds'].clip(lower=1.e-37, upper=1.e37, inplace=True)

And even with that, afterwards it is necessary to remove NaN values in some entries. I also put 1.e37 as value:

   df_new.fillna(value=1.e37, inplace=True)

I saved the file in csv format in order to ingest it into paudb:

   outfile = '/cephfs/pic.es/astro/scratch/jcarrete/sandbox/parquet_bcnz_v29_some_fields.csv'


And then I insert it into paudb:

tail -n+2 parquet_bcnz_v29_some_fields.csv | pv -l | psql -U postgres -W -h db.pau.pic.es dm -c "COPY photoz_bcnz (production_id, ref_id, zb, odds, pz_width, zb_mean, chi2, n_band, ebv, qz, best_run) FROM STDIN DELIMITER ',' CSV"

And also into CosmoHub:

First it is necessary to create the partition in the external table:

   ALTER TABLE jcarrete.paudm_photoz_bcnz_updated_v1_1_csv ADD IF NOT EXISTS PARTITION(production_id=952);

Probably it is necessary to give write permissions for hive (or for everybody) to the just created directory of the partition. The hdfs directory is here:

   /user/jcarrete/data/paudm/photoz_bcnz_updated_v1_1_csv/production_id=952

And then one can ingest the data into the cosmohub table:

   INSERT OVERWRITE TABLE cosmohub.paudm_photoz_bcnz_updated_v1_1 PARTITION (production_id=952) SELECT ref_id, zb, odds, pz_width, zb_mean, chi2, n_band, ebv, qz, best_run FROM jcarrete.paudm_photoz_bcnz_updated_v1_1_csv WHERE ref_id IS NOT NULL AND production_id = 952;