Difference between revisions of "Photoz bcnz"
 (Created page with "Version parquet v29  I use a notebook to read and clean the data since there are wrong or weird values in the file: e.g. NaN values, odds > 1.e60, etc..  The notebook is calle...")  | 
				|||
| (One intermediate revision by the same user not shown) | |||
| Line 1: | Line 1: | ||
| − | Version parquet v29  | + | ==== Version parquet v29 ====  | 
I use a notebook to read and clean the data since there are wrong or weird values in the file: e.g. NaN values, odds > 1.e60, etc..  | I use a notebook to read and clean the data since there are wrong or weird values in the file: e.g. NaN values, odds > 1.e60, etc..  | ||
| Line 7: | Line 7: | ||
     PAUdm photoz parquet files.ipynb  |      PAUdm photoz parquet files.ipynb  | ||
| − | I saved the file in csv format in order to ingest it into paudb.  | + | It is necessary to make a clip for the fields odds, pz_width, zb_mean and chi2, e.g.  | 
| + | |||
| + |     ## Clipping large and small values  | ||
| + |     df_new['odds'].clip(lower=1.e-37, upper=1.e37, inplace=True)  | ||
| + | |||
| + | And even with that, afterwards it is necessary to remove NaN values in some entries.  | ||
| + | I also put 1.e37 as value:  | ||
| + | |||
| + |     df_new.fillna(value=1.e37, inplace=True)  | ||
| + | |||
| + | I saved the file in csv format in order to ingest it into paudb:  | ||
| + | |||
| + |     outfile = '/cephfs/pic.es/astro/scratch/jcarrete/sandbox/parquet_bcnz_v29_some_fields.csv'  | ||
| + | |||
| + | |||
| + | And then I insert it into paudb:  | ||
| + | |||
| + | tail -n+2 parquet_bcnz_v29_some_fields.csv | pv -l | psql -U postgres -W -h db.pau.pic.es dm -c "COPY photoz_bcnz (production_id, ref_id, zb, odds, pz_width, zb_mean, chi2, n_band, ebv, qz, best_run) FROM STDIN DELIMITER ',' CSV"  | ||
| + | |||
| + | And also into CosmoHub:  | ||
| + | |||
| + | First it is necessary to create the partition in the external table:  | ||
| + | |||
| + |     ALTER TABLE jcarrete.paudm_photoz_bcnz_updated_v1_1_csv ADD IF NOT EXISTS PARTITION(production_id=952);  | ||
| + | |||
| + | Probably it is necessary to give write permissions for hive (or for everybody) to the just created directory of the partition.  | ||
| + | The hdfs directory is here:  | ||
| + | |||
| + |     /user/jcarrete/data/paudm/photoz_bcnz_updated_v1_1_csv/production_id=952  | ||
| + | |||
| + | And then one can ingest the data into the cosmohub table:  | ||
| + | |||
| + |     INSERT OVERWRITE TABLE cosmohub.paudm_photoz_bcnz_updated_v1_1 PARTITION (production_id=952) SELECT ref_id, zb, odds, pz_width, zb_mean, chi2, n_band, ebv, qz, best_run FROM jcarrete.paudm_photoz_bcnz_updated_v1_1_csv WHERE ref_id IS NOT NULL AND production_id = 952;  | ||
Latest revision as of 14:30, 17 February 2020
Version parquet v29
I use a notebook to read and clean the data since there are wrong or weird values in the file: e.g. NaN values, odds > 1.e60, etc..
The notebook is called:
PAUdm photoz parquet files.ipynb
It is necessary to make a clip for the fields odds, pz_width, zb_mean and chi2, e.g.
## Clipping large and small values df_new['odds'].clip(lower=1.e-37, upper=1.e37, inplace=True)
And even with that, afterwards it is necessary to remove NaN values in some entries. I also put 1.e37 as value:
df_new.fillna(value=1.e37, inplace=True)
I saved the file in csv format in order to ingest it into paudb:
outfile = '/cephfs/pic.es/astro/scratch/jcarrete/sandbox/parquet_bcnz_v29_some_fields.csv'
And then I insert it into paudb:
tail -n+2 parquet_bcnz_v29_some_fields.csv | pv -l | psql -U postgres -W -h db.pau.pic.es dm -c "COPY photoz_bcnz (production_id, ref_id, zb, odds, pz_width, zb_mean, chi2, n_band, ebv, qz, best_run) FROM STDIN DELIMITER ',' CSV"
And also into CosmoHub:
First it is necessary to create the partition in the external table:
ALTER TABLE jcarrete.paudm_photoz_bcnz_updated_v1_1_csv ADD IF NOT EXISTS PARTITION(production_id=952);
Probably it is necessary to give write permissions for hive (or for everybody) to the just created directory of the partition. The hdfs directory is here:
/user/jcarrete/data/paudm/photoz_bcnz_updated_v1_1_csv/production_id=952
And then one can ingest the data into the cosmohub table:
INSERT OVERWRITE TABLE cosmohub.paudm_photoz_bcnz_updated_v1_1 PARTITION (production_id=952) SELECT ref_id, zb, odds, pz_width, zb_mean, chi2, n_band, ebv, qz, best_run FROM jcarrete.paudm_photoz_bcnz_updated_v1_1_csv WHERE ref_id IS NOT NULL AND production_id = 952;