Accessing PIC Big Data Services from JupyterLab

PIC's Jupyter service is integrated with the Big Data platform. This allows you to access services such as HDFS, Hive, and Spark directly from your JupyterLab environment.

Requirements

1. Kerberos Authentication

You must have a valid Kerberos ticket before accessing any Big Data service.

Step-by-step

Open a terminal session on the same node where your Jupyter session is running.
- In JupyterLab, use the Launcher → Terminal.
Run the following commands:

kinit -n -c ~/.fast.ccache @PIC.ES
kinit -T ~/.fast.ccache

Verify your ticket

klist

You should see a valid ticket with a non-expired timestamp.

Notes

If the ticket expires, you must repeat the process.
Without a valid Kerberos ticket, access to HDFS, Hive, or Spark will fail.

Quick test (HDFS access)

hdfs dfs -ls /

If this works, your authentication is correctly configured.

---

2. Python Environment

You need specific Python packages depending on the service:

Service	Required packages
Spark	findspark, pyspark
Hive	pyhive[hive_pure_sasl, kerberos]

Installation example

pip install findspark pyspark
pip install "pyhive[hive_pure_sasl, kerberos]"

---

Using Spark from Jupyter

Setup

import os

# Environment configuration
os.environ["HADOOP_HOME"] = "/usr/local/hadoop"
os.environ["HADOOP_CONF_DIR"] = "/usr/local/hadoop/etc/hadoop"
os.environ["HIVE_HOME"] = "/usr/local/hive"
os.environ["HIVE_CONF_DIR"] = "/usr/local/hive/conf"

import findspark
findspark.init()

Create a Spark session

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("example")
    .enableHiveSupport()
    .getOrCreate()
)

sc = spark.sparkContext

Example: Read a Hive table

df = spark.sql("SELECT * FROM some_database.some_table LIMIT 10")
df.show()

---

Using Hive (PyHive) from Jupyter

Connect to Hive

from pyhive import hive

conn = hive.connect(
    host="hsrv01.pic.es",
    port=10000,
    kerberos_service_name="hive",
    auth="KERBEROS",
)

cursor = conn.cursor()

Execute a query

query = "SELECT * FROM some_database.some_table LIMIT 10"
cursor.execute(query)

rows = cursor.fetchall()
colnames = [c[0] for c in cursor.description]

Convert to Astropy Table (optional)

from astropy.table import Table

table = Table(rows=rows, names=colnames)

---

Troubleshooting

Kerberos errors

Run klist and check expiration
Re-run kinit if needed

Hive connection fails

Ensure Kerberos ticket is valid
Verify correct host (hsrv01.pic.es)
Check that required Python packages are installed

Spark does not start

Verify environment variables (HADOOP_HOME, HIVE_HOME)
Ensure findspark.init() is executed before creating the session

---

Summary

Obtain a Kerberos ticket (kinit)
Install required Python packages
Configure environment variables
Use Spark or Hive from your notebook

---

For further assistance, contact PIC support (user.support@pic.es).

Spark on Jupyter

Contents