Spark on Jupyter

From Public PIC Wiki
Jump to navigation Jump to search

Accessing PIC Big Data Services from JupyterLab

PIC's Jupyter service is integrated with the Big Data platform. This allows you to access services such as HDFS, Hive, and Spark directly from your JupyterLab environment.

Requirements

1. Kerberos Authentication

You must have a valid Kerberos ticket before accessing any Big Data service.

Step-by-step

  1. Open a terminal session on the same node where your Jupyter session is running.
    • In JupyterLab, use the LauncherTerminal.
  2. Run the following commands:
kinit -n -c ~/.fast.ccache @PIC.ES
kinit -T ~/.fast.ccache

Verify your ticket

klist

You should see a valid ticket with a non-expired timestamp.

Notes

  • If the ticket expires, you must repeat the process.
  • Without a valid Kerberos ticket, access to HDFS, Hive, or Spark will fail.

Quick test (HDFS access)

hdfs dfs -ls /

If this works, your authentication is correctly configured.

---

2. Python Environment

You need specific Python packages depending on the service:

Service Required packages
Spark findspark, pyspark
Hive pyhive[hive_pure_sasl, kerberos]

Installation example

pip install findspark pyspark
pip install "pyhive[hive_pure_sasl, kerberos]"

---

Using Spark from Jupyter

Setup

import os

# Environment configuration
os.environ["HADOOP_HOME"] = "/usr/local/hadoop"
os.environ["HADOOP_CONF_DIR"] = "/usr/local/hadoop/etc/hadoop"
os.environ["HIVE_HOME"] = "/usr/local/hive"
os.environ["HIVE_CONF_DIR"] = "/usr/local/hive/conf"

import findspark
findspark.init()

Create a Spark session

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("example")
    .enableHiveSupport()
    .getOrCreate()
)

sc = spark.sparkContext

Example: Read a Hive table

df = spark.sql("SELECT * FROM some_database.some_table LIMIT 10")
df.show()

---

Using Hive (PyHive) from Jupyter

Connect to Hive

from pyhive import hive

conn = hive.connect(
    host="hsrv01.pic.es",
    port=10000,
    kerberos_service_name="hive",
    auth="KERBEROS",
)

cursor = conn.cursor()

Execute a query

query = "SELECT * FROM some_database.some_table LIMIT 10"
cursor.execute(query)

rows = cursor.fetchall()
colnames = [c[0] for c in cursor.description]

Convert to Astropy Table (optional)

from astropy.table import Table

table = Table(rows=rows, names=colnames)

---

Troubleshooting

Kerberos errors

  • Run klist and check expiration
  • Re-run kinit if needed

Hive connection fails

  • Ensure Kerberos ticket is valid
  • Verify correct host (hsrv01.pic.es)
  • Check that required Python packages are installed

Spark does not start

  • Verify environment variables (HADOOP_HOME, HIVE_HOME)
  • Ensure findspark.init() is executed before creating the session

---

Summary

  1. Obtain a Kerberos ticket (kinit)
  2. Install required Python packages
  3. Configure environment variables
  4. Use Spark or Hive from your notebook

---

For further assistance, contact PIC support (user.support@pic.es).