Spark on Jupyter
Jump to navigation
Jump to search
Accessing PIC Big Data Services from JupyterLab
PIC's Jupyter service is integrated with the Big Data platform. This allows you to access services such as HDFS, Hive, and Spark directly from your JupyterLab environment.
Requirements
1. Kerberos Authentication
You must have a valid Kerberos ticket before accessing any Big Data service.
Step-by-step
- Open a terminal session on the same node where your Jupyter session is running.
- In JupyterLab, use the Launcher → Terminal.
- Run the following commands:
kinit -n -c ~/.fast.ccache @PIC.ES kinit -T ~/.fast.ccache
Verify your ticket
klist
You should see a valid ticket with a non-expired timestamp.
Notes
- If the ticket expires, you must repeat the process.
- Without a valid Kerberos ticket, access to HDFS, Hive, or Spark will fail.
Quick test (HDFS access)
hdfs dfs -ls /
If this works, your authentication is correctly configured.
---
2. Python Environment
You need specific Python packages depending on the service:
| Service | Required packages |
|---|---|
| Spark | findspark, pyspark |
| Hive | pyhive[hive_pure_sasl, kerberos] |
Installation example
pip install findspark pyspark pip install "pyhive[hive_pure_sasl, kerberos]"
---
Using Spark from Jupyter
Setup
import os # Environment configuration os.environ["HADOOP_HOME"] = "/usr/local/hadoop" os.environ["HADOOP_CONF_DIR"] = "/usr/local/hadoop/etc/hadoop" os.environ["HIVE_HOME"] = "/usr/local/hive" os.environ["HIVE_CONF_DIR"] = "/usr/local/hive/conf" import findspark findspark.init()
Create a Spark session
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName("example")
.enableHiveSupport()
.getOrCreate()
)
sc = spark.sparkContext
Example: Read a Hive table
df = spark.sql("SELECT * FROM some_database.some_table LIMIT 10")
df.show()
---
Using Hive (PyHive) from Jupyter
Connect to Hive
from pyhive import hive
conn = hive.connect(
host="hsrv01.pic.es",
port=10000,
kerberos_service_name="hive",
auth="KERBEROS",
)
cursor = conn.cursor()
Execute a query
query = "SELECT * FROM some_database.some_table LIMIT 10" cursor.execute(query) rows = cursor.fetchall() colnames = [c[0] for c in cursor.description]
Convert to Astropy Table (optional)
from astropy.table import Table table = Table(rows=rows, names=colnames)
---
Troubleshooting
Kerberos errors
- Run
klistand check expiration - Re-run
kinitif needed
Hive connection fails
- Ensure Kerberos ticket is valid
- Verify correct host (
hsrv01.pic.es) - Check that required Python packages are installed
Spark does not start
- Verify environment variables (
HADOOP_HOME,HIVE_HOME) - Ensure
findspark.init()is executed before creating the session
---
Summary
- Obtain a Kerberos ticket (
kinit) - Install required Python packages
- Configure environment variables
- Use Spark or Hive from your notebook
---
For further assistance, contact PIC support (user.support@pic.es).