Tallada: Created page with "= Accessing PIC Big Data Services from JupyterLab = PIC's Jupyter service is integrated with the Big Data platform. This allows you to access services such as '''HDFS''', '''..."

2026-04-08T08:51:24Z

Created page with "= Accessing PIC Big Data Services from JupyterLab = PIC's Jupyter service is integrated with the Big Data platform. This allows you to access services such as '''HDFS''', '''..."

New page

= Accessing PIC Big Data Services from JupyterLab =

PIC's Jupyter service is integrated with the Big Data platform. This allows you to access services such as '''HDFS''', '''Hive''', and '''Spark''' directly from your JupyterLab environment.

== Requirements ==

=== 1. Kerberos Authentication ===

You must have a '''valid Kerberos ticket''' before accessing any Big Data service.

==== Step-by-step ====

# Open a terminal session on the same node where your Jupyter session is running.
#* In JupyterLab, use the ''Launcher'' → ''Terminal''.
# Run the following commands:

<pre>
kinit -n -c ~/.fast.ccache @PIC.ES
kinit -T ~/.fast.ccache
</pre>

==== Verify your ticket ====

<pre>
klist
</pre>

You should see a valid ticket with a non-expired timestamp.

==== Notes ====

* If the ticket expires, you must repeat the process.
* Without a valid Kerberos ticket, access to HDFS, Hive, or Spark will fail.

==== Quick test (HDFS access) ====

<pre>
hdfs dfs -ls /
</pre>

If this works, your authentication is correctly configured.

---

=== 2. Python Environment ===

You need specific Python packages depending on the service:

{| class="wikitable"
|-
! Service !! Required packages
|-
| Spark || findspark, pyspark
|-
| Hive || pyhive[hive_pure_sasl, kerberos]
|}

==== Installation example ====

<pre>
pip install findspark pyspark
pip install "pyhive[hive_pure_sasl, kerberos]"
</pre>

---

== Using Spark from Jupyter ==

=== Setup ===

<pre>
import os

# Environment configuration
os.environ["HADOOP_HOME"] = "/usr/local/hadoop"
os.environ["HADOOP_CONF_DIR"] = "/usr/local/hadoop/etc/hadoop"
os.environ["HIVE_HOME"] = "/usr/local/hive"
os.environ["HIVE_CONF_DIR"] = "/usr/local/hive/conf"

import findspark
findspark.init()
</pre>

=== Create a Spark session ===

<pre>
from pyspark.sql import SparkSession

spark = (
SparkSession.builder
.appName("example")
.enableHiveSupport()
.getOrCreate()
)

sc = spark.sparkContext
</pre>

=== Example: Read a Hive table ===

<pre>
df = spark.sql("SELECT * FROM some_database.some_table LIMIT 10")
df.show()
</pre>

---

== Using Hive (PyHive) from Jupyter ==

=== Connect to Hive ===

<pre>
from pyhive import hive

conn = hive.connect(
host="hsrv01.pic.es",
port=10000,
kerberos_service_name="hive",
auth="KERBEROS",
)

cursor = conn.cursor()
</pre>

=== Execute a query ===

<pre>
query = "SELECT * FROM some_database.some_table LIMIT 10"
cursor.execute(query)

rows = cursor.fetchall()
colnames = [c[0] for c in cursor.description]
</pre>

=== Convert to Astropy Table (optional) ===

<pre>
from astropy.table import Table

table = Table(rows=rows, names=colnames)
</pre>

---

== Troubleshooting ==

=== Kerberos errors ===
* Run <code>klist</code> and check expiration
* Re-run <code>kinit</code> if needed

=== Hive connection fails ===
* Ensure Kerberos ticket is valid
* Verify correct host (<code>hsrv01.pic.es</code>)
* Check that required Python packages are installed

=== Spark does not start ===
* Verify environment variables (<code>HADOOP_HOME</code>, <code>HIVE_HOME</code>)
* Ensure <code>findspark.init()</code> is executed before creating the session

---

== Summary ==

# Obtain a Kerberos ticket (<code>kinit</code>)
# Install required Python packages
# Configure environment variables
# Use Spark or Hive from your notebook

---

For further assistance, contact PIC support (user.support@pic.es).

Spark on Jupyter - Revision history

Tallada: Created page with "= Accessing PIC Big Data Services from JupyterLab = PIC's Jupyter service is integrated with the Big Data platform. This allows you to access services such as '''HDFS''', '''..."