<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://pwiki.pic.es/index.php?action=history&amp;feed=atom&amp;title=Spark_on_Jupyter</id>
	<title>Spark on Jupyter - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://pwiki.pic.es/index.php?action=history&amp;feed=atom&amp;title=Spark_on_Jupyter"/>
	<link rel="alternate" type="text/html" href="https://pwiki.pic.es/index.php?title=Spark_on_Jupyter&amp;action=history"/>
	<updated>2026-04-16T08:08:32Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.35.14</generator>
	<entry>
		<id>https://pwiki.pic.es/index.php?title=Spark_on_Jupyter&amp;diff=1340&amp;oldid=prev</id>
		<title>Tallada: Created page with &quot;= Accessing PIC Big Data Services from JupyterLab =  PIC's Jupyter service is integrated with the Big Data platform. This allows you to access services such as '''HDFS''', '''...&quot;</title>
		<link rel="alternate" type="text/html" href="https://pwiki.pic.es/index.php?title=Spark_on_Jupyter&amp;diff=1340&amp;oldid=prev"/>
		<updated>2026-04-08T08:51:24Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;= Accessing PIC Big Data Services from JupyterLab =  PIC&amp;#039;s Jupyter service is integrated with the Big Data platform. This allows you to access services such as &amp;#039;&amp;#039;&amp;#039;HDFS&amp;#039;&amp;#039;&amp;#039;, &amp;#039;&amp;#039;&amp;#039;...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;= Accessing PIC Big Data Services from JupyterLab =&lt;br /&gt;
&lt;br /&gt;
PIC's Jupyter service is integrated with the Big Data platform. This allows you to access services such as '''HDFS''', '''Hive''', and '''Spark''' directly from your JupyterLab environment.&lt;br /&gt;
&lt;br /&gt;
== Requirements ==&lt;br /&gt;
&lt;br /&gt;
=== 1. Kerberos Authentication ===&lt;br /&gt;
&lt;br /&gt;
You must have a '''valid Kerberos ticket''' before accessing any Big Data service.&lt;br /&gt;
&lt;br /&gt;
==== Step-by-step ====&lt;br /&gt;
&lt;br /&gt;
# Open a terminal session on the same node where your Jupyter session is running.&lt;br /&gt;
#* In JupyterLab, use the ''Launcher'' → ''Terminal''.&lt;br /&gt;
# Run the following commands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
kinit -n -c ~/.fast.ccache @PIC.ES&lt;br /&gt;
kinit -T ~/.fast.ccache&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Verify your ticket ====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
klist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should see a valid ticket with a non-expired timestamp.&lt;br /&gt;
&lt;br /&gt;
==== Notes ====&lt;br /&gt;
&lt;br /&gt;
* If the ticket expires, you must repeat the process.&lt;br /&gt;
* Without a valid Kerberos ticket, access to HDFS, Hive, or Spark will fail.&lt;br /&gt;
&lt;br /&gt;
==== Quick test (HDFS access) ====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
hdfs dfs -ls /&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If this works, your authentication is correctly configured.&lt;br /&gt;
&lt;br /&gt;
---&lt;br /&gt;
&lt;br /&gt;
=== 2. Python Environment ===&lt;br /&gt;
&lt;br /&gt;
You need specific Python packages depending on the service:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Service !! Required packages&lt;br /&gt;
|-&lt;br /&gt;
| Spark || findspark, pyspark&lt;br /&gt;
|-&lt;br /&gt;
| Hive || pyhive[hive_pure_sasl, kerberos]&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==== Installation example ====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
pip install findspark pyspark&lt;br /&gt;
pip install &amp;quot;pyhive[hive_pure_sasl, kerberos]&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
---&lt;br /&gt;
&lt;br /&gt;
== Using Spark from Jupyter ==&lt;br /&gt;
&lt;br /&gt;
=== Setup ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
import os&lt;br /&gt;
&lt;br /&gt;
# Environment configuration&lt;br /&gt;
os.environ[&amp;quot;HADOOP_HOME&amp;quot;] = &amp;quot;/usr/local/hadoop&amp;quot;&lt;br /&gt;
os.environ[&amp;quot;HADOOP_CONF_DIR&amp;quot;] = &amp;quot;/usr/local/hadoop/etc/hadoop&amp;quot;&lt;br /&gt;
os.environ[&amp;quot;HIVE_HOME&amp;quot;] = &amp;quot;/usr/local/hive&amp;quot;&lt;br /&gt;
os.environ[&amp;quot;HIVE_CONF_DIR&amp;quot;] = &amp;quot;/usr/local/hive/conf&amp;quot;&lt;br /&gt;
&lt;br /&gt;
import findspark&lt;br /&gt;
findspark.init()&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Create a Spark session ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
from pyspark.sql import SparkSession&lt;br /&gt;
&lt;br /&gt;
spark = (&lt;br /&gt;
    SparkSession.builder&lt;br /&gt;
    .appName(&amp;quot;example&amp;quot;)&lt;br /&gt;
    .enableHiveSupport()&lt;br /&gt;
    .getOrCreate()&lt;br /&gt;
)&lt;br /&gt;
&lt;br /&gt;
sc = spark.sparkContext&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Example: Read a Hive table ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
df = spark.sql(&amp;quot;SELECT * FROM some_database.some_table LIMIT 10&amp;quot;)&lt;br /&gt;
df.show()&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
---&lt;br /&gt;
&lt;br /&gt;
== Using Hive (PyHive) from Jupyter ==&lt;br /&gt;
&lt;br /&gt;
=== Connect to Hive ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
from pyhive import hive&lt;br /&gt;
&lt;br /&gt;
conn = hive.connect(&lt;br /&gt;
    host=&amp;quot;hsrv01.pic.es&amp;quot;,&lt;br /&gt;
    port=10000,&lt;br /&gt;
    kerberos_service_name=&amp;quot;hive&amp;quot;,&lt;br /&gt;
    auth=&amp;quot;KERBEROS&amp;quot;,&lt;br /&gt;
)&lt;br /&gt;
&lt;br /&gt;
cursor = conn.cursor()&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Execute a query ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
query = &amp;quot;SELECT * FROM some_database.some_table LIMIT 10&amp;quot;&lt;br /&gt;
cursor.execute(query)&lt;br /&gt;
&lt;br /&gt;
rows = cursor.fetchall()&lt;br /&gt;
colnames = [c[0] for c in cursor.description]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Convert to Astropy Table (optional) ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
from astropy.table import Table&lt;br /&gt;
&lt;br /&gt;
table = Table(rows=rows, names=colnames)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
---&lt;br /&gt;
&lt;br /&gt;
== Troubleshooting ==&lt;br /&gt;
&lt;br /&gt;
=== Kerberos errors ===&lt;br /&gt;
* Run &amp;lt;code&amp;gt;klist&amp;lt;/code&amp;gt; and check expiration&lt;br /&gt;
* Re-run &amp;lt;code&amp;gt;kinit&amp;lt;/code&amp;gt; if needed&lt;br /&gt;
&lt;br /&gt;
=== Hive connection fails ===&lt;br /&gt;
* Ensure Kerberos ticket is valid&lt;br /&gt;
* Verify correct host (&amp;lt;code&amp;gt;hsrv01.pic.es&amp;lt;/code&amp;gt;)&lt;br /&gt;
* Check that required Python packages are installed&lt;br /&gt;
&lt;br /&gt;
=== Spark does not start ===&lt;br /&gt;
* Verify environment variables (&amp;lt;code&amp;gt;HADOOP_HOME&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;HIVE_HOME&amp;lt;/code&amp;gt;)&lt;br /&gt;
* Ensure &amp;lt;code&amp;gt;findspark.init()&amp;lt;/code&amp;gt; is executed before creating the session&lt;br /&gt;
&lt;br /&gt;
---&lt;br /&gt;
&lt;br /&gt;
== Summary ==&lt;br /&gt;
&lt;br /&gt;
# Obtain a Kerberos ticket (&amp;lt;code&amp;gt;kinit&amp;lt;/code&amp;gt;)&lt;br /&gt;
# Install required Python packages&lt;br /&gt;
# Configure environment variables&lt;br /&gt;
# Use Spark or Hive from your notebook&lt;br /&gt;
&lt;br /&gt;
---&lt;br /&gt;
&lt;br /&gt;
For further assistance, contact PIC support (user.support@pic.es).&lt;/div&gt;</summary>
		<author><name>Tallada</name></author>
	</entry>
</feed>