Spark on Hadoop integration with Jupyter

For several years, Jupyter notebook has established itself as the notebook solution in the Python universe. Historically, Jupyter is the tool of choice for data scientists who mainly develop in Python. Over the years, Jupyter has evolved and now has a wide range of features thanks to its plugins. Also, one of the main advantages of Jupyter is its ease of deployment.

More and more Spark developers favor Python over Scala to develop their various jobs for speed of development.

In this article, we will see together how to connect a Jupyter server to a Spark cluster running on Hadoop Yarn secured with Kerberos.

How to install Jupyter?

We cover two methods to connect Jupyter to a Spark cluster:

  1. Set up a script to launch a Jupyter instance that will have a Python Spark interpreter.
  2. Connect Jupyter notebook to a Spark cluster via the Sparkmagic extension.

Method 1: Create a

Read more