How to Install Apache Spark on Windows

October 10, 2024

Introduction

Apache Spark is an open-source framework for processing large volumes of batch and streaming data from multiple sources. It is used in distributed computing for machine learning, data analytics, and graph processing tasks.

Learn how to install Apache Spark on Windows and verify the installation works.

Steps to install Apache Spark on Windows.

Prerequisites

  • A system running Windows 10 or 11.
  • A user account with administrator privileges.
  • Access to Windows Command Prompt or PowerShell.
  • A tool to extract .tar files, such as 7-Zip or WinRAR.

Install and Set Up Apache Spark on Windows

To set up Apache Spark, you must install Java, download the Spark package, and set up environment variables. Python is also required to use Spark's Python API called PySpark.

If you already have Java 8 (or later) and Python 3 (or later) installed, you can skip the first step of this guide.

Step 1: Install Spark Dependencies

Using the Windows winget utility is a convenient way to install the necessary dependencies for Apache Spark:

1. Open Command Prompt or PowerShell as an Administrator.

2. Enter the following command to install the Azul Zulu OpenJDK 21 (Java Development Kit) and Python 3.9:

winget install --id Azul.Zulu.21.JDK -e --source winget && winget install --id Python.Python.3.9 -e --source winget
Installing Java and Python from Windows cmd.

Adjust the command if you need a specific Java or Python version or a different OpenJDK vendor.

3. The OpenJDK installation is in the C:\Program Files\Zulu\zulu-21 folder by default. The space in the path can cause issues when launching Apache Spark. Avoid this by moving the installation to a folder without spaces. Use the following command to create a new Zulu folder in the root directory and move the installation:

mkdir C:\Zulu && robocopy "C:\Program Files\Zulu\zulu-21" "C:\Zulu\zulu-21" /E /MOVE

4. Verify the installation by checking the Java and Python versions:

java -version && python --version
Checking the Java and Python versions in the Windows cmd.

The output confirms your system is using OpenJDK 21 and Python 3.9.

Step 2: Download Apache Spark

To download the latest Apache Spark release:

1. Open a browser and navigate to the official Apache Spark download page.

2. The latest Spark version is selected by default. At the time of writing, the latest version is Spark 3.5.3 for Hadoop 3.3.

3. Click the spark-3.5.3-bin-hadoop3.tgz download link.

Downloading the latest Apache Spark version.

4. Select a location from a list of mirror servers to begin the download.

Apache Spark mirror links.

5. Save the file, for example, to the Downloads folder.

Step 3: Verify Spark Software File

To verify that you are working with an unaltered, uncorrupted file, check its checksum:

1. Navigate to the Spark Download page and open the checksums link, preferably in a new tab.

Apache Spark checksums link.

2. Open Command Prompt and use the cd command to navigate to the folder where you downloaded Apache Spark. For example, if the file is in the Downloads folder, enter:

cd C:\Users\username\Downloads

Replace username with your actual Windows username.

3. Use the following command to calculate the downloaded file's checksum:

certutil -hashfile spark-3.5.3-bin-hadoop3.tgz SHA512
A checksum calculation for an Apache Spark download on Windows.

The system displays a long alphanumeric code followed by the message Certutil: -hashfile completed successfully.

4. Manually compare the checksum output with the one on the Apache Spark website.

If they match, the file is legitimate.

Step 4: Install Apache Spark

To install Apache Spark, extract the downloaded file to a desired location:

1. For example, create a new Spark folder in the root of the C: drive using the following command:

cd \ && mkdir Spark

2. Use 7-Zip to extract the Spark .tar file from the .tgz compressed file and place it in the Spark folder:

"C:\Program Files\7-Zip\7z.exe" x "C:\Users\username\Downloads\spark-3.5.3-bin-hadoop3.tgz" -oC:\Spark
Unzipping the Spark .tgz file using 7-Zip in Windows.

Replace username in the file path with your username.

Note: The file paths are wrapped in double quotes because the username and folder names contain spaces.

3. Extract the Spark binaries from the Spark .tar file:

"C:\Program Files\7-Zip\7z.exe" x "C:\Spark\spark-3.5.3-bin-hadoop3.tar" -oC:\Spark
Extracting Apache Spark binaries in Windows.

4. Use the dir command to list the contents of the Spark folder:

cd C:\Spark && dir
The unzipped Spark installation folder.

The spark-3.5.3-bin-hadoop3 folder contains the necessary files to run Spark.

Step 5: Add winutils.exe File

The winutils utility enables Apache Spark and other Hadoop-based tools to run on Windows. You need to download the winutils.exe file that matches the Hadoop version used by your Spark installation:

1. Create a hadoop\bin folder in the C: drive to store the winutils.exe file:

cd \ && mkdir C:\hadoop\bin

2. Use the curl command to download the file from the winutils GitHub repository into the newly created folder:

curl --ssl-no-revoke -L -o C:\hadoop\bin\winutils.exe https://github.com/cdarlint/winutils/raw/master/hadoop-3.3.5/bin/winutils.exe
Downloading the wunutils utility for Spark in Windows.

In this example, the Hadoop version is 3.3.5. If necessary, edit the command to correspond to the Hadoop version your Spark installation is using.

3. Verify that the file has been downloaded:

cd C:\hadoop\bin && dir
The winutils.exe file in the Hadoop folder in Windows.

The winutils.exe file is listed in the folder.

Step 6: Configure Environment Variables

Add the Spark, Java, and Hadoop locations to your system's Path environment variable to run the Spark shell directly from the CLI.

You can edit environment variables manually using the Windows UI. However, setting User Variables from Command Prompt can be faster and more efficient.

Use the setx command to set the SPARK_HOME User Variable:

setx SPARK_HOME "C:\Spark\spark-3.5.5-bin-hadoop3"

To set HADOOP_HOME, enter:

setx HADOOP_HOME "C:\hadoop"

To set JAVA_HOME, type:

setx JAVA_HOME "C:\Zulu\zulu-21"
Set the Spark variables in Windows.

The paths in the commands are set based on the previous installation steps. If you used different versions and file paths, adjust the commands accordingly.

Enter the following command to retrieve the current Path from the registry and store it in the oldPath variable:

for /f "tokens=2*" %A in ('reg query "HKCU\Environment" /v Path') do set "oldPath=%B"

Append the Spark, Hadoop, and Java paths to the Path variable without overwriting existing entries:

setx Path "%oldPath%;%SPARK_HOME%\bin;%HADOOP_HOME%\bin;%JAVA_HOME%\bin"
Append Spark paths to Path variable in Windows.

Restart Command Prompt to apply the changes. In rare instances, a full system reboot may be required.

Step 7: Launch Spark

To start the Spark Scala shell, enter the following:

spark-shell

This opens an interactive shell interface with Spark notifications and information. The output includes details about the Spark version, configuration, and available resources.

Starting the Spark Scala shell in Windows.

To access Spark's Web UI, open a web browser and navigate to the following URL:

http://localhost:4040/

You can replace localhost with the name or IP address of your system. The Spark Web UI provides details about running jobs, stages, and resource usage.

The Executors tab in the Spark Web UI.

Use the following command to exit Spark and close the Scala shell:

:quit

Alternatively, press CTRL+D while using the command prompt. Enter the following command to run Spark using Python:

pyspark

This launches the Spark shell with a Python interface.

Launching pyspark in Spark in Windows.

To exit pyspark, type:

quit()

Test Spark

To test the Spark installation, use the Scala interface to read and manipulate a file. In this example, the name of the file is pnaptest.txt. Open Command Prompt and navigate to the folder with the file you want to use:

1. Launch the Spark shell:

spark-shell

2. Load the file in the Spark context by defining a variable and specifying the file name (including its extension):

val x = sc.textFile("pnaptest.txt")

The command loads the file into a Resilient Distributed Dataset (RDD), which allows you to perform actions and transformations on the data.

3. Enter the following command to perform an action on the RDD and view the contents of the file:

x.take(11).foreach(println)
Perform an action on a file in Spark Windows.

Spark displays the first 11 lines of the file. You can adjust the number of lines by changing the number in the take() method.

4. Use the map transformation to create a new RDD based on the existing one. For example, reverse each line of the file using the following command:

val y = x.map(_.reverse)

This creates a new RDD (y), where each original RDD (x) line is reversed.

5. Enter the following command to display the first 11 lines of the transformed RDD:

y.take(11).foreach(println)
Transforming RDDs in Apache Spark in Windows.

The output prints 11 lines of the pnaptest.txt file, with each line reversed.

When ready, enter the :quit command or press CTRL+D to exit the Spark shell.

Conclusion

You now have a working installation of Apache Spark on Windows 10 or 11. This local setup is ideal for running basic tests and getting familiar with Spark's core features.

Read about Spark DataFrames and Spark Streaming to explore more advanced functionalities.

Was this article helpful?
YesNo
Goran Jevtic
Goran combines his leadership skills and passion for research, writing, and technology as a Technical Writing Team Lead at phoenixNAP. Working with multiple departments and on various projects, he has developed an extraordinary understanding of cloud and virtualization technology trends and best practices.
Next you should read
RDD vs. DataFrame vs. Dataset
July 21, 2021

Spark contains three major data structures and APIs for working with big data: RDDs, DataFrames and Datasets. Learn about the difference between them as well as when it's best to apply each.
Read more
How to Run PySpark on Jupyter Notebook
October 4, 2023

Combining PySpark and Jupyter Notebooks creates a powerful data analysis environment. Learn how to set up PySpark to run on Jupyter Notebook in this guide.
Read more
Structured vs. Unstructured Data: Understanding Differences
March 7, 2023

Depending on how data looks, we can categorize information as structured and unstructured. Learn about the differences to maximize the usefulness of your data.
Read more
R vs. Python - A Detailed Overview
October 19, 2023

R and Python are giants in the world of data analytics, exploration, and graphing. This article provides an in-depth overview of the differences between R and Python.
Read more