Recently, my interest stumbled on a website (https://therinspark.com/) that made me want to learn more about big data analytics with Hadoop and Spark using R. The choice of interface for learning Big Data Analytics using Hadoop and Spark is more than one, usually using python but I don’t have any experience with python and I don’t want to mess up my working computer environment. So, I decided to jump on and ready to make my hand dirty when preparing the environment
So, I follow the instructions written on the web, the first obstacle is to install Hadoop through sparklyr library. Either my connection is not good enough or the repository mirror is not tolerate the delay, and the download process of spark-3.2.3-bin-hadoop2.7 was unsuccessful. Poking around with how to install Spark-Hadoop and then found a clue that I can install it another way. Here are the steps that I has been done to make the Spark – Hadoop work with R in my Manjaro Linux;
- Install the sparklyr library using install.packages(“spraklyr”)
- Load the library -> library(sparklyr)
- check the available spark-Hadoop version using spark_available_version()
4. The newest version is spark 3.3, but if I try to download it, it gave an error message
5. so, I use version 3.2 and execute the spark_install(version=”3.2″)
Fig 2 shows that the spark-hadoop download process failed. But, I’ve got the link for downloading the file, so I do download the file using the web browser, then put the file in the working directory where you want to work with.
6. load the sparklyr library and then execute spark_install_tar(“<file name>”).
7. the installation process will begin then after finishing you can use the spark-hadoop for learning big data analytics.
Note: Don’t forget to check your JDK version according to the book, it uses the openjdk8. if your java version is higher, consider downgrading the java and setup the environment variable of your system.