Stoat Cluster
NOTE: As of Summer 2018, the PDL Stoat cluster is no longer available. All of the documentation below is strictly for historical purposes.
About Stoat
Stoat is an Apache Hadoop computing facility running Cloudera CDH5.4 with the following components:
- HDFS
- Hive
- Oozie
- Sqoop2
- YARN
- ZooKeeper
About Hadoop
Hadoop is an open-source software framework for distributed storage and distributed processing of large datasets. You can find more information about Hadoop from the [[https://hadoop.apache.org/][Apache Software Foundation]
About Cloudera
Cloudera is a software company that provides Apache Hadoop based software, support and services. Cloudera's hadoop distribution, (CDH), includes several apache licensed open source projects that work with hadoop.
Getting Started
Requesting an account
Please contact us to be added to the stoat group.
Accessing the cluster
In order to access the cluster, you need to be on campus or connected the
General Campus VPN.
Proxy Server
If you want to monitor your jobs with a web browser, you will need to configure your web browser to use a
ProxyServer.
Use SSH to initiate a session to the login node for the cloud cluster:
ssh shell.stoat.pdl.local.cmu.edu
From the login node you can launch your jobs.
Submit a job
Run following command under Linux shell of the login node
hadoop jar YourJar.jar YourClass CommandLineArguments
Example Jobs:
MapReduce
hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar pi 10 10
Spark (YARN Client Mode)
source /etc/spark/conf/spark-env.sh
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client $SPARK_HOME/lib/spark-examples.jar 100
Monitoring your jobs
You can monitor applications at
http://rm.stoat.pdl.local.cmu.edu:8088/ (note that you have to configure a
ProxyServer first).