Overview:
OpenCloud was a data-intensive hadoop cluster that was in service from 2009 through 2018. It was used by both systems researchers and application researchers. This section was intended for application researchers; more systems-y information is available elsewhere.
OpenCloud was composed of compute nodes, storage nodes and login/master nodes. In general, a user logs into a login node and submits jobs from the login node. Job control systems like Hadoop allocate compute node processes and manage these processes on your behalf. Users monitor status using services available on the login nodes, or via web services. Users don't need to directly access the compute nodes, storage nodes or service master nodes. All these other nodes are controlled and managed by the tools/systems invoked by users via login nodes.
The storage nodes provide over 250 TB of RAID protected storage outside (in addition to) the storage resources in the compute and login nodes. There are 10
dedicated storage nodes.
Compute Node Configuration.
There are two classes of compute nodes, OCX and OCY. The main differences are disk and network; they are largely equivalent in CPU and RAM. There are currently 104 OCX nodes
worker nodes, each with 8 cores, 16 GB DRAM, 4 1TB disks and 10GbE connectivity between nodes. There are 30 OCY nodes with a single 250 GB or 400 GB disk and 1x GbE network. The compute cluster in @OpenCloud, seen as a single computer, has over 1 tera-operations per second, over 1 TB memory, 296 1TB disks, and over 40 Gbps bisection bandwidth.
The compute node configuration is detailed below.
Item |
OCX Specifications |
OCY Specifications |
CPU |
53x 8-core Xeon E5440 2.83GHz |
8-core Xeon L5420 2.50GHz |
CPU |
11x Xeon E5450 3.00GHz |
8-core Xeon L5420 2.50GHz |
Disks |
4x 1.0 TB 3.5" SATA disks |
1x 250 GB or 400 GB 3.5" SATA disk |
Form factor |
1U single system |
MTU |
9000b |
1500b |
Network |
1x Qlogic/!NetXen QLE3142-CU-CK 10GbE NIC |
1x Broadcom NetXtreme BCM5722 GbE NIC |
RAM |
16 GB |
16 GB |
There are various master nodes each with less disk than the compute nodes. In addition to providing login services for users, these nodes provide a range of master services, such as: zookeeper replicated system state, Hadoop and Maui/Torque job control queues, NFS service and storage, HDFS master, Hbase master, Ganglia and Nagios monitoring, and others to come.
Network
The network for the OpenCloud cluster is made up of three 10GE switches as follows:
- Each compute node and storage node connects to one of the two 48-port Arista 7148S switches.
- These compute/storage node switches connect to each other and head nodes via a 48-port Force10 S4810 Head end switch .
Storage
There are several different storage facilities available to a user of OpenCloud.
Due to their scale, the storage facilities in the cloud cluster are not backed up, thus data losses may be possible. Back up your data or risk losing it.
- Scratch space on each compute node. This is not replicated. It is not really managed, and it can be lost or deleted by various events in the system. Users should never try to use this. Some services (MapReduce/Hadoop for example) use it temporarily.
- HDFS storage in the compute cluster. This is replicated, persistent storage, but embedded in the compute clusters. This is the primary storage users will compute from and into. There may be multiple different HDFS services, sometimes to isolate which nodes experience HDFS "interference" from other jobs, sometimes because experimental services are being put into use. There is TBs of HDFS, but not 100s of TBs of HDFS.
- PDL Home directories. Inputs and code will often be installed into the login nodes and made available to the compute nodes via NFS. There will be GBs of space for home directories.