Go to the top of the NLANR/DAST web site

AAD | Advisor | Autobuf v2.0 | Multicast Beacon | BIMA | Iperf | NextINet | Tools | Web100 | All Projects


Search this site with Google

About:
- DAST
- NLANR
- FAQ
- Staff
- Contact DAST

End User Tools and Projects
- NextINet
- Advanced Applications
Database

- DAST Projects/Tools
- Network Performance
and Measurement Tools

End User Support
- Getting Started Guide
- Networking Glossary
- Other Projects/Organizations
- Funding Opportunities

Documents
- Guides/Tutorials
- Papers/Articles
- Presentations
- Reference Books

WebCT Courses
- Tuning Applications

Events
- NLANR/DAST Training
- NLANR Packets Calendar
- Idesk Travel Schedule

News
- Press Releases
- Alliance Data Link
- I2 Newswire Archives

Reports & Statistics
- Monthly Updates and QSRs
- Abilene "Weather Map"
- Web Server Stats

Contents

Introduction

Connections

Performance

Methods

Examples

Resources

Glossary

Getting Started Guide
Methods and Issues 
of Distributed Computing

  1. Middleware Solutions
    1. CORBA
    2. Legion
    3. Globus
  2. Challenges
    1. Security
    2. Accessing Remote Resources
    3. Resource Discovery
    4. Comunication/Data Access
    5. Fault Detection/Tolerance
  3. Middleware Summary


The massive growth of the Internet has given rise to a wide variety of protocols, libraries, and applications allowing various resources to talk to one another. Developing distributed applications has become increasingly difficult when faced with the growing number of technologies from which to choose. This situation is further complicated by the growing number of technical issues that need to be addressed when attempting to develop an application.

This chapter attempts to provide basic information about some of the technical issues involved in distributed computing, and how they are addressed by some of the current paradigms. This discussion is limited to TCP/IP, CORBA, Legion, and Globus. Although TCP/IP does not provide a framework to address the various challenges raised in developing applications on the Grid, it is discussed briefly because it is used by most, if not all, of the other distributed computing frameworks. Whether you incorporate Legion, Globus,  CORBA, or some other middlware solution into your application, you are indirectly using TCP/IP at some point to communicate data. Understanding a little about how TCP/IP works can help you see how other distributed computing frameworks perform communication and understand some of the additional overhead resulting in decreased network performance.

1. The TCP/IP Protocols

TCP/IP is a family of network protocols that has gained general acceptance over nearly 20 years. It is responsible for most of the tools you use everyday to browse web pages, download files, and use ping to check other hosts. Below are discussed the three most general protocols -- IP, UDP, and TCP -- and  some simple client/server examples are provided.

The development of TCP/IP created the notion of layered protocols. IP (Internet Protocol), the most basic protocol of TCP/IP, provides the minimum requirements to route a message and move it from one machine to another over a network interface. For example, the ping command queries the network availability of remote hosts using IP (or raw sockets). Sockets are the end points of a connection and are created with and used by function calls. The Connections Section provides more information on network connections. Application developers typically don't program at the IP (Internet Protocol) level because IP is too low-level and does not meet general reliability requirements that most distributed applications have.

The User Datagram Protocol (UDP) provides a bit more functionality than using raw sockets and is a very basic transport protocol built on top of IP. Because UDP does not guarantee that messages will reach their final destinations, UDP is used for certain applications like streaming audio and video where packet loss recovery is not as important. The loss of a packet or two in a whole series of packets comprising a video image does not greatly change the end user's experience.

Like UDP, the Transmission Control Protocol (TCP) is also layered on top of IP but provides more reliabilty by ensuring that packets that are dropped get retransmitted and reach their destination in the order they were sent. TCP addresses the following transport issues:

  • handshaking mechanisms to establish a connection between two machines
  • capabilities for flow control (how much data can be sent at a time)
  • congestion control (what to do when packets are dropped)
  • polling for messages (how long to wait for an incoming packet)
  • retransmission of lost or corrupted data
TCP/IP specifies a sockets API that lets developers write their own network code to transmit data using the TCP or UDP protocol. Servers can be designed that process requests from clients and return updated information. The steps involved in building a server are:
  1. initialize the socket
  2. bind the socket to a chosen port greater than 1024
  3. listen for incoming connections
  4. accept (or not) any incoming requests
In order to connect to the server, the client must:
  1. initialize a socket
  2. connect to the server
Once a connection is established, client and server can read and write data. See Examples  for some examples.

Unfortunately, sockets programming is still too low-level for developers not intimately familiar with some of the more esoteric and finer details of TCP/IP and network programming. For this reason, developers typically rely on distributed computing frameworks, or middleware, discussed below to more effectively enable application development. See W. Richard Stevens, Unix Network Programming, Vol. 1 for more information.

2. Middleware Solutions

Middleware provides a software framework that attempts to overcome many of the limitations of using TCP/IP for communication, as well as supporting advanced capabilities and services for distributed applications. Middleware provides advanced architecture that sits on top of TCP/IP and hides various communications issues and enhanced functionality. It also provides:
  • more robust communications libraries that take advantage of TCP/IP and provide the ability for clients and servers to communicate via remote method invocations or remote procedure calls. This allow clients and servers to specify remote tasks besides just sending data, which is an advantage.
  • advanced architectures for addressing other technical challenges that face the design and development of a distributed application. The most common issues that arise are security, resource management, resource discovery, and fault tolerance in a heterogenous environment.
The next sections focus on the design architecture of some of the more popular choices of middleware such as CORBA, Globus, and Legion. Technical challenges -- security, resource access, resource discovery, communication access, and fault tolerance -- are discussed for CORBA, Globus, and Legion to convey the advantages and disadvantages of each design architecture.

Keep in mind as you read that CORBA's Interface Definition Language makes it possible for both Legion and Globus to interoperate with it. Developers can therefore take advantage of the various services Legion and Globus offer while still working within the CORBA model.

2.1 CORBA

Complex interoperability issues in the business world were acknowledged in 1989 when the Object Management Group (OMG) was established to create a standard for the design of middleware. The result was the Common Object Request Broker Architecture (CORBA). The components of CORBA are illustrated in Figure 1.


Figure 1: Common Object Request Broker Architecture
(Courtesy: Steve Vinoski, IONA)

CORBA defines an Interface Definition Language (IDL) and the Application Programming Interfaces (API) that enable client/server interaction within a particular Object Request Broker (ORB). The ORB is the middleware that establishes the client-server relationships between objects. Client and server applications (objects) are specified using IDL and then compiled into the stub (or skeleton) interfaces (the actual code to describe the object's methods). The languages supported by the IDL compiler depend on the ORB implementation. CORBA clients and servers can be developed to dynamically (at run-time) invoke remote methods using the Dynamic Invocation Interface for the client and the Dynamic Skeleton Interface (DSI) for the server implementation.

2.2 Legion

Legion shares with CORBA its object-based approach. Legion calls itself the "Grid OS," an object-based metasystem designed to connect various computers and networks together and provide the illusion of a single system. However, unlike CORBA, Legion is also intended for high-performance computing users. The Legion architecture is shown in Figure 2.


Figure 2: Overview of Legion Architecture
(Courtesy: Andrew Grimshaw, University of Virginia)

Legion sits on top of a user's operating system, negotiating between that computer's resources and whichever resources or applications are required. Legion handles resources scheduling and security issues. A user-controlled naming system called context space is used, so that users can easily monitor and use objects. A Legion resource maintains Core Objects, such as the host and vault objects used to represent processors and persistent storage respectively. As part of Legion, many high-level Unix-like tools exist for working with Legion objects. Using Legion's Interface Description Language (IDL), implementation code can be written in any of the supported languages: Java, C, C++, Fortran, and the CORBA IDL.

2.3 Globus

Globus, like Legion, focuses on the requirements of the high-performance computing spectrum. The Globus architecture is shown in Figure 3.


Figure 3: Overview of Globus Architecture

Systems with Globus deployed are, like Legion, capable of secure authentication and job scheduling. However, the design approach taken by the Globus developers is entirely unlike Legion or CORBA. Rather than define an object-oriented architecture, Globus provides libraries, written in C, for addressing individual areas like security, remote job submission, file staging, and resource discovery. In the Globus framework, applications can take advantage of only the libraries they need without adopting an entire architecture to create a distributed application. Globus also offers a number of higher-level tools to allow users to stage files, execute programs remotely, and inquire about other resources.

3. Challenges

In this section the differing approaches to using Globus, Legion, or CORBA are compared in the context of five challenge areas: security, communicating with remote resources, resource discovery, communication and data access, and fault tolerance.

3.1 Security

The first technical challenge encountered in harnessing multiple resources consists of authentication and authorization. In this case, it is assumed that authorization -- permission to access a system -- is determined by the systems administrator and subject to local site policy. Authentication, however, is the means by which you establish your identity to the remote site. Generally, sites have their own policies regarding how you can authenticate to a resource. Such policies might include using security services such as SSH (Secure Shell) or Kerberos or perhaps even allowing plain-text passwords (i.e., no secure authentication required).

Your decision to adopt a particular distributed computing framework may be highly dependent on the security requirements of the site resources you plan on using.

CORBA.   The Object Management Group defined four security specifications:

  1. The CORBA Security Service Specification, which provides mechanisms to address delegation (passing credentials around), authentication, and access control using a variety of mechanisms and policies including X.509 certificates (authentication of entities involved in a distributed directory service), SSL (Secure Sockets Layer, a standard security protocol), and Kerberos. The security service is dependent on a vendor's implementation of the Object Request Broker (ORB).
  2. The Secure Interoperability/secIOP Specification provides mechanisms for secure interoperability between different vendor's CORBA security services.
  3. The CORBA ORB-SSL Integration Specification discusses the use of SSL within the Internet Inter-Orb Protocol-level (IIOP) of the ORB. (See Communication below for information on IIOP.)
  4. The CORBA/Firewall Specification describes how a firewall can process IIOP, provide access control to client and server objects and protecting server objects.
Legion.   Legion objects are capable of implementing their own security policies and include the following features:
  • authentication is supported via Kerberos v5, and PKI-based (public key infrastructure) security using X.509 certificates is under development.
  • an access control module known as MayI permits developers to define their own method for gaining access to a Legion object.
  • allowing for three message-layer security modes based on the RSAREF 2.0 libraries: private (encrypted communication), protected (fast digested communication with unforgeable secrets), and no security.
Globus.   Globus relies on the Grid Security Infrastructure (GSI) to support the following features:
  • standard X.509 certificates that are commonly used in secure Internet transactions to provide a user with a global identity.
  • authentication using the Generic Security Services API under Eric Young's implementation of SSL.
  • single sign-on, which means a user only has to login once to use all Grid compute resources to which the user has authorization.
  • Mapping of a global identity, a Grid user identification, to your local resource login is by means of a grid-mapfile that associates the identity with the resource login.

3.2 Accessing Remote Resources

Assuming you have been granted access to a particular computer, you may want to use specific hardware/software resources that must be shared by all users. This brings into question the notion of a resource broker that can take incoming requests and allocate system resources in an efficient and fair manner. In typical client/server applications, the server is responsible for spawning processes and allocating machine resources such as memory, processors, and bandwidth. In a Grid environment, you may want to use several resources simultaneously, called resource co-allocation. Co-allocation represents a difficult scheduling problem because now you need to coordinate resource usage between multiple sites, each operating under its own policy. The level of complexity of a resource broker is largely dependent on the application. For example, a large-scale distributed simulation allowing for remote data visualization may require resource brokers at each site to coordinate the allocation of multiple processors at a given time (called advanced reservation), as well as providing for a certain amount of network bandwidth for the visualization.

CORBA.   Although CORBA provides no direct support for the execution and scheduling of programs on remote computers, this feature could be designed using remote method invocation. Remote method invocation (RMI) allows one CORBA object, or program, to call functions from another CORBA object on another computer. For example, an existing code or existing executable could be "wrapped" into a CORBA server object that could then be invoked by the client using RMI. A drawback in the high-performance computing world is that CORBA has no support for specific scheduling systems such as LSF, PBS, or NQE. It is up to the developer to design server-side code that can create and submit a batch script to the scheduling system if necessary.

Legion.   Legion allows you to run your existing code on Legion resources by following three steps:

  1. compile your code with the Legion libraries and header files
  2. register the code using something like legion_register_program (or legion_mpi_register for an MPI code)
  3. run the program using something like legion_run (or legion_mpi_run for an MPI code)
Legion also supports local scheduling systems such as LoadLeveler, Codine, PBS, and NQS. You can copy and move files across computers by performing legion_cp or legion_mv, Unix-like commands for manipulating Legion objects.

A Collection is a Legion object (what actually runs in the computer) used to collect and maintain information about currently registered host and vault objects. Tools are provided for system administrators of hosts (computational resources) and vaults (storage resources) to set a scheduling policy and add resources to a Collection.

Globus.   When using Globus, globus-job-run is a useful command that is based on the Globus Resource and Allocation Manager (GRAM). As a part of GRAM, a gatekeeper runs on systems deploying Globus. A gatekeeper is responsible for deciding if you have been granted permission to use the resource. It does this by comparing your certificate name with a list of authorized users in a gridmap file. Once authorized, the gatekeeper starts up a job-manager that is responsible for executing and monitoring your job. The gatekeeper can run jobs in interactive mode, or it can submit the job to be run by the local scheduling system on a given queue. Common schedulers are PBS (Portable Batch System), LSF (Load Sharing Facility), LoadLeveler, or NQE (Network Queuing Environment). The Resource Specification Language (RSL) is another Globus component, used by GRAM to specify executables, input, output, and other information for running jobs. RSL strings are attribute value pairs like memory=X or directory=/u/ncsa/novotny.

3.3 Resource Discovery

One of the central problems facing distributed application developers is keeping track of the resources being used and allowing for dynamic resource discovery. Rather than specifying the names of the machines you want to utilize, ideally you would like to specify the application requirements (computing power, network performance, or installed software) that you need to run. An information or naming system could compare your requirements with its dynamic database of resources and find the machines most appropriate for your needs. Not only could a naming service provide capabilities for querying, but, in many instances, you might want to publish relevant data to the database, allowing for other applications to take advantage of this information (e.g., currently running jobs, or benchmark data to be used by later runs).

CORBA.   CORBA provides various services that are useful for locating CORBA server objects. The Corba Naming Service provides a white-pages directory that can query CORBA objects by name. A CORBA server must register its name and object reference with the naming service, where you can query by name as you do with the white pages of your phonebook. The CORBA Trader Service is useful for querying objects by property, much the way you look services up in the yellow pages.

Legion.   Legion offers a common context space to display Legion objects. Higher-level tools exist for querying hosts (compute resource objects) and vaults (storage server objects) and other objects. However, Legion expects you to add the objects to your context space before it can be queried.

Globus.   In Globus, the Grid Information Service (GIS) helps Grid users locate resources and resource attributes from a dynamic database. Typically, when a resource registers itself under Globus, information such as memory, number of processors, and other capabilities are communicated to an LDAP (Lightweight Directory Access Protocol) server. LDAP has become a standard protocol for maintaining a catalog in a directory tree structure. Information about the Globus implementation of the LDAP server is available on the Globus website. Also provided is information about computer attributes like memory, operating system, cache, and local scheduling system.

3.4 Communication / Data Access

Communication methods lie at the heart of distributed computing and provide the necessary glue for connecting heterogenous resources together. TCP/IP addresses this issue by providing several different communications protocols for the transmission of data. Depending on the application requirements, it may be necessary to support other protocols as well. For instance videoconferencing applications may use unreliable multicast (disbursing multiple copies of a packet to multiple recipients where some packets may be lost) for audio and video. If your application involves high-performance compute resources, then you need to think about the communications methods used to share memory or pass messages between processors. Typically this takes the form of threads or OpenMP directives to share memory and MPI or PVM for message passing.

CORBA.   In CORBA, communication between client and server is done by the object request broker (ORB) allowing remote method invocations. The underlying protocol between ORBs is the General Inter-Orb Protocol (GIOP). GIOP specifies the message layouts and types. The common data representation (CDR) details the primitive data types, the composition of constructed types, and the format and layout of the various pseudo-objects. The CDR takes care of the data conversion between client and server machines. The GIOP has certain transport assumptions. For instance, it is based on a connection-oriented and reliable byte stream, which fits into the model of TCP/IP. The Internet Inter-Orb Protocol (IIOP) is a mapping of GIOP into a TCP/IP session. It is a commonly deployed standard between ORBs running on the Internet.

Legion.   Under Legion, communication between objects is achieved via non-blocking remote method invocations. Method invocations are subject to a scheduling and security policy. Legion uses a three-level naming system (see Figure 4).


Figure 4: Legion Object

Legion objects are maintained in a single, shared context space and are referred to by their context names. Context names are mapped to LOIDs (Legion Object Identifiers) that contain a public key from RSA Security Inc. Because LOIDs are location-independent, they are mapped to a LOA (Legion Object Address) for communication. A LOA is a physical address that contains the required information (IP address and port number) to permit other objects to communicate with the object. MPI processes and PVM tasks are represented as Legion objects, enabling heterogenous parallel computing.

Globus.   Similar to TCP/IP, Globus also provides layered communication protocols above TCP/IP. The simplest, globus_io, is a communications library that sits atop TCP/IP and provides a higher level sockets library capable of asynchronous, multithreaded communication. This communication is possible without you knowing the ugly details about asynchronous I/O and POSIX threads. The globus_io library not only allows you to set socket attributes, but uses GSI, the Grid Security Infrastructure to create secure communications channels and authorize connections. The globus_io library is used by GRAM for client/gatekeeper handshaking and by the Globus Access to Secondary Storage (GASS) component. GASS extends the globus_io library by including support for the HTTP/1.1 protocol and GSI, so that data can be transferred securely via https.

Nexus is a core communications library in Globus that provides a multimethod communications protocol capable of supporting TCP, UDP, reliable and unreliable multicast, shared memory and Posix threads. Nexus is intended to be used more by compiler writers and low-level Grid developers. One of the most exciting uses of Nexus is the MPICH-G library. MPICH-G allows parallel programs that have been written using MPI to run effectively over the wide-area network across heterogenous machines simply by compiling and linking your MPI code with the MPICH-G headers and libraries.

3.5 Fault Detection / Tolerance

What happens if one of the computers required for your simulation or the electron microscope you want to gather data from goes down? In an ever-changing Grid environment, you would like to have some way of knowing which resources may be down and which ones you can use. Providing basic fault detection can pave the way for the application developers to incorporate dynamic reconfiguration in a distributed application or checkpointing/restarting in the case where one of the resources required in a computation goes down.

CORBA.   In CORBA, fault tolerance must be devised by the application developer. One approach is object/data replication. In order to minimize machine failure, a server object is replicated across multiple servers. Databases, too, may be mirrored (replicated) at additional sites. A client request may be handled by a server that contains a pointer to a replicated server and database. If the replicated server dies, the query can be re-routed to another server, transparently to the client. As long as one replicated server is still up and running, the client will eventually get the result.

Legion.   Legion provides a checkpointing library for SPMD-style (Single Program Multiple Data) applications. In SPMD applications, each task is typically responsible for a region of data and exchanging data between neighboring tasks. Library routines can be used to create a "checkpoint" file that periodically saves the necessary data that is required when restarting the program in case of a failure. Special options used when running a program via legion_mpi_run (see Connections) can activate checkpointing and specify how long after a failure the application should be restarted (if at all).

Globus.   The Globus Heart Beat Monitor (HBM) provides a library intended to allow applications to register with a local monitor and provide on-going status to a data collector located on another machine. The data collector can then be queried for status of the running application. As of this writing, the Heart Beat Monitor (HBM) is not quite ready for prime-time, but may evolve into a Globus event model in the future.
 

4. Middleware Summary

This table summarizes the various features of each of the three paradigms described in this chapter. For more information refer to the appropriate section of this chapter.
 
CORBA
Legion
Globus
Security Kerberos, SSL, and others Kerberos v5; GSS API to come GSI offers GSS API, SSL with X509 certificates, Kerberos
Accessing resources Invokes methods on other objects Tools permit remote job submission, file staging Tools permit remote job submission, file staging
Resource discovery Offers naming and trading services Limited MDS handles resource information
Communication ORB used for method invocation LOIDs  Nexus and globus_io library used for remote procedure calls
Fault tolerance None Checkpointing via libraries, C, C++ Can query for machine status


Contents

Introduction

Connections

Performance

Methods

Examples

Resources

Glossary


Contact DASTBlank Space Last reviewed: December 31, 1969
NLANR || Applications Support || Engineering Support || Measurement and Network Analysis