About:
- DAST
- NLANR
- FAQ
- Staff
- Contact DAST
End User Tools and Projects
- NextINet
- Advanced Applications
  Database
- DAST Projects/Tools
- Network Performance
  and Measurement Tools
End User Support
- Getting Started Guide
- Networking Glossary
- Other Projects/Organizations
- Funding Opportunities
Documents
- Guides/Tutorials
- Papers/Articles
- Presentations
- Reference Books
WebCT Courses
- Tuning Applications
Events
- NLANR/DAST Training
-
NLANR Packets Calendar
- Idesk Travel Schedule
News
- Press Releases
- Alliance Data Link
- I2 Newswire Archives
Reports & Statistics
- Monthly Updates and QSRs
-
Abilene "Weather Map"
- Web Server Stats
|
Getting Started Guide
Methods and Issues
of Distributed Computing
-
Middleware Solutions
-
CORBA
-
Legion
-
Globus
-
Challenges
-
Security
-
Accessing Remote Resources
-
Resource Discovery
-
Comunication/Data Access
-
Fault Detection/Tolerance
-
Middleware Summary
The massive growth of the Internet has given rise to a wide variety
of protocols, libraries, and applications allowing various resources to
talk to one another. Developing distributed applications has become increasingly
difficult when faced with the growing number of technologies from which
to choose. This situation is further complicated by the growing number
of technical issues that need to be addressed when attempting to develop
an application.
This chapter attempts to provide basic information about some of the
technical issues involved in distributed computing, and how they are addressed
by some of the current paradigms. This discussion is limited to TCP/IP,
CORBA, Legion, and Globus. Although TCP/IP does not provide a framework
to address the various challenges raised in developing applications on
the Grid, it is discussed briefly because it is used by most, if not all,
of the other distributed computing frameworks. Whether you incorporate
Legion, Globus, CORBA, or some other middlware solution into your
application, you are indirectly using TCP/IP at some point to communicate
data. Understanding a little about how TCP/IP works can help you see how
other distributed computing frameworks perform communication and understand
some of the additional overhead resulting in decreased network performance.
1. The TCP/IP Protocols
TCP/IP is a family of network protocols that has gained general acceptance
over nearly 20 years. It is responsible for most of the tools you use everyday
to browse web pages, download files, and use ping to check other
hosts. Below are discussed the three most general protocols -- IP, UDP, and
TCP -- and some simple client/server examples
are provided.
The development of TCP/IP created the notion of layered protocols. IP
(Internet Protocol), the most basic protocol of TCP/IP, provides the minimum
requirements to route a message and move it from one machine to another
over a network interface. For example, the ping command queries
the network availability of remote hosts using IP (or raw sockets). Sockets
are the end points of a connection and are created with and used by function
calls. The Connections Section provides more information
on network connections. Application developers
typically don't program at the IP (Internet Protocol) level because IP
is too low-level and does not meet general reliability requirements that
most distributed applications have.
The User Datagram Protocol (UDP) provides a bit more functionality than
using raw sockets and is a very basic transport protocol built on top of
IP. Because UDP does not guarantee that messages will reach their final
destinations, UDP is used for certain applications like streaming audio
and video where packet loss recovery is not as important. The loss of a packet or
two in a whole series of packets comprising a video image does not greatly
change the end user's experience.
Like UDP, the Transmission Control Protocol (TCP) is also layered on
top of IP but provides more reliabilty by ensuring that packets that are
dropped get retransmitted and reach their destination in the order they
were sent. TCP addresses the following transport issues:
-
handshaking mechanisms to establish a connection between two machines
-
capabilities for flow control (how much data can be sent at a time)
-
congestion control (what to do when packets are dropped)
-
polling for messages (how long to wait for an incoming packet)
-
retransmission of lost or corrupted data
TCP/IP specifies a sockets API that lets developers write their own network
code to transmit data using the TCP or UDP protocol. Servers can be designed
that process requests from clients and return updated information. The
steps involved in building a server are:
-
initialize the socket
-
bind the socket to a chosen port greater than 1024
-
listen for incoming connections
-
accept (or not) any incoming requests
In order to connect to the server, the client must:
-
initialize a socket
-
connect to the server
Once a connection is established, client and server can read and write
data. See Examples for some examples.
Unfortunately, sockets programming is still too low-level for developers
not intimately familiar with some of the more esoteric and finer details
of TCP/IP and network programming. For this reason, developers typically
rely on distributed computing frameworks, or middleware, discussed
below to more effectively enable application development. See W. Richard
Stevens,
Unix
Network Programming, Vol. 1 for more information.
2. Middleware Solutions
Middleware provides a software framework that attempts to overcome
many of the limitations of using TCP/IP for communication, as well as supporting
advanced capabilities and services for distributed applications. Middleware
provides advanced architecture that sits on top of TCP/IP and hides various
communications issues and enhanced functionality. It also provides:
-
more robust communications libraries that take advantage of TCP/IP and
provide the ability for clients and servers to communicate via remote method
invocations or remote procedure calls. This allow clients and servers to
specify remote tasks besides just sending data, which is an advantage.
-
advanced architectures for addressing other technical challenges that face
the design and development of a distributed application. The most common
issues that arise are security, resource management, resource discovery,
and fault tolerance in a heterogenous environment.
The next sections focus on the design architecture of some of the more
popular choices of middleware such as CORBA, Globus, and Legion. Technical
challenges -- security, resource access, resource discovery, communication
access, and fault tolerance -- are discussed for CORBA, Globus, and Legion
to convey the advantages and disadvantages of each design architecture.
Keep in mind as you read that CORBA's Interface Definition Language
makes it possible for both Legion and Globus to interoperate with it. Developers
can therefore take advantage of the various services Legion and Globus
offer while still working within the CORBA model.
2.1 CORBA
Complex interoperability issues in the business world were acknowledged
in 1989 when the Object Management Group
(OMG) was established to create a standard for the design of middleware.
The result was the Common Object Request Broker
Architecture (CORBA). The components of CORBA are illustrated in Figure
1.
Figure 1: Common Object Request Broker Architecture
(Courtesy: Steve Vinoski, IONA)
CORBA defines an Interface Definition Language (IDL) and the Application
Programming Interfaces (API) that enable client/server interaction within
a particular Object Request Broker (ORB). The ORB is the middleware that
establishes the client-server relationships between objects. Client and
server applications (objects) are specified using IDL and then compiled
into the stub (or skeleton) interfaces (the actual code to
describe the object's methods). The languages supported by the IDL compiler
depend on the ORB implementation. CORBA clients and servers can be developed
to dynamically (at run-time) invoke remote methods using the Dynamic Invocation
Interface for the client and the Dynamic Skeleton Interface (DSI) for the
server implementation.
2.2 Legion
Legion shares with CORBA
its object-based approach. Legion calls itself the "Grid OS," an object-based
metasystem designed to connect various computers and networks together
and provide the illusion of a single system. However, unlike CORBA, Legion
is also intended for high-performance computing users. The Legion architecture
is shown in Figure 2.
Figure 2: Overview of Legion Architecture
(Courtesy: Andrew Grimshaw, University of Virginia)
Legion sits on top of a user's operating system, negotiating between
that computer's resources and whichever resources or applications are required.
Legion handles resources scheduling and security issues. A user-controlled
naming system called context space is used, so that users can easily
monitor and use objects. A Legion resource maintains Core Objects, such
as the host and vault objects used to represent processors and persistent
storage respectively. As part of Legion, many high-level Unix-like tools
exist for working with Legion objects. Using Legion's Interface Description
Language (IDL), implementation code can be written in any of the supported
languages: Java, C, C++, Fortran, and the CORBA IDL.
2.3 Globus
Globus, like Legion, focuses on the
requirements of the high-performance computing spectrum. The Globus architecture
is shown in Figure 3.
Figure 3: Overview of Globus Architecture
Systems with Globus deployed are, like Legion, capable of secure authentication
and job scheduling. However, the design approach taken by the Globus developers
is entirely unlike Legion or CORBA. Rather than define an object-oriented
architecture, Globus provides libraries, written in C, for addressing individual
areas like security, remote job submission, file staging, and resource
discovery. In the Globus framework, applications can take advantage of
only the libraries they need without adopting an entire architecture to
create a distributed application. Globus also offers a number of higher-level
tools to allow users to stage files, execute programs remotely, and inquire
about other resources.
3. Challenges
In this section the differing approaches to using Globus, Legion, or CORBA
are compared in the context of five challenge areas: security, communicating
with remote resources, resource discovery, communication and data access,
and fault tolerance.
3.1 Security
The first technical challenge encountered in harnessing multiple resources
consists of authentication and authorization. In this case, it is assumed
that authorization -- permission to access a system -- is determined
by the systems administrator and subject to local site policy. Authentication,
however, is the means by which you establish your identity to the remote
site. Generally, sites have their own policies regarding how you can authenticate
to a resource. Such policies might include using security services such
as SSH (Secure Shell) or Kerberos
or perhaps even allowing plain-text passwords (i.e., no secure authentication
required).
Your decision to adopt a particular distributed computing framework
may be highly dependent on the security requirements of the site resources
you plan on using.
CORBA. The Object Management Group defined four security
specifications:
-
The CORBA Security Service Specification, which provides mechanisms to
address delegation (passing credentials around), authentication, and access
control using a variety of mechanisms and policies including X.509 certificates
(authentication of entities involved in a distributed directory service),
SSL (Secure Sockets Layer, a standard security protocol), and Kerberos.
The security service is dependent on a vendor's implementation of the Object
Request Broker (ORB).
-
The Secure Interoperability/secIOP Specification provides mechanisms for
secure interoperability between different vendor's CORBA security services.
-
The CORBA ORB-SSL Integration Specification discusses the use of SSL within
the Internet Inter-Orb Protocol-level (IIOP) of the ORB. (See Communication
below for information on IIOP.)
-
The CORBA/Firewall Specification describes how a firewall can process IIOP,
provide access control to client and server objects and protecting server
objects.
Legion. Legion objects are capable of implementing their
own security policies and include the following features:
-
authentication is supported via Kerberos v5, and PKI-based (public
key infrastructure) security using X.509 certificates is under development.
-
an access control module known as MayI permits developers to define
their own method for gaining access to a Legion object.
-
allowing for three message-layer security modes based on the RSAREF 2.0
libraries: private (encrypted communication), protected (fast digested
communication with unforgeable secrets), and no security.
Globus. Globus relies on the Grid
Security Infrastructure (GSI) to support the following features:
-
standard X.509 certificates that are commonly used in secure Internet transactions
to provide a user with a global identity.
-
authentication using the Generic Security Services API under Eric Young's
implementation of SSL.
-
single sign-on, which means a user only has to login once to use all Grid
compute resources to which the user has authorization.
-
Mapping of a global identity, a Grid user identification, to your
local resource login is by means of a grid-mapfile that associates
the identity with the resource login.
3.2 Accessing Remote Resources
Assuming you have been granted access to a particular computer, you may
want to use specific hardware/software resources that must be shared by
all users. This brings into question the notion of a resource broker that
can take incoming requests and allocate system resources in an efficient
and fair manner. In typical client/server applications, the server is responsible
for spawning processes and allocating machine resources such as memory,
processors, and bandwidth. In a Grid environment, you may want to use several
resources simultaneously, called resource co-allocation. Co-allocation
represents a difficult scheduling problem because now you need to coordinate
resource usage between multiple sites, each operating under its own policy.
The level of complexity of a resource broker is largely dependent on the
application. For example, a large-scale distributed simulation allowing
for remote data visualization may require resource brokers at each site
to coordinate the allocation of multiple processors at a given time (called
advanced
reservation), as well as providing for a certain amount of network
bandwidth for the visualization.
CORBA. Although CORBA provides no direct support
for the execution and scheduling of programs on remote computers, this
feature could be designed using remote method invocation. Remote method
invocation (RMI) allows one CORBA object, or program, to call functions
from another CORBA object on another computer. For example, an existing
code or existing executable could be "wrapped" into a CORBA server object
that could then be invoked by the client using RMI. A drawback in the high-performance
computing world is that CORBA has no support for specific scheduling systems
such as LSF, PBS, or NQE. It is up to the developer to design server-side
code that can create and submit a batch script to the scheduling system
if necessary.
Legion. Legion allows you to run your existing code
on Legion resources by following three steps:
-
compile your code with the Legion libraries and header files
-
register the code using something like legion_register_program
(or legion_mpi_register for an MPI code)
-
run the program using something like legion_run (or legion_mpi_run
for an MPI code)
Legion also supports local scheduling systems such as LoadLeveler, Codine,
PBS, and NQS. You can copy and move files across computers by performing
legion_cp
or legion_mv, Unix-like commands for manipulating Legion objects.
A Collection is a Legion object (what actually runs in
the computer) used to collect and maintain information about currently
registered host and vault objects. Tools are provided for system administrators
of hosts (computational resources) and vaults (storage resources) to set
a scheduling policy and add resources to a Collection.
Globus. When using Globus, globus-job-run
is a useful command that is based on the Globus
Resource and Allocation Manager (GRAM). As a part of GRAM, a gatekeeper
runs on systems deploying Globus. A gatekeeper is responsible for
deciding if you have been granted permission to use the resource. It does
this by comparing your certificate name with a list of authorized users
in a gridmap file. Once authorized, the gatekeeper starts up a job-manager
that is responsible for executing and monitoring your job. The gatekeeper
can run jobs in interactive mode, or it can submit the job to be run by
the local scheduling system on a given queue. Common schedulers are PBS
(Portable Batch System), LSF (Load Sharing Facility), LoadLeveler, or NQE
(Network Queuing Environment). The Resource
Specification Language (RSL) is another Globus component, used by GRAM
to specify executables, input, output, and other information for running
jobs. RSL strings are attribute value pairs like memory=X or
directory=/u/ncsa/novotny.
3.3 Resource Discovery
One of the central problems facing distributed application developers is
keeping track of the resources being used and allowing for dynamic resource
discovery. Rather than specifying the names of the machines you want to
utilize, ideally you would like to specify the application requirements
(computing power, network performance, or installed software) that you
need to run. An information or naming system could compare your requirements
with its dynamic database of resources and find the machines most appropriate
for your needs. Not only could a naming service provide capabilities for
querying, but, in many instances, you might want to publish relevant data
to the database, allowing for other applications to take advantage of this
information (e.g., currently running jobs, or benchmark data to be used
by later runs).
CORBA. CORBA provides various services that are useful
for locating CORBA server objects. The Corba Naming Service provides a
white-pages directory that can query CORBA objects by name. A CORBA server
must register its name and object reference with the naming service, where
you can query by name as you do with the white pages of your phonebook.
The CORBA Trader Service is useful for querying objects by property, much
the way you look services up in the yellow pages.
Legion. Legion offers a common context space to display
Legion objects. Higher-level tools exist for querying hosts (compute resource
objects) and vaults (storage server objects) and other objects. However,
Legion expects you to add the objects to your context space before it can
be queried.
Globus. In Globus, the Grid Information Service (GIS)
helps Grid users locate resources and resource attributes from a dynamic
database. Typically, when a resource registers itself under Globus, information
such as memory, number of processors, and other capabilities are communicated
to an LDAP (Lightweight Directory Access Protocol) server. LDAP has become
a standard protocol for maintaining a catalog in a directory tree structure.
Information about the Globus
implementation of the LDAP server is available on the Globus website.
Also provided is information about computer attributes like memory, operating
system, cache, and local scheduling system.
3.4 Communication / Data Access
Communication methods lie at the heart of distributed computing and provide
the necessary glue for connecting heterogenous resources together. TCP/IP
addresses this issue by providing several different communications protocols
for the transmission of data. Depending on the application requirements,
it may be necessary to support other protocols as well. For instance videoconferencing
applications may use unreliable multicast (disbursing multiple copies of
a packet to multiple recipients where some packets may be lost) for audio
and video. If your application involves high-performance compute resources,
then you need to think about the communications methods used to share memory
or pass messages between processors. Typically this takes the form of threads
or OpenMP directives to share memory and MPI or PVM for message passing.
CORBA. In CORBA, communication between client and
server is done by the object request broker (ORB) allowing remote method
invocations. The underlying protocol between ORBs is the General Inter-Orb
Protocol (GIOP). GIOP specifies the message layouts and types. The common
data representation (CDR) details the primitive data types, the composition
of constructed types, and the format and layout of the various pseudo-objects.
The CDR takes care of the data conversion between client and server machines.
The GIOP has certain transport assumptions. For instance, it is based on
a connection-oriented and reliable byte stream, which fits into the model
of TCP/IP. The Internet Inter-Orb Protocol (IIOP) is a mapping of GIOP
into a TCP/IP session. It is a commonly deployed standard between ORBs
running on the Internet.
Legion. Under Legion, communication between objects
is achieved via non-blocking remote method invocations. Method invocations
are subject to a scheduling and security policy. Legion uses a three-level
naming system (see Figure 4).
Figure 4: Legion Object
Legion objects are maintained in a single, shared context space and
are referred to by their context names. Context names are mapped
to LOIDs (Legion Object Identifiers) that contain a public key from RSA
Security Inc. Because LOIDs are location-independent, they are mapped to
a LOA (Legion Object Address) for communication. A LOA is a physical address
that contains the required information (IP address and port number) to
permit other objects to communicate with the object. MPI processes and
PVM tasks are represented as Legion objects, enabling heterogenous parallel
computing.
Globus. Similar to TCP/IP, Globus also provides layered
communication protocols above TCP/IP. The simplest, globus_io,
is a communications library that sits atop TCP/IP and provides a higher
level sockets library capable of asynchronous, multithreaded communication.
This communication is possible without you knowing the ugly details about
asynchronous I/O and POSIX threads. The globus_io library not
only allows you to set socket attributes, but uses GSI, the Grid Security
Infrastructure to create secure communications channels and authorize connections.
The globus_io library is used by GRAM for client/gatekeeper handshaking
and by the Globus Access to Secondary
Storage (GASS) component. GASS extends the globus_io library
by including support for the HTTP/1.1 protocol and GSI, so that data can
be transferred securely via https.
Nexus is a core communications
library in Globus that provides a multimethod communications protocol capable
of supporting TCP, UDP, reliable and unreliable multicast, shared memory
and Posix threads. Nexus is intended to be used more by compiler writers
and low-level Grid developers. One of the most exciting uses of Nexus is
the MPICH-G library. MPICH-G allows
parallel programs that have been written using MPI to run effectively over
the wide-area network across heterogenous machines simply by compiling
and linking your MPI code with the MPICH-G headers and libraries.
3.5 Fault Detection / Tolerance
What happens if one of the computers required for your simulation or the
electron microscope you want to gather data from goes down? In an ever-changing
Grid environment, you would like to have some way of knowing which resources
may be down and which ones you can use. Providing basic fault detection
can pave the way for the application developers to incorporate dynamic
reconfiguration in a distributed application or checkpointing/restarting
in the case where one of the resources required in a computation goes down.
CORBA. In CORBA, fault tolerance must be devised
by the application developer. One approach is object/data replication.
In order to minimize machine failure, a server object is replicated across
multiple servers. Databases, too, may be mirrored (replicated) at additional
sites. A client request may be handled by a server that contains a pointer
to a replicated server and database. If the replicated server dies, the
query can be re-routed to another server, transparently to the client.
As long as one replicated server is still up and running, the client will
eventually get the result.
Legion. Legion provides a checkpointing library for
SPMD-style (Single Program Multiple Data) applications. In SPMD applications,
each task is typically responsible for a region of data and exchanging
data between neighboring tasks. Library routines can be used to create
a "checkpoint" file that periodically saves the necessary data that is
required when restarting the program in case of a failure. Special options
used when running a program via legion_mpi_run (see
Connections) can activate checkpointing and specify how long after a failure
the application should be restarted (if at all).
Globus. The Globus Heart
Beat Monitor (HBM) provides a library intended to allow applications
to register with a local monitor and provide on-going status to a data
collector located on another machine. The data collector can then be
queried for status of the running application. As of this writing, the
Heart Beat Monitor (HBM) is not quite ready for prime-time, but may evolve
into a Globus event model in the future.
4. Middleware Summary
This table summarizes the various features of each of the three paradigms
described in this chapter. For more information refer to the appropriate
section of this chapter.
|
CORBA
|
Legion
|
Globus
|
| Security |
Kerberos, SSL, and others |
Kerberos v5; GSS API to come |
GSI offers GSS API, SSL with X509 certificates,
Kerberos |
| Accessing resources |
Invokes methods on other objects |
Tools permit remote job submission, file staging |
Tools permit remote job submission, file staging |
| Resource discovery |
Offers naming and trading services |
Limited |
MDS handles resource information |
| Communication |
ORB used for method invocation |
LOIDs |
Nexus and globus_io library used for remote procedure
calls |
| Fault
tolerance |
None |
Checkpointing via libraries, C, C++ |
Can query for machine status |
|