|
Data mining
will most often require performing work on several hosts in
the cluster. Running the processes in parallel on these machines
can speed up this work. Refer to Brewster Khale's P-Machine
working notes for the underlying ideas of how this is
done on the archive.
Creating parallel data mining jobs is best approached by first
constructing code that will perform the desired operation
on just one machine. Once this code works properly, the program
p2 can be used launch this code in parallel on any
specified remote host.
p2 will establish all of the connections with the remote
hosts, launch the specified job on the remote hosts, and collect
the data that each data mining job on each host returns. The
gist of p2's function is to rsh out to each specified host
and then launch a particular command or sequence of commands
on those hosts. Each host will produce some output and then
feed it back to p2, which will then combine the data flow
into one data stream. The following is the basic syntax to
launch a p2 job:
> p2
'my_data_mining_job' -p $ARCS
The command,
including any pipelines and redirection, that is to be issued
to each remote host must be included within single quotes,
followed by the option '-p', followed by the machines on which
the script will be run. Anything located between the single
quotes will be executed on each specified remote machine as-is.The
previous example will launch the script my_data_mining_job
on all of the hosts in the archive.
The output
from the p2 operation will be returned to whichever terminal
deployed it. The output will be asynchronous and have no information
about which machine produced which data. We will deal with
this issue momentarily.
Any valid
shell command can be placed within the quotes, including Perl
scripts, redirection, and pipes. Keep in mind that any file
creation and redirection will occur independently on each
specified host. Since homeserver is NFS mounted on each remote
host, each host can access scripts located in a user's personal
directory thereby eliminating the need to copy a given script
onto each remote host. This also means that log files from
each process running on each specified remote host can be
saved directly into the user's personal directory.
Remote
machines on which to run the p2 command are listed after the
-p flag. For example, to run p2 on only two machines, ia00100
and ia00200, the syntax following the '-p' will be /net/ia00100
/net/ia00200. To run your script on all of the machines
in the first rack, simply use the environment variable $rack1
. To run the script on all of the machines in the archive,
use the environment variable $ARCS.
It is
advised to launch a p2 job on just one host first to make
sure the desired result is achieved. One can then specify
more hosts if the test was successful. p2 instantiates many
processes and temporary files on each specified remote host
in order to coordinate the job, so canceling a malformed job
can be quite a chore.
When the
p2 job is launched, each specified host will return output
one line at a time to standard out. p2 ensures that each output
line is retrieved in its entirety and is not interrupted by
the output data from other hosts. However, the ordering of
the results is not guaranteed. p2 will collect data from standard
out of any machine that provides a complete line of data.
Faster machines will provide data sooner than slower ones,
and this will be reflected in the random ordering of output
collected by p2.
When the
command
[homeserver]
p2 'uptime' -p /net/ia00100 /net/ia00101
is executed,
the resulting output is:
P begun
: Fri Jun 28 13:13:28 2002
1:14pm up 21 days, 2:11, 0 users, load average: 0.00, 0.00,
0.00
1:12pm up 21 days, 2:11, 0 users, load average: 0.00, 0.02,
0.08
P ended : Fri Jun 28 13:13:29 2002
The uptime
information for each machine is returned, but there is no
way to know which line corresponds to which machine. However,
when a p2 job is launched, an environment variable "P"
is set on each machine to record the name of host specified
after the -p flag of the p2 command. The AV tool av_pp will
prepend this P environment variable on each output line from
each host. The usage and result of av_pp is below:
[homeserver]
p2 'uptime | av_pp' -p /net/ia00100 /net/ia00101
P begun
: Fri Jun 28 13:14:54 2002
/net/ia00100 1:14pm up 21 days, 2:12, 0 users, load average:
0.00, 0.01, 0.07
/net/ia00101 1:15pm up 21 days, 2:12, 0 users, load average:
0.00, 0.00, 0.00
P ended : Fri Jun 28 13:14:55 2002
Now the
output is unambiguous.
For more
examples using 'P', see the example
code pages.
|