(logo)
(navigation image)
Home Wayback Machine | Archive-It | Blog | Heritrix

Search: Advanced Search

Anonymous User (login or join us)Upload
 Reference Links
Researcher access is currently not available pending redesign. This material has been retained for reference and was current information as of late 2002.

Data Available
Tools Available
Example Projects
Tool Documentation
Example Code
Parallel Operations using 'P'

Data mining will most often require performing work on several hosts in the cluster. Running the processes in parallel on these machines can speed up this work. Refer to Brewster Khale's P-Machine working notes for the underlying ideas of how this is done on the archive.

Creating parallel data mining jobs is best approached by first constructing code that will perform the desired operation on just one machine. Once this code works properly, the program p2 can be used launch this code in parallel on any specified remote host.

p2
will establish all of the connections with the remote hosts, launch the specified job on the remote hosts, and collect the data that each data mining job on each host returns. The gist of p2's function is to rsh out to each specified host and then launch a particular command or sequence of commands on those hosts. Each host will produce some output and then feed it back to p2, which will then combine the data flow into one data stream. The following is the basic syntax to launch a p2 job:

> p2 'my_data_mining_job' -p $ARCS

The command, including any pipelines and redirection, that is to be issued to each remote host must be included within single quotes, followed by the option '-p', followed by the machines on which the script will be run. Anything located between the single quotes will be executed on each specified remote machine as-is.The previous example will launch the script my_data_mining_job on all of the hosts in the archive.

The output from the p2 operation will be returned to whichever terminal deployed it. The output will be asynchronous and have no information about which machine produced which data. We will deal with this issue momentarily.

Any valid shell command can be placed within the quotes, including Perl scripts, redirection, and pipes. Keep in mind that any file creation and redirection will occur independently on each specified host. Since homeserver is NFS mounted on each remote host, each host can access scripts located in a user's personal directory thereby eliminating the need to copy a given script onto each remote host. This also means that log files from each process running on each specified remote host can be saved directly into the user's personal directory.

Remote machines on which to run the p2 command are listed after the -p flag. For example, to run p2 on only two machines, ia00100 and ia00200, the syntax following the '-p' will be /net/ia00100 /net/ia00200. To run your script on all of the machines in the first rack, simply use the environment variable $rack1 . To run the script on all of the machines in the archive, use the environment variable $ARCS.

It is advised to launch a p2 job on just one host first to make sure the desired result is achieved. One can then specify more hosts if the test was successful. p2 instantiates many processes and temporary files on each specified remote host in order to coordinate the job, so canceling a malformed job can be quite a chore.

When the p2 job is launched, each specified host will return output one line at a time to standard out. p2 ensures that each output line is retrieved in its entirety and is not interrupted by the output data from other hosts. However, the ordering of the results is not guaranteed. p2 will collect data from standard out of any machine that provides a complete line of data. Faster machines will provide data sooner than slower ones, and this will be reflected in the random ordering of output collected by p2.

When the command

[homeserver] p2 'uptime' -p /net/ia00100 /net/ia00101

is executed, the resulting output is:

P begun : Fri Jun 28 13:13:28 2002
1:14pm up 21 days, 2:11, 0 users, load average: 0.00, 0.00, 0.00
1:12pm up 21 days, 2:11, 0 users, load average: 0.00, 0.02, 0.08
P ended : Fri Jun 28 13:13:29 2002

The uptime information for each machine is returned, but there is no way to know which line corresponds to which machine. However, when a p2 job is launched, an environment variable "P" is set on each machine to record the name of host specified after the -p flag of the p2 command. The AV tool av_pp will prepend this P environment variable on each output line from each host. The usage and result of av_pp is below:

[homeserver] p2 'uptime | av_pp' -p /net/ia00100 /net/ia00101

P begun : Fri Jun 28 13:14:54 2002
/net/ia00100 1:14pm up 21 days, 2:12, 0 users, load average: 0.00, 0.01, 0.07
/net/ia00101 1:15pm up 21 days, 2:12, 0 users, load average: 0.00, 0.00, 0.00
P ended : Fri Jun 28 13:14:55 2002

Now the output is unambiguous.

For more examples using 'P', see the example code pages.


Terms of Use (10 Mar 2001)