The McKoy Group Pentium III Cluster


Through an equipment grant from Intel, we in the research group of Prof. Vincent McKoy have constructed a cluster of Intel-architecture workstations that we operate as a parallel computer. We are using this cluster to carry out studies of electron collisions with etchant gases used in plasma processing of semiconductors, work supported by SEMATECH and carried out in collaboration with Dr. S. Shankar of Intel's TCAD and Dr. W. L. Morgan of Kinema Research.

The current characteristics of our cluster are:

Cluster Status

Our 32-processor Pentium III cluster is the successor to a 16-processor Pentium II cluster that we operated successfully for 3 years. At present we have two 8-machine, 16-processor subclusters operating over separate 8-port Gbit switches. In the next phase of integration, we will fuse these subclusters by installing a second Gbit NIC in 4 machines from each cluster and interconnecting them either via a third switch or by direct crossover connections.

Some Performance Data

The Cluster vs. Other Parallel Machines

The tables below give an indication of the relative performance of our application on several parallel architectures. The "medium" test case is several years old and is really too small to scale well on current machines. The "large" problem is a fragment of a recent production calculation. Results are given in wallclock seconds versus number of processors used (NPEs). (Note: The Pentium II cluster results in these tables were obtained over a 100-Mbit network; the Pentium III results were obtained on a Gbit network.)

"Medium" Test Case "Large" Test Case
NPEsHP X-ClassCRAY T3D300 MHz CRAY T3E300 MHz P-II Cluster700 MHz quad P-III933 MHz P-III Cluster
15482599572646292.2222
22731316289360146.9112
414662115221379.460
87531477185---32
164417942157---22
NPEs300 MHz CRAY T3ESGI Origin 2000HP Superdome300 MHz P-II Cluster933 MHz P-III Cluster
1------------13619
2---7469---205176983
4---39362696115873531
844271966135257721780
162267108371346121126
321158548374------

Performance can be measured in a number of ways. One important statistic is absolute speed per processor. For the matrix-multiplication step in the "large" test case, we achieve speeds up to 675 MFLOP, which we consider to be very good. Performance on still larger matrices should be even better. Another important measure is the parallel efficiency, defined as (T1/N)/TN, where T1 is the time to do a task on 1 processor and TN is the time to do the same task on N processors. On the "large" problem, the Pentium III cluster has a high parallel efficiency (96% or better) up to 8 processors. At 16 processors, there is a noticeable degradation: the parallel efficiency drops to 76%. This falloff occurs because, beyond 8 processors, it is necessary to send large messages over the network in the matrix-multiplication step. As the system software for networking continues to improve, we hope to see improvements in performance.


Current Production Work

As part of our ongoing studies of electron collisions with fluorocarbon and fluorosilane molecules relevant to plasma etching, we have conducted calculations of elastic electron collisions with tetrafluorosilane, hexafluoroethane, and perfluorocyclobutane. Work on the largest of these molecules, perfluorocyclobutane, has involved continuous runs of over 48 hours on eight processors.

We also use the cluster for exploratory calculations as part of the continual development of our code and methodology. For example, we have looked at vibrational effects on low-energy electron scattering by small hydrocarbons.

Just before we began upgrading to the Pentium III cluster, we completed a study of electron collisions with perfluoroethylene carried out entirely on the Pentium II cluster. Initial results from those calculations were presented at the 51st Gaseous Electronics Conference (GEC) in October 1998; a much more extensive set of results was presented at the 53rd GEC in October 2000, and has now been published.

Where to From Here?

As the performance data above demonstrate, the time-to-solution achievable by running around the clock on the full 32-processor cluster can be quite competitive with that achievable when sharing a much larger parallel supercomputer. The very largest jobs may still require resources (both computational speed and aggregate memory) that are only available on large machines like the HP Superdome or SGI Origin 2000, but many jobs that only a few years ago required supercomputer resources are now feasible ``in-house'' on clustered workstations.

Some Recent Presentations and Publications


Last updated January 21, 2002