# HIGH-PERFORMANCE COMPUTING USING FPGAS PDF

Will X. Y. Li, Rosa H. M. Chan, Wei Zhang, Chiwai Yu, Dong Song, Theodore W. Berger et al. Pages PDF · High-Performance FPGA-Accelerated. High-Performance Computing using FPGA covers the area of high performance Included format: EPUB, PDF; ebooks can be used on all reading devices. Request PDF on ResearchGate | High-performance computing using FPGAs | High-Performance Computing using FPGA covers the area of high performance.

Author: | GOLDEN STOFFER |

Language: | English, Arabic, Hindi |

Country: | Madagascar |

Genre: | Health & Fitness |

Pages: | 353 |

Published (Last): | 04.08.2016 |

ISBN: | 617-2-73065-559-1 |

ePub File Size: | 28.50 MB |

PDF File Size: | 20.69 MB |

Distribution: | Free* [*Registration needed] |

Downloads: | 31028 |

Uploaded by: | DANICA |

FPGAs have historically been restricted to a narrow set of HPC applications because of . accelerated using in-socket FPGA accelerators. In-socket .. http:// pocboarentivi.gq 2. HIGH PERFORMANCE SCIENTIFIC COMPUTING USING FPGAS WITH IEEE. FLOATING POINT possible to implement high-performance matrix and vector kernel operations on .. available at pocboarentivi.gq High-Performance Reconfigurable Computers are parallel computing systems that contain multiple settings, the design uses FPGAs as coprocessors that.

One approach is to analyze on the order of a hundred samples, each with tens of thousands of gene expressions, to find correlations between expression patterns and disease phenomena.

The kernel operation is a series of dot-product and sum DPS calculations feeding covariance, matrix inversion, and regression CIR logic. Usually the solution involves a very deep pipeline hundreds or even thousands of stages long.

Difficulty arises, however, when successive functions have different rates of sourcing and sinking data. The solution is to rate-match sequential functions by replicating the slower functions and then using them in rotation for the desired throughput. FPGAs are often viewed as homogeneous substrates that can be configured into arbitrary logic.

In the past five years, however, an ever larger fraction of their chip area has been devoted to hard-wired components, such as integer multipliers and independently accessible BRAMs.

**Other books:**

*COMPUTER ORGANIZATION NOTES PDF*

For example, the Xilinx VP has independently addressable, bit, quad-ported BRAMs; it achieves a sustained bandwidth of 20 terabytes per second at capacity. Using this bandwidth greatly facilitates high performance and is an outstanding asset of current-generation FPGAs. In molecular dynamics, efficient algorithms for computing the electrostatic interaction often involve mapping charges onto a 3D grid. The first phase of each iteration computes the 3D charge distribution, while the second phase locates each atom in that field and applies a force to it according to its charges and that region of the force field.

Because atoms almost never align to the grid points on which the field is computed, trilinear interpolation uses the eight grid points nearest to the atom to determine field strength.

Key to such a structure is simultaneous access to all grid points surrounding the atom. This in turn requires appropriate partitioning of the 3D grid among the BRAMs to enable collisionless access, and also efficient logic to convert atom positions into BRAM addresses.

We have prototyped a memory-access configuration that supports tricubic interpolation by fetching 64 neighboring grid-point values per cycle. We have also generalized this technique into a tool that creates custom interleaved memories for access kernels of various sizes, shapes, and dimensionality. With high-end microprocessors having bit data paths, often overlooked is that many BCB applications require only a few bits of precision.

In fact, even the canonical floating point of MD is often implemented with substantially reduced precision, although this remains controversial. In contrast with microprocessors, FPGAs enable configuration of data paths into arbitrary sizes, allowing a tradeoff between precision and parallelism.

An additional benefit of minimizing precision comes from shorter propagation delays through narrower arithmetic units. All BCB applications described here benefit substantially from the selection of nonstandard data type sizes.

For example, microarray values and biological sequences require only two to five bits, and shape characterization of a rigid molecule requires only two to seven bits. While most MD applications require more than the 24 bits provided by a single-precision floating point, they might not need double precision 53 bits.

We return to the modeling molecular interactions case study to illustrate the tradeoff between PE complexity and degree of parallelism.

## An Investigation into Applicability of Distributed FPGAs to High‐Performance Computing

That study examined six different models describing intermolecular forces. Molecule descriptions range from two to seven bits per voxel, and scoring functions varied with the application. The number of PEs that fit the various maximum-sized cubical computing arrays into a Xilinx XC2VP70 ranged from 8 3 to 2, 14 3 , according to the resources each PE needed.

Since clock speeds also differed for each application-specific accelerator, they covered a 7: If we had been restricted to, for example, 8-bit arithmetic, the performance differential would have been even greater.

Microprocessors provide support for integers and floating point, and, depending on multimedia features, 8-bit saturated values. In digital signal processing systems, however, cost concerns often require DSPs to have only integers. Software can emulate floating point when required; also common is use of block floating point. Alternatives include the block floating point, log representations, and the semi-floating point. We would generally use double-precision floating points for further computations.

Careful analysis shows that the number of distinct alignments that must be computed is quite small even though the range of exponents is large. This enables the use of a stripped-down floating-point mode, particularly one that does not require a variable shift. The resulting force pipelines with bit precision are 25 percent smaller than ones built with a commercial single-precision bit floating-point library. The relative costs of arithmetic functions are different on FPGAs than on microprocessors.

For example, FPGA integer multiplication is efficient compared to addition, while division is orders-of-magnitude slower.

Even if the division logic is fully pipelined to hide its latency, the cost remains high in chip area, especially if the logic must be replicated. Thus, restructuring arithmetic with respect to an FPGA cost function can substantially increase performance. The microarray data analysis kernel as originally formulated requires division.

We represent numbers as rationals, with a separate numerator and denominator, replacing division operations with multiplication. This doubles the required number of bits, but rational values are needed only at a short, late segment of the data path. Consequently, the additional logic required for the wider data path is far lower than the logic for division would have been. The final two methods deal with two familiar HPC issues: These methods differ from the others in that they require design tools not widely in use, either because they are currently proprietary 11 or exist only as prototypes.

HPC applications are often complex and highly parameterized, resulting in variations in applied algorithms as well as data format.

Contemporary object-oriented technology can easily support these variations, including function parameterization. This level of parameterization is far more difficult to implement in current hardware description languages, but it enables higher reuse of the design, amortizes development cost over a larger number of uses, and relies less on skilled hardware developers for each application variation. Other essential methods for searching biological databases are based on dynamic programming.

Although generally referred to by the name of one particular variation, Smith-Waterman, DP-based approximate string matching actually consists of a large number of related algorithms that vary significantly in purpose and complexity. In traditional hardware design systems, components comprise black boxes with limited internal parameterization.

Reuse largely entails creating communication and synchronization structures and connecting these to the memory subsystems. System performance thus depends on memory, synchronization, and communication, which are the aspects most unfamiliar to traditional programmers.

The term application family describes a computation that matches this description, and DP-based approximate string matching offers an example. Each level of design hierarchy has fixed interfaces to the components above and below in that hierarchy. The fixed interface includes data types defined and used in that level, but possibly also passed through communication channels at other levels.

Within a hierarchical level, each component type has several possible implementations, including definitions of its data elements.

The fixed interface, however, hides that variability from other design layers. Logical structure of application family for DP-based approximate string matching. Each level of design hierarchy has fixed interfaces to the components above and below that hierarchy.

Within a hierarchical level, each component type has several possible implementations, which the fixed interface hides from other design layers.

Our initial implementation allowed more than combinations of the three component types, with many more variations possible through parameter settings. This structure was quite natural in the object-oriented algorithms we used but required more configurability than VHDL features provide. Given the frequency at which larger FPGAs become available, automated sizing of complex arrays will become increasingly important for porting applications among FPGA platforms. All the case studies can be scaled to use additional hardware resources.

FPGA capacity has terms for each of the available hardware resources, including hard multipliers and BRAMs as well as general-purpose logic elements. Depending on the application, any of the resources can become the limiting one. As shown in Figure 2a , arrays can be simple linear structures. In this case, the array can grow only in increments of whole rows or columns; architectural parameters are not literal numbers of PEs. Computing arrays like those in Figure 2c have multiple subsystems of related sizes and different algebraic growth laws.

Figure 2d represents a tree-structured array, showing how arrays can grow according to exponential or other nonlinear laws. Of course, a computing array can include multiple architectural parameters, nonlinear growth patterns, coupled subsystems growing according to different algebraic laws, and multiple resource types.

Growth laws for computing arrays specified in terms of architectural parameters. In string matching, for example, PE size depends on the number of bits in the string element—for example, 2 bits for DNA or 5 bits for proteins—and on the type of comparison performed. Array dimensions naturally grow when larger FPGAs offer more resources, and they decrease when complex applications consume more resources per PE.

Automated sizing is possible within the experimental LAMP design system 4 but cannot be expressed in mainstream design tools or methodologies. High-performance computing programmers are a highly sophisticated but scarce resource. Such programmers are expected to readily use new technology but lack the time to learn a completely new skill such as logic design. As a result, developers have expended much effort to develop design tools that translate high-level language programs to FPGA configurations, but with modest expectations of results.

In other words, what support would enable an HPC programmer to use these methods? While there is potential for enormous speedup in FPGA-based acceleration of HPC applications, achieving it demands both selecting appropriate applications and specific design methods that ensure such applications are flexible, scalable, and at least somewhat portable.

## Reconfigurable computing

Such methods are firmly entrenched in HPC tools and practices. We thank the anonymous reviewers for their many helpful suggestions. Martin C. His research interests include computer architecture, applying configurable logic to high-performance computing, and design automation.

## High-Performance Computing Using FPGAs

Herbordt received a PhD in computer science from the University of Massachusetts. Contact him at ude. Tom VanCourt is a senior member of the technical staff, software engineering, at Altera Corp.

His research interests include applications and tools for reconfigurable computing. Contact him at moc. Gu received an MS in computer science from Fudan University.

Sukhwani received an MS in electrical and computer engineering from the University of Arizona. He is a student member of the IEEE. His research interests include using FPGAs and other hybrid architectures in high-performance image and signal processing applications. Conti received an MS in electrical and computer engineering from Northeastern University and a BS in computer systems engineering from Boston University.

He is a member of the IEEE. Contact him at gro. His research interests include the use of FPGAs in scientific computing and hyperspectral image processing. National Center for Biotechnology Information , U.

Computer Long Beach Calif. Author manuscript; available in PMC May Author information Copyright and License information Disclaimer.

**READ ALSO:**

*BECAUSE YOU LOVED ME PDF*

Boston University. Copyright notice. See other articles in PMC that cite the published article. Abstract Numerous application areas, including bioinformatics and computational biology, demand increasing amounts of processing capability. In this case, accelerating HPC applications with FPGAs is similar to that of porting uniprocessor applications to massively parallel processors, with two key distinctions: Type of support required Methods supported Electronic design automation: Open in a separate window.

Method 1: Use an algorithm optimal for FPGAs Having multiple plausible algorithms is common for a given task—application and target hardware determine the final selection. Application example Modeling molecular interactions, or docking, is a key computational method used for in silico drug screening.

**Related Post:**

*PDF FILE OF COMPUTER HARDWARE AND NETWORKING NOTES*

Method 2: Method 3: Use appropriate FPGA structures Certain data structures such as stacks, trees, and priority queues are ubiquitous in application programs, as are basic operations such as search, reduction, and parallel prefix, and using suffix trees. Application example Another important bioinformatics task is analyzing DNA or protein sequences for patterns indicative of disease or other functions fundamental to cell processes.

Method 4: Application example Central to computational biochemistry, molecular dynamics applications predict molecular structure and interactions. Method 5: We selected widely used applications with high potential parallelism, and preferably, low precision. In terms of programming effort, we considered a few months to a year or two depending on potential impact as being realistic. In addition, we avoided low-level issues related to logic design and synthesis in electronic design automation, as well as high-level issues such as partitioning and scheduling in parallel processing.

Although we focused on our own BCB work, the methods apply largely to other domains in which FPGAs are popular, such as signal and image processing. Method 1:Use an algorithm optimal for FPGAs Having multiple plausible algorithms is common for a given task—application and target hardware determine the final selection.

Application example Modeling molecular interactions, or docking, is a key computational method used for in silico drug screening. Fast Fourier transforms are used to compute the 3D correlations. First, small data type sizes, such as 1-bit values for representing interior versus exterior information, offer little advantage on a microprocessor. On an FPGA, however, smaller processing elements allow for more PEs in a given amount of computing fabric, and implementing products of 1-bit values is trivial.

In addition, systolic arrays for correlation are efficient. The form we chose requires one input value and generates one output value per cycle, while holding hundreds of partial sums in on-chip registers.

Finally, our implementation, after a brief setup phase, delivers one multiply-accumulate operation per clock cycle per PE, times hundreds to thousands of PEs in the computing array.

Because good computing modes for software are not necessarily good computing modes for hardware, restructuring an application can often substantially improve its performance. For example, while random-access and pointer-based data structures are staples of serial computing, they may yield poor performance on FPGAs.

Streaming, systolic, and associative computing structures, and arrays of fine-grained automata, are preferable. The most commonly used applications are based on the basic local alignment search tool, which operates in multiple phases. BLAST first determines seeds, or good matches of short subsequences, then extends these seeds to find promising candidates, and finally processes the candidates in detail, often using dynamic programming DP methods.

The first dimension generates, on every cycle, the character-character match scores for a particular alignment of the sequence of interest versus the database. The second dimension processes the score sequence to find the maximal local alignment.

The tree structure keeps the hardware cost low; pipelining assures generation of maximal local alignments at the streaming rate. Method 3:Use appropriate FPGA structures Certain data structures such as stacks, trees, and priority queues are ubiquitous in application programs, as are basic operations such as search, reduction, and parallel prefix, and using suffix trees. Equally ubiquitous in digital logic, the analogous structures and operations usually differ from what is obtained by directly translating software structures into hardware.

Application example Another important bioinformatics task is analyzing DNA or protein sequences for patterns indicative of disease or other functions fundamental to cell processes. These patterns are often repetitive structures, such as tandem arrays and palindromes under various mismatch models.

This is sometimes difficult to achieve with existing HPC code—for example, profiling often points to kernels that comprise just 60 to 80 percent of execution time.

The problem is especially severe with legacy codes and may require a substantial rewrite. Not all is lost, however. The nonkernel code may lend itself to substantial improvement; as its relative execution time increases, expending effort on its optimization may become worthwhile. Also, combining computations not equally amenable to FPGA acceleration may have optimized the original code; separating them can increase the acceleratable kernel.

Application example Central to computational biochemistry, molecular dynamics applications predict molecular structure and interactions. The MD computation itself is an iterative application of Newtonian mechanics on particle ensembles and alternates between two phases: force computation and motion update. The force computation comprises several terms, some of which involve bonds. The motion update and bonded force computations are O N in the number of particles being simulated, while the nonbonded are O N log N or N2.

The latter comprises the acceleratable kernel. Method 5: Hide latency of independent functions Latency hiding is a basic technique for achieving high performance in parallel applications. Overlap between computation and communication is especially desirable. In FPGA implementations, further opportunities arise: Rather than allocating tasks to processors that must communicate with one another, latency hiding simply lays out functions on the same chip to operate in parallel.

Application example Returning to the example of modeling molecular interactions, the docking algorithm must repeat the correlations at three-axis rotations—more than for typical degree sampling intervals.

Implementations on sequential processors typically rotate the molecule in a step separate from the correlation. The preferred technique is based on runtime index calculation and has two distinctive features. First, index computation can be pipelined to generate indices at operating frequency due to the predictable order of access to voxels. Method 6:Use rate-matching to remove bottlenecks Computations often consist of independent function sequences, such as a signal passing through a series of filters and transformations.

Multiprocessor implementations offer some flexibility in partitioning by function or data, but on an FPGA, functions are necessarily laid out on the chip and so function-level parallelism is built in although functions can also be replicated for data parallelism. This implies pipelining not only within, but also across, functions.

## Office of Science

Application example DNA microarrays simultaneously measure the expression of tens of thousands of genes, and are used to investigate numerous questions in biology. One approach is to analyze on the order of a hundred samples, each with tens of thousands of gene expressions, to find correlations between expression patterns and disease phenomena.

The kernel operation is a series of dot-product and sum DPS calculations feeding covariance, matrix inversion, and regression CIR logic.

Usually the solution involves a very deep pipeline hundreds or even thousands of stages long.

Difficulty arises, however, when successive functions have different rates of sourcing and sinking data.Method 5: Hide latency of independent functions Latency hiding is a basic technique for achieving high performance in parallel applications. Use rate-matching to remove bottlenecks Take advantage of FPGA-specific hardware Use appropriate arithmetic precision Create families of applications, not point solutions Scale application for maximal use of FPGA hardware.

In molecular dynamics, efficient algorithms for computing the electrostatic interaction often involve mapping charges onto a 3D grid.

Press; Method 6:Use rate-matching to remove bottlenecks Computations often consist of independent function sequences, such as a signal passing through a series of filters and transformations. Finding information about a newly discovered gene or protein by searching biomedical databases for similar sequences is a fundamental bioinformatics task.

FPGAs are often viewed as homogeneous substrates that can be configured into arbitrary logic. This enables us to build tactical signals intelligence products that provide operators with the means to address the self-organizational capabilities afforded by modern communications infrastructure, without compromising strategic signals intelligence requirements.

Extending this analogy illustrates the extremely flexible, customizable, and functional nature of FPGAs.