In computing, hardware acceleration is the use of computer hardware specially made to perform some functions more efficiently than is possible in software running on a general-purpose central processing unit (CPU). Any transformation of data or routine that can be computed, can be calculated purely in software running on a generic CPU, purely in custom-made hardware, or in some mix of both. An operation can be computed faster in application-specific hardware designed or programmed to compute the operation than specified in software and performed on a general-purpose computer processor. Each approach has advantages and disadvantages. The implementation of computing tasks in hardware to decrease latency and increase throughput is known as hardware acceleration.
Typical advantages of the software include more rapid development (leading to faster times to market), lower non-recurring engineering costs, heightened portability, and ease of updating features or patching bugs, at the cost of overhead to compute general operations. Advantages of hardware include speedup, reduced power consumption, lower latency, increased parallelism and bandwidth, and better utilization of area and functional components available on an integrated circuit; at the cost of lower ability to update designs once etched onto silicon and higher costs of functional verification and times to market. In the hierarchy of digital computing systems ranging from general-purpose processors to fully customized hardware, there is a tradeoff between flexibility and efficiency, with efficiency increasing by orders of magnitude when any given application is implemented higher up that hierarchy. This hierarchy includes general-purpose processors such as CPUs, more specialized processors such as GPUs, fixed-function implemented on field-programmable gate arrays (FPGAs), and fixed-function implemented on an application-specific integrated circuit (ASICs).
Hardware acceleration is advantageous for performance, and practical when the functions are fixed so updates are not as needed as in software solutions. With the advent of reprogrammable logic devices such as FPGAs, the restriction of hardware acceleration to fully fixed algorithms has eased since 2010, allowing hardware acceleration to be applied to problem domains requiring modification to algorithms and processing control flow.
Integrated circuits can be created to perform arbitrary operations on analog and digital signals. Most often in computing, signals are digital and can be interpreted as binary number data. Computer hardware and software operate on information in binary representation to perform computing; this is accomplished by calculating boolean functions on the bits of input and outputting the result to some output device downstream for storage or further processing.
Computational equivalence of hardware and software
Either software or hardware can compute any computable function. Custom hardware offers higher performance per watt for the same functions that can be specified in software. Hardware description languages (HDLs) such as Verilog and VHDL can model the same semantics as software and synthesize the design into a netlist that can be programmed to an FPGA or composed into logic gates of an application-specific integrated circuit.
The vast majority of software-based computing occurs on machines implementing the von Neumann architecture, collectively known as stored-program computers. Computer programs are stored as data and executed by processors, typically one or more CPU cores. Such processors must fetch and decode instructions as well as data operands from memory as part of the instruction cycle to execute the instructions constituting the software program. Relying on a common cache for code and data leads to the von Neumann bottleneck, a fundamental limitation on the throughput of software on processors implementing the von Neumann architecture. Even in the modified Harvard architecture, where instructions and data have separate caches in the memory hierarchy, there is overhead to decoding instruction opcodes and multiplexing available execution units on a microprocessor or microcontroller, leading to low circuit utilization. Intel's hyper-threading technology provides simultaneous multithreading by exploiting the under-utilization of available processor functional units and instruction-level parallelism between different hardware threads.
Hardware execution units do not in general rely on the von Neumann or modified Harvard architectures and do not need to perform the instruction fetch and decode steps of an instruction cycle and incur those stages' overhead. If needed calculations are specified in a register transfer level (RTL) hardware design, the time and circuit area costs that would be incurred by instruction fetch and decoding stages can be reclaimed and put to other uses.
This reclamation saves time, power, and circuit area in computation. The reclaimed resources can be used for increased parallel computation, other functions, communication, or memory, as well as increased input/output capabilities. This comes at the opportunity cost of less general-purpose utility.
Greater RTL customization of hardware designs allows emerging architectures such as in-memory computing, transport triggered architectures (TTA), and networks-on-chip (NoC) to further benefit from increased locality of data to execution context, thereby reducing computing and communication latency between modules and functional units.
Custom hardware is limited in parallel processing capability only by the area and logic blocks available on the integrated circuit die. Therefore, the hardware is much freer to offer massive parallelism than software on general-purpose processors, offering a possibility of implementing the parallel random-access machine (PRAM) model.
It is common to build multicore and manycore processing units out of microprocessor IP core schematics on a single FPGA or ASIC. Similarly, specialized functional units can be composed in parallel as in digital signal processing without being embedded in a processor IP core. Therefore, hardware acceleration is often employed for repetitive, fixed tasks involving little conditional branching, especially on large amounts of data. This is how Nvidia's CUDA line of GPUs are implemented.
As device mobility has increased, the relative performance of specific acceleration protocols has required new metricizations, considering the characteristics such as physical hardware dimensions, power consumption and operations throughput. These can be summarized into three categories: task efficiency, implementation efficiency, and flexibility. Appropriate metrics consider the area of the hardware along with both the corresponding operations throughput and energy consumed.