Abstract
Combinatorial optimizations are significantly important to a variety of engineering, scientific, and marketing problems, such as VLSI circuit layout design, machine learning, logistics, and financial markets. However, most of the combinatorial problems are extremely challenging for the complexity of those algorithms tends to be NP-hardness, which means that we cannot find an accurate solution within polynomial computation time. In order to effectively solve these problems, many heuristic methods have been proposed like simulated annealing, quantum annealing, quantum bifurcation, and simulated bifurcation. Because researchers have identified that many combinatorial problems can be mapped to the so-called Ising problem which is established from thermodynamics and statistical physics, various Ising machines have sprung up, and one of the cutting-edge Ising machines is called a coherent Ising machine(CIM), which is based on the optical bifurcation phenomenon. Since the requirements for quantum bifurcation machines and CIM are much more complicated, for example, we need refrigerators and lasers for quantum machines and CIM, a recently proposed simulated bifurcation machine (SBM) aroused much attention for its high degree of parallelism and good performance to find the approximate solution to most scenarios which need to find the optimal solution instantaneously. Considering the practicality of deployment and the high level of configurability and parallelism of FPGA, therefore, we are determined to realize a FPGA-based simulated bifurcation machine(SBM) that supports a large sparse dataset.For a large sparse dataset, to begin with, the size of all the data usually amounts to millions of floating point numbers. Since the representation of a floating-point number, be it a float or double, at least needs 4 bytes to represent, we particularly quantized the dataset to only 1 byte without losing the accuracy of the SBM algorithm. From our experiments, if the initial quantized position/momentum value and some parameters are appropriately tuned, we can even get a better result compared with the floating-point number. It turns out that the SB algorithm is sensitive to the initial conditions and quantization is also fit for this algorithm. In addition, considering the sparsity of the dataset, in which there are many zeros padding within the matrix. People often compressed the dataset using compressed sparse row(CSR) or compressed sparse column(CSC) format. However, considering the convenience of hardware design targeting for ASIC or FPGA on which we describe our circuits using a kind of hardware description language(HDL), such as Verilog, VHDL, or System Verilog, we use the coordinate format(COO) to compress our dataset. By using this kind of representation, we can easily address the sparse matrix representation without any additional computation module for the conversion from CSR to index our effective values. The reason for that, we will discuss it in the introduction part. Regarding the circuit design for SBM, since the major and most time-consuming operation in this algorithm is sparse matrix dense vector multiplication(SPMV), we draw the inspiration from the GAS computation model for graph processing and encode and compress the sparse matrix values as edges and for the numbers in the vector, we quantized the floating point values to Int8 numbers as vertexes. First, we dispatch the edges and vertexes data to our PEs(Processing Elements) through AXI4 protocal, within our PEs, then those PEs will fetch the respective edges and vertexes to do the computation in parallel and after we get the SPMV results, we can do the time evolution(TE) part for the SBM algorithm. During the time evolution period, we use the ping-pong buffer to make the computation in a pipelined fashion such that we can continuously dispatch our vertexes and edges to make the PEs run without idle states. The concrete circuits will be described in more detail in a later chapter. In our experiments, by means of our quantization techniques, we can have a 10x speed up without any accuracy loss compared with our double values using the PyTorch framework and we also implement our circuit design using ZCU102 in the condition of 128 bits per cycle from DDR4 which is under the control of PS(Processing System) part.
| Date of Award | 2023 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Wei ZHANG (Supervisor) |
Cite this
- Standard