xilinx的fpgapynq/zedboard加速YOLOv2的demo

共2000个文件

png：7613个

h：132个

c：71个

fpga开发

77 浏览量 2023-05-14 14:09:07 上传评论收藏 38.53MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

xilinx的fpga pynq/zedboard 加速YOLOv2的demo （2000个子文件）

yolov2_acc_gen_template.h.b0 38KB

yolov2_acc_test_template.h.b0 28KB

yolov2_acc_test_template.h.b1 26KB

yolov2_acc_test_template.h.b2 33KB

yolov2_acc_test_template.h.b3 33KB

yolov2_acc_test_template.h.b4 38KB

yolov2_acc_test_template.h.b5 29KB

yolov2_acc_test_template.h.b6 29KB

yolov2_acc_test_template.h.b7 29KB

yolov2_acc_test_template.h.b8 29KB

yolov2_acc_test_template.h.b9 31KB

yolov2.bit 3.86MB

data.c 47KB

parser.c 45KB

lsd.c 44KB

go.c 43KB

image.c 38KB

classifier.c 35KB

network.c 30KB

detector.c 28KB

lstm_layer.c 24KB

region_layer.c 19KB

convolutional_layer.c 19KB

darknet.c 18KB

attention.c 15KB

rnn.c 15KB

utils.c 14KB

gru_layer.c 13KB

nightmare.c 13KB

coco.c 13KB

yolo_layer.c 12KB

connected_layer.c 11KB

yolo.c 11KB

captcha.c 11KB

compare.c 11KB

batchnorm_layer.c 10KB

detection_layer.c 10KB

rnn_layer.c 10KB

demo.c 10KB

deconvolutional_layer.c 10KB

blas.c 9KB

crnn_layer.c 9KB

local_layer.c 9KB

box.c 8KB

gemm.c 8KB

cifar.c 8KB

instance-segmenter.c 8KB

segmenter.c 8KB

regressor.c 7KB

iseg_layer.c 7KB

rnn_vid.c 7KB

normalization_layer.c 5KB

cost_layer.c 5KB

reorg_layer.c 5KB

voxel.c 5KB

layer.c 4KB

writing.c 4KB

tag.c 4KB

matrix.c 4KB

cuda.c 4KB

maxpool_layer.c 4KB

route_layer.c 4KB

activations.c 4KB

tree.c 4KB

super.c 4KB

dice.c 4KB

softmax_layer.c 3KB

upsample_layer.c 3KB

option_list.c 3KB

shortcut_layer.c 3KB

AddThisCodeSegmentToParse.c 3KB

crop_layer.c 3KB

swag.c 2KB

logistic_layer.c 2KB

avgpool_layer.c 2KB

l2norm_layer.c 2KB

activation_layer.c 2KB

dropout_layer.c 2KB

art.c 1KB

list.c 1KB

col2im.c 1KB

im2col.c 1KB

softmax.c 943B

densenet201.cfg 19KB

resnext152-32x4d.cfg 16KB

resnet152.cfg 15KB

resnext101-32x4d.cfg 11KB

resnet101.cfg 10KB

yolov3-spp.cfg 8KB

yolov3-openimages.cfg 8KB

yolov3.cfg 8KB

yolov3-voc.cfg 8KB

darknet53.cfg 6KB

darknet53_448.cfg 6KB

resnext50.cfg 5KB

resnet50.cfg 5KB

resnet34.cfg 4KB

yolov1.cfg 3KB

yolov2.cfg 3KB

共 2000 条

# YOLOv2 Accelerator in Xilinx's Zynq-7000 Soc(PYNQ-z2, Zedboard and ZCU102) A Demo for accelerating YOLOv2 in Xilinx's FPGA PYNQ-z2, Zedboard and ZCU102 __I have graduated from Jiangnan University, China in July 1, 2019. Related papers are available now.__ Master thesis ["Research of Scalability on FPGA-based Neural Network Accelerator"](https://kns.cnki.net/KCMS/detail/detail.aspx?dbcode=CMFD&dbname=CMFDTEMP&filename=1019228234.nh&uid=WEEvREcwSlJHSldRa1FhdXNXaEhoOGhUTzA5T0tESzdFZ2pyR1NJR1ZBaz0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!&v=MjE5NTN5dmdXN3JBVkYyNkY3RzZGdFBQcTVFYlBJUjhlWDFMdXhZUzdEaDFUM3FUcldNMUZyQ1VSTE9lWnVkdUY=) Journal article ["Design and implementation of FPGA-based deep learning object detection system"](https://kns.cnki.net/KCMS/detail/detail.aspx?dbcode=CJFQ&dbname=CJFDLAST2019&filename=DZJY201908009&uid=WEEvREcwSlJHSldRa1FhdXNXaEhoOGhUTzA5T0tESzdFZ2pyR1NJR1ZBaz0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!&v=MDU0NDJDVVJMT2VadWR1Rnl2Z1c3ck1JVGZCZDdHNEg5ak1wNDlGYllSOGVYMUx1eFlTN0RoMVQzcVRyV00xRnI=) Journal article ["Design and Implementation of YOLOv2 Accelerator Based on Zynq7000 FPGA Heterogeneous Platform"](https://kns.cnki.net/KCMS/detail/detail.aspx?dbcode=CJFQ&dbname=CJFDTEMP&filename=KXTS201910005&uid=WEEvREcwSlJHSldRa1FhdXNXaEhoOGhUTzA5T0tESzdFZ2pyR1NJR1ZBaz0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!&v=MjkwNzdXTTFGckNVUkxPZVp1ZHVGeXZnVzdyT0xqWGZmYkc0SDlqTnI0OUZZWVI4ZVgxTHV4WVM3RGgxVDNxVHI=) For PYNQ-z2 and Zedboard, in addition to final Linux application( For PYNQ, turn to PYNQ directory; For Zedboard and ZCU102, turn to SDK and PetaLinux), other steps are almost same: ## (1)Software Simulation Firstly, you should download the darknet source from [https://github.com/pjreddie/darknet](https://github.com/pjreddie/darknet) and yolov2.weights from [https://pjreddie.com/media/files/yolov2.weights](https://pjreddie.com/media/files/yolov2.weights). Secondly, modify the darknet's weight load function to get the weights and biases that we want(Here, considering that batcn normalizaton can be combined with weight and bias). Thirdly, considering that multiple and add operations that implemented in hardware logic will cost too high resources in FPGA[3][6], we should use lower percision operation instead of float-32. Here, I just follow [3] and [6] to quantize the input/output feature maps, weights and biases to dynamic fixed-16. And use fixed-16 operation to replace multiple, add and relu operations in float-32 percision. ## (2)HLS Accelerator and Simulation Oh, this part is too complicated to brightly intoduce. __Current design didnt implemment C/RTL simulation, because that testbench always overflow!__ if anyone can solve it, please tell me and upload it, Thanks! ## (3)Vivado Block Design Just connect the YOLOv2 IP in Vivado Block Design. Only the clock wizzard configuraiotn should be careful. I remembered that input clock is 100MHz, output's clock is 150MHz, __Reset pin active low__, that's all. ## (4)Vivado SDK for Zedboard This step just wants to get the executable file to driver and control YOLOv2 Acceleraotr in PL. Here, I reserved 0x1000_0000 bytes memories for accelerator to read/wirte feature maps and read weights. ## (5)PetaLinux Related steps have been updated in Petalinux direcotry. Just use two files(__.hdf file and .bit file__) that generated from Vivado project to create one Peatalinux. Then test yolov2 acclerator in it. Every directory has some steps to help further implement or study this accelerator. # Design and Optimization of YOLOv2 Accelerator Based on FPGA According to the analysis of the YOLOv2 network, most layers are serially processed, except for the routing layer. The routing layer can be implemented by setting a specific address in advance. From an accelerator perspective, the work required is to interact with memory in order (reading memory data, processing data, and then writing back memory data). Since the amount of data input and output is very large, loop tiling technique is always applied to reuse data and reduce memory access times, which tiles the convolution loop R, C, M, N to Tr, Tc, Tm ,Tn[8]. The overall architecture of the accelerator is shown below: ![overview](https://github.com/dhm2013724/yolov2_xilinx_fpga/blob/master/overview.png) Similar to [4,5,8], the accelerator has two AXI4 master interfaces and one AXI4-Lite slave interface. AXI-Lite slave interface is responsible for reading and writing control, data and status register sets. The input feature maps and weights are read concurrently by two master interfaces, and the output feature maps are written back simultaneously through write channel. The Data Scatter module is designed to generate the corresponding write address and distribute the data read from the DRAM to the on-chip buffers. The Data Gather module is designed to generate the DRAM write-back address and write the data in the output buffer back to the DRAM. The other red modules are responsible for the processing of the convolutional layer (Conv and Leaky ReLU), the maximum pooling layer (Pool) and the reorg layer (Reorg). ## Weight Arrangement The effective FPGA bandwidth goes up with the increase of burst length and finally flattens out above some burst length threshold[7]. The data tiling technique usually results in a discontinuous DRAM access for the row-major data layout in DRAM. To reduce the number of memory accesses and increase the effective memory bandwidth, we arrange the kernel weights for an entire tile to a continuous block to ensure a high utilization of the bandwidth of external memory [3]. ## Parallel Convolution Engine The acceleration strategy of convolutional layer is similar to [5][6], which utilizes input and output parallelism to accelerate the computation. By designing multiple parallel multiplication units and add trees to achieve input parallelism (Tn parallelism) and output parallelism (Tm parallelism) in convolution calculation. The Tm*Tn multiplication units are calculated in parallel. The add trees of Log2 (Tn) depth are accumulated by pipeline, and generate the partial sums. ## Ping-Pong operation Similar to [8], the design implements ping-pong buffers to overlap the delay of reading input feature maps and weights, writing output feature maps and calculation, which greatly improves the dynamic utilization of the computing engines. # Evaulate Experiments show that floating point addition in HLS requires three DSP resources, floating point multiplication requires two DSPs; fixed point 16-bit multiplication requires one DSP, and fixed-point 16-bit addition can be implemented only using LUT. After placing and routing, resource consumptions of fixed-16 (Tn=2, Tm=32, Tr=26, Tc=26) are shown as follows: | Resource | DSP | BRAM | LUT | FF | Freq | Dev | | ----- | ----- | ----- | ----- | ----- | ----- |----- | |INT16(n4m32) old| 153(69%) | 88(63%) | 35977(68%) | 36247(34%) | 150MHz |Zedboard| |FT32(n4m23) old| 209(95%) |115(82%) | 36348(68%) | 64077(60%) | 140MHz |Zedboard| |INT16(n4m32) old| 147(6%) | 88(10%) | 36759(13%) | 30447(6%) | 180MHz |ZCU102 | |FT32-(n8m28,CONV II=3,POOL II=2) default float32| 259(72%) | 91(42%) | 31985(45%) | 53728(38%) | 180MHz |EdgeBoard(ZU3EG) | |FT32-(n4m36,CONV II=3,POOL II=2) current float32 mp| 334(93%) | 109(50%) | 43877(62%) | 73854(52%) | 150MHz |EdgeBoard(ZU3EG) | According to the current design, DSP and BRAM are more expensive. The cost of DSP can be further reduced (there are many bit-width redundant multiplications), and the BRAM cost can be reduced. (As Shen [1] said, BRAM allocates an exponential size of 2 in HLS. Actually, many BRAMs are redundant. ). The performance comparison in the two cases is shown in the following table: | Performance | | | | |

评论收藏

内容反馈