# YOLOv2 Accelerator in Xilinx's Zynq-7000 Soc(PYNQ-z2, Zedboard and ZCU102)
A Demo for accelerating YOLOv2 in Xilinx's FPGA PYNQ-z2, Zedboard and ZCU102
__I have graduated from Jiangnan University, China in July 1, 2019. Related papers are available now.__
Master thesis ["Research of Scalability on FPGA-based Neural Network Accelerator"](https://kns.cnki.net/KCMS/detail/detail.aspx?dbcode=CMFD&dbname=CMFDTEMP&filename=1019228234.nh&uid=WEEvREcwSlJHSldRa1FhdXNXaEhoOGhUTzA5T0tESzdFZ2pyR1NJR1ZBaz0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!&v=MjE5NTN5dmdXN3JBVkYyNkY3RzZGdFBQcTVFYlBJUjhlWDFMdXhZUzdEaDFUM3FUcldNMUZyQ1VSTE9lWnVkdUY=)
Journal article ["Design and implementation of FPGA-based deep learning object detection system"](https://kns.cnki.net/KCMS/detail/detail.aspx?dbcode=CJFQ&dbname=CJFDLAST2019&filename=DZJY201908009&uid=WEEvREcwSlJHSldRa1FhdXNXaEhoOGhUTzA5T0tESzdFZ2pyR1NJR1ZBaz0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!&v=MDU0NDJDVVJMT2VadWR1Rnl2Z1c3ck1JVGZCZDdHNEg5ak1wNDlGYllSOGVYMUx1eFlTN0RoMVQzcVRyV00xRnI=)
Journal article ["Design and Implementation of YOLOv2 Accelerator Based on Zynq7000 FPGA Heterogeneous Platform"](https://kns.cnki.net/KCMS/detail/detail.aspx?dbcode=CJFQ&dbname=CJFDTEMP&filename=KXTS201910005&uid=WEEvREcwSlJHSldRa1FhdXNXaEhoOGhUTzA5T0tESzdFZ2pyR1NJR1ZBaz0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!&v=MjkwNzdXTTFGckNVUkxPZVp1ZHVGeXZnVzdyT0xqWGZmYkc0SDlqTnI0OUZZWVI4ZVgxTHV4WVM3RGgxVDNxVHI=)
For PYNQ-z2 and Zedboard, in addition to final Linux application( For PYNQ, turn to PYNQ directory; For Zedboard and ZCU102, turn to SDK and PetaLinux), other steps are almost same:
## (1)Software Simulation
Firstly, you should download the darknet source from [https://github.com/pjreddie/darknet](https://github.com/pjreddie/darknet) and yolov2.weights from [https://pjreddie.com/media/files/yolov2.weights](https://pjreddie.com/media/files/yolov2.weights).
Secondly, modify the darknet's weight load function to get the weights and biases that we want(Here, considering that batcn normalizaton can be combined with weight and bias).
Thirdly, considering that multiple and add operations that implemented in hardware logic will cost too high resources in FPGA[3][6], we should use lower percision operation instead of float-32. Here, I just follow [3] and [6] to quantize the input/output feature maps, weights and biases to dynamic fixed-16. And use fixed-16 operation to replace multiple, add and relu operations in float-32 percision.
## (2)HLS Accelerator and Simulation
Oh, this part is too complicated to brightly intoduce. __Current design didnt implemment C/RTL simulation, because that testbench always overflow!__ if anyone can solve it, please tell me and upload it, Thanks!
## (3)Vivado Block Design
Just connect the YOLOv2 IP in Vivado Block Design. Only the clock wizzard configuraiotn should be careful. I remembered that input clock is 100MHz, output's clock is 150MHz, __Reset pin active low__, that's all.
## (4)Vivado SDK for Zedboard
This step just wants to get the executable file to driver and control YOLOv2 Acceleraotr in PL. Here, I reserved 0x1000_0000 bytes memories for accelerator to read/wirte feature maps and read weights.
## (5)PetaLinux
Related steps have been updated in Petalinux direcotry. Just use two files(__.hdf file and .bit file__) that generated from Vivado project to create one Peatalinux. Then test yolov2 acclerator in it.
Every directory has some steps to help further implement or study this accelerator.
# Design and Optimization of YOLOv2 Accelerator Based on FPGA
According to the analysis of the YOLOv2 network, most layers are serially processed, except for the routing layer. The routing layer can be implemented by setting a specific address in advance.
From an accelerator perspective, the work required is to interact with memory in order (reading memory data, processing data, and then writing back memory data). Since the amount of data input and output is very large, loop tiling technique is always applied to reuse data and reduce memory access times, which tiles the convolution loop R, C, M, N to Tr, Tc, Tm ,Tn[8].
The overall architecture of the accelerator is shown below:

Similar to [4,5,8], the accelerator has two AXI4 master interfaces and one AXI4-Lite slave interface. AXI-Lite slave interface is responsible for reading and writing control, data and status register sets. The input feature maps and weights are read concurrently by two master interfaces, and the output feature maps are written back simultaneously through write channel.
The Data Scatter module is designed to generate the corresponding write address and distribute the data read from the DRAM to the on-chip buffers. The Data Gather module is designed to generate the DRAM write-back address and write the data in the output buffer back to the DRAM. The other red modules are responsible for the processing of the convolutional layer (Conv and Leaky ReLU), the maximum pooling layer (Pool) and the reorg layer (Reorg).
## Weight Arrangement
The effective FPGA bandwidth goes up with the increase of burst length and finally flattens out above some burst length threshold[7]. The data tiling technique usually results in a discontinuous DRAM access for the row-major data layout in DRAM. To reduce the number of memory accesses and increase the effective memory bandwidth, we arrange the kernel weights for an entire tile to a continuous block to ensure a high utilization of the bandwidth of external memory [3].
## Parallel Convolution Engine
The acceleration strategy of convolutional layer is similar to [5][6], which utilizes input and output parallelism to accelerate the computation. By designing multiple parallel multiplication units and add trees to achieve input parallelism (Tn parallelism) and output parallelism (Tm parallelism) in convolution calculation. The Tm*Tn multiplication units are calculated in parallel. The add trees of Log2 (Tn) depth are accumulated by pipeline, and generate the partial sums.
## Ping-Pong operation
Similar to [8], the design implements ping-pong buffers to overlap the delay of reading input feature maps and weights, writing output feature maps and calculation, which greatly improves the dynamic utilization of the computing engines.
# Evaulate
Experiments show that floating point addition in HLS requires three DSP resources, floating point multiplication requires two DSPs; fixed point 16-bit multiplication requires one DSP, and fixed-point 16-bit addition can be implemented only using LUT. After placing and routing, resource consumptions of fixed-16 (Tn=2, Tm=32, Tr=26, Tc=26) are shown as follows:
| Resource | DSP | BRAM | LUT | FF | Freq | Dev |
| ----- | ----- | ----- | ----- | ----- | ----- |----- |
|INT16(n4m32) old| 153(69%) | 88(63%) | 35977(68%) | 36247(34%) | 150MHz |Zedboard|
|FT32(n4m23) old| 209(95%) |115(82%) | 36348(68%) | 64077(60%) | 140MHz |Zedboard|
|INT16(n4m32) old| 147(6%) | 88(10%) | 36759(13%) | 30447(6%) | 180MHz |ZCU102 |
|FT32-(n8m28,CONV II=3,POOL II=2) default float32| 259(72%) | 91(42%) | 31985(45%) | 53728(38%) | 180MHz |EdgeBoard(ZU3EG) |
|FT32-(n4m36,CONV II=3,POOL II=2) current float32 mp| 334(93%) | 109(50%) | 43877(62%) | 73854(52%) | 150MHz |EdgeBoard(ZU3EG) |
According to the current design, DSP and BRAM are more expensive. The cost of DSP can be further reduced (there are many bit-width redundant multiplications), and the BRAM cost can be reduced. (As Shen [1] said, BRAM allocates an exponential size of 2 in HLS. Actually, many BRAMs are redundant. ).
The performance comparison in the two cases is shown in the following table:
| Performance | | | | |
没有合适的资源?快使用搜索试试~ 我知道了~
xilinx的fpga pynq/zedboard 加速YOLOv2的demo

共2000个文件
png:7613个
h:132个
c:71个

8 下载量 77 浏览量
2023-05-14
14:09:07
上传
评论
收藏 38.53MB ZIP 举报
温馨提示
根据对YOLOv2网络的分析,除路由层外,大部分层都是串行处理的。路由层可以通过预先设置一个特定的地址来实现。 从加速器的角度来看,需要做的工作就是按顺序与内存进行交互(读取内存数据,处理数据,然后写回内存数据)。由于输入和输出的数据量非常大,为了重用数据和减少内存访问次数,总是采用循环平铺技术,将卷积循环R、C、M、N平铺到Tr、Tc、Tm、Tn[8] . 文件中有详细说明和相关论文参考
资源推荐
资源详情
资源评论



























收起资源包目录





































































































共 2000 条
- 1
- 2
- 3
- 4
- 5
- 6
- 20
资源评论


希望代码都能跑
- 粉丝: 334
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 数据挖掘技术在现代远程开放教育中的应用.docx
- 电力物联网的通信技术探究.docx
- 数据库课程设计.doc
- 中国光通信行业运营商资本支出分析及市场需求预测.docx
- 基于BP神经网络的网络舆情预警研究.docx
- 2016年通信施工组织计划.doc
- 试论电气工程自动化控制对智能化技术的运用.docx
- 数据库原理与应用实验指导书.doc
- 无线网络安全-MSE安全攻防培训资料.pptx
- 网络技术在电气工程及其自动化中的应用研究.docx
- Oracle数据库图书管理课程设计.doc
- YOLO 目标检测算法的相关实现方式
- 关于计算机软件专业学生毕业设计工作的探讨.docx
- 不定积分基本公式和运算法则直接积分法.doc
- C语言学生成绩管理系统设计.doc
- 汽车网络及电器架构.ppt
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
