32. Linux on Zynq / Real-Time Video Processing
Post date: Sep 01, 2016 2:41:28 PM
Recently completed another project with the Zybo! The video can be seen below. Also, the GitHub repository can be found here! Since I really enjoyed working on this project, I will describe the RTL design and some of the software here! By the way, after this post, I will start to post these small blogs on my HACKADY.IO since a lot of other hobbyists / makers are building a community there.
RTL Design. The design from the perspective of the Vivado IP Integrator can be viewed below! I really enjoy the graphical approach to creating a top module; I find it all too easy to make a mistake when creating a regular top module in either Verilog or VHDL.
RTL design
So, in every project I do I try to learn something new. In the case of this project, one of the new things I needed to learn was the AXI VDMA. In general, a DMA is a component within a computer system that allows a peripheral device to access memory, without having to go through a host device. The host device is typically a more general-purpose processor that runs the system's higher-level, application software. DMAs are particularly useful when the host device needs to speed its precious clock cycles on time critical ( i.e. real-time ) tasks and can't even afford to lose time with costly context-switching. Prior to this project, most of my experience with DMAs was with the AXI DMA ( performs transfers between memory and a device ) and the AXI Central DMA ( CDMA ) ( performs transfers between memory and memory ).
Of course, the VDMA is nothing but a specialized DMA for transferring video frames! Similar to the AXI DMA, main memory is accessed through AXI4-Full interfaces, whereas access to a device is done through the simpler AXI4-Stream interface. The major difference is that synchronization on each new video frame is done with either one of the user defined signals in the AXI4-Stream interface or a fsync signal on the VDMA itself. The last signal in the AXI4-Stream interface signifies the end of a video line. But, a lot of the lower level details are in fact handled by the VDMA and the Xilinx HLS Video library. I should point out the only core I created myself is the HLS Video Core.
Close look at how the HLS Video core is connected to a VDMA core. The AXI memory-map to slave ( M_AXI_MMS ) and slave to memory-map ( M_AXI_S2MM ) connects to main memory through the PS.
Despite the fact the RTL design utilizes a significant amount of resources , one of my biggest regrets with the project is that I don't do as much video processing in the FPGA as I originally wanted. Specifically, I only perform the resizing of 320x240 video frames to 640x460 video frames in HLS Videocore. The other image processing algorithms ( i.e. Sobel, Harris, and Canny ) are computed with OpenCV image processing functions in software. In my future project with my HD camera, I definitely intend to focus a bit more hardware acceleration with the FPGA!
There are two VDMAs implemented in this project. The Digilent AXI Display Core, which drives the VGA interface, acquires video frames with one VDMA, and the HLS Video core depends on the second. Both VDMAs are configured in free-running / parking modes of operation, however the VDMA connected to the AXI Display Core only needs to read.
Software. I of course needed to learn how to configure the USB controller as a host PHY device with EHCI. Luckily, Xilinx has a wiki where they demonstrate how to properly edit the device tree and enable the right drivers for the kernel ( it's almost nice that I'm doing this type of project so late ). Not mentioned in the wiki, though, was the fact the usb libraries need to be included in the root file system. And, in order to run the usb utilities, those need to be enabled in the root file system, as well. Apart from changing the device tree, all of these configurations are done with the petalinux-config tool, of course!
Another issue I ran into was how to take advantage of interrupts from a user application, without having to create a separate Linux driver module for each core. Plus, I especially wanted to use the Standalone drivers, even for the GPIO. Previously, my go to driver for the AXI GPIO core in Linux had been the SysFs driver, but I didn't like the fact I would have to constantly access files in order to utilize a few GPIO signals. Instead, I took full advantage of the generic userspace I/O ( UIO ) driver, which not only lets me import Standalone drivers but also makes accessing interrupts super easy! I still have plans on implementing a Linux driver module ( just a little goal of mine ), but I want to explore HLS a bit more in depth first.
So, here's a small outline on how the software is structured. I'm going to avoid adding code snippets since I find including a ton of snippets is not as appealing as a graphical representation, for instance the screenshots taken from the IP Integrator.
Perform the necessary memory maps from virtual to specified physical memory! The Standalone and the drivers I write are intended to run closer to the hardware; in other words, the virtual addresses granted by the Linux kernel for user applications are no good!
Configure the camera and set the resolution to 320x240! Programming this step is simple, once you ensure the kernel has the write drivers and libraries! And, by simple, I am referring to the OpneCV VideoCapture class.
Configure the display! Because I am using the Digilent AXI Display core, I only needed to make a few modifications to their driver. In fact, I needed to make similar modifications to the other Standalone drivers since they all depend on physical addresses, not the kernel's virtual addresses. ( In an effort not to repeat myself, I won't repeat this step but know that it's implied. )
Associate OpenCV with the video frame buffers. This step involves associating the video frame buffers --- which are placed outside the memory space of the kernel --- with OpenCV Mat objects. This step is crucial since I want to depend on the functionality of OpenCV instead of re-implementing all the filters! More on this detail in the main loop of the application!
Configure GPIO driver for user I/O. Nothing too fancy here, other than I am once again using the Standalone driver. A separate thread is launched in order to avoid polling the GPIO core for new input; the thread is written such that it waits on the GPIO's interrupt.
Configure the HLS Video core. This step not only involves setting up the HLS Video core, but also the respective VDMA. Nice feature of the HLS tool is that a software driver is automatically generated for both Linux ( using the generic UIO driver ) and Standalone... but I didn't like how the Linux / UIO driver was structured, so I ended up doing what I did for the other cores that had Standalone drivers!
Run the main loop! This step of course is composed of multiple steps! Before getting to what those steps are, I want to point out frames are buffered to ensure the visuals shown on the display appear to change smoothly; without frame buffering, you can actually see the changes occurring at a line and pixel level. Thus, existing outside the memory space of the kernel are a process frame and display frame, both of which have a resolution of 640x480 and 4-channel 8-bit pixels.
Display the frame for which the processing is finished. Pretty self explanatory. The details of this operation are abstracted away by the Digilent driver. Since the source code is freely available, you can look can see that changing the frame involves configuring the VDMA to park at a selected frame. Not sure whether the term "park" is specific to the AXI VDMA core or general to other DMAs specialized for video, but it basically means the VDMA is configured to continuously read over a single frame.
Perform the processing on a separate frame. After a frame is captured from the web cam, a filtering algorithm can be performed. The slide switches on the Zybo board enable the filtering operations: Gray, Sobel, Harris, and Canny. To be honest, the focus on the project wasn't necessarily on the filters, mainly because they're OpenCV function calls. The important sub-steps are to copy the filtered frames into an available frame buffer that exists outside the memory space of the kernel, and then trigger the VDMA core associated with the HLS Video core to perform the resizing.
That's it! There's a lot of text that went into explaining the main points of the project. So, I would suggest looking at the source code in the GitHub repository ( see link at beginning of blog ).
What's next? Well, I am disappointed with two aspects of this project. 1 ) I didn't do as much video processing in the FPGA as I originally planned. I could have done more, but I also planned to develop the embedded design such that different filters can be selected. As far as I can see, you can't easily multiplex different video streams within a single HLS core. In retrospect, I should have learned how to utilize the AXI4-Stream Switch, but I completely overlooked its existence. 2 ) I've had an HD camera for a while, but never bothered to get the cable to connect it to the Zybo board's HDMI port. And, I've seen a video on YouTube of someone streaming video from the HD camera to the Zybo! If I can do the same, I can avoid having to reduce the resolution to such a low amount!
The HD camera I'll hopefully be able to use!
Moreover, this project only showcases real-time video processing, but doesn't do anything with the extracted features ( i.e. the filtered images ). So, in the future, I plan to conceive a scenario and build an embedded design that will force to me to perform video processing in real-time, but also perform pattern recognition / decision making!
But before another video capture project, I want to return to doing another audio / video-oriented project. Specifically, I have already started remaking ( and improving ) a project I did a few months ago on the Nexys 4 DDR: Visual Equalizer!