Professional Documents
Culture Documents
137
from functions and ℎ . Jacobian, H and F, the matrix is − − −
represented in equations (6) and (7). 1
O = Q− R (ST − IUTV )WX
2
-() −
! = /
-
0
O [
(* ) = + Y 0 0 0Z
|O|
-ℎ()
% = 1
- 3) Observation Model
0
estimation is calculated by equations (9) and (10). These for % and ! are 6 and 7.
equations multiply differentiated quaternion, observed value
from gyroscope that is subtracted by bias in the state vector
and value dt which is sampling rate of sensors. After
calculation of n, the n is normalized. Finally, the normalized
value is added to the current state. 㻌
138
- -
checking. This step is repeated until the functions work
⎡ ⋯ ⎤ correctly.
-
- ⎢ - ⎥
j= = ⋮ ⋱ ⋮ ⎥ In the second step, the C-based program is translated to RTL
- ⎢- - ⎥ code using C synthesizer. We define interfaces and directives
⎢ ⋯ to optimize the defined C functions. The synthesizer reports
⎣ - - ⎦ the number of calculation cycle, the latency in the cycle, the
maximum clock frequency and the occupied FPGA resources.
Hence, developers can optimize the C code to back to the first
5) Noise components step if they need.
For the variance of noises, We set n = 0.000012 for In the final step, the generated RTL code is checked by
quaternion, nI = 0.00303 for biases to angular rate C/RTL co-simulation. This step contains following stages.
measurements, nT = 0.0254 for acceleration, n
= 0.221
respectively. Thus, we get covariance matrices as follows. 1. Execute the test bench code in C to generate input and
&= diag(nT , nT , nT , n
, n
, n ) output vectors for RTL design
2. Simulate the generated RTL using the input vector
p
Fig. 3. Design Flow in Vivado HLS Fig. 4. Flowchart of AHRS
In the first step, we implement algorithms and test-bench. At the first, the state vector, the error covariance matrix, the
Functions are written in C-based languages that are C, C++ or noise measurement matrix and noise model covariance matrix
SystemC, and then, the implemented programs are tested with are initialized. The dataset for evaluation of this
test bench also written in C-based languages. The test bench implementation is embedded in the program. The main loop
code includes dummy inputs, calling functions and result contains time update and the measurement update processes.
139
In time update step, the system updates the estimated state /* matrix service functions */
vector x*A and covariance matrix P*A that represent the /* define D as matrix dimension */
/* Type data_t is 32bit floating point*/
probabilistic model of state description. In measurement
update step, we update state vector and FRYDULDQFH matrix void matrix_mult(data_t a[D][D],
with estimatedA A , and observation values obtained data_t b[D][D], data_t c[D][D]){
from sensors. The output is reused in the time update step. In for(int i = 0; i < D; i++)
this paper, we used values from embedded data in the program for(int j = 0; j < D; j++)
instead of sensor data to concentrate with the evaluation of this for(int k = 0; k < D; k++)
c[i][j] += a[i][k] * b[k][j];
circuit. }
In the implementation of a circuit for proposed AHRS, most
of resource and time-consuming parts are matrix void matrix_inverse(data_t a[D][D],
multiplications and inversion. TABLE II. shows the number data_t inv_a[D][D]){
of these operations. Matrix multiplication is executed 10 times, data_t buf;
int i, j, k;
and matrix inverse operation is executed once. These two have
/* initialization */
high computing complexity, however, also have high for(i = 0; i < D; i++)
parallelism, hence, we can improve performance using FPGA. for(j = 0; j < D; j++)
In addition, we describe implementation each critical modules inv_a[i][j] = (i == j) ? 1.0 : 0.0;
of this system with Vivado HLS in the next subsection. /* Gaussian elimination */
for(i = 0; i < D; i++) {
TABLE II. NUMBERS OF MATRIX OPERATIONS IN AHRS CIRUICT buf = a[i][i] != 0 ? 1 / a[i][i] : 0;
for(j = 0; j < D; j++) {
Type of operationss Number of execution a[i][j] *= buf;
Multiplication 10 inv_a[i][j] *= buf;
Inversion 1 }
Addiition 2 for(j = 0; j < D; j++) {
Subtraction 1 if (i != j) {
buf = a[j][i];
Transportation 2
for (k = 0; k < D; k++) {
C. Arithmetic Operations a[j][k] -= a[i][k] * buf;
inv_a[j][k] -= inv_a[i][k] * buf;
A part of arithmetic operation implementation is shown in }
Fig. 5. The function matrix_mult contains straightforward }
implementation for matrix-matrix multiplication using a }
simple triple loop. The function includes the pipeline directive }
}
for optimization at synthesizing. This increases the
performance about 18 times and increases resources about 2 Fig. 5. Synthesizable C-code for matrix multiplication and inversion
times in our case. The matrix_inverse function employs. The
Gaussian elimination algorithm for matrix inversion. We also V. EVALUATION AND DISCUSSION
use a pipelining directive in the most inner loop of the time-
In this section, we evaluate basic matrix operations and the
consuming loop. As a result, this optimization makes
whole AHRS system described in section IV. We used Vivado
performance increase about 2.9 times and increases resources
Design Suite HLx 2016.2 for synthesis tool and Vivado HLS
about 1 % in our case.
2016.2 for C-based implementation. These are provided by
Function % and ℎ described as equations (7) and (11) Xilinx Inc. Moreover, we implement our system on Zynq-
require addition and multiplication, hence, the synthesizer
7020 (XC7Z0201CLG484C).
adds DSPs automatically. However, in this instance, we cannot
1) Evaluation environment
improve performance effectively. For this problem, we adapt
allocation directive to restrain using DSPs. We used Zynq-7020 board that includes xc7z20 for
The data type (data_t) in this calculation is single precision performance evaluation of AHRS as a system. This chip has
consisting 32-bit floating point data format which is defined two areas which are Processing System (PS) and
IEEE-754 standard. Programmable Logic (PL). PS includes ARM Cortex-A9
processor which can configure clock frequency, peripheral
controller and memory interface, thus this system can work
independently. The other hand PL is FPGA and connected with
PS via some ports.
Our proposed circuit is contained in PL, and connected to
32-bit general purpose AXI master and slave port. Slave port
uses MMI/O register for control circuit from the processor in
PS. The AHRS circuit accesses data that is the state vector and
the noise covariance matrix via the AXI Master port.
140
TABLE V. RESOURCE UTILIZATION OF MODULED FUNCTIONS
Resource BRAM DSP FF LUT
Utilization
C(2) 0(0%) 7(3%) 1389(1%) 1807(3%)
<(2) 0(0%) 7(3%) 1062(~0%) 1607(3%)
Fig. 6. Proposed Programmable SoC Architecture TABLE VI. CALCULATION CYCLES OF PROPOSED CIRCUIT
Minimum Maximum
B. Performance of arithmetic operations Latency (cycles) 5598 5696
TABLE III. shows the number of resources used for a matrix Interval (cycles) 5599 5697
multiplication unit and utilization of programmable logic part
of Zynq-7020. We set the number of dimensions of the matrix TABLE VII. RESOURCE UTILIZATION OF PROPOSED CIRCUIT
to 7 that correspond with the dimension of the state vector . Used resources BRAM DSP FF LUT
The clock frequency of this circuit is 106MHz. The maximum Total 45 48 13270 27088
clock cycle and the interval to compute the multiplication are Available 280 220 106400 53200
211 and 211 cycles, respectively. Utilization (%) 16 21 12 32
TABLE III. RESOURCE UTILIZAITON OF MATRIX MULTIPLICATION Since this circuit calculates the processes of time update and
BRAM DSP FF LUT measurement update autonomously, processors will be
Number of ussed 0 10 2746 2362 released from the work to maintain the dataflow of matrix that
resource general matrix accelerator implementation need.
Utilization (%) 0 4 2 4
D. Comparison with other configurations
TABLE IV. shows the synthesized results of the matrix For the evaluation of proposed AHRS system, we compare
inversion function. The maximum frequency of this module is the performance of three kinds of system configurations show
118MHz. The maximum clock cycle and the interval for the as follows.
computation of the matrix inversion are 1943 and 1493, RISC-SYS1
respectively.
RISC-SYS2
TABLE IV. RESOURCE UTILIZATION OF MATRIX INVERSION Proposed System
BRAM DSP FF LUT We evaluate three types of AHRS systems. RISC-SYS1 and
Number of used 0 10 2766 3633 2 are only software implementation and running on RISC
resource
Utilization (%) 0 4 2 6
processor that is ARM Cortex-A9. The difference of these is a
configuration of the clock frequency. RISC-SYS1 is
We show the basic performance of matrix operations in this configured with 667MHz, the maximum frequency of target
board. On the one hand RISC-SYS2 is configurated with
section. In the system description and synthesis, each function
written in C are embedded to a higher module to reduce a 72MHz that estimate the result of a software system running
overhead of function call by the synthesizer. However, we on Cortex-M4 processor which is often used for small UAV
systems such as LibrePilot [11] and Cleanflight [12]. Finally,
understand the basic behavior the synthesized matrix
operations from C, and we can use the result to estimate the the block diagram of Proposed System is )LJ. The task is
system resource occupation. divided two, attitude estimation and conversion quaternion
between Euler angle. First task is computed in AHRS circuit,
TABLE V. shows the synthesized results of functions ℎ() and second one is calculated by ARM processor. The
%(). The maximum frequency of function ℎ() is 121 MHz configuration of the processor is 667MHz.
and function % is 111MHz. From this result, the FPGA
resource utilization is reduced by optimization that is In this evaluation, we assume that proposed system read
described in section III. input samples from the static memory array embedded in the
program. The dataset contains 9000 samples of 3 sensors data
that injected noise intentionally. We measured the calculation
141
time of for samples in dataset and divide the time by the REFERENCES
number of samples to estimate the time per one sample. [1] R. Konomura and K. Hori, “Phenox: Zynq 7000 based quadcopter robot,”
in 2014 International Conference on ReConFigurable Computing and
TABLE VIII. AHRS EXECUTION RESULTS FPGAs (ReConFig14), 2014, pp. 1–6.
[2] J. F. Guerrero-Castellanos, H. Madrigal-Sastre, S. Durand, N. Marchand,
Processing Maximum W. F. Guerrero-Sanchez, and B. B. Salmeron, “Design and
time (msec) sampling freq. (Hz) implementation of an Attitude and Heading Reference System (AHRS),”
(A) RISC--SYS1 0.345 2898 in 2011 8th International Conference on Electrical Engineering,
(B) Proposed System 0.062 16129 Computing Science and Automatic Control, 2011, pp. 1–5.
(C) RISC--SYS2 3.198 312 [3] J. L. Marins, Xiaoping Yun, E. R. Bachmann, R. B. McGhee, and M. J.
Zyda, “An extended Kalman filter for quaternion-based orientation
TABLE VIII. shows AHRS calculation time per one sample. estimation using MARG sensors,” in Proceedings 2001 IEEE/RSJ
Since the current sampling frequency of typical sensors are International Conference on Intelligent Robots and Systems. Expanding
the Societal Role of Robotics in the the Next Millennium (Cat.
100Hz when set to 667MHz to CPU frequency, real-time can No.01CH37180), vol. 4, pp. 2003–2011.
be achieved without using a special purpose circuit shown in [4] H. Chenini, D. Heller, C. Dezan, J. Diguet, and D. Campbell, “Embedded
(A) of TABLE IV. In a CPU setting to a 72MHz assuming an real-time localization of UAV based on an hybrid device,” in 2015 IEEE
actual small UAV system, we couldn’t obtain real-time International Conference on Acoustics, Speech and Signal Processing
performance in the system shown in (B). However, the system (ICASSP), 2015, pp. 1543–1547.
with a combined CPU and AHRS circuit could achieve real- [5] D. Pritsker, “Hybrid implementation of Extended Kalman Filter on an
time. In addition, the system (B) can handle 16 kHz of the FPGA,” in 2015 IEEE Radar Conference (RadarCon), 2015, pp. 0077–
0082.
sampling rate of sensors that is needed more precise
[6] V. Bonato, R. Peron, D. F. Wolf, J. A. M. de Holanda, E. Marques, and J.
navigation systems. M. P. Cardoso, “An FPGA Implementation for a Kalman Filter with
Application to Mobile Robotics,” in 2007 International Symposium on
VI. CONCLUSION Industrial Embedded Systems, 2007, pp. 148–155.
In this paper, re proposed an extended Kalman filter (EKF) [7] Xilinx, “Zynq-7000 All Programmable SoC Data Sheet: Overview,” 2017.
hardware for real-time attitude heading reference system [8] Intel FPGA, “Intel SoCs Overview,” 2016. [Online]. Available:
(AHRS). EKF reduces effects of noises by sensor fusion, and https://www.altera.com/products/soc/overview.html. [Accessed: 19-Jul-
2017].
real-time and highly accurate attitude estimation is required
[9] A. M. Sabatini, “Quaternion-Based Extended Kalman Filter for
for autonomous flight of UAVs. To reduce loads of processors Determining Orientation by Inertial and Magnetic Sensing,” IEEE Trans.
in real-time calculation of EKF, we implemented a circuit of Biomed. Eng., vol. 53, no. 7, pp. 1346–1356, Jul. 2006.
EKF by C-based hardware design that is called high-level [10] Xilinx, “Vivado High-Level Synthesis,” 2017. [Online]. Available:
synthesis (HLS) in Programmable SoC using Vivado HLS. In https://www.xilinx.com/products/design-tools/vivado/integration/esl-
EKF, most of resource and time-consuming parts are matrix design.html. [Accessed: 10-Jul-2017].
multiplication and inversion. We used straightforward [11] LibrePilot, “LibrePilot – Open – Collaborative – Free,” 2016. [Online].
implementations on these parts and added some optimizations Available: https://www.librepilot.org/site/index.html. [Accessed: 19-Jul-
2017].
for HLS. As a result, we achieved 5 times faster EKF
processing on FPGA than the processing on ARM Cortex-A9 [12] Cleanflight, “Cleanflight ·.” [Online]. Available: http://cleanflight.com/.
[Accessed: 19-Jul-2017].
running on 677Hz with single core processing. Our system can
handle 16KHz of sampling rate of high-speed sensors in spite
of 100Hz of the typical current system.
142