Professional Documents
Culture Documents
of Bachelors of Technology
MONTH YEAR
IIIT Hyderabad 0
B-Tech Thesis
IIIT Hyderabad 1
B-Tech Thesis
Hyderabad, India
CERTIFICATE
It is certified that the work contained in this project, titled Custom Query Co-
Processing on FPGAs by Jairaj Bhattacharya, R Mukund, Gaurav Shah and Sagar
Gupta has been carried out under my supervision for a partial fulfilment of
degree.
Signature_______________________________________________
Dr. R Govindarajulu
IIIT Hyderabad 2
B-Tech Thesis
IIIT Hyderabad 3
B-Tech Thesis
Acknowledgements
I am very thankful to my advisors, Dr. R Govindarajulu and Dr. Suresh Purini for giving
me complete freedom and flexibility to work on this topic. Both of them have been
very encouraging and motivating and the intensity of encouragement has always
increased with time. I am greatly indebted to them.
I owe a huge debt of gratitude to, Dr. Chidamber Kulkarni of Xilinx (India) for his
invaluable help and guidance in the Project. I am also very thankful to Bharghava and
Abinesh in the CVEST (Centre for VLSI and Embedded Systems) department. Those
small discussions were very helpful and provided me a really valuable information.
IIIT Hyderabad 4
B-Tech Thesis
Abstract
The rapid increase in the data volumes for the past few decades has intensified the
need for high processing power for database and data mining applications.
Researchers have actively sought to design and develop new architectures for
improving the performance. Recent research shows that the performance can be
significantly improved using either (a) effective utilization of architectural features
and memory hierarchies used by the conventional processors, or (b) the high
computational power and memory bandwidth in commodity hardware such as
network processing units (NPUs), and graphics processing units (GPUs). Our Project
was focussed on an extensive research on how a Custom Query Co-processor can be
designed with data management in mind to help build heterogeneous chips for
Query processing applications. Our project highlights the intra and inter-query
parallelism that can be exploited in dedicated architectures. Implementation of basic
model and detailed analysis have been done for simple operators to check the
correctness of designed Architectures.
IIIT Hyderabad 5
B-Tech Thesis
List of Contents
1) Introduction............................................................................................ [8]
IIIT Hyderabad 6
B-Tech Thesis
List of Figures
IIIT Hyderabad 7
B-Tech Thesis
1. Introduction
Databases are widely used in nearly each and every field of computing today ranging
from network applications to embedded applications. The two important factors
which judge any application today are its speed and power consumption and there is
constant work to improve both these parameters in the field of query processing.
Google provides results to millions of queries every second. For most this seems like
a problem that is solved. However there are many challenging problems in query
processing, for example, how to index and search through hours and hours of audio
or video for that matter. There are multiple approaches to tackling this problem. We
propose to investigate the architectural aspects of implementing efficient search
engines by digging into the hardware to develop enhanced query processing
capability.
In a typical query we can say that two queries are almost independent of each other
hence we can execute more than one query at a time in parallel. Multiple queries can
IIIT Hyderabad 8
B-Tech Thesis
be run at the hardware level in parallel using the multi-core architecture providing
huge performance benefits.
In addition to multiple query processors to exploit query level parallelism, one can
also exploit the fact that each query can be factored in to multiple stages of
operations and exploit the parallelism due to pipeline each query in to multiple
stages (eg: SIMD – single instruction multiple data), which we call intra-query
parallelism. Thus, we have two degrees of parallelism to exploit. First query level
parallelism, by having multiple query processors in a single FPGA, and second, intra-
query parallelism, by pipelining each query in to multiple stages of operations.
Together these techniques should result in enhanced figures of merit.
In this field although no direct work has been done in the area of Hardware based
Query co-processor, there is a lot of material available wherein the idea and
techniques for the same are discussed. We have studied these papers in detail and
present a short summary on each of them in the following paragraphs.
This paper[1] was authored by Prof Stavros Harizopoulos and Prof Anastassia
Ailamaki belonging to the Computer Science Department of Carnegie Mellon
University and Prof Vladislav Shkapenyuk of the Computer Science Department of
Rutgers University.
They introduce two concepts, namely on-demand simultaneous pipelining (OSP) and
QPipe. OSP helps to maximize the data and work sharing across parallel query
executions and pipelines the operator’s output simultaneously to multiple parent
nodes thereby enabling dynamic operator sharing. Qpipe is a relational engine which
supports OSP. Each relational operator is encapsulated in a micro-engine serving
query tasks from a queue, naturally exploiting all data and work sharing
opportunities.
IIIT Hyderabad 9
B-Tech Thesis
A buffer pool manager keeps track of the memory pages being read and being
brought in, decides on when to update the pages back in the memory and replaces
old pages by the LRU policy. Materialized view selection is typically applied to
workloads known in advance, in order to speed-up queries that contain common
sub-expressions. The results from the queries which have been processed are stored
in a cache which enables faster response for queries which reuse that data. Multiple-
query optimization identifies common sub-expressions in query plans during
optimization, and produces globally-optimal plans.
[1] then speaks about the challenges in simultaneous pipelining in detail. For the
runtime detection of overlapping queries the engine must track each and every
query which is currently being executed. The engine has to compare the relational
operators of the input query with the current set of operators which are in
execution. This run-time detection should be as efficient as possible and scale well
with the number of active queries.
In the case of multiple-scan consumers, when new scan requests for the same table
arrive repeatedly and dynamically share a single scan, a large number of partial scans
will then be active on the same relation. Sometimes, the query plan is generated
such that the order of the scanned tuples is of importance and cannot be modified.
Also, there is the case of the engine getting caught in deadlocks during parallel query
processing.
IIIT Hyderabad 10
B-Tech Thesis
The current query processing model can be termed as “query-centric”, as the engine
designs conform to the “many operators, one query” standard. The system consists
of an optimizer which takes the input query and outputs a desired and optimum
query plan to perform the query processing. The query plan is constructed in the
form of a tree which consists of the input points in the form of leaves and all the
relational operators in the form of nodes. Each query is assumed to be independent
of one another and is executed in parallel by the engine. Therefore, the common
data across query executions is provided by a buffer pool of memory and this pool
maintains this information through a replacement policy.
The support for OSP is implemented in the following manner. Whenever a new
packet enters the queue of a µEngine, the whole queue is checked for overlapping
operators. There are two elements that are common to all μEngines: the OSP
Coordinator and the Deadlock Detector. The OSP Coordinator does the ground-work
for integrating an incoming query into the “in-progress” query if their operators form
an overlap while the Deadlock Detector ensures a deadlock-free execution of
simultaneously pipelined schedules.
In conclusion, they have provided a few techniques and concepts to facilitate the
efficient parallelization of query processing by using the concept of data reusability
and simultaneous pipelining. We were very influenced by the ideas presented and
would include them in our design.
This paper[2] was authored by Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson,
Naju G. Mancheril, Anastassia Ailamaki and Babak Falsafi of the Database Group and
Computer Architecture Lab (CALCM) of Carnegie Mellon University.
IIIT Hyderabad 11
B-Tech Thesis
According to their claims, till now the main source of bottleneck for database
systems was the off chip data stalls so the major effort was directed towards bringing
more and more data into the caches. At the same time, because of the ever-
increasing level of integration in processor chips, the newer processors have larger
(but slower) data caches. We find that the major bottleneck of current software is
data cache stalls, with L2 hit stalls rising from oblivion to become the dominant
execution time component in some cases.
They recognize that chip multiprocessor designs can be split into two different
families or "camps." The fat camp has wide-issue out-of-order processors which
addresses data stalls by exploiting instruction-level parallelism (ILP) (e.g., Intel Core
Duo, IBM Power 5). The lean camp has heavily multithreaded in-order processors to
hide data stalls across threads by overlapping data access latencies with useful
computation (e.g., Sun UltraSparc T1). The database applications are divided into
saturated workloads, in which idle processors always find an available thread to run,
and unsaturated workloads, in which processors may not always find threads to run,
thereby exposing data access latencies.
IIIT Hyderabad 12
B-Tech Thesis
DBMS.
Figure 5 shows that in six out of eight combinations of hardware and workloads,
data stalls dominate execution time. They analyze the components of data cache
stalls. Large on-chip L2 caches result in higher hit rates that shift stalls from memory
to L2 hits, and increase the relative importance of L2 hit stalls. Integrating multiple
processing cores on a single chip causes data accesses that result in long-latency
coherence misses in an SMP system and are converted into L2 hits on the shared L2
cache of the CMP. So, the L2 hit stall component increases by a factor of 7, showing
the difference in their results and prior research findings. Chip multiprocessors
integrate multiple cores on a single chip, which promotes sharing of data through the
common L2 cache, but the contention for the shared hardware resources negates
the positive effects of on chip data sharing.
They have then proposed a few measures to improve the performance of DBMS
systems on CMPs. The first is to increase parallelism, which is a very easy task as
databases are inherently parallel. The queries can be broken up and worked on
concurrently. We see that all this conforms to the discussions in paper1 of which we
explained earlier and gives added impetus to develop our engine on similar lines.
They also propose improving data locality, like for threads bearing a producer-
consumer relationship, execution would be faster if they were bind to the same core.
The third one is to construct a staged database system wherein the work is carried
out in “stages” as compared to monolithic requests. The query is decomposed into
“packets” and then sent to different stages according to its requirements. The type
of system may also bring in data locality by explicit scheduling of the packets.
IIIT Hyderabad 13
B-Tech Thesis
This paper[3] authored by the same group from CMU introduces a staged design for
high-performance evolvable DBMS that are easy to maintain. They have proposed to
break the database system into modules and to encapsulate them into self-contained
stages connected to each other through queues.
All stages are independent servers with its own queue, thread support, and resource
management that communicate and interact with other stages through a well-
defined interface. Stages accept packets, each carrying a query. They have adopted
for two levels of CPU scheduling which are local thread scheduling within a stage and
global scheduling across stages. They have divided in top level the actions that a
database server performs into five query execution stages connect, parse, 7 BTP-
Report: 2 optimize, execute, and disconnect. The execute stage typically represents
the largest part of a query’s lifetime and is further decomposed into several stages.
Stages provide two basic operations, enqueue and dequeue for a queue with the
incoming packets. This system works through the exchange of packets between
stages. A packet represents work that the server must perform for a specific query at
a given stage. Each stage has a pool of worker threads (the stage threads) that
continuously call dequeue on the stage’s queue, and one thread reserved for
scheduling purposes. More than one threads per stage help mask I/O events while
still executing in the same stage (when there are more than one packet in the
queue). If all threads happen to suspend for I/O, or the stage has used its time
quantum, then a stage-level scheduling policy specifies the next stage to execute.
Operators are grouped together which use a small portion of the common or shared
data and code, and separate operators that access a large common code base or
common data. In the execution engine implementation maps the different operators
into five distinct stages. file scan “fscan” and index scan “iscan” ,for accessing stored
IIIT Hyderabad 14
B-Tech Thesis
data sequentially or with an index, respectively, sort, join which includes three join
algorithms, and a fifth stage that includes the aggregate operators (min-max,
average, etc.). The “fscan” and “iscan” stages are replicated and are separately
attached to the database tables. This way, queries that access the same tables can
take advantage of relations that lie already in the higher levels of the memory
hierarchy.
Other then studying research papers on related work we were also working in
parallel on the Implementation of our Design. Our First task was to understand the
present query processing architecture. In order to do that we made a high-level
modelling of the present architecture and then correlate with a benchmarked
Database management system. We worked on a query processor code which
implemented operations of searching data from tables based on simple engines like
“FROM” , “Where” and “Select”. This helped us design a framework for the existing
sequential architecture. Once we finished with our implementation we started
testing our design on SAM (SPARC Architecture Model) simulator on OpenSPARC T2.
We then profiled our code to analyse the overheads and latencies in the existing
code. This has helped us realize the intricate details in query processing and places
where searching can be parallelized in Multicore - Multi threaded architectures.
The following flow chart gives a brief overview of the work done in the
Implementation of high level modelling:
IIIT Hyderabad 15
B-Tech Thesis
Study Database
Frame High Level
Benchmarks like TPC
Sequential Architecture
to model our
for Implementation
sequential
architecture
C CODE IMPLEMENTATION
Coding of Cache
models by
accessing
Schema and
hashing tables
Implementation of
Analysis, Testing and Profiling of Code in
Query Plan analyser
SAS (Sparc Architecture Simulator) and
using where engine
identifying parallel Implementations for
and filtering cache
architecture based on related work
using from and select
studied
engine
IIIT Hyderabad 16
B-Tech Thesis
The high-level module which was written in C was then analysed for its power
consumption through the soft explorer analysis tool. Soft Explorer is an automatic
Power/Energy consumption estimation tool. The entry point of this tool can be both
the assembly code generated by the compiler (or directly the ASM code written by
the programmer) or directly the C code. This tool gives the global and local (for each
loop) consumption and also the consumption repartition. It can also give the energy
consumption of the external memory and the energy consumption repartition
between the processor and the external memory.
The power consumption of the design can be split into four major components :
1. DMA Power
3. Fetch Power
4. Clock Power
IIIT Hyderabad 17
B-Tech Thesis
We simulated the sequential C code on the CS5510 dsp processor to view the
power/energy consumption in a serial query processor so that we can have a
comparison for the pipelined verilog code which we show to have better
performance. The power analysis of the sequential C code is displayed in the figure
below:
This analysis shows that 52 percent and 41 percent of the power consumption is
done by the clock and the fetch unit respectively making them the prominent
factors.
The next simulation was carried out to determine the share of the total power
consumption and the fetch power consumption amongst the different data
structures in the code.
IIIT Hyderabad 18
B-Tech Thesis
The following illustrates the power consumption breakout of the used data
structures in the C code. This gives us hint of how the Data elements are used in the
High-level modelling.
Further Power of Fetch Data can be seen in the next simulation graph.
IIIT Hyderabad 19
B-Tech Thesis
The process is shown here in the following picture how any SQL query can be
compiled to generate its optimized query plan without executing the query. Once
this is done, we proceed in the generation of packet for communication with the
FPGA. The generation of the packet is done as a shell script in the packet itself. The
packet is somewhat made as follows:
IIIT Hyderabad 20
B-Tech Thesis
Packets will be formed after this. The packets for query can be either dependent or
independent. Here is an example of how we will be forming the query packets for
communication.
IIIT Hyderabad 21
B-Tech Thesis
Depending on the Query plan the packets will be modified by the Encoder and
decoder module.
IIIT Hyderabad 22
B-Tech Thesis
FROM ENGINE:
The primary function of the “from” engine is to fetch table data from the
common memory to the local memory of the co-processor. This engine accepts
instructions of the following packet from the pre-processor:
EI or Engine ID is unique to each engine and is “0000” for the “FROM” engine. The
IFU makes use of this ID to ensure the correct routing of all instructions to their
respective engines.
Table Add(ress) is a 32 bit field which specifies the starting address in the common
memory for the table data which has to be fetched into the local memory.
Table Size is an x bit field composed of two sub-fields- Length of y bits and Width of z
bits. These values indicate the size of the table that is to be retrieved in the form of
the number of rows and columns.
The data is processed and this engine generates another intermediate instruction for
consumption by the where or join engine in the above format
IIIT Hyderabad 23
B-Tech Thesis
Table Add(ress) is a 32 bit field which specifies the starting address in the local
memory of the table data.
Table ID is a unique ID attached to each table that is stored in the local memory.
All the other data fields are similar to the above case.
JOIN ENGINE:
The join engine’s task is to take two SQL tables and apply the process of
merge join on them to get a single resultant table which would be required by the
other engines for further processing of the query.
It then transmits the information about the new table through an intermediate
instruction with the above specified format.
Temp Table Address is the temporary address of the newly formed table in the local
memory.
IIIT Hyderabad 24
B-Tech Thesis
WHERE ENGINE:
The where engine receives a temporary table, which serves as its working data set,
and a set of conditions, each specified in a separate instruction, which are to be
applied on that data set. The pre-processor parses multiple “where” conditions in the
input query and splits each one into a separate instruction to be passed to the
“where” engine.
The OPID for the where engine gives the information about the operation which has
to be performed on the attributes which finally filters the useful tuples from the
temporary table .
FILTER
ATR1 and ATR2 are the fields on which the operation is performed and the results
gives us the constraints which when applied on the temporary table gives us set of
tuples that are stored in memory in form of a temporary table .
J bit tells whether the input temporary table is a resultant of join operation or not .
IIIT Hyderabad 25
B-Tech Thesis
This engine outputs a new temporary table. The data in this table is a subset of the
main database table and satisfies all conditions set forth in the query. The
information about the new table is encapsulated in an intermediate instruction of
the above format and passed onto the “SELECT” engine.
SELECT engine:
ATTRIBUTE FIELD specifies the attributes that have to be displayed as output and
hence the select engine filters those attributes from the temporary table whose
whereabouts are send in form of a intermediate packet (where -> select ) .
IIIT Hyderabad 26
B-Tech Thesis
NO OP is no operation .
The above four engines are the crux of the architecture defined since they all are
different units this helps us to exploit pipelining while solving even a single query
which is also defined as intra-query parallelism . When there are multiple queries
that are solved then the architecture advances to inter-query parallelism .
Dataflow:
In this section we shall see how a query is being processed by the custom query
processer . The IFU (Instruction Fetch Unit) sends in the packets one by one to the
IIIT Hyderabad 27
B-Tech Thesis
common bus whenever the common bus is free i.e. when the common bus gives
ready to receive signal on being requested by the IFU .
Since the query has been segmented into parts such that each engine could perform
the specified task . The encoder and the decoder units are common to every engine
architecture. The decoder decodes the packets and sends the information in them to
operation unit which performs the specified task and then sends the information to
the encoder unit which encodes the output in form of packet and forwards it to the
common bus so that it can reach its destination engine , otherwise it simply displays
the output .
The first task of every query is to retrieve the tables on which the manipulations
have to be done .This task is performed in the "from" engine which retrieves the
data from the main memory and stores it in the local memory of the co_processor.
The address of the this local memory is being forwarded in the form of packets to the
"join" engine if it is required otherwise the packets are forwarded to the where
engine.
According to the query the "where" engine gets the address of the temporary table
in the intermediate packets on which it has to do the manipulations from the "from"
or "join" engine. "Join" engine comes into role if the data manipulation has to be
done on two or more tables. The manipulations to be done are also being send in
form of packets by the preprocessor forwarded to the engines by the IFU through the
common bus . The decoder in the where engine decodes these packets and sends
the information to the operation unit . The resultant tuples are stored in another
temporary table in the memory and the address of this temporary table is forwarded
to the "select" engine in form of intermediate packets.
In the "select" engine the decoder has to decode two kind of packets , one that are
intermediate packets formed by the "where" engine and the other send by the
preprocessor which gives the information about the attributes that have to be
selected among the resultant tuples manipulated in the "where" engine and hence
display the output.
The storage of temporary tables is done in the local memory of the preprocessor.
This enables speedup while data has to be fetched with almost 100 percent hit rate.
The input buffer and the output buffer are just the part of the local SRAM . The
memory operations in the coprocessor are being synchronized by the memory
controller . It is responsible for the data integrity .
IIIT Hyderabad 28
B-Tech Thesis
Operator
Memory Controller:
This is the memory interface of the engine. It will talk to the external memory. Its job
is to fetch the data to the data buffer and then store back data to the memory. It also
receives the address of the written output block in the memory to send it to encoder
module.
Data Buffer:
It is the local memory provided to each engine to hide the memory access latency.
There are two buffers inside input buffer and output buffer for storing input and
output data respectively.
Operator:
This is the main code of the engine and will perform the desired operation. This will
perform all the architecture operations like sequence scan, index scan, projection.
IIIT Hyderabad 29
B-Tech Thesis
The above figure is architecture for the execution of query by the query engines.
Each of the signals are as follows:
IFU receives queries in the form of packets for further processing by the query
processor. It puts packets on the common bus. Common bus is read by all the
engines, if the destination address field in the packet matches with the engine code
then that packet will be stored in the decode queue of the engine. There are two
decoding queues namely the dependent and the independent packet queue.
Depending upon the nature of the packet it will be stored in the respective queue. If
IIIT Hyderabad 30
B-Tech Thesis
the memory controller is free it will assert the ‘tx_status’ signal and wait for receiving
data from the Encoder/Decoder unit. The Encoder/Decoder Unit will send the
required data to the memory controller and simultaneously send the ‘meta data’ of
the table to the operator. Once data buffer starts receiving the table data, the
operator commences its operation, and writes the data back in to the output buffer.
1. Common Bus: It is a 32-bit Bus to transfer Packet from IFU to operator and from
operator to other operator. It is a shared bus with handshaking as discussed below.
2. bus_rd_rq(Active LOW) : Initially all the engines will read the bus hence
‘bus_rd_rq’ is HIGH. Seeing a packet designated to itself, corresponding operator will
de-assert the signal. It will asserted again when EOP(End Of Packet) is received.
IIIT Hyderabad 31
B-Tech Thesis
9. Buffer_full(Active HIGH) :Data buffer will assert it to indicate that input data
buffer is full. Seeing it HIGH memory controller will de-assert ‘mem_data_valid’
signal. It will be de-asserted when input buffer of the data buffer will go below a
threshold level.
10. mem_wr_rq(Active HIGH) : Data buffer will assert this signal when its output
data buffer has entries more than a certain threshold. It will be de-asserted when
data writing is finished.
To verify the functional correctness of the design blocks we generated an input file
which provided the input ports with a bit-stream. By running the simulation tool on
such a dataset we checked the data bits at various stages of processing. The outputs
at each encoder and decoder were also verified.
The following show the Functional Simulation results of the IFU- ENCODER and
DECODER systems in GTK-Wave:
IIIT Hyderabad 32
B-Tech Thesis
The following module interfaces the IFU with the FROM, WHERE and SELECT encoder
and decoder units. the FROM decoder unit has been implemented with a serial FIFO
whereas WHERE and SELECT have been implemented with flexible FIFO systems
which can break its link based on the query ID of the packet placed in the FIFO. The
following system is connected via IFU and tested with clock cycles to give correct
simulation outputs.
Then the Operator Module has been implemented along with its interface with
SRAM, Memory controller and clock. The Module Engines have been tested for logic
simulation and have given correct results.
The Engines have been worked out in a clock based state diagram logic where
control signals vary on clock edge. A screen shot of simulations of this module is
shown here:
IIIT Hyderabad 33
B-Tech Thesis
The RTL synthesis of both the modules have been done in Synopsis Design Tools and
Xilinx ISE. The results calculate are shown below:
FINAL RESULTS:
This report gives us information about the logic as well as device utilization for our
particular design on the Xilinx Virtex 2Pro FPGA.
RTL Top Level Output File Name : fromdecoder.ngr
Top Level Output File Name : fromdecoder
Output Format : NGC
Optimization Goal : Speed
Keep Hierarchy : NO
Design Statistics
# IOs : 133
Cell Usage :
# BELS : 637
# BUF : 2
# INV : 1
# LUT3 : 4
# LUT3_D : 1
# LUT3_L : 3
# LUT4 : 589
# LUT4_D : 29
# LUT4_L : 8
# FlipFlops/Latches : 716
IIIT Hyderabad 34
B-Tech Thesis
# FDE : 76
# FDRE : 640
# Clock Buffers : 1
# BUFGP : 1
# IO Buffers : 132
# IBUF : 67
# OBUF : 65
=========================================================
================
The Design was simulated for package no 2vp2fg256-7 and the following reports
show the utilization summary:
---------------------------
Timing Analysis :
Timing Detail:
--------------
All values displayed in nanoseconds (ns)
=========================================================
================
Timing constraint: Default period analysis for Clock
'clk'
Clock period: 2.831ns (frequency: 353.280MHz)
IIIT Hyderabad 35
B-Tech Thesis
=========================================================
================
Mapping Analysis:
Mapping Report
--------------
Logic Utilization:
Number of Slice Flip Flops:
716 out of 2,816 25%
Number of 4 input LUTs:
636 out of 2,816 22%
Logic Distribution:
Number of occupied Slices:
401 out of 1,408 28%
Number of Slices containing only related logic:
401 out of 401 100%
Number of Slices containing unrelated logic:
0 out of 401 0%
Total Number of 4 input LUTs:
636 out of 2,816 22%
Number of bonded IOBs:
IIIT Hyderabad 36
B-Tech Thesis
IIIT Hyderabad 37
B-Tech Thesis
IIIT Hyderabad 38
B-Tech Thesis
6.0 Conclusion
The following work has given a direction towards implementing an architecture for
accelerating query operations. It also provides definite results to support our case
and probe into the details of customizing indexing , merging and hashing operations
with hardware SM based operators. Further work in this direction would be done to
make a scalable hardware to interface Post-GRE and try TPC benchmarks on the
designed coprocessor.
7.0 References:
IIIT Hyderabad 39