You are on page 1of 40

B-Tech Thesis

CUSTOM QUERY Co-PROCESSING on


FPGAs

B.Tech Project submitted in partial fulfillment

of the requirements for the degree

of Bachelors of Technology

in Electronics and Communication Engineering

JAIRAJ BHATTACHARYA 200630012 jairaj@students.iiit.ac.in


Gaurav Shah 200630011 gaurav_shah@students.iiit.ac.in
MUKUND R 200630022 r_mukund@students.iiit.ac.in
Sagar Gupta 200630030 sagargupta@students.iiit.ac.in

International Institute of Information Technology

Hyderabad - 500 032, INDIA

MONTH YEAR

IIIT Hyderabad 0
B-Tech Thesis

Copyright © Jairaj Bhattacharya, Mukund, Gaurav Shah, Sagar 2009

All Rights Reserved

IIIT Hyderabad 1
B-Tech Thesis

International Institute of Information Technology

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this project, titled Custom Query Co-
Processing on FPGAs by Jairaj Bhattacharya, R Mukund, Gaurav Shah and Sagar
Gupta has been carried out under my supervision for a partial fulfilment of
degree.

Signature_______________________________________________

Dr. R Govindarajulu

Dr. Suresh Purini

Dr. Chidamber Kulkarni

IIIT Hyderabad 2
B-Tech Thesis

Dedicated to our dear Parents and friends

IIIT Hyderabad 3
B-Tech Thesis

Acknowledgements

I am very thankful to my advisors, Dr. R Govindarajulu and Dr. Suresh Purini for giving
me complete freedom and flexibility to work on this topic. Both of them have been
very encouraging and motivating and the intensity of encouragement has always
increased with time. I am greatly indebted to them.

I owe a huge debt of gratitude to, Dr. Chidamber Kulkarni of Xilinx (India) for his
invaluable help and guidance in the Project. I am also very thankful to Bharghava and
Abinesh in the CVEST (Centre for VLSI and Embedded Systems) department. Those
small discussions were very helpful and provided me a really valuable information.

IIIT Hyderabad 4
B-Tech Thesis

Abstract

The rapid increase in the data volumes for the past few decades has intensified the
need for high processing power for database and data mining applications.
Researchers have actively sought to design and develop new architectures for
improving the performance. Recent research shows that the performance can be
significantly improved using either (a) effective utilization of architectural features
and memory hierarchies used by the conventional processors, or (b) the high
computational power and memory bandwidth in commodity hardware such as
network processing units (NPUs), and graphics processing units (GPUs). Our Project
was focussed on an extensive research on how a Custom Query Co-processor can be
designed with data management in mind to help build heterogeneous chips for
Query processing applications. Our project highlights the intra and inter-query
parallelism that can be exploited in dedicated architectures. Implementation of basic
model and detailed analysis have been done for simple operators to check the
correctness of designed Architectures.

IIIT Hyderabad 5
B-Tech Thesis

List of Contents

1) Introduction............................................................................................ [8]

2) Preliminary Work.................................................................................... [8]

2.1) Investigative Research......................................................................... [9]

2.2) Q-Pipe.................................................................................................. [9]

2.3) Database Servers on chip multiprocessors......................................... [11]

2.4) A Case for Staged Systems.................................................................. [14]

3.1) Implementation of a High-Level Model.............................................. [15]

3.2) Analysing of High-Level Model........................................................... [17]

4.0) Hardware Implementation................................................................. [20]

4.1) System Architecture........................................................................... [22]

5.0) Implementation and Results.............................................................. [32]

5.1) RTL Schematics .................................................................................. [37]

6.0) Conclusion........................................................................................ [39]

7.0) References......................................................................................... [40]

IIIT Hyderabad 6
B-Tech Thesis

List of Figures

1.0) Break Down of Execution Time........................................................... [13]

2.0) Flow Graph of Project......................................................................... [16]

3.0) Soft Explorer Data-Flow...................................................................... [17]

4.0) Data Re-partition................................................................................ [18]

5.0) Power Table Graph............................................................................. [19]

5.1) Power Fetch Graph............................................................................. [19]

6.0) Pre-Processor Model.......................................................................... [20]

7.0) Query Plan Screen Shot...................................................................... [21]

7.0) System Architecture........................................................................... [22]

7.1) From Packet...................................................................................... [23]

7.2) Where Packet.................................................................................... [24]

7.3) Select Packet..................................................................................... [26]

7.4) Data-Flow........................................................................................... [27]

8.0) Engine Architecture Flow................................................................... [29]

8.1) Engine Architecture Graph................................................................. [30]

8.2) Simulation of Encoder-Decoder......................................................... [33]

8.3) Simulation of Operator Module......................................................... [34]

8.4) Synthesis of From Decoder................................................................ [37]

8.5) Synthesis of From and Join Engine.................................................... [38]

IIIT Hyderabad 7
B-Tech Thesis

1. Introduction

Processing query is a fundamental operation for most modern computing


applications, especially for those that rely on Internet to reach the end user. Current
state-of-the-art solutions rely on a cluster of computers to perform concurrent
processing of individual queries. Thus building efficient query processing
architectures is central to building efficient search engines.

Databases are widely used in nearly each and every field of computing today ranging
from network applications to embedded applications. The two important factors
which judge any application today are its speed and power consumption and there is
constant work to improve both these parameters in the field of query processing.
Google provides results to millions of queries every second. For most this seems like
a problem that is solved. However there are many challenging problems in query
processing, for example, how to index and search through hours and hours of audio
or video for that matter. There are multiple approaches to tackling this problem. We
propose to investigate the architectural aspects of implementing efficient search
engines by digging into the hardware to develop enhanced query processing
capability.

Our goal is to design and implement a custom query processing engine as an


accelerator to the traditional processor that will improve the efficiency of query
processing capability on current systems. We propose to do this through the use of
FPGA. In addition to performance related efficiency we would also focus on
decreasing the price performance ratio on server based query processing systems.

2.0 Preliminary Work


We proposed to use FPGAs as the key building block for testing and performance
analysis of our query processing architecture. A typical FPGA consist of arrays of CLBs
(Configurable logic Blocks) and routing channels with Block RAMs for memory.

In a typical query we can say that two queries are almost independent of each other
hence we can execute more than one query at a time in parallel. Multiple queries can

IIIT Hyderabad 8
B-Tech Thesis

be run at the hardware level in parallel using the multi-core architecture providing
huge performance benefits.

In addition to multiple query processors to exploit query level parallelism, one can
also exploit the fact that each query can be factored in to multiple stages of
operations and exploit the parallelism due to pipeline each query in to multiple
stages (eg: SIMD – single instruction multiple data), which we call intra-query
parallelism. Thus, we have two degrees of parallelism to exploit. First query level
parallelism, by having multiple query processors in a single FPGA, and second, intra-
query parallelism, by pipelining each query in to multiple stages of operations.
Together these techniques should result in enhanced figures of merit.

2.1 Investigative Research

In this field although no direct work has been done in the area of Hardware based
Query co-processor, there is a lot of material available wherein the idea and
techniques for the same are discussed. We have studied these papers in detail and
present a short summary on each of them in the following paragraphs.

2.2 QPipe: A simultaneously Pipelined Relational Query


Engine

This paper[1] was authored by Prof Stavros Harizopoulos and Prof Anastassia
Ailamaki belonging to the Computer Science Department of Carnegie Mellon
University and Prof Vladislav Shkapenyuk of the Computer Science Department of
Rutgers University.

They introduce two concepts, namely on-demand simultaneous pipelining (OSP) and
QPipe. OSP helps to maximize the data and work sharing across parallel query
executions and pipelines the operator’s output simultaneously to multiple parent
nodes thereby enabling dynamic operator sharing. Qpipe is a relational engine which
supports OSP. Each relational operator is encapsulated in a micro-engine serving
query tasks from a queue, naturally exploiting all data and work sharing
opportunities.

IIIT Hyderabad 9
B-Tech Thesis

Qpipe exhibits a speedup of 2x on a 4-disk Pentium 4 server system and its


throughput increases as more clients are added. The reason for this improved
performance is that the memory pages brought in by one query are shared and
reused by all the queries currently in execution. In the ideal case, this should be done
for all relational operators and not just table scans and this capability is termed as
OSP. The current challenge is to implement this OSP with as minimum overhead as
possible.

We face many challenges in implementing OSP on a standard query engine, which


are: A) Efficient detection of overlapping operators and eligibility. B) Coping with
variations in the input and output speeds of the queries. C) Optimize restrictions on
query execution order to maximize sharing opportunities. If we use the
aforementioned “one query, many operators” concepts the overhead incurred would
negate any performance benefits.

A buffer pool manager keeps track of the memory pages being read and being
brought in, decides on when to update the pages back in the memory and replaces
old pages by the LRU policy. Materialized view selection is typically applied to
workloads known in advance, in order to speed-up queries that contain common
sub-expressions. The results from the queries which have been processed are stored
in a cache which enables faster response for queries which reuse that data. Multiple-
query optimization identifies common sub-expressions in query plans during
optimization, and produces globally-optimal plans.

[1] then speaks about the challenges in simultaneous pipelining in detail. For the
runtime detection of overlapping queries the engine must track each and every
query which is currently being executed. The engine has to compare the relational
operators of the input query with the current set of operators which are in
execution. This run-time detection should be as efficient as possible and scale well
with the number of active queries.

In the case of multiple-scan consumers, when new scan requests for the same table
arrive repeatedly and dynamically share a single scan, a large number of partial scans
will then be active on the same relation. Sometimes, the query plan is generated
such that the order of the scanned tuples is of importance and cannot be modified.
Also, there is the case of the engine getting caught in deadlocks during parallel query
processing.

IIIT Hyderabad 10
B-Tech Thesis

The current query processing model can be termed as “query-centric”, as the engine
designs conform to the “many operators, one query” standard. The system consists
of an optimizer which takes the input query and outputs a desired and optimum
query plan to perform the query processing. The query plan is constructed in the
form of a tree which consists of the input points in the form of leaves and all the
relational operators in the form of nodes. Each query is assumed to be independent
of one another and is executed in parallel by the engine. Therefore, the common
data across query executions is provided by a buffer pool of memory and this pool
maintains this information through a replacement policy.

In the implementation of QPipe, the data to be processed is introduced to the


system in the form of packets, and is distributed to various µEngines each of which
handles a specific relational operator. The µEngine queues the incoming packets,
which contain all information about the data to be processed and the location of
storage of the output data, and then processes them one by one. The individual work
of all the µEngines is linked by a producer-consumer relationship as the output of
one µEngine acts as the input of another.

The support for OSP is implemented in the following manner. Whenever a new
packet enters the queue of a µEngine, the whole queue is checked for overlapping
operators. There are two elements that are common to all μEngines: the OSP
Coordinator and the Deadlock Detector. The OSP Coordinator does the ground-work
for integrating an incoming query into the “in-progress” query if their operators form
an overlap while the Deadlock Detector ensures a deadlock-free execution of
simultaneously pipelined schedules.

In conclusion, they have provided a few techniques and concepts to facilitate the
efficient parallelization of query processing by using the concept of data reusability
and simultaneous pipelining. We were very influenced by the ideas presented and
would include them in our design.

2.3 Database Servers on Chip Multiprocessors

This paper[2] was authored by Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson,
Naju G. Mancheril, Anastassia Ailamaki and Babak Falsafi of the Database Group and
Computer Architecture Lab (CALCM) of Carnegie Mellon University.

IIIT Hyderabad 11
B-Tech Thesis

According to their claims, till now the main source of bottleneck for database
systems was the off chip data stalls so the major effort was directed towards bringing
more and more data into the caches. At the same time, because of the ever-
increasing level of integration in processor chips, the newer processors have larger
(but slower) data caches. We find that the major bottleneck of current software is
data cache stalls, with L2 hit stalls rising from oblivion to become the dominant
execution time component in some cases.

With recent developments in semiconductor technologies, the size of on chip data


caches has increased exponentially to 16MB in Dual-Core Intel Xeon 7100, and 24MB
in Dual-Core Intel Itanium 2. This trend shows no sign of change as computer
designers feel that larger caches are proportional to increase in performance. But the
disadvantage with larger cache sizes is that they have higher latencies as on-chip L2
latency has increased more than 3-fold during the past decade — e.g., from 4 cycles
in Intel Pentium III (1995) to 14 cycles in IBM Power5 (2004).

They recognize that chip multiprocessor designs can be split into two different
families or "camps." The fat camp has wide-issue out-of-order processors which
addresses data stalls by exploiting instruction-level parallelism (ILP) (e.g., Intel Core
Duo, IBM Power 5). The lean camp has heavily multithreaded in-order processors to
hide data stalls across threads by overlapping data access latencies with useful
computation (e.g., Sun UltraSparc T1). The database applications are divided into
saturated workloads, in which idle processors always find an available thread to run,
and unsaturated workloads, in which processors may not always find threads to run,
thereby exposing data access latencies.

The software used is FLEXUS, is a cycle-accurate full-system simulator that simulates


both user-level and operating system code, for accurate simulations of chip
multiprocessors and symmetric multiprocessors running unmodified commercial
database workloads. They characterize the performance of database workloads on
an LC camp and an FC camp with the UltraSPARC III ISA running the Solaris 8
operating system. The LC camp employs four 2-issue superscalar in-order cores. The
FC camp employs four aggressive out-of-order cores. The cache times are assessed
by Cacti 4.2, which is an integrated cache access time, cycle time, area, leakage, and
dynamic power model. The workloads consist of OLTP (TPC-C) and

IIIT Hyderabad 12
B-Tech Thesis

DSS (TPC-H) benchmarks running on a commercial

DBMS.

Figure 5 shows that in six out of eight combinations of hardware and workloads,
data stalls dominate execution time. They analyze the components of data cache
stalls. Large on-chip L2 caches result in higher hit rates that shift stalls from memory
to L2 hits, and increase the relative importance of L2 hit stalls. Integrating multiple
processing cores on a single chip causes data accesses that result in long-latency
coherence misses in an SMP system and are converted into L2 hits on the shared L2
cache of the CMP. So, the L2 hit stall component increases by a factor of 7, showing
the difference in their results and prior research findings. Chip multiprocessors
integrate multiple cores on a single chip, which promotes sharing of data through the
common L2 cache, but the contention for the shared hardware resources negates
the positive effects of on chip data sharing.

They have then proposed a few measures to improve the performance of DBMS
systems on CMPs. The first is to increase parallelism, which is a very easy task as
databases are inherently parallel. The queries can be broken up and worked on
concurrently. We see that all this conforms to the discussions in paper1 of which we
explained earlier and gives added impetus to develop our engine on similar lines.
They also propose improving data locality, like for threads bearing a producer-
consumer relationship, execution would be faster if they were bind to the same core.
The third one is to construct a staged database system wherein the work is carried
out in “stages” as compared to monolithic requests. The query is decomposed into
“packets” and then sent to different stages according to its requirements. The type
of system may also bring in data locality by explicit scheduling of the packets.

IIIT Hyderabad 13
B-Tech Thesis

2.4 A Case for Staged Database System

This paper[3] authored by the same group from CMU introduces a staged design for
high-performance evolvable DBMS that are easy to maintain. They have proposed to
break the database system into modules and to encapsulate them into self-contained
stages connected to each other through queues.

The demand for executing time-critical operations and supporting it is increasing in a


base of millions of users. To cope with high demands modern database systems need
to explore a work- centric multi threaded and multi-process execution model for
parallelism. In the proposed design, every stage exclusively owns a data structure,
allocates hardware resources, and makes its own scheduling decisions. Frequent
switching between threads of the same program interleaves unrelated memory
accesses, thereby reducing locality. They address memory performance from a single
application’s point of view, improving locality across its threads.

All stages are independent servers with its own queue, thread support, and resource
management that communicate and interact with other stages through a well-
defined interface. Stages accept packets, each carrying a query. They have adopted
for two levels of CPU scheduling which are local thread scheduling within a stage and
global scheduling across stages. They have divided in top level the actions that a
database server performs into five query execution stages connect, parse, 7 BTP-
Report: 2 optimize, execute, and disconnect. The execute stage typically represents
the largest part of a query’s lifetime and is further decomposed into several stages.

Stages provide two basic operations, enqueue and dequeue for a queue with the
incoming packets. This system works through the exchange of packets between
stages. A packet represents work that the server must perform for a specific query at
a given stage. Each stage has a pool of worker threads (the stage threads) that
continuously call dequeue on the stage’s queue, and one thread reserved for
scheduling purposes. More than one threads per stage help mask I/O events while
still executing in the same stage (when there are more than one packet in the
queue). If all threads happen to suspend for I/O, or the stage has used its time
quantum, then a stage-level scheduling policy specifies the next stage to execute.
Operators are grouped together which use a small portion of the common or shared
data and code, and separate operators that access a large common code base or
common data. In the execution engine implementation maps the different operators
into five distinct stages. file scan “fscan” and index scan “iscan” ,for accessing stored

IIIT Hyderabad 14
B-Tech Thesis

data sequentially or with an index, respectively, sort, join which includes three join
algorithms, and a fifth stage that includes the aggregate operators (min-max,
average, etc.). The “fscan” and “iscan” stages are replicated and are separately
attached to the database tables. This way, queries that access the same tables can
take advantage of relations that lie already in the higher levels of the memory
hierarchy.

3.1 Implementation of a high level model

Other then studying research papers on related work we were also working in
parallel on the Implementation of our Design. Our First task was to understand the
present query processing architecture. In order to do that we made a high-level
modelling of the present architecture and then correlate with a benchmarked
Database management system. We worked on a query processor code which
implemented operations of searching data from tables based on simple engines like
“FROM” , “Where” and “Select”. This helped us design a framework for the existing
sequential architecture. Once we finished with our implementation we started
testing our design on SAM (SPARC Architecture Model) simulator on OpenSPARC T2.
We then profiled our code to analyse the overheads and latencies in the existing
code. This has helped us realize the intricate details in query processing and places
where searching can be parallelized in Multicore - Multi threaded architectures.

The following flow chart gives a brief overview of the work done in the
Implementation of high level modelling:

IIIT Hyderabad 15
B-Tech Thesis

Study Database
Frame High Level
Benchmarks like TPC
Sequential Architecture
to model our
for Implementation
sequential
architecture

C CODE IMPLEMENTATION

Code parsing Engine


for Query search with
parsing ability of
From, Where and
Select

Coding of Cache
models by
accessing
Schema and
hashing tables

Implementation of
Analysis, Testing and Profiling of Code in
Query Plan analyser
SAS (Sparc Architecture Simulator) and
using where engine
identifying parallel Implementations for
and filtering cache
architecture based on related work
using from and select
studied
engine

IIIT Hyderabad 16
B-Tech Thesis

3.2 Analyzing our High-level Model:

The high-level module which was written in C was then analysed for its power
consumption through the soft explorer analysis tool. Soft Explorer is an automatic
Power/Energy consumption estimation tool. The entry point of this tool can be both
the assembly code generated by the compiler (or directly the ASM code written by
the programmer) or directly the C code. This tool gives the global and local (for each
loop) consumption and also the consumption repartition. It can also give the energy
consumption of the external memory and the energy consumption repartition
between the processor and the external memory.

The power consumption of the design can be split into four major components :

1. DMA Power

2. Processing Unit Power

3. Fetch Power

4. Clock Power

IIIT Hyderabad 17
B-Tech Thesis

We simulated the sequential C code on the CS5510 dsp processor to view the
power/energy consumption in a serial query processor so that we can have a
comparison for the pipelined verilog code which we show to have better
performance. The power analysis of the sequential C code is displayed in the figure
below:

This analysis shows that 52 percent and 41 percent of the power consumption is
done by the clock and the fetch unit respectively making them the prominent
factors.

The next simulation was carried out to determine the share of the total power
consumption and the fetch power consumption amongst the different data
structures in the code.

IIIT Hyderabad 18
B-Tech Thesis

The following illustrates the power consumption breakout of the used data
structures in the C code. This gives us hint of how the Data elements are used in the
High-level modelling.

Further Power of Fetch Data can be seen in the next simulation graph.

IIIT Hyderabad 19
B-Tech Thesis

4.0 Hardware Implementation:

Our Implementation has narrowed down to the following Data-Flow Diagram as


shown below. The distinction between the Pre-processor and the co-processor to
identify the point of shifting of data to be handled in Hardware has been shown here
in our report. We further would summarize implementation scheme as well as
architecture for implementation in this part of the Report.

The process is shown here in the following picture how any SQL query can be
compiled to generate its optimized query plan without executing the query. Once
this is done, we proceed in the generation of packet for communication with the
FPGA. The generation of the packet is done as a shell script in the packet itself. The
packet is somewhat made as follows:

IIIT Hyderabad 20
B-Tech Thesis

Packets will be formed after this. The packets for query can be either dependent or
independent. Here is an example of how we will be forming the query packets for
communication.

IIIT Hyderabad 21
B-Tech Thesis

Depending on the Query plan the packets will be modified by the Encoder and
decoder module.

4.1 System Architecture

IIIT Hyderabad 22
B-Tech Thesis

FROM ENGINE:
The primary function of the “from” engine is to fetch table data from the
common memory to the local memory of the co-processor. This engine accepts
instructions of the following packet from the pre-processor:

QI or Query ID is unique to a specific query and is assigned by the pre-processor. This


is used by the co-processor to maintain coherency between different instruction
segments of the same query.

EI or Engine ID is unique to each engine and is “0000” for the “FROM” engine. The
IFU makes use of this ID to ensure the correct routing of all instructions to their
respective engines.

OP ID or Operator ID for the FROM engine has a constant value as it consists of a


single operator.

Table Add(ress) is a 32 bit field which specifies the starting address in the common
memory for the table data which has to be fetched into the local memory.

Table Size is an x bit field composed of two sub-fields- Length of y bits and Width of z
bits. These values indicate the size of the table that is to be retrieved in the form of
the number of rows and columns.

J specifies whether a join operation has to be performed on the table or not .

The data is processed and this engine generates another intermediate instruction for
consumption by the where or join engine in the above format

IIIT Hyderabad 23
B-Tech Thesis

Table Add(ress) is a 32 bit field which specifies the starting address in the local
memory of the table data.

Table ID is a unique ID attached to each table that is stored in the local memory.

All the other data fields are similar to the above case.

JOIN ENGINE:
The join engine’s task is to take two SQL tables and apply the process of
merge join on them to get a single resultant table which would be required by the
other engines for further processing of the query.

It then transmits the information about the new table through an intermediate
instruction with the above specified format.

Temp Table Address is the temporary address of the newly formed table in the local
memory.

The other fields have been explained earlier.

IIIT Hyderabad 24
B-Tech Thesis

WHERE ENGINE:

The where engine receives a temporary table, which serves as its working data set,
and a set of conditions, each specified in a separate instruction, which are to be
applied on that data set. The pre-processor parses multiple “where” conditions in the
input query and splits each one into a separate instruction to be passed to the
“where” engine.

The OPID for the where engine gives the information about the operation which has
to be performed on the attributes which finally filters the useful tuples from the
temporary table .

FILTER

ATR1 and ATR2 are the fields on which the operation is performed and the results
gives us the constraints which when applied on the temporary table gives us set of
tuples that are stored in memory in form of a temporary table .

J bit tells whether the input temporary table is a resultant of join operation or not .

The other fields have been explained earlier.

IIIT Hyderabad 25
B-Tech Thesis

This engine outputs a new temporary table. The data in this table is a subset of the
main database table and satisfies all conditions set forth in the query. The
information about the new table is encapsulated in an intermediate instruction of
the above format and passed onto the “SELECT” engine.

SELECT engine:

This engine is fed by the intermediate instructions formed by the “WHERE”


engine. The task performed here is to retrieve those data fields from this new table
which have been specified in the query. This information is extracted by the pre-
processor and conveyed to this engine through an instruction of the format:

ATTRIBUTE FIELD specifies the attributes that have to be displayed as output and
hence the select engine filters those attributes from the temporary table whose
whereabouts are send in form of a intermediate packet (where -> select ) .

AN specifies the number of attributes that have to be selected .

IIIT Hyderabad 26
B-Tech Thesis

NO OP is no operation .

The other fields have been explained earlier.

The above four engines are the crux of the architecture defined since they all are
different units this helps us to exploit pipelining while solving even a single query
which is also defined as intra-query parallelism . When there are multiple queries
that are solved then the architecture advances to inter-query parallelism .

Dataflow:

In this section we shall see how a query is being processed by the custom query
processer . The IFU (Instruction Fetch Unit) sends in the packets one by one to the

IIIT Hyderabad 27
B-Tech Thesis

common bus whenever the common bus is free i.e. when the common bus gives
ready to receive signal on being requested by the IFU .

Since the query has been segmented into parts such that each engine could perform
the specified task . The encoder and the decoder units are common to every engine
architecture. The decoder decodes the packets and sends the information in them to
operation unit which performs the specified task and then sends the information to
the encoder unit which encodes the output in form of packet and forwards it to the
common bus so that it can reach its destination engine , otherwise it simply displays
the output .

The first task of every query is to retrieve the tables on which the manipulations
have to be done .This task is performed in the "from" engine which retrieves the
data from the main memory and stores it in the local memory of the co_processor.
The address of the this local memory is being forwarded in the form of packets to the
"join" engine if it is required otherwise the packets are forwarded to the where
engine.

According to the query the "where" engine gets the address of the temporary table
in the intermediate packets on which it has to do the manipulations from the "from"
or "join" engine. "Join" engine comes into role if the data manipulation has to be
done on two or more tables. The manipulations to be done are also being send in
form of packets by the preprocessor forwarded to the engines by the IFU through the
common bus . The decoder in the where engine decodes these packets and sends
the information to the operation unit . The resultant tuples are stored in another
temporary table in the memory and the address of this temporary table is forwarded
to the "select" engine in form of intermediate packets.

In the "select" engine the decoder has to decode two kind of packets , one that are
intermediate packets formed by the "where" engine and the other send by the
preprocessor which gives the information about the attributes that have to be
selected among the resultant tuples manipulated in the "where" engine and hence
display the output.

The storage of temporary tables is done in the local memory of the preprocessor.
This enables speedup while data has to be fetched with almost 100 percent hit rate.
The input buffer and the output buffer are just the part of the local SRAM . The
memory operations in the coprocessor are being synchronized by the memory
controller . It is responsible for the data integrity .

IIIT Hyderabad 28
B-Tech Thesis

Operator
Memory Controller:
This is the memory interface of the engine. It will talk to the external memory. Its job
is to fetch the data to the data buffer and then store back data to the memory. It also
receives the address of the written output block in the memory to send it to encoder
module.

Data Buffer:
It is the local memory provided to each engine to hide the memory access latency.
There are two buffers inside input buffer and output buffer for storing input and
output data respectively.

Operator:
This is the main code of the engine and will perform the desired operation. This will
perform all the architecture operations like sequence scan, index scan, projection.

IIIT Hyderabad 29
B-Tech Thesis

Encoder and Decoder Module:


Decoder buffer will store the query coming from IFU or operator based on its
dependent or independent nature. It is discussed in detail later in the document.
Decoder will decode the query packets get the information of ENGINE CODE,
query_id, table address, table size etc and send them to the respective modules.
Encoder will make and store the new packet to send it to other engines.

The above figure is architecture for the execution of query by the query engines.
Each of the signals are as follows:

IFU receives queries in the form of packets for further processing by the query
processor. It puts packets on the common bus. Common bus is read by all the
engines, if the destination address field in the packet matches with the engine code
then that packet will be stored in the decode queue of the engine. There are two
decoding queues namely the dependent and the independent packet queue.
Depending upon the nature of the packet it will be stored in the respective queue. If

IIIT Hyderabad 30
B-Tech Thesis

the memory controller is free it will assert the ‘tx_status’ signal and wait for receiving
data from the Encoder/Decoder unit. The Encoder/Decoder Unit will send the
required data to the memory controller and simultaneously send the ‘meta data’ of
the table to the operator. Once data buffer starts receiving the table data, the
operator commences its operation, and writes the data back in to the output buffer.

1. Common Bus: It is a 32-bit Bus to transfer Packet from IFU to operator and from
operator to other operator. It is a shared bus with handshaking as discussed below.

Handshaking Signals interfacing with IFU:

2. bus_rd_rq(Active LOW) : Initially all the engines will read the bus hence
‘bus_rd_rq’ is HIGH. Seeing a packet designated to itself, corresponding operator will
de-assert the signal. It will asserted again when EOP(End Of Packet) is received.

3. bus_rd_grant(Active HIGH) : Seeing ‘bus_rd_rq’ low of a particular engine, IFU


will assert ‘bus_rd_grant’ HIGH of that engine and LOW to all other engines. It will be
de-asserted when ‘bus_rd_rq’ of that engine is again HIGH.

Handshaking Signals interfacing the external Memory Controller:

4. mem_rq(Active HIGH) : Mem. Controller will assert it when it has a table


address and a table size to send to memory. Mem. Controller will wait for
‘mem_grant’ signal to be high. It will be de-asserted once all the data has been sent.

5. mem_grant(Active HIGH) : Memory controller seeing ‘mem_rq’ HIGH of a


engine based on availability of memory assert this signal. It will be de-asserted once
‘mem_rq’ is de-asserted.

6. mem_add_rq(Active HIGH) : When ‘op_block_done’ is HIGH it will assert this


signal to get the address and offset of the written output block in the memory from
the memory controller. It will be de-asserted once ‘mem_add_grant’ is HIGH.

7. mem_add_grant(Active HIGH) : Memory controller seeing ‘mem_add_grant’


HIGH will assert this signal and send the address and offset of the stored block. It will
be de-asserted again once ‘mem_add_rq’ is LOW.

8. Mem_data_valid(Active HIGH) : Memory controller will assert it to


communicate to data buffer of an engine that it is sending valid data. It will be de-
asserted temporarily if ‘buffer_full’ is HIGH and permanently when ‘ip_table_done’ is
asserted.

IIIT Hyderabad 31
B-Tech Thesis

9. Buffer_full(Active HIGH) :Data buffer will assert it to indicate that input data
buffer is full. Seeing it HIGH memory controller will de-assert ‘mem_data_valid’
signal. It will be de-asserted when input buffer of the data buffer will go below a
threshold level.

10. mem_wr_rq(Active HIGH) : Data buffer will assert this signal when its output
data buffer has entries more than a certain threshold. It will be de-asserted when
data writing is finished.

11. mem_wr_grant(Active HIGH) :Seeing mem_wr_rq HIGH to grant access to


data buffer to write output data block based on availability memory controller will
assert ‘mem_wr_grant’ signal. It will be de-asserted once ‘mem_wr_rq’ is de-
asserted.

12. ip_table_done(Active HIGH) : It will be asserted to send a signal to data buffer


that that table fetching is finished. It will be de-asserted when mem_data_valid is
asserted.

5.0 Implementation and Results

The Following Architecture was implemented in Verilog HDL. The Implementation


procedure was carried out in two parts:

a) IFU-ENCODER DECODER system


This consisted of all the engines along with their encoder-decoder queues and
memory controllers. The engines were then interfaced with the common IFU
unit to complete this section of the design.
b) Operator Architecture
This part consisted of designing the core of each engine. The operator design
required extra care so that latencies could be kept in check.

To verify the functional correctness of the design blocks we generated an input file
which provided the input ports with a bit-stream. By running the simulation tool on
such a dataset we checked the data bits at various stages of processing. The outputs
at each encoder and decoder were also verified.

The following show the Functional Simulation results of the IFU- ENCODER and
DECODER systems in GTK-Wave:

IIIT Hyderabad 32
B-Tech Thesis

The following module interfaces the IFU with the FROM, WHERE and SELECT encoder
and decoder units. the FROM decoder unit has been implemented with a serial FIFO
whereas WHERE and SELECT have been implemented with flexible FIFO systems
which can break its link based on the query ID of the packet placed in the FIFO. The
following system is connected via IFU and tested with clock cycles to give correct
simulation outputs.

SIMULATION of IFU- ENCODER - DECODER Unit

Then the Operator Module has been implemented along with its interface with
SRAM, Memory controller and clock. The Module Engines have been tested for logic
simulation and have given correct results.

The Engines have been worked out in a clock based state diagram logic where
control signals vary on clock edge. A screen shot of simulations of this module is
shown here:

IIIT Hyderabad 33
B-Tech Thesis

` SIMULATION of OPERATOR MODULE

The RTL synthesis of both the modules have been done in Synopsis Design Tools and
Xilinx ISE. The results calculate are shown below:

FINAL RESULTS:
This report gives us information about the logic as well as device utilization for our
particular design on the Xilinx Virtex 2Pro FPGA.
RTL Top Level Output File Name : fromdecoder.ngr
Top Level Output File Name : fromdecoder
Output Format : NGC
Optimization Goal : Speed
Keep Hierarchy : NO

Design Statistics
# IOs : 133

Cell Usage :
# BELS : 637
# BUF : 2
# INV : 1
# LUT3 : 4
# LUT3_D : 1
# LUT3_L : 3
# LUT4 : 589
# LUT4_D : 29
# LUT4_L : 8
# FlipFlops/Latches : 716

IIIT Hyderabad 34
B-Tech Thesis

# FDE : 76
# FDRE : 640
# Clock Buffers : 1
# BUFGP : 1
# IO Buffers : 132
# IBUF : 67
# OBUF : 65
=========================================================
================

The Design was simulated for package no 2vp2fg256-7 and the following reports
show the utilization summary:

Device utilization summary:


---------------------------

Selected Device : 2vp2fg256-7

Number of Slices: 435 out of


1408 30%
Number of Slice Flip Flops: 716 out of
2816 25%
Number of 4 input LUTs: 635 out of
2816 22%
Number of IOs: 133
Number of bonded IOBs: 133 out of
140 95%
Number of GCLKs: 1 out of
16 6%

---------------------------

Timing Analysis :
Timing Detail:
--------------
All values displayed in nanoseconds (ns)

=========================================================
================
Timing constraint: Default period analysis for Clock
'clk'
Clock period: 2.831ns (frequency: 353.280MHz)

IIIT Hyderabad 35
B-Tech Thesis

Total number of paths / destination ports: 6057 / 1368


---------------------------------------------------------
----------------
Delay: 2.831ns (Levels of Logic = 2)
Source: ins1/wrptr_2_1 (FF)
Destination: ins1/mem_7_63 (FF)
Source Clock: clk rising
Destination Clock: clk rising

Data Path: ins1/wrptr_2_1 to ins1/mem_7_63


Gate Net
Cell:in->out fanout Delay Delay
----------------------------------------
FDE:C->Q 6 0.370 0.581
LUT3_D:I0->O 1 0.275 0.349
LUT4:I2->O 64 0.275 0.717
FDRE:CE 0.263
----------------------------------------
Total 2.831ns (1.183ns logic, 1.648ns route)
(41.8% logic, 58.2% route)

=========================================================
================

Mapping Analysis:

Mapping Report
--------------

Logic Utilization:
Number of Slice Flip Flops:
716 out of 2,816 25%
Number of 4 input LUTs:
636 out of 2,816 22%
Logic Distribution:
Number of occupied Slices:
401 out of 1,408 28%
Number of Slices containing only related logic:
401 out of 401 100%
Number of Slices containing unrelated logic:
0 out of 401 0%
Total Number of 4 input LUTs:
636 out of 2,816 22%
Number of bonded IOBs:

IIIT Hyderabad 36
B-Tech Thesis

133 out of 140 95%


Number of BUFGMUXs:
1 out of 16 6%

5.1 RTL SCHEMATICS OF SYNTHESIS:

These screenshots were generated by the Xilinx Synthesis Tool.

The IFU to FROM DECODER UNIT

IIIT Hyderabad 37
B-Tech Thesis

b) RTL One Level Down:

c) FROM AND JOIN OPERATOR SCHEMATIC :

IIIT Hyderabad 38
B-Tech Thesis

6.0 Conclusion
The following work has given a direction towards implementing an architecture for
accelerating query operations. It also provides definite results to support our case
and probe into the details of customizing indexing , merging and hashing operations
with hardware SM based operators. Further work in this direction would be done to
make a scalable hardware to interface Post-GRE and try TPC benchmarks on the
designed coprocessor.

7.0 References:

[1] QPipe: A Simultaneously Pipelined Relational Query Engine. Stavros


Harizopoulos, Vladislav Shkapenyuk, Anastassia Ailamaki. SIGMOD 2005, June 14-
16, 2005, Baltimore, Maryland, USA.
[2] Database Servers on Chip Multiprocessors: Limitations and Opportunities.
Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju G. Mancheril, Anastassia
Ailamaki and Babak Falsafi. 3rd Biennial Conference on Innovative Data Systems
Research (CIDR), January 7-10, 2007, Asilomar, California, USA.
[3] A Case for Staged Database Systems: S. Harizopoulos and A. Ailamaki. In
proceedings of the First International Conference on Innovative Data Systems
Research (CIDR), Asilomar, CA, January 2003.
[4] Database System Implementation: Hector Garcia Molina 2nd Edition
[5]Database System Concepts: Silberschatz , KorthSudarshan
[6] L. D. Shapiro. Join processing in database systems with large main
memories. ACM Trans. Database Syst., 11(3):239{264, 1986
[7] K. Stocker, D. Kossmann, R. Braumandl, and A. Kemper. Integrating semi-
join- reducers into state of the art query processors. In ICDE, pages 575{584,
2001.
[8] K. A. Ross. Modeling the performance of algorithms on flash memory
devices. In DaMoN, pages 11{16, 2008 }.

IIIT Hyderabad 39

You might also like