Professional Documents
Culture Documents
These slides released under the Creative Commons AttributionNoncommercialShare Alike 3.0 License
What we'll cover
Not just about the new tool
− Background on the problems related to the
functional testing of database systems
−
We'll survey and assess the state of the testing tool
landscape
Applications / strengths / limitations
Why the history lesson?
Testing is a difficult task, particularly for
complex systems like database servers
−
This task presents its own interesting set of
obstacles to overcome.
A little bit of background is necessary to
understand the reasoning behind this
approach.
“Not ignorance, but ignorance of ignorance is
the death of knowledge”
− Alfred North Whitehead
Where do you get these ideas?
Based on research provided by Microsoft's SQL
Server team
−
One of the only sources for material on this topic
− Their work has provided invaluable insight into
long-term results of strategies and inspiration for
newer testing techniques
Hands-on experience working with MySQL and
Drizzle
−
Almost 4 years of blood, sweat, and tears ; )
Testing Databases is Crazy Hard!
It's a big task
Table composition (data types and combinations
thereof)
Table population (size and data distribution)
Query space – SQL is expressive!
Table access methods (optimizations, etc)
− index_merge/intersection = the unicorn of optimizer
testing ; )
Effects of various switches (materialization,
semijoin, etc)
It's a really big task
Essentially infinite input space + ever-growing
feature set =
−
Exhaustive testing – 'not gonna happen'
− Need to be smart about what we do test
− Need to be ruthless about what tests we accept as
good
Maintenance is costly – can't waste time on useless tests
Some additional things
Not easy to unit-test
−
Logical separations well-understood, testing them not
so easy.
Semantics of the test are as suspect as the code
− Hack up a parse tree for a subquery-heavy bit of SQL?
Time to benefit ratio = not so good / lots of effort
− Our unit-testing GSOC student ran away after the summer and hasn't
come near Drizzle since ; )
End-to-end testing (SQL queries) most
effective / productive
Focus on functional testing
Other types of tests are important, but having
a really fast server that delivers incorrect
results doesn't matter
One could also argue that such tests evolve
from a set of solid queries that exercise the
server code
We concern ourselves with useful query / test
case generation
Useful queries and tests?
A test with 1 million SELECT 1's technically
does something, but nothing any user would
likely ever care about.
Additionally, the mysql test suite is filled with
random tests where dev's thought they were
doing something, but there is no definitive
proof.
− one test with 10k rows of data and 2 simple
selects– why?
It 'seemed' like it was doing something?
− Devolving into superstition at this point
The evolution of testing tools
Understanding history
Let's look at the various functional tools
available
−
Understanding their strengths and limitations helps
us to understand how subsequent tools came to be
Hand-crafted tests
Random / stochastic testing tools
feedback-based random query generation
− aka genetic algorithms
mad science...mwa ha ha! ; )
hand-crafted tests
e.g.
drizzle-test-run / mysql-test-run
In the beginning...
How almost all testing starts.
Quick
Easy
VERY good for targeted testing
−
Easily verified results / limited domain
LENGTH(), SUBSTR()
small, limited functions
Can apply equivalence class partitioning
It is how most systems test
− Testing based on this strategy helped make MySQL
into a solid and widely-used product
−
Postgres' test suite is based on similar tests
− Drizzle still uses it for a significant portion of its
own testing
Significant time and effort have been put into most
tests
Waste not want not
Hand-crafted tests
Can be anything
− we generally view DTR .test files as a case
− mysql_protocol.prototest uses python scripting
Generally mean a highly targeted test case
that was written by a human and that will
likely require maintenance and extension by
one as well
Example test (slave plugin)
--disable_warnings
DROP TABLE IF EXISTS t1;
--enable_warnings
--echo Populating master server
CREATE TABLE t1 (a int not null auto_increment, primary key(a));
INSERT INTO t1 VALUES (),(),();
SELECT * FROM t1;
….
--echo Connecting to slave...
connect (slave_con,127.0.0.1,root,,test, $BOT0_S1);
echo Using connection slave_con...;
connection slave_con;
--sleep 3
--echo Checking slave contents...
--source include/wait_for_slave_plugin_to_sync.inc
SHOW CREATE TABLE t1;
SELECT * FROM t1;
Example result
DROP TABLE IF EXISTS t1;
Populating master server
CREATE TABLE t1 (a int not null auto_increment, primary key(a));
INSERT INTO t1 VALUES (),(),();
SELECT * FROM t1;
a
1
2
3
Connecting to slave...
Using connection slave_con...
Checking slave contents...
SHOW CREATE TABLE t1;
Table Create Table
t1 CREATE TABLE `t1` (
`a` INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`a`)
) ENGINE=InnoDB COLLATE = utf8_general_ci
SELECT * FROM t1;
a
1
2
3
Switching to default connection
DROP TABLE t1;
So, what's the issue?
Where to begin?
− Scalability / sustainability
Human effort / attention required to just maintain
existing tests can be considerable
Effort to write new tests, particularly for complicated
features
Microsoft's research has shown an average of .5 hours
for a good, hand-crafted test
− Testing complicated features == EXPENSIVE!
Also shown that test development time can far exceed
code development time
Bad strategery, cont'd
Lack of coverage
− Due to the nature of human development cycles,
typically simpler cases are created first. As a
result, more complex bugs can be missed until
much later in the development cycle (if at all)
− Microsoft has admitted that bugs were found long
after the defective feature had been rolled out
due to the non-standard circumstances required to
trigger them
Lack of coverage, cont'd
More complex queries, such as the heavy use
of subqueries for optimizer tests, literally
can't be written by hand
−
The effort required to create valid complex
queries isn't worth it
−
Validation is also quite problematic
Will discuss solutions to this problem in a bit...
− If anyone is actually good at these tasks, I am
scared of them
Break out the Turing test!
Crazy-complex queries!
1998 (yes, 1998!) - Microsoft publishes a paper
outlining the RAGS system
−
Brute force, intelligently applied
− Automated tool for query generation
Allows rapid generation of complex test queries
− Microsoft research refers to a 1 Million fold increase in query
volume
Recognizes that query generation and execution are
essentially mechanical tasks
Leave the human free to use creativity and attention
where it is effective
RAGS
randomly generated queries
− General rules for query construction
via stochastic parse tree
−
Throw it at the server
Validation through comparison
− Different DBMS's
− Same software w/ different settings
Crash detection
~50% of generated queries executed / returned results
MySQL vs. the randgen
2007-ish, the random query generator (aka
randgen) is unleashed on the MySQL codebase
−
Based on Microsoft's RAGS research
− Put the hurting on the Falcon storage engine
− Also part of why we are just now seeing 6.0
optimizer features being reintroduced into MySQL
>: )
Admittedly a lot of edge cases
− Broken is broken
−
Lots of edge case bugs are worrisome as well
What the hell is going on in the code?
Sample randgen grammar
query:
SELECT * FROM _table WHERE int_field comparison_operator _digit ;
comparison_operator:
> | >= | < | <= | > | < | > | < |= | != ;
int_field:
col_int | col_int_key | col_int_not_null | col_int_not_null_key | pk ;
Applications
Good for providing an initial baseline
− Determine if the code is solid enough for a human
to devote craftiness to breaking it
−
Frees QA devs' time for better things
think up more challenges
work on fine-grained testing (hand-crafted tests)
It is vital to remember that QA is a creative (not
mechanical) task. Outsmarting buggy code requires
time
“See what shakes out”
Applications, cont'd
Good for testing some things, not others
− Covers a lot of ground, but not so easy to express
certain things in a stochastic manner
−
Optimizer validation = great!
− Transaction log = great!
− Testing complex scenarios = not so great
Drizzledump migration
Now for the bad stuff
There are tradeoffs – can do some things at
the expense of not doing other tasks
−
We trade precision for brute force
− The wrong tool can make a seemingly easy task
much, much more difficult ; )
Like picking up a dime with a pair of gloves
Again, testing Drizzledump migration
− pcrews.egg_on_face=True
Tests can still be expensive
We can cover a lot of ground for our efforts,
but it still takes development and
maintenance time
−
Creating verifiable tests
− Tuning the tests
To hit desired code
To generate valid queries
Often cyclical
−
Maintenance
How hard to update or change?
Expensive tests
Optimizer grammars took ~4 months to
produce
−
It was a first effort, but it shows it is not trivial to
become familiar with things
Outer join grammars took ~2 months
Need to figure in feature complexity, value,
etc
Additional costs
Development and tuning
− How easily expanded are the tools as we discover
new ideas / needs?
−
Certain things hard to express
can't always change the server state as we want
Changing the tests can also be expensive
−
Tradeoff between tuning and robustness
How complex could a test be?
join:
{ $stack->push() }
table_or_join
{ $stack->set("left",$stack->get("result")); }
left_right outer JOIN table_or_join
ON
join_condition ;
join_condition:
int_condition | char_condition ;
int_condition:
{ my $left = $stack->get("left"); my %s=map{$_=>1} @$left; my @r=(keys %s); my $table_string = $prng-
>arrayElement(\@r); my @table_array = split(/AS/, $table_string); $table_array[1] } . int_indexed =
{ my $right = $stack->get("result"); my %s=map{$_=>1} @$right; my @r=(keys %s); my $table_string = $prng-
>arrayElement(\@r); my @table_array = split(/AS/, $table_string); $table_array[1] } . int_indexed
{ my $left = $stack->get("left"); my $right = $stack->get("result"); my @n = (); push(@n,@$right);
push(@n,@$left); $stack->pop(\@n); return undef } |
int_field_name:
`pk` | `col_int_key` | `col_int` |
`col_bigint` | `col_bigint_key` |
`col_int_not_null` | `col_int_not_null_key` ;
char_field_name:
`col_char_10` | `col_char_10_key` | `col_text_not_null` | `col_text_not_null_key` |
`col_text_key` | `col_text` | `col_char_10_not_null_key` | `col_char_10_not_null` |
`col_char_1024` | `col_char_1024_key` | `col_char_1024_not_null` |
`col_char_1024_not_null_key` ;
int_indexed:
`pk` | `col_int_key` | `col_bigint_key` | `col_int_not_null_key` ;
char_indexed:
`col_char_1024_key` | `col_char_1024_not_null_key` |
`col_char_10_key` | `col_char_10_not_null_key` ;
Tuning vs. Robustness
# 2011-03-22T20:45:24 Rows returned:
$VAR1 = {
' 0' => 148,
' 1' => 8,
' 2' => 1,
' 3' => 1,
' 4' => 1,
' -1' => 76,
' 10' => 2,
'>10' => 2,
'>100' => 1
};
# 2011-03-22T20:45:24 Rows affected:
$VAR1 = undef;
# 2011-03-22T20:45:24 Explain items:
$VAR1 = undef;
# 2011-03-22T20:45:24 Errors:
$VAR1 = {
'(no error)' => 173,
'Unknown column \'%s\' in \'IN/ALL/ANY subquery\'' => 12,
'Unknown column \'%s\' in \'field list\'' => 37,
'Unknown column \'%s\' in \'having clause\'' => 9,
'Unknown column \'%s\' in \'where clause\'' => 1,
'Unknown table \'%s\'' => 18
};
Valid queries
Was considered a large problem at Microsoft
Have run into similar issues with the randgen
Difficult to express more complex queries /
sets of queries while keeping them valid and
worthwhile
− Makes it harder to hit difficult / rare code paths or
combinations of them
Wasteful / not reusable
Every time we run the randgen, we generate
the same invalid queries
−
Good thing – every run with a given seed = same
data and queries produced
Repeatability is a mantra of QA!
− Bad thing – we waste cycles on queries that don't
make it deep into the system
No way to organize queries so we can filter
them according to criteria
−
At least not yet, randgen devs are a crafty lot!
feedback-based query generation
e.g.
kewpie
Microsoft leads the way again
To overcome the limitations of purely
stochastic systems, adopted a genetic-based
approach
Generate / execute / evaluate / mutate
Uses a variety of feedback from the system
under test to determine 'fitness' of a query
− Keep it?
−
Mutate it further?
Genetic-based testing
Progressive building of valid queries
− SELECT col1 FROM table1;
− SELECT col1 FROM table1 WHERE col2 < 'value'
− SELECT col1 FROM table1 WHERE col2 < 'value'
AND...
We end up with a set of queries that have
some marked effect on the database
Organizing queries
MS uses a data warehouse of these test queries
− Provides a pool for all new testing efforts
Have a new measure of 'interesting'?
− Pull some queries and put them through the system!
−
Easily organized / sorted / manipulated
Provides a set of well-cataloged building blocks for
future tests
kewpie...finally ; )
Drizzle's efforts at creating this technology
Our testing experiences have been in-line with
Microsoft's and we recognize similar needs
kewpie? = query probulator
− Futurama for the win!
The probulator!
kewpie
Still very early in development
− Sorry, it won't make your database webscale
overnight
Hire a marketing department for that ; )
Idea is to teach something how to create
queries once and then provide a means of
directing the query generation
−
Use of feedback
evaluation functions
− Use of specific mutation patterns
Favoring some / probability tweaks
Evaluation functions
Check the effects of our query(ies) on the
database
−
code coverage
− gdb output (Igor delta debugger project)
− EXPLAIN plans
−
changes in select variables
−
log output
− custom code instrumentation
The possibilities are endless!
kewpie, cont'd
Written in Python
Currently tightly integrated with dbqp.py,
Drizzle's experimental test runner
−
Likely to become a more separate tool over time
− Expedience and all of that
Originally based on SQLAlchemy
−
They try to help you succeed a bit too much for
nefarious testing purposes ; )
Design ideas
First and foremost a query generator
− Create good, effective, well-cataloged queries
− Building blocks for more complex tests
Stress tests
Performance
Durability
etc
Design ideas, cont'd
We want to have a 'query' be:
− easily manipulated
− easily analyzed / broken down
Provide a robust set of functions for working
with query objects
−
addColumn(type=None, aggregate=False...)
−
addTable(name=None,rowCount=None...)
Lots of knobs for tuning things
What can it do?
Not a lot quite yet...: /
− Generation of SELECT lists, certain JOINs and
WHERE conditions
−
Still a lot to do...was a bit distracted:
We had this GA release thing we were working on...
Structure
As with all dbqp 'modes', we have a custom
test executor and test manager
−
testManager
what does a testCase look like / how to package the
relevant data for execution
manages testCases for the testExecutor
− testExecutor
Set up for the test
Execute it
Evaluate the results
Structure
query_manager:
− populates and manipulates queries
add_tables()
− add a given number of tables to the query object from what is
available in the test bed
add_columns()
− add a column from the tables used in the query
add_where()
− add a where clause using an available column from the tables
used in the query
Structure
query generation = all about the tables
− we center everything on this as it determines what
columns are available for valid queries
−
generating invalid SQL will be necessary and useful
at some point, but it is entirely too easy to pick a
column not used in the query
The randgen requires you to pick a bit blindly
in terms of column / table combinations
− means invalid queries (boo, hiss!)
Structure
query_evaluator
− Runs the various bits of evaluator code
− Currently very primitive
Only have row_count evaluator
Can add other evaluations as needed
Will eventually need proper fitness functions to
We prime the system via a python cnf file
− Determines initial query population and their
structure
All generated on-the-fly for now
Eventually will be able to pull from a database of test
[mutators]
add_table = 2
add_column = 4
add_where = 3
[test_servers]
servers = [[--innodb.replication-log]]
[evaluators]
row_count = True
explain_output = False
test execution
We create an initial set of queries
We then execute each query
If it passes evaluation, we then create a copy
− The original good query can serve as a seed for
further mutations
− The copy is mutated and executed
We use max_mutate_count to limit query
lifespan (no endless runs)
Next steps
How long do you have to listen? ; )
− database storage and retrieval of queries
− more fine grained control over query generation
− more extensive / complex query generation via
query mixing
subqueries
unions
− trimming the query pool
Next steps...still
More evaluation code
− gcov, gdb, EXPLAIN...
Fitness functions
The list goes on
− As mentioned earlier, the test domain is
essentially infinite
− code will evolve to solve problems
Demo Time!
Summary
Testing Challenge
drizzle-test-run/mysql-test-run
random query generator
kewpie
References
References
Microsoft
−
RAGS –“Massive Stochastic Testing of SQL”
http://research.microsoft.com/pubs/69660/tr-98-21.ps
−
Genetic testing - “A genetic approach for random testing of database systems”
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.3435&rep=rep1&type=pdf
Random query generator
− https://launchpad.net/randgen
− http://forge.mysql.com/wiki/RandomQueryGenerator
− http://datacharmer.blogspot.com/2008/12/guest-post-philip-stoev-if-
you-love-it.html
− http://carotid.blogspot.com/2008_09_01_archive.html#521833683342
482424
References, cont'd
Drizzle + Drizzle testing tools
− http://drizzle.org/
−
https://launchpad.net/drizzle
−
http://docs.drizzle.org/testing/test-run.html
−
http://docs.drizzle.org/testing/dbqp.html
−
http://docs.drizzle.org/testing/randgen.html
kewpie_demo_tree:
−
lp:~patrick-crews/drizzle/dbqp_kewpie_demo