Professional Documents
Culture Documents
Objectives 1. 2. 3. Learn the basics of relational databases Learn how to use MySQL Learn how to use the Structured Query Language (SQL)
Outline
Why are databases important in bioinformatics? Brief background in databases Introduction to the Structured Query Language A worked example a Sequence database in MySQL
What is a database?
Collection of information
Spreadsheet Filing cabinet Oracle database
Databases help us efficiently organise, integrate and query data in order to make scientific inferences
http://bioteach.ubc.ca
Databases and bioinformatics (2004 data) Nucleotide records Protein sequences 3D structures Interactions & complexes Human Unigene Cluster Maps and Complete Genomes Different taxonomy Nodes Human dbSNP Human RefSeq records bp in Human Contigs > 5,000 kb (116) PubMed records OMIM records 36,653,899 4,436,362 19,640 52,385 118,517 6,948 283,121 13,179,601 22,079 2,487,920,000 12,570,540 15,138
HELP!
RELATIONAL DATABASES
Relational Databases
A brief history
Developed by E.F. Codd (IBM) 1969-70
Died 2003
Awarded the Turing prize for his work (Computer Science equivalent of Nobel Prize) Developed 12 rules to define a RD that call for a language to define, manipulate and query the data in the database 1 rule led to the Structured Query Language (SQL) that is used in every RDMBS system on the market
ANSI standard (92,99)
Advanced Software Development Workshop 2009- 9
SQL
Relational Model
All data stored in tables Table is a relation made up of columns (fields) and rows (records) Intersection of a column and a row is a typed value Integer, Real, Varchar, Text, Blob, etc Operations on tables produce tables
Data independence Shielding the data from the application Efficiency Storage, retrieval, integration Data integrity/security Constraints, access controls
ACID test
In computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction. An example of a transaction is a transfer of funds from one bank account to another, even though it might consist of multiple individual operations (such as debiting one account and crediting another).
Atomicity
all or nothing transaction If one operation fails, all fail
Consistency
data integrity constraints
Isolation
Every transaction has a consistent view of the database regardless of what other transactions are being processes
Durability
Once a transaction is complete, the newly updated data will survive failures of any kind logs
Advanced Software Development Workshop 2009- 13
Research fuelled by corporate databases gives us great technology for biological science
30+ years of research into robust systems Industry standards for databases Vendors committed to high-quality products
Oracle, DB2, Sybase, MS SQLserver, etc
Emergence of the internet and database driven webcontent set the stage for bioinformatics Data mining tools for creating statistical associations
Diapers and beer? Teradata, a division of NCR Corporation
SQL
Advanced Software Development Workshop 2009- 15
SQL
SQL
Commercial RDBMS
Oracle According to Forbes, Larry Ellison is the 9th richest person in the US ($18 billion) DB2 IBMs solution free for academics Microsoft SQL server For Windows
PostgreSQL
http://www.postgresql.org/ the worlds most advanced Open Source database software Began in 1986 at UC Berkeley For many years considered the most sophisticated OS RDBMS Performance? Comes with most Linux distros Small but loyal user community
MySQL
MySQL
Free
As in free beer
As in free speech
Fast
Extremely fast reads for certain table types Outperforms any RDMBS for reads
Functional
Ease of use APIs in Perl, C, C++, Java Client/server architecture Works well with Apache/PHP for very popular OS dynamic web solution
BASE (http://base.thep.lu.se) BioArray Software Environment a web-based database solution for microarrays
Divide records into useful fields that describe the particular record
The meta-data
Create a model based on the useful fields Create a database from the model Insert the data into the database The data is now computable
LOCUS DEFINITION
gene
Data
YSCITRSA2 2075 bp DNA linear PLN 26-APR-2004 Saccharomyces cerevisiae isoleucyl tRNA synthetase (LAF1) gene, partial cds; and unknown gene. ACCESSION L32174 VERSION L32174.1 GI:46561769 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. REFERENCE 1 (bases 1 to 2075) AUTHORS Chen,E. and Bretscher,A.P. TITLE The LAF1 open reading frame encodes a second isoleucyl tRNA synthetase in the yeast Saccharomyces cerevisiae JOURNAL Unpublished FEATURES Location/Qualifiers source 1..2075 /organism="Saccharomyces cerevisiae" /mol_type="genomic DNA" /db_xref="taxon:4932" gene <1..1204 /gene="LAF1" CDS <1..1204 /gene="LAF1" /note="disruption results in an abnormal actin cytoskeleton; putative" /codon_start=2 /product="isoleucyl tRNA synthetase" /protein_id="AAT01099.1" /db_xref="GI:46561770" /translation="SLKLSKLPSPLYQVCLEGSDQHRGWFQSSLLTKVASSNVPVAPY EEVITHGFTLDENGLKMSKSVGNTISPEAIIRGDENLGLPALGVVGLRYLIAHSNFTT DIVAGPTVMKHVGEALKKVRTNFRYLLSNLQKSQDFNLLPIEQLRRVDQYTLYKINEL LETTREHYQKYNFSKVLITLQYHLNNELSAFYFDISKDILYSNQISWSWQEGRSNNAC PYTNAYRAILAPILPVMVQEVWKYIPEGWLQGQEHIDINPMRGKWPFLDSNTEIVTSF ENFELKILKQFQEEFKRLSLEEGVTKTTHSHVTIFTKHHLPFSSDELCDILQSSAVDI LQMDSNNNSHPTIELGRGINVQILVNVQILVERSKRHNCPRCWKANSAEEDKLCDRCK EAVDHLMS" CDS 1452..2075 /note="putative" /codon_start=1 /product="unknown" /protein_id="AAT01100.1" /db_xref="GI:46561771" /translation="MTVMNLFFRPCQLQMGSGPLELMLKRPTQLTTFMNTRPGGSTQI RFISGNLDPVKRREDRLRKIFSKSRLLTRLNKNPKFSHYFDRLSEAGTVPTLTSFFIL HEVTANTTTVLLWWLLYNLDLSDDFKLPNFLNGLMDSCHTAMEKFVGKRYQECLNKNK LILSGTVAYVTVKLLYPVRIFISIWGAPYFGKWLLLPFQKLKHLIKK" ORIGIN 1 aagcttaaag ttgtcaaaac tcccatcccc cctgtaccaa gtttgtctag aaggatctga 61 tcaacataga ggatggtttc aaagttcact gctaacaaaa gtagcatcaa gtaatgtccc 121 tgttgcacca tatgaagaag tgattactca tggttttacc ctagatgaga atggtctgaa 181 aatgtcaaaa tctgtgggaa atacaatttc tcccgaagca ataattcgag gcgatgaaaa 241 cttaggctta ccagctttgg gtgttgtagg cttgaggtat ctgatagcac attcgaattt 301 cacaactgat atagttgctg gcccgactgt gatgaaacat gtaggagaag ctctaaaaaa 361 ggttaggact aactttcgct atttattgag taatttacag aagtcccaag atttcaacct 421 tttgccgatt gaacaattac gccgtgttga tcaatatacc ttgtataaga taaacgaact 481 gctggaaacg acgagagaac actaccaaaa gtacaacttt tccaaggttc tcattactct 541 acaatatcat ttaaataacg agctatcggc gttttatttt gatatctcaa aggatatttt 601 atattccaac caaatatctt ggtcatggca agaaggcagg tcaaacaacg cttgtccata 661 tactaatgca tatagggcaa ttcttgcacc aatattaccc gttatggtcc aagaagtatg 721 gaagtatata ccagaaggat ggttacaagg acaagaacat atagacatta atccgatgcg 781 tggaaaatgg ccgtttttgg actcaaatac ggaaatcgtc acctcctttg aaaactttg
2075 bp L32174
<1..1204 /gene="LAF1"
CREATE Sequence
CREATE TABLE Sequence ( sequence_id INT NOT NULL AUTO_INCREMENT, sequence LONGTEXT NOT NULL, defline TEXT, accession VARCHAR(255) NOT NULL, version INT DEFAULT 0, length INT DEFAULT 0, moltype INT NOT NULL, PRIMARY KEY(sequence_id) );
CREATE Ontology
CREATE TABLE Ontology ( ontology_id INT NOT NULL AUTO_INCREMENT, term VARCHAR(255) NOT NULL, description TEXT NOT NULL, PRIMARY KEY (ontology_id) );
CREATE Feature
CREATE TABLE Feature ( feature_id INT NOT NULL AUTO_INCREMENT, sequence_id INT NOT NULL, ontology_id INT NOT NULL, FOREIGN KEY (sequence_id) REFERENCES Sequence, FOREIGN KEY (ontology_id) REFERENCES Ontology, PRIMARY KEY(feature_id) );
CREATE Location
CREATE TABLE Location ( location_id INT NOT NULL AUTO_INCREMENT, feature_id INT NOT NULL, start INT NOT NULL, stop INT NOT NULL, strand INT NOT NULL, FOREIGN KEY (feature_id) REFERENCES Feature, PRIMARY KEY(location_id) );
CREATE Qualifier
CREATE TABLE Qualifier ( qualifier_id INT NOT NULL AUTO_INCREMENT, feature_id INT NOT NULL, ontology_id INT NOT NULL, value TEXT NOT NULL, FOREIGN KEY (feature_id) REFERENCES Feature, FOREIGN KEY (ontology_id) REFERENCES Ontology, PRIMARY KEY (qualifier_id) );
INSERT an ontology
mysql>
INSERT INTO Ontology (term, description) VALUES -> ('exon', 'an exon in genomic sequence');
Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO Ontology (term, description) VALUES
INSERT a sequence
mysql> DESC Sequence; +-------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-------------+--------------+------+-----+---------+----------------+ | sequence_id | int(11) | | PRI | NULL | auto_increment | | sequence | longtext | | | | | | defline | text | YES | | NULL | | | accession | varchar(255) | | | | | | version | int(11) | YES | | 0 | | | length | int(11) | YES | | 0 | | | moltype | int(11) | | | 0 | | +-------------+--------------+------+-----+---------+----------------+ 7 rows in set (0.00 sec) mysql> INSERT INTO Sequence (sequence, defline, accession, version, length, moltype) -> VALUES ('ATGACGATCAGCATCAGCTACAGCTG', '> seq1', 'seq1', 1, 26, 1); Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM Sequence; +-------------+----------------------------+---------+-----------+---------+--------+---------+ | sequence_id | sequence | defline | accession | version | length | moltype | +-------------+----------------------------+---------+-----------+---------+--------+---------+ | 2 | ATGACGATCAGCATCAGCTACAGCTG | > seq1 | seq1 | 1 | 26 | 1 | +-------------+----------------------------+---------+-----------+---------+--------+---------+ 1 row in set (0.03 sec) mysql> SELECT * FROM Ontology; +-------------+-------------+---------------------------------------------+ | ontology_id | term | description | +-------------+-------------+---------------------------------------------+ | 3 | start codon | denotes an Methionine codon of a transcript | | 4 | exon | an exon in genomic sequence | | 5 | exon type | 3'UTR, initial, internal, terminal, 5'UTR | +-------------+-------------+---------------------------------------------+ 3 rows in set (0.00 sec) mysql>
INSERT a Location
mysql> SELECT * From Feature; +------------+-------------+-------------+ | feature_id | sequence_id | ontology_id | +------------+-------------+-------------+ | 1 | 2 | 3 | +------------+-------------+-------------+ 1 row in set (0.01 sec) mysql> DESC Location; +-------------+---------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-------------+---------+------+-----+---------+----------------+ | location_id | int(11) | | PRI | NULL | auto_increment | | feature_id | int(11) | | | 0 | | | start | int(11) | | | 0 | | | stop | int(11) | | | 0 | | | strand | int(11) | | | 0 | | +-------------+---------+------+-----+---------+----------------+ 5 rows in set (0.00 sec) mysql>
Joining tables
mysql> SELECT * FROM Feature; +------------+-------------+-------------+ | feature_id | sequence_id | ontology_id | +------------+-------------+-------------+ | 1 | 2 | 3 | | 2 | 2 | 4 | | 3 | 2 | 4 | +------------+-------------+-------------+ 3 rows in set (0.04 sec)
Setting up a complex query Consider sequence seq1 with the following features: Initial exon from 1..6 Internal exon from 15..20 Note that with relational model the term exon only appears once in the database
mysql> SELECT * FROM Sequence WHERE sequence_id = 2; Mechanobiology Research Center of Excellence Research Center of Excellence in +-------------+----------------------------+---------+-----------+---------+--------+---------+ | sequence_id | sequence | defline | accession | version | length | moltype | MECHANOBIOLOGY +-------------+----------------------------+---------+-----------+---------+--------+---------+ | 2 | ATGACGATCAGCATCAGCTACAGCTG | > seq1 | seq1 | 1 | 26 | 1 | +-------------+----------------------------+---------+-----------+---------+--------+---------+ 1 row in set (0.04 sec)
Complex query
mysql> SELECT * FROM Feature WHERE sequence_id = 2; +------------+-------------+-------------+ | feature_id | sequence_id | ontology_id | +------------+-------------+-------------+ | 1 | 2 | 3 | | 2 | 2 | 4 | | 3 | 2 | 4 | +------------+-------------+-------------+ 3 rows in set (0.03 sec) mysql> SELECT * FROM Location; +-------------+------------+-------+------+--------+ | location_id | feature_id | start | stop | strand | +-------------+------------+-------+------+--------+ | 1 | 1 | 1 | 3 | 1 | | 2 | 2 | 1 | 6 | 1 | | 3 | 3 | 15 | 20 | 1 | +-------------+------------+-------+------+--------+ 3 rows in set (0.20 sec)
The relational model stores data efficiently and optimises the modifiablility of the data. What if exon changes to something else?
mysql> SELECT * FROM Ontology; +-------------+-------------+---------------------------------------------+ | ontology_id | term | description | +-------------+-------------+---------------------------------------------+ | 3 | start codon | denotes an Methionine codon of a transcript | | 4 | exon | an exon in genomic sequence | | 5 | exon type | 3'UTR, initial, internal, terminal, 5'UTR | +-------------+-------------+---------------------------------------------+ 3 rows in set (0.00 sec) mysql> SELECT * FROM Qualifier; +--------------+------------+-------------+----------+ | qualifier_id | feature_id | ontology_id | value | +--------------+------------+-------------+----------+ | 1 | 2 | 5 | initial | | 2 | 3 | 5 | internal |
40
Advanced Software Development Workshop 2009- 40
Aggregate queries
mysql> SELECT * FROM Sequence; +-------------+----------------------------+---------+-----------+---------+--------+---------+ | sequence_id | sequence | defline | accession | version | length | moltype | +-------------+----------------------------+---------+-----------+---------+--------+---------+ | 2 | ATGACGATCAGCATCAGCTACAGCTG | > seq1 | seq1 | 1 | 26 | 1 | | 3 | SLKLSKLPSPLYQVCLE | > seq2 | L32174 | 1 | 17 | 3 | | 4 | MASQQQCGAR | > seq | seq3 | 1 | 10 | 3 | +-------------+----------------------------+---------+-----------+---------+--------+---------+ mysql> SELECT count(*), +----------+---------+ | count(*) | moltype | +----------+---------+ | 1 | 1 | | 2 | 3 | +----------+---------+ 2 rows in set (0.08 sec)
Using LIMIT
mysql> +-------------+----------------------------+---------+-----------+---------+--------+---------+ | sequence_id | sequence | defline | accession | version | length | moltype | +-------------+----------------------------+---------+-----------+---------+--------+---------+ | 2 | ATGACGATCAGCATCAGCTACAGCTG | > seq1 | seq1 | 1 | 26 | 1 | | 3 | SLKLSKLPSPLYQVCLE | > seq2 | L32174 | 1 | 17 | 3 | +-------------+----------------------------+---------+-----------+---------+--------+---------+ 2 rows in set (0.08 sec)
UPDATING a table
mysql> SELECT * FROM Qualifier; +--------------+------------+-------------+----------+ | qualifier_id | feature_id | ontology_id | value | +--------------+------------+-------------+----------+ | 1 | 2 | 5 | initial | | 2 | 3 | 5 | internal | +--------------+------------+-------------+----------+ 2 rows in set (0.00 sec) mysql>
Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0
mysql>
Query OK, 1 row affected (0.04 sec) mysql> SELECT * FROM Qualifier; +--------------+------------+-------------+---------+ | qualifier_id | feature_id | ontology_id | value | +--------------+------------+-------------+---------+ | 1 | 2 | 5 | initial | +--------------+------------+-------------+---------+ 1 row in set (0.03 sec)
Optimisation
Perking up MySQL
Queries Database server
Indexing
In general, indexing your data makes retrieval orders of magnitude faster Consider a list of 1000000 sequences with accession numbers You need to find the one sequence with accession number AC123456 Response time requires O(1000000) operations if the accession field is not indexed
Equivalent to scanning through a list
Response time requires O(log(1000000)) = O(6) operations if the accession field is indexed
Somewhat like a hashtable lookup
Types of indexes
PRIMARY KEY
To identify the main accessor field of the table
UNIQUE
Constraint to ensure that all entries in a field are different
INDEX
Creates a way to quickly search on a given field
FULLTEXT
For large TEXT fields > 255 characters
Drawbacks to indexing Need more disk space Can slow down inserts Know your data and the queries you will perform on the data Only index fields you think you will query on Requires spending time in the design phase to define requirements of the database
Creating an index
mysql> CREATE INDEX acindex ON Sequence (accession); Query OK, 1 row affected (0.18 sec) Records: 1 Duplicates: 0 Warnings: 0
DBA
Variables (--variable-name=value) and boolean options {FALSE|TRUE} Value (after reading options) --------------------------------- ----------------------------basedir /raid/db/mysql/mysql-max4.0.14-pc-linux-i686/ bdb-home (No default value) bdb-logdir (No default value) bdb-tmpdir (No default value) bind-address (No default value) console FALSE chroot (No default value) character-sets-dir /raid/db/mysql/mysql-max4.0.14-pc-linux-i686/share/mysql/charsets/ datadir /raid/db/mysql/mysql-max4.0.14-pc-linux-i686/data/ default-character-set latin1 enable-locking FALSE enable-pstack FALSE gdb FALSE innodb_data_home_dir (No default value) innodb_log_group_home_dir (No default value) innodb_log_arch_dir (No default value) innodb_flush_log_at_trx_commit 1 innodb_flush_method (No default value) innodb_fast_shutdown TRUE innodb_max_dirty_pages_pct 90 init-file (No default value) log (No default value) language /raid/db/mysql/mysql-max4.0.14-pc-linux-i686/share/mysql/english/ local-infile TRUE log-bin (No default value) log-bin-index (No default value) log-isam myisam.log log-update (No default value) log-slow-queries (No default value) log-slave-updates FALSE low-priority-updates FALSE master-host (No default value) master-user test master-port 3306
master-connect-retry 60 master-retry-count 86400 master-info-file master.info master-ssl FALSE master-ssl-key (No default value) master-ssl-cert (No default value) master-ssl-capath (No default value) master-ssl-cipher (No default value) myisam-recover OFF memlock FALSE disconnect-slave-event-count 0 abort-slave-event-count 0 max-binlog-dump-events 0 sporadic-binlog-dump-fail FALSE new FALSE old-protocol 10 old-rpl-compat FALSE pid-file /raid/db/mysql/mysql-max4.0.14-pc-linux-i686/data/watson.pid log-error port 3306 report-host (No default value) report-user (No default value) report-password (No default value) report-port 3306 rpl-recovery-rank 0 relay-log (No default value) relay-log-index (No default value) safe-user-create FALSE server-id 1 show-slave-auth-info FALSE concurrent-insert TRUE skip-grant-tables FALSE skip-slave-start FALSE relay-log-info-file relay-log.info slave-load-tmpdir /raid/tmp/ socket /tmp/mysql.sock sql-bin-update-same FALSE sql-mode OFF temp-pool TRUE tmpdir /raid/tmp
external-locking use-symbolic-links symbolic-links log-warnings warnings back_log bdb_cache_size bdb_log_buffer_size bdb_max_lock bdb_lock_max binlog_cache_size connect_timeout delayed_insert_timeout delayed_insert_limit delayed_queue_size flush_time ft_min_word_len ft_max_word_len ft_max_word_len_for_sort ft_stopword_file innodb_mirrored_log_groups innodb_log_files_in_group innodb_log_file_size innodb_log_buffer_size innodb_buffer_pool_size innodb_additional_mem_pool_size innodb_file_io_threads innodb_lock_wait_timeout innodb_thread_concurrency innodb_force_recovery interactive_timeout join_buffer_size key_buffer_size long_query_time lower_case_table_names max_allowed_packet max_binlog_cache_size max_binlog_size max_connections max_connect_errors max_delayed_threads max_heap_table_size
FALSE TRUE TRUE FALSE FALSE 50 8388600 0 10000 10000 32768 5 300 100 1000 0 4 254 20 (No default value) 1 2 5242880 1048576 8388608 1048576 4 50 8 0 28800 131072 402653184 10 FALSE 1047552 4294967295 1073741824 100 10 20 16777216
To see what values a running MySQL server is using, type 'mysqladmin variables' instead of 'mysqld --help'.
Tuning the system to your needs Need to think about uses of the database How many concurrent connections? Will there be large records? Will there be repetitive queries? Will I need large indexes? Tuning the system can give huge gains in performance lets you get the most out of the system
Important parameters
max_allowed_packet
Largest amount of data to be transmitted to the client in 1 packet
max_connections
The largest number of concurrent connections to the database server
datadir
The location of the data files on the system
query_cache
Size of cache for repetitive queries
Communicating with
Through the Unix command line MySQL client Comes with MySQL Through APIs (Application Programming Interface) MySQL C API Perl DBI MySQL++ (C++)
http://dev.mysql.com/downloads/other/plusplus/
Choose the method that is right for the job Administration MySQL CC PHP MyAdmin Standalone Application APIs Web Application PHP/Java servlets Low throughput queries Command line client
Topics not covered MySQL tools mysqldump Tool to dump a schema, all the data and/or both mysqlimport Tool to import delimited files Look before you parse! mysqladmin For DBAs to create database, change passwords, etc Read the mysql documentation
Summary
Relational databases are necessary in bioinformatics Relational databases allow us to efficiently store and query large amounts of data MySQL is a good choice for RDBMS engine because it is highly functional at no cost
Resources
MySQL
http://www.mysql.com http://dev.mysql.com.mysql/en/index.html http://www.mysql.com/products/mysqlcc/ http://dev.mysql.com/doc/connector/j/en