Computational Biology Lab

Tutorial for open queries to the Chili database 

Here you may learn to query the chili database by following examples.
To communicate with the database you will use statements in the SQL (structured query language)

SQL statements have the general form

    select <which variable(s)> from <which table(s)> where <one or more conditions> <other modifiers>

The only compulsory words are select and from. But you must tell which variable(s) and from which table(s) you want to display information.

There are many tutorials to learn SQL, for example this one (I did not reviewed it).

To see the tables that exist in the database you need to consult the ChiliTables page. There you will also see the variables ("Field") that compose each table. When you have a SQL statement ready, you must copy it in the window of the "Open Query". You need to input your statement into the "Consulta" field and press "Enviar". Then the results will be displayed, and you can optionally save them with the "Exportar" option.

Note that by default, in the "Open Query" window you will obtain ONLY the first 10 results of your query; however, when you export ("Exportar") your results, ALL of them will be send to the corresponding file.

If you send an incorrect query, an error message will be displayed; but do not worry, nothing terrible will happen.

To really learn from the examples presented here, you must make some variations and see the results of those. Try to understand the logic of the queries. By simply copying and pasting the statements you will NOT learn anything! Interesting variations that you can try include modifying the variables, tables and or conditions. Be creative and experiment! Statements that you can input are highlighted in dark red. First, just copy and paste the statement, then try some variations.

Using a single table

1. Looking at all values (rows) in all variables (columns) of a given table.
select * from seq_ident
Comment: Here the asterisk ("*")  function as a wildcard meaning "all variables". Also because we do not give any condition after the name of the table, all rows will be selected. This kind of queries (without conditions) are likely to give huge results, thus you must be careful (in some cases it can take many seconds to perform the query and very large files will be generated). Try substituting the asterisk by one or more of the variables that exist in the table. Do not use this statement with other tables, by the reason explained before.

2. Restricting our results with some condition(s)
select * from seq_ident where seq_id<=10
Comment: Here only the records for which the identifier of the sequence ("seq_id") is smaller or equal than ten will be presented.

3. One good way to evaluate the likeness of two sequences is the "bitscore" given by BLAST (this is a better way to evaluate the quality of a blast result, better than the "Expected value" or E-value). Let's see some sequences that are identified with high certainty
select contig, gene, perc_align, bitscore from seq_ident where bitscore>5500
Comment: This results in only two genes; note that one of the genes is reported with a percentage of alignment (perc_align) larger than 100. This is an error than needs to be check and corrected in the database!

4. Counting and obtaining statistics from a table. Let's count how many rows we have in the seq_ident table  and at the same time obtain the minimum (min), average (avg), maximum (max) and standard deviation ("std" in MySQL, "S" in usual notation) for the length of the BLAST alignment (length_alin_bp) that was used to identify the sequences.
select count(*) 'rows', min(length_alin_bp) 'minAlin', round( avg(length_alin_bp)) 'AvgAlin', max(length_alin_bp) 'MaxAlin', round(std(length_alin_bp)) 'S' from seq_ident
Comment: We see that, even when the average length of the alignment used to identify a sequence is large enough (around 750 bp), there are at least some sequences which were identified with a short alignment. How many are there?

5. Making the previous statment limited to alignment with less than 100 bp.
select count(*) 'rows', min(length_alin_bp) 'minAlin', round( avg(length_alin_bp)) 'AvgAlin', max(length_alin_bp) 'MaxAlin', round(std(length_alin_bp)) 'S' from seq_ident where length_alin_bp<100
Comment: We see that there are around 1500 sequences with poor identification.


6. We can count (and obtain statistics) categorizing for some specific variable. In the table "seq_ident" the variable "BLAST_DB" indicates from which database was obtained the identification. Let's see the average length of alignment and bit score for each one of these databases.
select BLAST_DB, count(*) 'sequences', round(avg(length_alin_bp)) 'AvgAlin', round(avg(bitscore)) 'AvgBS' from seq_ident group by BLAST_DB order by count(*) desc
Comment: Can you guess what is the meaning of each one of the words in this statement? Experiment by letting out one of the words at the time and see what happens.

7. Looking at partial matches with a variable. Assume that you want to find putative transcription factors in the database. We are going to look for a partial match between the description and the phrase of interest.
select seq_name, description from seq_ident where description like '%transcription factor%'
Comment: Here the percent character, "%" is used as a wild card meaning "any thing"; thus in words we are lokking for sequences in which the description is "anything" followed by "transcription factor" followed by "anything". Note that different sequences could give the same hit, thus we need to count...

8. Counting different instances in the previous statement
select count(*) 'sequences', count(distinct contig) 'contigs', count(distinct gene) 'genes', count(distinct description) 'descriptions' from seq_ident where description like '%transcription factor%'
Comment: Note that there is the same number of rows and contigs (because identification was performed at contigs level), however, the number of genes (identifying a sequence) is less than the number of contigs. This means that, for some case, different contigs are identified with the same gene. Moreover, the number of descriptions is less than the number of genes, thus there are cases where different genes have the same description. Nothing is as simple as seen at first sight!


Using more than one table: The real power of relational databases and SQL

9. Now assume that we want to get the sequences that code for transcription factors. The sequences are in the table "seq", while their descriptions are in table "seq_ident". We need to link ("relation") these two tables by at least one of the variables that they have in common. Note that "seq_name" in "seq" and "seq_ident" are the same variable, thus we can obtain the (different) sequences representing putative transcription factors by something like:
select seq.seq_name 'seq_name', seq 'sequence', description from seq, seq_ident where seq.seq_name=seq_ident.seq_name and description like '%transcription factor%'
Comment: Note that there will be some sequences that have the same identifier.

10. Selecting only very high quality identifications of transcription factors by asking for a very high bitscore
select seq.seq_name 'seq_name', gene, seq 'sequence', description from seq, seq_ident where seq.seq_name=seq_ident.seq_name and description like '%transcription factor%' and bitscore>1000
Comment: Note that these sequences are large and very well identified.

11. Looking for a DNA motiv in a sequence. Even when there is not the best way (that is why Blast was designed), it is possible to quickly see if a particular strech of DNA is present in one or more sequences of the database
select seq.seq_name 'seq_name', description from seq, seq_ident where seq.seq_name=seq_ident.seq_name and seq like '%TTT%TGA%GTC%ACTTGCACCCTT%'
Comment: Note that we use various wild cards ("%" characters) in our comparison

12. Selecting sequences that are putatively involved with  CAPSAICINOID BIOSYNTHESIS. First, let's obtain only the seq_name, gene and description from the seq_ident (conditioning to find the term of interest in the table annotKegPathExpress)
select seq_name, gene, description from seq_ident, annotKegPathExpress where gene=id_gene and  pathway_hit like '%CAPSAICINOID BIOSYNTHESIS%'

13. Finally we are going to modify the previous statement to obtain also the sequence. We will also be more strict, asking that the sequence is identified by a large bit score and we will order the sequences by their description. Note that in this case we are making a query from three different tables and thus there are various equalities that must be fulfilled in order to select only the correct sequences.
select seq.seq_name 'seq_name', gene, description, seq from seq_ident, annotKegPathExpress, seq where seq.seq_name=seq_ident.seq_name and gene=id_gene and  pathway_hit like '%CAPSAICINOID BIOSYNTHESIS%' and bitscore>300 order by description
Comment: Note that the condition " seq.seq_name=seq_ident.seq_name " imply that we will be selecting the same entities from these tables ("seq" and "seq_ident"), while the condition " gene=id_gene " implies that the "gene" selected from "seq_ident" will be the same than the one from "annotKegPathExpress". Also, we ask that the pathway hit must contain the phrase 'CAPSAICINOID BIOSYNTHESIS' somewhere, and we select only hits (concordances) for which the bit score is at leas 300 and order the results by the variable description.

Investigating gene expression results

Not all tables with results are presented in this version of the database (that will change in the near future). However, table "discoveries" presents a summary of the statistical results aimed for non-statisticians.
We will see the power of datamining this table, coupled with others.

14. In how many genes do we have significant (at false discovery rate of 1%) discoveries of differential expression at each one of the contrasts?
select sum(D_DE) 'general DE', sum(abs(I10to20)) 'DE 10 to 20', sum(abs(I20to40)) 'DE 20 to 40', sum(abs(I10to20)) 'DE 40 to 60' from discoveries
Comment: In here we add (sum) the values of the indicator variables that denote a discoverie for the experiment as a whole (D_DE) as well as the absolute values of the indicatos of discoveries at each one of the contrasts (10 to 20 DAA, etc). The use of the absolute value (abs) is needed because these variables (I10to20, I20to40 and I40to60) are signed indicators, being 1 if there was an increse in value between the treatments and -1 if there was a decrease. These variables are zero if the change was not significant..

15. Let's investigate how many genes had a significant change in expression between the 10 and 20 DAA treatments. We can also see the net (sum) and average change (avg) per gene.
select I10to20, count(*) 'n_genes', sum(ch10to20) 'netChangeTPM', round(avg(ch10to20)) 'avgChangeTPM' from discoveries group by I10to20
Comment: Variable "I10to20" has values 1 if there was a positive (increase) and significant change from 10 to 20 DAA; a value of 0 if the change was not significant and a value of -1 if there was a significant decrease from 10 to 20 DAA. On the other hand, the variable "ch10to20" has the change in TPM (transcripts per million) for each gene. As excercise, investigate the changes between the other two transitions (from 20 to 40 and 40 to 60 DAA). Which of the transitions had more changes in gene expression?

16. How many putative transcription factors significantly changed their expression (increasing or decreasing) from the 10 to the 20 DAA?
select count(*) 'n_TF' from discoveries, seq_ident where gene=id_gene and abs(I10to20)=1 and description like '%transcription factor%'
Comment: Here we use the absolute value of "I10to20" to include both the changes increasing (1) and decreasing (-1).

17. We want to know which transcription factors changed and how large was the change between 10 and 20 DAA.
select distinctrow id_gene, ch10to20, description from discoveries, seq_ident where gene=id_gene and abs(I10to20)=1 and description like '%transcription factor%' order by ch10to20 desc
Comment: We ask to see only the rows that present distinct results. Ask to see the name of the gene, the net change in the period and the description where the conditions set in the previous statement are fulfilled.

18. Assume that a researcher is interested in the 'RNA splicing, via endonucleolytic cleavage and ligation'. How many genes annotated with this biological process (BP) change at each one of the transitions? and Which are the sizes of the changes from one stage to the other?
select distinctrow discoveries.id_gene 'id_gene', seq_ident.description 'Description', ch10to20, ch20to40, ch40to60 from annotBPexpress, seq_ident, discoveries where annotBPexpress.id_gene=seq_ident.gene and annotBPexpress.id_gene=discoveries.id_gene and seq_ident.gene=discoveries.id_gene and annotBPexpress.description='RNA splicing, via endonucleolytic cleavage and ligation' and ((I10to20 !=0)or(I20to40!=0)or(I40to60!=0)) order by abs(ch10to20)+abs(ch20to40)+abs(ch40to60) desc
Comment: This complex query get information from three different tables. Study it and modify it to look for genes that are annotated in other procecess, for example 'RNA processing' or 'gene silencing by RNA'.