LIB
 

Tutorial

The user can query the ConjoinG database in several ways. The ConjoinG ‘Query’ page is divided into the following two sections:

General Queries: This section provides basic and more frequently asked queries, including:

Advanced Queries: This section provides additional, in-depth queries about the CGs, including:

An example query page is shown below:

General Query Options

Chromosome: Search for CGs present on one or more chromosomes, as shown in the figure.
Gene Symbol: Search for CGs by gene symbol for either the CG or parent genes. As text is entered, matching suggestion are automatically listed.
By default, the gene symbol entered is searched for in the ‘official HGNC gene symbol’ list and the list of all gene aliases. However, if the ‘Exclude Aliases’ option is checked, then the search is made only in the ‘official HGNC gene symbol’ list.
mRNA Accession: Search for CG by mRNA or EST accession number. This search field has a built in wild card search option. For example, the following query will return all CGs which are supported by mRNA or EST accessions containing the letter 'A'.
Function Keyword(s): Search for CGs formed by parent genes which contain with used entered function keyword(s). This search is performed on the following parent gene data fields: HGNC official gene symbol and full name, aliases, summary, Gene Ontology, and KOGs. CGs which are already found in the NCBI Entrez Gene database will also be inncluded in the search. This search allows the following Boolean operators: AND, OR and NOT.
For example, the query shown here will return all hits which contain the keywords ‘cyto’ and ‘Meta’. Similarly, the query shown here will return all hits which do not contain the keyword ‘cyto’.
Experimental Status: Search for CGs according to their experimental status. For a subset of CGs, we have attempted to confirm them by RT-PCR and sequencing of the PCR products in 16 human tissues.
Confirmed: This search option returns those CGs which we have experimentally confirmed in one or more tissues.
Attempted but not confirmed: This search option returns those CGs which we have failed to experimentally confirmed in any of the 16 tissues.
Not attempted experimentally: This search option returns those CGs for which we have not yet performed any experimental confirmation.
Disorder Implicated: Search for CGs which are formed by one or more parent genes implicated in a genetic disorder. As the disorder name is entered, matching suggestion included in the database are automatically listed.
Known Conjoined Genes: If checked, the user can search for CGs which are currently found in the NCBI Entrez Gene database.
Advanced Query Options
The second section provides some additional, in-depth queries about the CGs, including: proteins formed by the CGs, tissue expression information, genomic features and functional information of parent genes, splicing patterns of CGs, and conservation of CGs in other vertebrate genomes. Following are more detailed explanations and information about these additional query options. By default, the queries in this section are collapsed and can be expanded by clicking anywhere on the query title:
Under the Protein Formed by Conjoined Genes query there are three sub-queries.
CG CDS Type: Search for CGs according to the type of CDS formed.
Chimeric CDS: This search option returns all CGs where the CDS is predicted to be translated in the same reading frame as both the parent genes, thereby forming a Chimeric protein.
Same as 5’ gene CDS: This search option returns all CGs where the CDS is predicted to be translated in the same reading frame as the 5' parent gene.
Same as Middle gene CDS: This search option returns all CGs where the CDS is predicted to be translated in the same reading frame as the middle parent gene.
Same as 3’ gene CDS: This search option returns all CGs where the CDS is predicted to be translated in the same reading frame as the 3' parent gene.
Novel CDS: This search option returns all CGs where the CDS is predicted to be translated in a different frame than either of the parent genes, thereby forming a novel protein.
Both 5’ and 3’ CDS: This search option returns all CGs which are supported by more than one mRNA such that two different CDSs are predicted, one in the same reading frame as the 5’ parent gene and the other in that of the 3’ parent gene.
No CDS: This search option returns all CGs for which no reliable (longest ORF starting with a methionine) CDS could be predicted.
No Prediction: This search option returns all those CGs for which no clear prediction could be done.
CG Transcript NMD Prediction: Search for CGs according to their Nonsense Mediated Decay (NMD) prediction. NMD predicted: This search option returns all CGs which contain a premature stop codon and are expected to undergo NMD.
NMD not predicted: This search option returns all CGs which do not contain a premature stop codon and thus are not expected to undergo NMD.
Insufficient information: This search option returns all CGs for which no predictions could be made related to NMD because either (a) their 3’ ends extended beyond the 3’ ends of the downstream parent gene thereby including a novel exon, or (b) no ORF could be found or predicted for that CG.
CG Protein Length: Search for CGs according to protein predicted length. Select one of four protein length categories: very small (<100 aa), small (100-300 aa), medium (301-500 aa), or large (>500 aa) proteins.
Under the Tissue Expression Information query there are three sub-queries.

Parent Genes: Search for CGs which are formed by parent genes found expressed in tissues containing the keyword entered by the user. Information about the expression in cancer or tumor tissues is obtained from the NCBI UniGene database. The search field has auto-suggest and lists all the tissues for the parent genes included in the database containing the keyword entered, as shown in the figure below. For example, by entering the keyword "kid" in the search box, all CGs formed by the parent genes found expressed in kidney, fetal kidney, and kidney tumor tissues are returned.
CG mRNA: Search for CGs which are found expressed in tissues containing the keyword entered by the user. For this search, the tissue expression information obtained from our experiments is pooled with that obtained from the NCBI GenBank database for each CG mRNA accession. The search field has auto-suggest and lists all the tissues for the CGs included in the database containing the keyword entered, as shown in the figure below. For example, by entering the keyword "liv" in the search box, all CGs found expressed in liver, fetal liver, and liver tumor tissues are returned.
CGs expressed in Cancer or Tumor Tissues: If checked, the user can search for those CGs which are found expressed in any cancer or tumor tissues. Information about expression in cancer and tumor tissues is obtained from the NCBI GenBank database for each CG mRNA accession.
Under the Parent Genes’ Genomic Features query there are five sub-queries.

Arrangement of Parent Genes: Search for CGs formed by the type of arrangement of the parent gene coding regions with respect to each other on the genomic DNA. The parent gene pairs have been divided into four categories, viz., "Overlapping" (a pair of adjacent genes whose coding regions partially overlap), "Non-overlapping" (a pair of adjacent genes whose coding regions do not overlap), "Gene-within-gene" (the same gene encodes two different sequences due to a frame shift (overlapping coding region) and produces two completely different proteins), and "No Information" (where no information could be obtained due to unavailability of the coordinates of one or more of the parent genes).
Distance between Parent Genes: Search for CGs formed by parent genes separated by a distance equal to (=), less than (<), or greater than (>) a value in kilobases specified by the user.
No. of Parent Genes: Search for CGs formed by the number of parent genes specified by the user. In our database, a vast majority (734) of the CGs are formed by just two parent genes, while a few rare cases are formed by three (13) and four (4) parent genes.
Gene Family: Search for CGs formed by parent genes belonging to the same, different, or unknown gene families.
NCBI Entrez Gene ID: Search for CGs formed by the parent gene with the NCBI Entrez Gene ID specified by the user.
Under the Parent Genes’ Functional Information query there are four sub-queries.

KOGs Functional Category: Search for CGs formed by the parent genes classified under the KOGs functional category selected by the user.
The following functional categories from the eukaryotic clusters of orthologous groups (KOGs) have been used:

Description Code Functional Class
Cell division and chromosome partitioning D Cellular processes
Cell envelope biogenesis, outer membrane M Cellular processes
Cell motility and secretion N Cellular processes
Posttranslational modification, protein turnover, chaperones O Cellular processes
Inorganic ion transport and metabolism P Cellular processes
Signal transduction mechanisms T Cellular processes
Translation, ribosomal structure and biogenesis J Information storage and processing
Transcription K Information storage and processing
DNA replication, recombination and repair L Information storage and processing
Energy production and conversion C Metabolism
Amino acid transport and metabolism E Metabolism
Nucleotide transport and metabolism F Metabolism
Carbohydrate transport and metabolism G Metabolism
Coenzyme metabolism H Metabolism
Lipid metabolism I Metabolism
Secondary metabolites biosynthesis, transport and catabolism Q Metabolism
General function prediction only R Poorly characterized
Function unknown S Poorly characterized
GO Function: Search for CGs formed by the parent genes classified with the Gene Ontology Function containing the keyword entered by the user. The search field has auto-suggest and lists all the GO Functions for the parent genes included in the database containing the entered keyword, as shown in the figure below.
GO Process: Search for CGs formed by the parent genes classified with the Gene Ontology Process containing the keyword specified by the user. The search field has auto-suggest and lists all the GO Processes for the parent genes included in the database containing the entered keyword, as shown in the figure below.
GO Component: Search for CGs formed by the parent genes classified with the Gene Ontology Component containing the keyword entered by the user. The search field has auto-suggest and lists all the GO Components for the parent genes included in the database containing the entered keyword, as shown in the figure below.
For the CG Conservation query the user can select from a list of genomes in which the CGs were found to be conserved, as shown in the figure below.

The results page displays a list of all the CGs which match the query options selected by the user along with their ConjoinG ID number, the chromosome they are located on, the participating parent gene symbols, and the mRNA or EST accession numbers used as evidence to confirm them, as shown in the image below.

The list of CGs can be sorted by ConjoinG ID number, chromosome number, or first (5') parent gene symbol. The results of the search can be downloaded in plain text format by clicking on the link located on the right just above the table. Clicking on the ConjoinG ID number takes the user to the profile page for that CG which provides detailed information about the CG.

The profile page provides detailed information about the comjoined genes from both our analysis and from other resources. The page is divided into the following sections:

The user can go to a specific section by using the Table of Contents in the left panel as shown in the image below:

Conjoined gene summary: This section provides a summary about the CG such as, our proposed gene symbol by us, the mRNA/EST accession numbers used as evidence to confirm the CG, and parent gene symbols. It also provides detailed information, if available, about the known CGs from the NCBI Entrez Gene database such as, the Entrez Gene ID number and a summary from NCBI, the official gene symbol, and the full name and aliases used from HGNC, as shown in the figure below:
Genomic context: This section provides genomic information about the CG such as its chromosomal location and strand. The user can also view the genomic location table for the mRNA or EST sequences used as evidence to confirm the CG as shown in the figure below: This section also gives the distance between the parent genes and their arrangement in the genome.
Map of CG region: A graphical representation showing the location of the parent genes and the CG along with their respective CDSs is given in this section. An example is shown in a figure below: There are three type of views presented: a combined view (all the mRNA, EST, and CDS tracks are shown for the parent genes and the CG), an mRNA only view (only the mRNA and EST tracks are shown), and a CDS only view (only the CDS tracks are shown). The image is hyperlinked to the locations, and the full sequences and exon sequences of all the mRNA, EST, and CDS sequences for the CG and the parent genes, as shown in the figure above. These images clearly show which exons from the parent genes participate in the formation of the CG. Novel exons appearing only in the CG can also be seen.

The user can view the CG mRNA full nucleotide sequence and protein sequence by clicking on the NM IDs or mRNA accession numbers to the left of the image. Individual exon sequences (both nucleotide and protein) can be seen by clicking on the exons. Mousing over an exon displays the location of the exon on the chromosome.

By default the graph only shows the mRNA sequences supporting the CG. By clicking on “Show EST” the users can also view the supporting ESTs, if available.

Parent gene information: This section provides detailed information about all the parent genes participating in the formation of the CG in separate sub-sections, as shown in the figure below. The information is extensively hyperlinked where applicable back to its source database.
General protein information: This section provides information about the CG protein formed, if any. As shown in the figure below, the user can select the CG mRNA accessions and parent gene NM IDs to see what regions of the parent protein sequences match with that of the CG. The image shows an example of a chimeric protein formed by a CG that combines the exons of the two parent genes in such a way that the translation of the mRNA occurs in the same frame as both parent genes. The coding region from the upstream parent gene is shown in red in both the CG and the parent gene (TRIM6), and that from the downstream parent gene is shown in blue. The unique regions are shown in black in each sequence, and the overlapping regions are highlighted in grey in the CG sequence. This section also provides information about whether or not the CG mRNA has been predicted to undergo nonsense-mediated decay (NMD).
CG conservation: In this section, and as shown in the image below, information is given about the conservation of the "CG junction exon" in other vertebrate genomes. For this analysis, the mRNA and EST libraries of the vertebrate genomes were used. A list of CG mRNA/EST accession numbers is provided so that the user can browse through the information for each sequence. By clicking on the links as shown below, the user can also see the percent similarity of the sequences and the aligned sequences themselves highlighted in red in both the CG and the sequence from the other vertebrate genome.
Experimental validation: This section gives information about experimental validation of the conjoined gene including; the accession used for primer designing, details of the primers used for RT-PCR, the PCR images, the tissues in which the CG was confirmed in our experiments and from GenBank, sequences of the PCR products mapped back to the genome, etc. An example is shown below:

The user can submit upto 1000 nucleotide or protein sequences of interest in fasta format to find a match in the ConjoinG database. The dataset for the “Nucleotide search” option comprises the nucleotide sequences of the conjoined mRNAs, ESTs, and parent genes. For the “Protein search” option, the predicted protein sequences of the conjoined mRNAs and parent genes are used. The search is made using the BLAST homology search tool with a default E-value cut-off of 10-6 and the option to filter the query sequence for low complexity regions turned off. An example of the Align input page is shown below:

The highest scoring pair for each hit is returned and the user can either view the alignment of their query sequence with the matches or go to the respective conjoined gene profile page to find more details about the matching conjoined gene. An example of the results page is shown below: