Question: What are conjoined genes (CGs)?
Answer: A “conjoined gene” (CG) is defined as a gene formed at the time of transcription by combining at least part of one exon from each of two or more distinct (parent) genes which lie on the same chromosome, in the same orientation, and translate independently into different proteins. In some cases, the transcripts formed by CGs are translated to form chimeric or completely novel proteins. Figure 1: Cartoon representation of the formation of conjoined gene A-B from parent genes A and B. In Figure 1, conjoined gene A-B combines at least one exon (complete or partial) from both gene A and gene B. Conjoined genes are usually only supported by a few mRNA or EST sequences, and rarely by a CCDS.
Question: How many conjoined genes are found in the NCBI Entrez Gene database?
Answer: Currently there are about 34 CGs listed in the NCBI Entrez Gene database (see table 1 below for a complete list), 29 of which could be found by our approach. For the remaining four cases, no aligned mRNA or EST sequences connecting the parent genes were found in the UCSC Genome Browser. Table 1: List of CGs found in the NCBI Entrez Gene database
Chromosome Proposed Symbol (5' > 3') Parent gene 1 Parent gene 2 Predicted coding status
2 STON1- GTF2A1L STON1 GTF2A1L protein coding
2 LY75-CD302 LY75 CD302 NA
3 MDS1-EVI1 MDS1 EVI1 protein coding
5 ANKHD1-EIF4EBP3 ANKHD1 EIF4EBP3 protein coding
9 PALM2-AKAP2 PALM2 AKAP2 protein coding
10 CYP2C18-CYP2C19 CYP2C18 CYP2C19 NA
11 ZFP91-CNTF ZFP91 CNTF non-coding(NMD candidate)
11 TRIM6-TRIM34 TRIM6 TRIM34 protein coding
11 MS4A7- MS4A14 MS4A7 MS4A14 NA
15 JMJD7-PLA2G4B JMJD7 PLA2G4B protein coding (?)
16 DDX19-DDX19L DDX19 DDX19L protein coding
17 CCL15-CCL14 CCL15 CCL14 non-coding(NMD candidate)
17 NME1-NME2 NME1 NME2 protein coding
17 TNFSF12-TNFSF13 TNFSF12 TNFSF13 protein coding
17 DRG2-MYO15A DRG2 MYO15A non-coding
19 PPAN-P2RY11 PPAN P2RY11 protein coding
20 Kua-UBE2V1 Kua UBE2V1 protein coding
22 PRR5-ARHGAP8 PRR5 ARHGAP8 protein coding
Question: How are conjoined genes identified?
Answer: The conjoined genes in the human genome were identified by using the alignments of the known genes, mRNAs, and ESTs to the human genome generated by UCSC (Human assembly (hg18) March, 2006). The algorithm “Conjoin” was developed and applied to the entire human genome and a list of conjoined gene candidates was obtained (see text for details). In a manual curation step, false positives arising out of misalignments of genes from the same gene family (Figure 2) and poor quality sequences, including short EST sequences, were removed . We also manually removed the false positive cases arising from splice variants of single gene loci with multiple gene names (Figure 3). Figure 2: Example of a false positive case due to a duplicated gene or region on chromosome 7 (CYP3A5 - CYP3A7). Figure 3: Example of a false positive case due to gene name variants of the same gene on chromosome 2 (UGT1A family).
Question: Are human conjoined genes conserved across other vertebrate genomes?

Answer: We examined the conservation of conjoined genes across 23 other vertebrate genomes. In order to determine conservation, we selected a ‘junction exon’ from every CG transcript. The junction exons can be defined as the exon(s) of CGs which contain DNA sequence from both participating parent genes, viz., the terminal exon of the upstream gene and the initial exon of the downstream gene. In cases where the terminal and initial exons of the two parent genes form separate exons in the CGs, with or without any other exon between them, the entire region was used. These junction exons were then searched for in the mRNA and EST databases of 23 vertebrate genomes using BLAT, selecting only those hits with an E-value cut off of less than 10-6. In addition, only matches where more than 90% of the junction exon sequence was conserved with more than 90% sequence identity were considered. A complete list of the 23 vertebrate genomes is given below in table 2:Table 2: List of vertebrate genomes used to measure CG conservation

Genome Release Total # of mRNA/EST sequences used Total # of CG mRNAs found conserved
Chimpanzee Mar-06 8,652.108 809
Rhesus macaque Jan-06 61,957 9
Orangutan Jul-07 51,889 9
Cow Oct-07 1,535,130 7
Rat Nov-04 1,100,941 3
Mouse Jul-07 5,179,572 7
Dog May-05 368,099 1
Horse Sep-07 38,189 3
Guinea Pig Feb-08 20,616 0
Chiken May-06 628,398 0
Marmoset Jun-07 3,585 0
Zebra Finch Jul-08 96,696 0
Zebrafish Jul-07 1,509,475 0
X. tropicalis Aug-05 1,289,358 0
Medaka Oct-05 617,589 0
Stickleback Feb-06 279,353 0
Lamprey Mar-07 121,720 0
Tetraodon Feb-04 99,539 0
Fugu Oct-04 26,879 0
Platypus Mar-07 9,854 0
Cat Mar-06 1,787 0
Opossum Jan-06 1,429 0
Lizard Jan-07 82 0

More than 70% of the human CG junction exons were found to be conserved across eight other vertebrate genomes. No significant conservation was observed in the lower-order vertebrates including zebrafish, X. tropicalis, medaka, stickleback, lamprey, tetraodon, and fugu although a large number of mRNA and EST sequences were available for all these genomes. Among the higher-order vertebrates, maximum conservation of CG junction exons was observed in the chimpanzee genome. We observed a large decrease in the number of conserved human CG junction exons from chimpanzee to macaque and orangutan. This could be due to the poor quality of transcriptome annotation of these genomes. With the advancement of high-throughput transcriptome sequencing technologies, such as RNAseq, more RNA sequence data is expected to become available, leading to the detection of additional conserved human CGs in these and other genomes. Nevertheless, all other genomes showed much less conservation of human CG junction exons (Figure 4). Therefore, it is evident from our analysis that CG conservation does not depend on the amount of sequence data available for any given genome; instead, it is correlated with the order of complexity of the vertebrate genomes. This also strengthens the hypothesis that conjoined genes have well-defined functional roles.

Figure 4: Distribution of human CG junction exons found conserved across other vertebrate genomes normalized with respect to the transcriptome information available for the other genomes. Higher-order vertebrates including chimpanzee, macaque and orangutan are expected to show higher conservation as compared to the lower-order vertebrates such as dog, mouse, and rat.
Question: Does any pattern exist in the splicing of the exons participating in the formation of conjoined genes?
Answer: In 58% of the cases, new splice sites have evolved in the conjoined genes by either “parent gene exon truncation” (Figure 5a) or by “parent gene exon extension” (Figure 5b). Interestingly, in 85% of the cases splicing occurred at canonical sites (GT-AG). The examples of non-canonical splice sites include GT-AG, GC-AG, CT-AG, GG-AG, GA-AG, CA-AG, GT-CG, GT-GG, TG-AG, GT-TG, AT-AG, TC-AG, GT-AC, CC-AG, TT-AG, GT-AA, AA-AG, GT-AT, GT-GC, AC-AG, AG-AG, CT-AT, GT-CC, GA-CT, GT-CT, GT-GA, CC-GC, GG-GT, GT-GT, GT-TC, CT-TT, and GT-TT. a. b. Figure 5: Diagram showing the formation of new splice sites in the exons of the conjoined genes (a) by parent gene exon truncation, and (b) by parent gene exon extension However, in almost all the CGs (99%) in which splice sites remain conserved from the parent genes, splicing occurred at canonical sites by using the universal donor (GT) and acceptor (AG) sequences (Figure 6). The examples of non-canonical splice sites include GT-AG, GC-AG, GA-AG, AT-AG, GT-AC, GT-AT, GT-GG, and GT-TT. Figure 6: Canonical splice sites are found for a majority of the conjoined genes An interesting pattern of exon splicing was observed in 42% of the conjoined genes where the terminal exons of the upstream (5’-) parent genes and initial exons of the downstream (3’-) parent genes did not participate in forming the conjoined genes. This results in the creation of a new intron spanning the terminal exon & 3’ UTR region of the upstream (5’-) gene followed by the intergenic region between the two child genes, and then the 5’ UTR & initial exon of the downstream (3’-) gene (Figure 7). Figure 7: Diagram showing the pattern of exons selected in the conjoined genes from their respective parent genes The formation of CGs not only created novel introns, but in 46% of the cases it also resulted in the appearance of completely novel exons. These conjoined genes harbored several types of novel exons including those from the intergenic regions, the intronic regions, or from the external regions lying upstream of the 5’-gene or downstream of the 3’-gene. Interestingly, the novel exons from the intergenic regions outnumbered all other types (Figure 8). Figure 8: Diagram showing the distribution of the location of novel exons found in the CGs
Question: Do conjoined genes form chimeric proteins?
Answer: Yes, out of 409 selected mRNAs representing 297 CGs for which reasonable ORF could be obtained, 55 (13%) formed chimeric proteins by joining the domains of the two parent genes.
Question: How many of the CGs identified by us were also found by other research groups?
Answer: In early 2006, Akiva et al. (2006) and Parra et al. (2006)independently analyzed the entire human genome and identified 212 and 127 CGs, respectively, using mRNA and EST information available in public databases. At the same time, another analysis done by Kim et al. (2006) resulted in the identification of 258 unique CGs in the entire human genome using the ECgene clustering system (Kim et al. 2005a, Kim et al. 2005b). On close examination, we found that some of these genes have now become obsolete due to the lack of reasonable alignments between the CGs and their proposed parent genes, revision in the coordinates of the participating parent genes in the current version of the human genome database, different chromosome location, or opposite orientation of the parent genes. As a result, only 193, 123, and 83 CGs remain valid in the Akiva, Kim and Parra dataset, respectively. To estimate the total number of unique CGs, we compared the 751 CGs identified by our approach with the above three datasets (Figure 9). We found that 232 out of 751 CGs overlapped with at least one other dataset, leaving 519 CGs uniquely identified by our method. Forty three CGs were uniquely identified by at least any one of the Akiva, Kim and Parra analysis but not by us.Figure 9: Venn diagram showing the overlap of the CGs identified by our approach (ConjoinG) with those identified by Akiva et al. (2006), Parra et al. (2006), and Kim et al. (2006) (ChimerDB). Twelve CGs are found common among all four datasets. 519 CGs were uniquely identified by our approach. We also compared the 13 CGs obtained by RT-PCR by Denoeud et al (2007) in the ENCODE region to our dataset. Only two CGs were found in common and five were doubtful because (1) the parent genes are in opposite orientation to each other, (2) the parent genes are in opposite orientation of CG, or (3) neither parent gene could be found in the human genome (see table). Six CGs in ENCODE study were uniquely identified by them but not by us. The sequences u sed to confirm these CGs were not found in the version of human genome data from UCSC that was used for our analysis. Thus, by adding the unique CGs identified by Akiva, Kim, Parra, and the ENCODE analysis to ours results in 800 unique CGs identified in the entire human genome to date.
Question: Are conjoined genes found in other genomes too?
Answer: Yes, using our algorithm we have identified conjoined genes in other genomes like mouse and drosophila, as shown in Table 4. Table 4: Table showing the number of CGs identified in human, mouse, and fruit fly.
Question: How far apart are the parent genes that form conjoined genes?
Answer: The median distance between parent genes forming CGs is 10 kb (see table), except for a few outliers such as DOCK5-PPP2R2A, LASP1-PPP1R1B, FIP1L1-PDGFRA, MATR3-PURA, etc. which are formed by genes lying as far apart as 700-800 kb on the same chromosome and which skip over exons of several genes in the intergenic regions between the two parent genes involved.
Question: How many CGs could be validated experimentally in human?
Answer: We attempted to experimentally validate 353 out of 751 CGs using RT-PCR and sequencing methods in 16 human tissues including brain, heart, kidney, lever, lung, pancreas, prostate, skeletal muscle, spleen, stomach, testis, uterus, fetal brain, fetal kidney, fetal liver, and fetal lung. We confirmed the CGs in 291 out of 353 (82%) cases by sequencing the expected conjoined mRNA in at least one of the above tissues. Interestingly, alternative splicing was observed in 63% (184 out of 291) of the CGs. And, many of these alternatively spliced products were found to harbor novel exons.
Question: In what human tissues are CGs expressed?
Answer: Among the tissues we examined. brain and testis were particularly enriched in CG sequences. Liver, fetel lung, spleen, stomach, and pancreas harbored the least number of detectable CGs of conjoined genes. Figure 10 shows the distribution of CG expression in the human tissues tested in our experiments. Figure 10: Distribution of human tissues in which the conjoined genes were expressed in our experiments.
Question: How many CGs are widely expressed and how many showed tissue-specific expression?
Answer: Because 69% (202/291) of CGs were found to be widely expressed (see table), formation of CGs is expected to be a more widespread phenomenon occurring in most tissues. However, 31% (89/291) of CGs showed tissue-specific expression, suggesting selective expression for many of them. Certainly, these CGs could be expressed in tissues other than the ones we examined.
Question: Could new conjoined genes be identified in human from regions where no prior conjoined mRNA evidence was available?
Answer: We randomly selected ten pairs of genes (test cases) from the human genome which satisfied the 'minimum' criterion for formation of CGs, that is, they had to be on the same chromosome, on the same strand, and less than 10 kb apart. Since the most preferred pattern of CG splicing is to exclude the terminal exon of the upstream gene and the initial exon of the downstream gene, we designed primer sequences from the second to last exon of the upstream gene and the second exon of the downstream gene for these ten regions and performed RT-PCR followed by sequencing of the PCR fragments. Surprisingly, out of the ten test cases, we were able to verify by sequence eight novel conjoined genes for the selected parent gene pairs (Table 5). In addition, alternative splicing with inclusion of novel exons was observed in these test cases. Table 5: List of test cases in human with no prior conjoined mRNA evidence.
5' Gene Symbol 3' Gene Symbol Strand Chromosome CG Confirmed? Alternative Splicing
ACTG2 DGUOK + 2 Yes Yes
ABHD14A ACY1 + 3 Yes Yes
SELK ACTR8 - 3 Yes Yes
ABCG4 NLRX1 + 11 Yes Yes
ACAD10 ALDH2 + 12 Yes Yes
ACCN2 SMARCD1 + 12 Yes No
ING4 ACRBP - 12 Yes Yes
DPM1 ADNP - 20 Yes Yes
  • Akiva, P., Toporik, A., Edelheit, S., Peretz, Y., Diber, A., Shemesh, R., Novik, A., & Sorek, R. (2006) Transcription-mediated gene fusion in the human genome. Genome Res. 16: 30-36.
  • Denoeud, F., Kapranov, P., Ucla, C., Frankish, A., Castelo, R., Drenkow, J., Lagarde, J., Alioto, T., Manzano, C., Chrast, J. et al. (2007) Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions.Genome Res. 17: 746-759.
  • Kim, N., Shin, S., & Lee, S. (2005a) ECgene: genome-based EST clustering and gene modeling for alternative splicing. Genome Res. 15: 566-576.
  • Kim, P., Kim, N., Lee, Y., Kim, B., Shin, Y., & Lee, S. (2005b) ECgene: genome annotation for alternative splicing. Nucleic Acids Res. 33: D75-D79.
  • Kim, N., Kim, P., Nam, S., Shin, S., & Lee, S. (2006) ChimerDB--a knowledgebase for fusion sequences. Nucleic Acids Res. 34: D21-D24.
  • Parra, G., Reymond, A., Dabbouseh, N., Dermitzakis, E. T., Castelo, R., Thomson, T. M., Antonarakis, S. E., & Guigo, R. (2006) Tandem chimerism as a means to increase protein complexity in the human genome. Genome Res. 16: 37-44.