De novo transcriptome assembly using hybridoma mRNA-seq data reconstructs transcripts containing Igh and Igl/Igk genes. We developed an automation tool (igfinder) to extract Igh and Igl/Igk gene sequences from assembled transcripts (e.g., Trinity output).
Biopython is required to run igfinder
extracts immunoglobulin sequences from assembled transcripts (e.g., Trinity output).
python igfinder.py [-h] -i I [-o O] [-c C] -r R
-h, --help show this help message and exit
-i I input_filename(fasta_format)
-o O output_dir
-c C Analysis frame: default=atarts from M
-r R sequence_reference csv file
The input file should be in fasta format.
Example: Trinity.fasta
>TR1
GTTCTCGATACTTCGTTGTGGTTGTGAACTCTGTCCGGCAGCCTCGGGCCTGC
GGTCTTGAGACGGAGCACCATGCCTACGATAAAGTTGCAGAGTTCTGATGGAG
AGATATTTGAAGTTGATGTAGAAATTGCCAAACAATCTGTGACTATCAAGACC
>TR2
CGTACAGTTGGAAACATTGAGTATTCGAGGCAATAACATTCGGTATTTTATTC
TACCAGACAGCTTACCTCTAGATACACTACTTGTGGATGTTGAACCTAAGGTG
AAGTCTAAGAAAAGAGAAGCTGTTGCAGGAAGAGGCCGAGGCCGGGGTAGAGG
AAGAGGACGTGGTCGTGGCAG
The reference file should be in csv format.
Example: rat.csv
Ighg1,TGTGCCCAGAAACTGTGGAG,1200
Ighg2a,GCCAAGGGAATGCAATCCTTG,1200
Ighg2b,CAAACAACAGCCCCATCTGTCTAT,1200
Ighg2c,AGAACAACAGCCCCATCTGTCTA,1200
Igk,ACCAACTGTATCTATCTTCCCACCATCCAC,600
Igl1,CAACCCAAGGCTACGCCCTC,600
Igl2,CAGCCCAAGTCCACTCCCAC,600
1st field - name of target gene
2nd field - unique sequence of target gene (nucleotide)
3rd field - predicted minimum length of target gene (CDS only, not including UTR)
The all output file of igfinder is in fasta format.
Example: final_output.fasta
>Target1
CAGGTTGCCTCCTCAAAATGAAGTTGCCTGTTAGGCTGTTGGTGCTGATGTTC
TGGATTCCTGCTTCCAGCAGTGATGTTTTGATGACCCAAACTCCACTCTCCCT
GCCTGTCAGTCTTGGAGATCAAGCCTCCATC
>Target2
CCTACATGGAGCTCCGCACCCTGACATCTGAGGACTCTGCCGTCTATTACTGT
ACAAGTTACGGGGGAGTTTATTGGGGCCAAGGGACTCTGGTCACTGTCTCTGC
AGCCAAAACAACACCCCCATCAGTCTATCCACTGGCCCCTGGGTGTGGAGATA
CAACTGGTTCCTCCGTGACTCTGGGATGCCTGGTCAAGGGCTACTTCCCTGAG
TCAGTGACTGTGACTTGGAACTCTGGATCCCTGTCCAG
Example: AA_output.fasta
>Target1
GCLLKMKLPVRLLVLMFWIPASSSDVLMTQTPLSLPVSLGDQASISCRSSQSI
VHSNGNTYLEWYLQKPGQSPKLLIYKVSNRFSGVPDRFSGSGTGTDFTLK
>Target2
IAGVQSQVQLQQSGAELVRPGASVTLSCKASGYTFTDYEIHWVKQTPVHGLEW
IGAIDPETGGTAYNQKFKGKATLTADKSSSTAYMELRTLTSEDSAVYYCTSYG
GVYWGQGTLVTVSAAKTTPPSVYPLAPGCGDTTGSSVTLGCLV
This tutorial aims to identify rat anti-Brg1 antibody sequence produced by 4E5 (hybridoma). The example data are mRNA-seq of 4E5.
You need to download the fastq files of 4E5 hybridoma mRNA-seq from NCBI and the reference file of rat (rat.csv).
Each of the fastq files contaions more than 40M reads. You should subsample reads for de novo assembly because calculation time with all reads would be much longer. You could obtain Igh and Igl/Igk with more than 30,000 reads.
In the case of subsampling 100,000 reads,
zcat all-R1.fastq | sed -n 1,400000p > 100000-R1.fastq
zcat all-R2.fastq | sed -n 1,400000p > 100000-R2.fastq
Step 1: de novo transcriptome assembly
In the case of Trinity
Trinity --seqType fq --left 100000-R1.fastq(.gz) --right 100000-R2.fastq(.gz) --CPU 2 --max_memory 12G
When Trinity completes, Trinity.fasta is created as an output file.
Step 2: Igh and Igl/Igk gene sequences extraction
python igfinder.py -i Trinity.fasta -o 4E5hybridoma -r rat.csv
The final_output.fasta in 4E5hybridoma directory would contain extracted Igh and Igl/Igk nuculeotide sequences and the AA_output.fasta would show IgH and IgL/IgK protein sequences with correct frame.
final_output.fasta
>Igk_TR109|c0_g1_i1233
GAAGGTCTTTCTCAGGGCTGTGTCATGATCCACATAAACTCGAGGAAAGCCCAAAGATGGT
GTTCAAATTTCAGATCCTTGGACTTCTGCTTTTCTGGATTTCAGCCTCTAGAGGGGACATC
GTGCTGACTCAGTCTCCAACCACCCTGTCTGTGACTCCAGGAGAGACAGTCAGTCTCTCCT
GCAGGGCTAGCCATAGTATTGGCACAAATCTACACTGGTATCAGCAAAAAACAAATGAGTC
TCCAAGGCTTCTCATCAAGTATTCTTCCCAGTCCATCTCTGGGATCCCCTCCAGGTTCAGT
GCCAGTGGATCAGGGACAGATTTTACTCTCAACATCAACAATGTGGAGTTTGATGATGTCT
CAAGTTATTTTTGTCAACAGACTCAAAGCTGGCCCGACACGTTTGGAGCTGGGACCAAGCT
GGAACTGAAACGGGCTGATGCTGCACCAACTGTATCTATCTTCCCACCATCCACGGAACAG
TTAGCAACTGGAGGTGCCTCAGTCGTGTGCCTCATGAACAACTTCTATCCCAGAGACATCA
GTGTCAAGTGGAAGATTGATGGCACTGAACGACGAGATGGTGTCCTGGACAGTGTTACTGA
TCAGGACAGCAAAGACAGCACGTACAGCATGAGCAGCACCCTCTCGTTGACCAAGGCTGAC
TATGAAAGTCATAACCTCTATACCTGTGAGGTTGTTCATAAGACATCATCCTCACCCGTCG
TCAAGAGCTTCAACAGGAATGAGTGTTAGACCCAAAGGTCCTGAGGTGCCACCTGCTCCCC
AGCTCCTTCCAATCTTCCCTCCTAAGGTCTTGGAGACTTCCCCACAAGCGACCTACCACTG
TTGCGGTGCTCCAAACCTCCTCCCCACCTCATCCTCCTTCCTTTCCTTGGCTTTGATCATG
CTAATATTTGGGGAATATTAAATAAAGTGAA
>Ighg2b_rcTR179|c0_g1_i1476
AATGTTTTCTCTACAGTCACTGAATCACAACATCCTCATTATGAAATGCAGGTGGATCATC
CTCTTCTTGATGGCAGTAGCTACAGGGGTCAACTCAGAAGTCCAGCTGCAGCAATCTGGGC
CTGAGCTTCAGAGACCCGGGGCCTCAGTCAAGTTGTCGTGCAAGGCTTCTGGCTATACCTT
TACAGAATACTATATGTACTGGGTGAAGCAGAGGCCTAAACAGGGCCTGGAATTAATAGGA
AGGATTGATCCTGAAGACGGTAGTACTGATTATGTTGAGAAGTTCAAAAACAAGGCCACAC
TGACTGCAGATACATCGTCCAACACAGCCTACATGCAACTCAGCAGCCTGACATCTGAGGA
CACAGCAACCTATTTTTGTGCGGCGGGAACTAGGTGGGGCCAAGGAGTCATGGTCACAGTC
TCCTCAGCCCAAACAACAGCCCCATCTGTCTATCCACTGGCTCCTGGATGTGGTGATACAA
CCAGCTCCACGGTGACTCTGGGATGCCTGGTCAAGGGCTATTTCCCTGAGCCAGTCACCGT
GACCTGGAACTCTGGAGCCCTGTCCAGCGATGTGCACACCTTTCCAGCTGTCCTGCAGTCT
GGGCTCTACACTCTCACCAGCTCAGTGACCTCCAGCACCTGGCCCAGCCAGACCGTCACCT
GCAACGTAGCCCACCCGGCCAGCAGCACCAAGGTGGACAAGAAAGTTGAGCGCAGAAATGG
CGGCATTGGACACAAATGCCCTACATGCCCTACATGTCACAAATGCCCAGTTCCTGAACTC
TTGGGTGGACCATCTGTCTTCATCTTCCCGCCAAAGCCCAAGGACATCCTCTTGATCTCCC
AGAACGCCAAGGTCACGTGTGTGGTGGTGGATGTGAGCGAGGAGGAGCCGGACGTCCAGTT
CAGCTGGTTTGTGAACAACGTAGAAGTACACACAGCTCAGACACAACCCCGTGAGGAGCAG
TACAACAGCACCTTCAGAGTGGTCAGTGCCCTCCCCATCCAGCACCAGGACTGGATGAGCG
GCAAGGAGTTCAAATGCAAGGTCAACAACAAAGCCCTCCCAAGCCCCATCGAGAAAACCAT
CTCAAAACCCAAAGGGCTAGTCAGAAAACCACAGGTATACGTCATGGGTCCACCGACAGAG
CAGTTGACTGAGCAAACGGTCAGTTTGACCTGCTTGACCTCAGGCTTCCTCCCTAACGACA
TCGGTGTGGAGTGGACCAGCAACGGGCATATAGAAAAGAACTACAAGAACACCGAGCCAGT
GATGGACTCTGACGGTTCTTTCTTCATGTACAGCAAGCTCAATGTGGAAAGGAGCAGGTGG
GATAGCAGAGCGCCCTTCGTCTGCTCCGTGGTCCACGAGGGTCTGCACAATCACCACGTGG
AGAAGAGCATCTCCCGGCCTCCGGGTAAATGAGCACGGCACCCAGAAAGCTCTCAGGTCCT
AAGGGACACTGACACCCATCTCCACCCTTCCCTTGTGTAAATAAAGCACCCAGCACTGCCC
TGGGACCCTGCT
AA_output.fasta
>Igk_TR109|c0_g1_i1233
RSFSGLCHDPHKLEESPKMVFKFQILGLLLFWISASRGDIVLTQSPTTLSVTPGETVSLSC
RASHSIGTNLHWYQQKTNESPRLLIKYSSQSISGIPSRFSASGSGTDFTLNINNVEFDDVS
SYFCQQTQSWPDTFGAGTKLELKRADAAPTVSIFPPSTEQLATGGASVVCLMNNFYPRDIS
VKWKIDGTERRDGVLDSVTDQDSKDSTYSMSSTLSLTKADYESHNLYTCEVVHKTSSSPVV
KSFNRNEC*TQRS*GATCSPAPSNLPS*GLGDFPTSDLPLLRCSKPPPHLILLPFLGFDHA
NIWGILNKV
>Ighg2b_rcTR179|c0_g1_i1476
MFSLQSLNHNILIMKCRWIILFLMAVATGVNSEVQLQQSGPELQRPGASVKLSCKASGYTF
TEYYMYWVKQRPKQGLELIGRIDPEDGSTDYVEKFKNKATLTADTSSNTAYMQLSSLTSED
TATYFCAAGTRWGQGVMVTVSSAQTTAPSVYPLAPGCGDTTSSTVTLGCLVKGYFPEPVTV
TWNSGALSSDVHTFPAVLQSGLYTLTSSVTSSTWPSQTVTCNVAHPASSTKVDKKVERRNG
GIGHKCPTCPTCHKCPVPELLGGPSVFIFPPKPKDILLISQNAKVTCVVVDVSEEEPDVQF
SWFVNNVEVHTAQTQPREEQYNSTFRVVSALPIQHQDWMSGKEFKCKVNNKALPSPIEKTI
SKPKGLVRKPQVYVMGPPTEQLTEQTVSLTCLTSGFLPNDIGVEWTSNGHIEKNYKNTEPV
MDSDGSFFMYSKLNVERSRWDSRAPFVCSVVHEGLHNHHVEKSISRPPGK*ARHPESSQVL
RDTDTHLHPSLV*IKHPALPWDPA
Please check ORFs (open reading frames) in identified sequences with Ape. You would find around 1400 bp and 700 bp ORFs coding Igh and Igl/Igk, respectively.