igfinder


What's this for?

De novo transcriptome assembly using hybridoma mRNA-seq data reconstructs transcripts containing Igh and Igl/Igk genes. We developed an automation tool (igfinder) to extract Igh and Igl/Igk gene sequences from assembled transcripts (e.g., Trinity output).

Download

igfinder
mouse.csv
rat.csv

Installation

Biopython is required to run igfinder

Usage

igfinder

extracts immunoglobulin sequences from assembled transcripts (e.g., Trinity output).

python igfinder.py [-h] -i I [-o O] [-c C] -r R  
    -h, --help     show this help message and exit   
    -i I           input_filename(fasta_format)
    -o O           output_dir
    -c C           Analysis frame: default=atarts from M
    -r R           sequence_reference csv file

File formats

Input file

The input file should be in fasta format.

Example: Trinity.fasta

>TR1  
GTTCTCGATACTTCGTTGTGGTTGTGAACTCTGTCCGGCAGCCTCGGGCCTGC   
GGTCTTGAGACGGAGCACCATGCCTACGATAAAGTTGCAGAGTTCTGATGGAG   
AGATATTTGAAGTTGATGTAGAAATTGCCAAACAATCTGTGACTATCAAGACC   
>TR2
CGTACAGTTGGAAACATTGAGTATTCGAGGCAATAACATTCGGTATTTTATTC   
TACCAGACAGCTTACCTCTAGATACACTACTTGTGGATGTTGAACCTAAGGTG   
AAGTCTAAGAAAAGAGAAGCTGTTGCAGGAAGAGGCCGAGGCCGGGGTAGAGG   
AAGAGGACGTGGTCGTGGCAG

Reference file

The reference file should be in csv format.

Example: rat.csv

Ighg1,TGTGCCCAGAAACTGTGGAG,1200
Ighg2a,GCCAAGGGAATGCAATCCTTG,1200
Ighg2b,CAAACAACAGCCCCATCTGTCTAT,1200
Ighg2c,AGAACAACAGCCCCATCTGTCTA,1200
Igk,ACCAACTGTATCTATCTTCCCACCATCCAC,600
Igl1,CAACCCAAGGCTACGCCCTC,600
Igl2,CAGCCCAAGTCCACTCCCAC,600

1st field - name of target gene
2nd field - unique sequence of target gene (nucleotide)
3rd field - predicted minimum length of target gene (CDS only, not including UTR)

Output file

The all output file of igfinder is in fasta format.

Example: final_output.fasta

>Target1
CAGGTTGCCTCCTCAAAATGAAGTTGCCTGTTAGGCTGTTGGTGCTGATGTTC     
TGGATTCCTGCTTCCAGCAGTGATGTTTTGATGACCCAAACTCCACTCTCCCT   
GCCTGTCAGTCTTGGAGATCAAGCCTCCATC 
>Target2   
CCTACATGGAGCTCCGCACCCTGACATCTGAGGACTCTGCCGTCTATTACTGT   
ACAAGTTACGGGGGAGTTTATTGGGGCCAAGGGACTCTGGTCACTGTCTCTGC    
AGCCAAAACAACACCCCCATCAGTCTATCCACTGGCCCCTGGGTGTGGAGATA    
CAACTGGTTCCTCCGTGACTCTGGGATGCCTGGTCAAGGGCTACTTCCCTGAG    
TCAGTGACTGTGACTTGGAACTCTGGATCCCTGTCCAG

Example: AA_output.fasta

>Target1
GCLLKMKLPVRLLVLMFWIPASSSDVLMTQTPLSLPVSLGDQASISCRSSQSI   
VHSNGNTYLEWYLQKPGQSPKLLIYKVSNRFSGVPDRFSGSGTGTDFTLK
>Target2
IAGVQSQVQLQQSGAELVRPGASVTLSCKASGYTFTDYEIHWVKQTPVHGLEW
IGAIDPETGGTAYNQKFKGKATLTADKSSSTAYMELRTLTSEDSAVYYCTSYG
GVYWGQGTLVTVSAAKTTPPSVYPLAPGCGDTTGSSVTLGCLV

Tutorial

This tutorial aims to identify rat anti-Brg1 antibody sequence produced by 4E5 (hybridoma). The example data are mRNA-seq of 4E5.

Prerequistcs

You need to download the fastq files of 4E5 hybridoma mRNA-seq from NCBI and the reference file of rat (rat.csv).

Each of the fastq files contaions more than 40M reads. You should subsample reads for de novo assembly because calculation time with all reads would be much longer. You could obtain Igh and Igl/Igk with more than 30,000 reads.

In the case of subsampling 100,000 reads,

zcat all-R1.fastq | sed -n 1,400000p > 100000-R1.fastq   
zcat all-R2.fastq | sed -n 1,400000p > 100000-R2.fastq   

Two steps to identify antibody sequence

Step 1: de novo transcriptome assembly

In the case of Trinity

Trinity --seqType fq --left 100000-R1.fastq(.gz) --right 100000-R2.fastq(.gz) --CPU 2 --max_memory 12G   

When Trinity completes, Trinity.fasta is created as an output file.

Step 2: Igh and Igl/Igk gene sequences extraction

python igfinder.py -i Trinity.fasta -o 4E5hybridoma -r rat.csv    

The final_output.fasta in 4E5hybridoma directory would contain extracted Igh and Igl/Igk nuculeotide sequences and the AA_output.fasta would show IgH and IgL/IgK protein sequences with correct frame.

final_output.fasta

>Igk_TR109|c0_g1_i1233
GAAGGTCTTTCTCAGGGCTGTGTCATGATCCACATAAACTCGAGGAAAGCCCAAAGATGGT   
GTTCAAATTTCAGATCCTTGGACTTCTGCTTTTCTGGATTTCAGCCTCTAGAGGGGACATC   
GTGCTGACTCAGTCTCCAACCACCCTGTCTGTGACTCCAGGAGAGACAGTCAGTCTCTCCT   
GCAGGGCTAGCCATAGTATTGGCACAAATCTACACTGGTATCAGCAAAAAACAAATGAGTC   
TCCAAGGCTTCTCATCAAGTATTCTTCCCAGTCCATCTCTGGGATCCCCTCCAGGTTCAGT   
GCCAGTGGATCAGGGACAGATTTTACTCTCAACATCAACAATGTGGAGTTTGATGATGTCT   
CAAGTTATTTTTGTCAACAGACTCAAAGCTGGCCCGACACGTTTGGAGCTGGGACCAAGCT   
GGAACTGAAACGGGCTGATGCTGCACCAACTGTATCTATCTTCCCACCATCCACGGAACAG   
TTAGCAACTGGAGGTGCCTCAGTCGTGTGCCTCATGAACAACTTCTATCCCAGAGACATCA   
GTGTCAAGTGGAAGATTGATGGCACTGAACGACGAGATGGTGTCCTGGACAGTGTTACTGA   
TCAGGACAGCAAAGACAGCACGTACAGCATGAGCAGCACCCTCTCGTTGACCAAGGCTGAC   
TATGAAAGTCATAACCTCTATACCTGTGAGGTTGTTCATAAGACATCATCCTCACCCGTCG   
TCAAGAGCTTCAACAGGAATGAGTGTTAGACCCAAAGGTCCTGAGGTGCCACCTGCTCCCC   
AGCTCCTTCCAATCTTCCCTCCTAAGGTCTTGGAGACTTCCCCACAAGCGACCTACCACTG   
TTGCGGTGCTCCAAACCTCCTCCCCACCTCATCCTCCTTCCTTTCCTTGGCTTTGATCATG   
CTAATATTTGGGGAATATTAAATAAAGTGAA
>Ighg2b_rcTR179|c0_g1_i1476
AATGTTTTCTCTACAGTCACTGAATCACAACATCCTCATTATGAAATGCAGGTGGATCATC   
CTCTTCTTGATGGCAGTAGCTACAGGGGTCAACTCAGAAGTCCAGCTGCAGCAATCTGGGC   
CTGAGCTTCAGAGACCCGGGGCCTCAGTCAAGTTGTCGTGCAAGGCTTCTGGCTATACCTT   
TACAGAATACTATATGTACTGGGTGAAGCAGAGGCCTAAACAGGGCCTGGAATTAATAGGA   
AGGATTGATCCTGAAGACGGTAGTACTGATTATGTTGAGAAGTTCAAAAACAAGGCCACAC   
TGACTGCAGATACATCGTCCAACACAGCCTACATGCAACTCAGCAGCCTGACATCTGAGGA   
CACAGCAACCTATTTTTGTGCGGCGGGAACTAGGTGGGGCCAAGGAGTCATGGTCACAGTC   
TCCTCAGCCCAAACAACAGCCCCATCTGTCTATCCACTGGCTCCTGGATGTGGTGATACAA   
CCAGCTCCACGGTGACTCTGGGATGCCTGGTCAAGGGCTATTTCCCTGAGCCAGTCACCGT   
GACCTGGAACTCTGGAGCCCTGTCCAGCGATGTGCACACCTTTCCAGCTGTCCTGCAGTCT   
GGGCTCTACACTCTCACCAGCTCAGTGACCTCCAGCACCTGGCCCAGCCAGACCGTCACCT   
GCAACGTAGCCCACCCGGCCAGCAGCACCAAGGTGGACAAGAAAGTTGAGCGCAGAAATGG   
CGGCATTGGACACAAATGCCCTACATGCCCTACATGTCACAAATGCCCAGTTCCTGAACTC   
TTGGGTGGACCATCTGTCTTCATCTTCCCGCCAAAGCCCAAGGACATCCTCTTGATCTCCC   
AGAACGCCAAGGTCACGTGTGTGGTGGTGGATGTGAGCGAGGAGGAGCCGGACGTCCAGTT   
CAGCTGGTTTGTGAACAACGTAGAAGTACACACAGCTCAGACACAACCCCGTGAGGAGCAG   
TACAACAGCACCTTCAGAGTGGTCAGTGCCCTCCCCATCCAGCACCAGGACTGGATGAGCG   
GCAAGGAGTTCAAATGCAAGGTCAACAACAAAGCCCTCCCAAGCCCCATCGAGAAAACCAT   
CTCAAAACCCAAAGGGCTAGTCAGAAAACCACAGGTATACGTCATGGGTCCACCGACAGAG   
CAGTTGACTGAGCAAACGGTCAGTTTGACCTGCTTGACCTCAGGCTTCCTCCCTAACGACA   
TCGGTGTGGAGTGGACCAGCAACGGGCATATAGAAAAGAACTACAAGAACACCGAGCCAGT   
GATGGACTCTGACGGTTCTTTCTTCATGTACAGCAAGCTCAATGTGGAAAGGAGCAGGTGG      
GATAGCAGAGCGCCCTTCGTCTGCTCCGTGGTCCACGAGGGTCTGCACAATCACCACGTGG   
AGAAGAGCATCTCCCGGCCTCCGGGTAAATGAGCACGGCACCCAGAAAGCTCTCAGGTCCT   
AAGGGACACTGACACCCATCTCCACCCTTCCCTTGTGTAAATAAAGCACCCAGCACTGCCC   
TGGGACCCTGCT

AA_output.fasta

>Igk_TR109|c0_g1_i1233
RSFSGLCHDPHKLEESPKMVFKFQILGLLLFWISASRGDIVLTQSPTTLSVTPGETVSLSC   
RASHSIGTNLHWYQQKTNESPRLLIKYSSQSISGIPSRFSASGSGTDFTLNINNVEFDDVS   
SYFCQQTQSWPDTFGAGTKLELKRADAAPTVSIFPPSTEQLATGGASVVCLMNNFYPRDIS   
VKWKIDGTERRDGVLDSVTDQDSKDSTYSMSSTLSLTKADYESHNLYTCEVVHKTSSSPVV   
KSFNRNEC*TQRS*GATCSPAPSNLPS*GLGDFPTSDLPLLRCSKPPPHLILLPFLGFDHA   
NIWGILNKV
>Ighg2b_rcTR179|c0_g1_i1476
MFSLQSLNHNILIMKCRWIILFLMAVATGVNSEVQLQQSGPELQRPGASVKLSCKASGYTF   
TEYYMYWVKQRPKQGLELIGRIDPEDGSTDYVEKFKNKATLTADTSSNTAYMQLSSLTSED   
TATYFCAAGTRWGQGVMVTVSSAQTTAPSVYPLAPGCGDTTSSTVTLGCLVKGYFPEPVTV   
TWNSGALSSDVHTFPAVLQSGLYTLTSSVTSSTWPSQTVTCNVAHPASSTKVDKKVERRNG   
GIGHKCPTCPTCHKCPVPELLGGPSVFIFPPKPKDILLISQNAKVTCVVVDVSEEEPDVQF   
SWFVNNVEVHTAQTQPREEQYNSTFRVVSALPIQHQDWMSGKEFKCKVNNKALPSPIEKTI   
SKPKGLVRKPQVYVMGPPTEQLTEQTVSLTCLTSGFLPNDIGVEWTSNGHIEKNYKNTEPV   
MDSDGSFFMYSKLNVERSRWDSRAPFVCSVVHEGLHNHHVEKSISRPPGK*ARHPESSQVL   
RDTDTHLHPSLV*IKHPALPWDPA

Please check ORFs (open reading frames) in identified sequences with Ape. You would find around 1400 bp and 700 bp ORFs coding Igh and Igl/Igk, respectively.