The workflow used for gene-specific qPCR primer design
- Step 1 Download the coding gene, mNRA and genome sequence for target species for which complete genome sequences have been assembled, annotated and published. It is recommended to retrieve above-mentioned sequences from integrative databases, such as NCBI (https://www.ncbi.nlm.nih.gov), Ensembl (http://www.ensembl.org), USCS (http://genome.ucsc.edu), Phytozome (https://phytozome.jgi.doe.gov/pz/portal.html) and Gramene (http://www.gramene.org). Researchers also could obtain required sequences from specific genome databases or prepared their own de novo assembled transcriptome sequence for gene-specific qPCR primer design. For each gene, only the longest coding/mRNA sequence for each gene was retained as representative coding/mNRA sequence, all duplicated coding/mNRA sequences will be removed.
- Step 2 To design as many qPCR primers as possible for each gene, all sequences retained in Step 1 are shredded into a series of short, overlapping template fragments using BBMap (http://jgi.doe.gov/data-and-tools/bbtools) with the specified window size and step size currently set to 300 bp and 50 bp, respectively. The window and step size can be adjusted for organisms with complex genomes, but this will dramatically increase the computation time.
Step 3 Step 3 performs automatic primer design for all of the shredded template fragments using Primer3 with the following parameters: amplicon size 80–300 bp; amplicon GC content 40–60%, with an optimal GC content of 50%; primer length 18–28 nt, with an optimal length of 22 nt; and melting temperature (Tm) 58–64ºC, with an optimal Tm of 60ºC and maximum Tm difference per primer pair of less than 3ºC. For each gene, duplicated candidate primer pairs generated at neighboring windows are removed.
Step 4 The output file of Primer3 were transformed into ePCR input format using an in-house perl script. Then, ePCR was employed for the first-round of specificity check to all designed primer pairs. All primers with one mismatch or indel against any off-target sequence are filtered out. The remaining candidate primer pairs are used for the next round of specificity validation.
Step 5 Step 5 indexes the databases for the genome sequence and coding/mRNA sequence of the target organism based on the k-mer index algorithm using the thermodynamics-based gene-specificity checking program MFEprimer-2.0. By default, the k-mer is set to 9, which is sufficient for primer specificity checking; this value can be smaller if more stringent conditions are desired, but much more time will be needed for computation. The specificity of the qPCR primer pairs against genomic sequence and coding/mRNA sequence databases is then estimated, and related parameters such as the binding site in the sequence, the size and GC content of the amplicon, and the Tm and Gibbs free energy (ΔG) of the forward and reverse primers can be saved.
Step 6 The primer pairs for each gene are divided into three levels based on PPC and the binding stability of the binding site, ΔG, as follows:
Standard for selection of level 1 primer pairs: the binding ability index, primer pair coverage (PPC) = 10. The primer pairs will be removed if their PPCs against any non-specific amplicons are higher than 10.
Standard for selection of level 2 primer pairs: the binding stability index ΔG < −9 kcal/mol. The primer pairs will be filtered out if the ΔG of binding of either the forward or reverse primer to off-target sequences is less than −9 kcal/mol.
Standard for selection of level 3 primer pairs: the binding stability index Gibbs free energy (ΔG) < −11 kcal/mol. The primer pairs will be filtered out if the ΔG of binding of either the forward or reverse primer to off-target sequences is less than −11 kcal/mol.
In addition, all qualified primer pairs are further divided into two categories: the best primers and alternative primers. The former category includes only one best, unique primer pair with the lowest total ΔG of the pair, which is the gene-specific qPCR primer pair recommended for the detection of gene expression. The latter category includes all alternative gene-specific qPCR primer pairs.
Step 7 Step 7 carries out a BLAT search using each pair of primers against the genomic sequence. The number of exons spanned and the genomic position of each primer and amplicon can be determined to select primers spanning an exon-intron boundary, and this information can be used to avoid amplifying contaminating genomic DNA.
After computation, our workflow generated a total of 51,091,785 gene-specific qPCR primer pairs, corresponding to 93.44% of the average coverage ratios per organism and an average of 15.33 primer pairs per gene. In human and mouse, more than 19 and 22 primer pairs per gene were obtained, with a coverage ratio of 78.83 and 90.32%, respectively, indicating that the results of our workflow are in accordance with those in MRPrimerW and PrimerBank. Moreover, it is more difficult to design qPCR primers for plants than for animals. Using the same workflow, the coverage ratios decreased from 93.80% in animals to 93.08% in plants, and the number of primer pairs per gene significantly dropped from 17.20 to 12.55 (permutation test P-value < 0.001). This reduction might have been due to the difficulty in designing qPCR primers for polyploid plants, such as oilseed rape, soybean (Glycine max), potato (Solanum tuberosum), and tomato (Solanum lycopersicum). Hence, our database will greatly facilitate qPCR experiments using organisms with complex genomes.
qPrimerDB is implemented using PostgreSQL 9.6 (http://www.postgresql.org), PHP 5.5 (http://www.php.net), Apache Web Server 2.4 (http://www.apache.org), and BioPerl 1.6 (http://bioperl.org) on a Linux CentOS 6.8 operating system. Our database is supported by integration with the Chado database schema and Drupal, a popular Content Management System (CMS). There are four layers in the qPrimerDB architecture, as described for SilkPathDB. In the core layer, all qPrimerDB data are stored within the Chado schema and maintained using the PostgreSQL. The second layer governs communications between the core layer and the outer layers, as a configurable layer. The third layer invokes background programs to process and respond to queries, and the Drupal CMS is used for website content management. The outermost layer is a user interface layer that allows the users to directly interact with the database. To provide a friendly interface for users accessing the database from desktop and mobile devices, the Bootstrap framework (http://getboot strap.com) is employed. PHP and BioPerl scripts are used to generate primer index and search results pages by retrieving primer information for query genes and organisms provided by the user. The BLAST database is built with NCBI BLAST+ 2.6.0, with the multiple database search function. Finally, jQWidgets (http://www.jqwidgets.com) is used to organize and present the feature index, search, and BLAST results.