Abstract

Camellia nitidissima Chi (CNC), a species of golden Camellia, is well known as “the queen of camellias.” It is an ornamental, medicinal, and edible plant grown in China. In this study, we conducted a genome survey sequencing analysis and simple sequence repeat (SSR) identification of CNC using the Illumina sequencing platform. The 21-mer analysis predicted its genome size to be 2,778.82 Mb, with heterozygosity and repetition rates of 1.42% and 65.27%, respectively. The CNC genome sequences were assembled into 9,399,197 scaffolds, covering ∼2,910 Mb and an N50 of 869 base pair. Its genomic characteristics were found to be similar to those of Camellia oleifera. In addition, 1,940,616 SSRs were identified from the genome data, including mono-(61.85%), di-(28.71%), tri-(6.51%), tetra-(1.85%), penta-(0.57%), and hexanucleotide motifs (0.51%). We believe these data will provide a useful foundation for the development of novel molecular markers for CNC as well as for further whole-genome sequencing of CNC.

1. Introduction

Camellia nitidissima Chi (CNC), a species of golden Camellia, is well known as “the queen of camellias” [1, 2]. It is largely grown in Guangxi province, China and has been introduced into Fujian province, China. C. nitidissima is a well-known ornamental plant because of its golden yellow flowers [2] that contain several flavonoids and polyphenols [3]. In addition, C. nitidissima is a well-known medicinal and edible plant in China [4]. The leaves and flowers of CNC have antioxidant and antimicrobial activities [1, 57] and are used as pancreatic lipase inhibitors [8] and potential anticancer drugs for gastric and colon cancers [9, 10].

Simple sequence repeats (SSRs), also known as microsatellites, are stretches of DNA consisting of tandemly repeated short units, 1‒6 base pairs (bp) in length [11], which have been identified and characterized in the genus Camellia. In the last 15 years, several SSRs markers have been developed from microRNA (miRNA), mRNA, genome, and chloroplast sequences to study the genetic variation and population structure in different genera of Camellia [1241], such as C. sinensis, C. osmanthus, C. vietnamensis, C. gauchowensis, C. huana, C. sasanqua, C. oleifera, C. japonica, and C. reticulata. In the last three years, SSR markers in the genus Camellia have emerged as a highly interesting research topic, with at least 14 studies on SSR markers [2841], including both genome-wide SSR markers and SSR identification of single resistance genes, gene families, whole transcription factors, and the development of SSR databases. For example, an SSR marker was used as a molecular marker to tag the blister blight disease-resistance trait of C. sinensis [29, 35]. Similarly, 72 SSR loci were detected in 14 and 15 phospholipase D gene families of C. sinensis for marker-assisted selection of resistance genes [37]. In addition, 3,687 SSR loci from 2,776 transcripts of transcription factor gene transcripts were identified for potential implications in trait dissection [40]. TeaMiD was developed for simple sequence repeat markers of C. sinensis, including 935,547 SSRs [41].

However, only 15 polymorphic microsatellite loci have been isolated and characterized from C. nitidissima [42]. Genome-wide SSR markers of C. nitidissima have not been identified because of a lack of genome sequences. Therefore, it is necessary to estimate the genome size and identify genome-wide SSRs in C. nitidissima using next-generation sequencing (NGS), which will be useful for further whole-genome sequencing and assessing genetic diversity within and among populations.

2. Materials and Methods

2.1. Plant Materials

CNC was obtained from Longyan City, Fujian Province, China. The leaf tissue was immediately collected from CNC, washed in sterile phosphate-buffered saline (PBS), frozen in liquid nitrogen, and stored at −80°C for further analysis.

2.2. DNA Extraction and Genome Sequencing

The total DNA of CNC was isolated using the cetyltrimethylammonium bromide (CTAB) DNA extraction protocol [43, 44]. The purity and concentration of the obtained gDNA were tested using a NanoPhotometer® spectrophotometer (Implen, CA, USA) and a Qubit® 2.0 fluorometer (Life Technologies, CA, USA), respectively [45]. Sequencing libraries for the quality-checked gDNA were generated using a TrueLib DNA Library Rapid Prep Kit for Illumina sequencing (Illumina, Inc., CA, USA) [45]. The libraries were subjected to size distribution analysis using an Agilent 2100 bioanalyzer (Agilent Technologies, Inc., CA, USA), followed by a real-time PCR quantitative test [45]. The successfully generated libraries were sequenced using an Illumina NovaSeq 6000 platform (Illumina, Inc., CA, USA), and 150-bp paired-end reads with an insert of approximately 350 bp that was generated [45].

2.3. DNA Data Cleaning and Genome Assessment

The obtained raw reads were filtered to obtain clean reads using trimmomatic version 0.36 (https://www.usadellab.org/cms/index.php?page=trimmomatic) [46]. The quality control (QC) standards of reads from DNA were as follows:(1)Trimming adapter sequences,(2)Trimming low quality or 3 bases (below quality 3) in the front of the reads,(3)Trimming low quality or 3 bases (below quality 3) in the tail region for reads,(4)Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15,(5)Removing reads with <51 bases.

To estimate the status of contamination from other species, 20,000 reads (10,000 reads from read 1 and 10,000 reads from read 2) were randomly selected from the resulting high-quality cleaned reads against the NCBI nonredundant nucleotide sequence (NT) database using the blastn software version 2.2.28 (https://blast.ncbi.nlm.nih.gov/Blast.cgi) [47, 48], with an E-value threshold of 1 × 10−5.

The resulting high-quality clean reads from DNA sequencing were subjected to K-mers analysis using Jellyfish version 2.3.0 (https://genome.umd.edu/jellyfish.html) [49] with savings in the hash-only canonical K-mers (−C) and K-mers values (−m 19, 21, and 23). Genome size, heterozygosity ratio, read duplication ratio, and read error ratio were estimated using GenomeScope version 2.0 (https://qb.cshl.edu/genomescope/) [50] with R version 4.1.3. The repeat rate was estimated as the percentage of the number of K-mers after a 1.8 fold in the main peak depth over the total number of K-mers.

2.4. Genome Assembly, GC Content Analysis, SSRs Identification, And Primer Design

The CNC genome was assembled using SOAPdenovo2 version 2.40 (https://github.com/aquaskyline/SOAPdenovo2) [51] with a K-mers value of 51 and other default settings. The GC content was calculated using contigs longer than 500 bp. SSRs were identified using MISA version 2.1 [11] with default parameters (SSR pattern: 1‒10, 2‒6, 3‒5, 4‒5, 5‒5, and 6‒5; the maximum length of sequence between two SSRs to register as a compound SSR was 100 bp). Primer pairs were designed using Primer3 version 2.6.1 [52], which were selected to meet the following criteria: the expected PCR product size ranged from 100 to 280 bp; primer length ranged from 18 to 23 bp (optimum length: 20 bp); primer melting temperature ranged from 57.0 to 60°C (optimum temperature: 5°C); and primer GC content ranged from 40 to 70%.

3. Results

3.1. Sequencing and QC of CNC

Approximately 343.06 Gb of high-quality, clean reads were obtained using the trimmomatic software [46] from approximately 382.21 Gb of raw reads using the Illumina NovaSep platform for the CNC genome survey (Table 1). The Q20, Q30, and GC content values of the clean reads were 95.67%, 89.52%, and 37%, respectively. The top six species from 20,000 randomly selected clean reads in the NT database were C. sinensis (2.26%), C. taliensis (0.17%), Vitis vinifera (0.11%), Helianthus maximiliani (0.05%), C. yunnanensis (0.05%), and C. pitardii (0.03%), indicating that there was no contamination from other species.

3.2. Genome Assessment

We estimated the CNC genome size using the K-mers value (K = 19, 21, and 23) (Table 2). According to the 21-mers recommendation [50], the CNC genome size and K-mer depth were 2, 778, 823, 868 bp and 101, respectively (Figure 1). The error and duplication rates of the reads were 0.248% and 0.706%, respectively. The heterozygosity and repeat rates of the sequences were 1.42% and 65.27%, respectively. The heterozygous peak K-mer frequency was 50, which indicates that the CNC genome has high heterozygosity (heterozygosity rate ≥0.8%) and high repetition (repetition rate ≥50%).

3.3. Genome Assembly and GC Content Analysis

The clean reads were assembled into 9,994,482 contigs and 9,399,197 scaffolds using the SOAPdenovo software with 51-mers value (Table 3). The total length of the contigs and scaffolds was 2,844,296,380 and 2,910,885,755 bp, respectively. According to the significant peaks of the CNC contig distribution (Figure 2), the peak located halfway in front of the main peak was the heterozygous peak [44], which also proved the existence of high heterozygosity in the CNC genome. Because of the high heterozygosity, the assembled haploid genome was larger than predicted. The maximum lengths of the contigs and scaffolds were 73,907 bp and 88,303 bp, respectively. The N50 lengths of the contigs and scaffolds were 649 bp and 869 bp, respectively. The GC contents of the contigs and scaffolds were 36.00% and 34.00%, respectively. The GC content of the scaffolds was lower than that of the contigs owing to the presence of an N base. The GC depth analysis (Figure 3) indicated that the GC content of the windows was mostly concentrated in the range of 20‒60%, which did not show any apparent abnormalities or GC bias [44]. The GC depth distribution was divided into two layers, which indicated the high heterozygosity of the CNC genome.

3.4. SSR Identification

A total of 1,940,616 SSRs were identified from 1,026,855 scaffolds in the CNC genome, including 346,619 SSRs involved in compound formation. In total, 332,308 scaffolds contained more than one SSR. The largest group of motifs was mononucleotide repeats (1,200,317 motifs; 61.85%). This was followed by dinucleotide (557,218 motifs, 28.71%), trinucleotide (126,286 motifs, 6.51%), tetranucleotide (35,890 motifs, 1.85%), pentanucleotide (10,975 motifs, 0.57%), and hexanucleotide (9,930 motifs, 0.51%) repeats. With an increase in the repeat motif length, the number of SSRs decreased. Among the mononucleotides (Table 4), A/T repeats were the predominant type (1,174,392 motifs, 97.84%). Among the dinucleotides (Table 5), AG/CT (277,157 motifs, 49.74%) and AT/AT repeats (228,679 motifs, 41.04%) were dominant, followed by AC/GT repeats (49,972 motifs, 8.97%), whereas CG/GC repeats (1410 motifs, 0.25%) were the lowest. Among the trinucleotides (Table 6), the most frequent motif was AAT/ATT repeats (47,924 motifs, 37.95%), followed by AAG/CTT (26,511 motifs, 20.99%) and ACC/GGT (22,235 motifs, 17.61%) repeats. ACG/CGT repeats (725 motifs; 0.57%) were the least frequent trinucleotide motifs. The longest tetra-, penta-, and hexanucleotide SSR repeats were AAAT/ATTT (23,406 motifs, 65.22%), AAAAT/ATTTT (2,951 motifs, 26.89%), and AAAAAT/ATTTTT (1187 motifs, 11.95%), respectively (Tables 79). To provide more information for SSR primer verification in future research, 49,046 SSRs (tr- and tetranucleotide) were suited to the designed primers. Primer information is presented in Supplementary Table 1.

4. Discussion

In the genus Camellia, the genomes of C. sinensis and C. oleifera have been sequenced and assembled [53, 54]. The genome size of C. sinensis ranged from 3,062.62 Mb (C. sinensis var. assamica) to 3,113.46 Mb (C. sinensis isolate G240). The CNC genome size was close to that of C. oleifera, which was 2889.51 Mb [54]. However, it was smaller than that of C. sinensis. The GC content of C. oleifera was 34.5189% [54]. The median GC content of C. sinensis was 38.5319% in the NCBI genome database. The GC content of CNC was close to that of C. oleifera but lower than that of C. sinensis. The result showed that C. oleifera is closer to CNC than C. sinensis in phylogenetic relationships, which is consistent with previous studies [55]. The genome assembly strategies of other species in the genus Camellia can be applied to CNC, such as Illumina combined with PacBio (or Oxford Nanopore Technologies) and Hi-C-based assembly, and genome assembly should be as difficult as C. oleifera, but less difficult than C. sinensis. The genome size estimated using NGS becomes more difficult in cases of high heterozygosity and high duplication, which can be further verified by constant-value (C-value) using flow cytometry. The motifs of SSRs including A or T were more abundant than those including C or G, the characteristics and distributions of which were similar to those reported in previous studies on C. sinensis [41]. Further validation studies of SSR markers are needed for the CNC population.

In the current study, the whole genome of CNC was sequenced using NGS for the first time, which will play an important role in future whole-genome sequencing projects. Statistical analysis of the differences in the quantity and motifs of SSRs provided a foundation for the further construction of high-density genetic maps of CNC. The wild CNC is an endangered plant in China. Therefore, the CNC genome survey will have important ecological significance.

5. Conclusions

In the present study, an approximate genome size of 2,778.82 Mb of CNS was estimated using the 21-mer analysis, with heterozygosity and repetition rates of 1.42% and 65.27%, respectively. The results showed the genomic characteristics of CNS were similar to those of C. oleifera. In total, 1,940,616 SSRs were identified in the genome data. We believe these results will provide meaningful data for conducting further genomic studies and a useful basis for the development of novel molecular markers. Hence, novel state-of-the-art genetic techniques, such as Illumina combined with PacBio HiFi and Hi-C-based assembly, need to be developed to obtain chromosomal-level scaffolding genomes.

Data Availability

The following information was supplied regarding the deposition of DNA sequences: the raw data can be obtained from the Sequence Read Archive at NCBI under accession numbers SRR19315149. The associated BioProject, Bio-Sample numbers are PRJNA839723, SAMN28548419, respectively.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The DNA-seq in this study was supported by Novogene Co., Ltd. This work was supported by Discipline and Master’s Site Construction Project of Guiyang University by Guiyang City Financial Support Guiyang University (KJY-2020), Science And Technology Support Program (Soft Science) Research Project Key Project (QKHZC[2018]20102; QKHZC[2019]20027H), Young Sci-Tech Talents Growth Program from the Department of Education of Guizhou Province under grant number QJHKYZ[2020]086, and Guizhou Fundamental Research Program (Natural Science Project) under grant number QianKeHeJiChu-ZK[2022]YiBan006.

Supplementary Materials

Supplementary Table 1. SSR primers pairs of the CNC genome. (Supplementary Materials)