Pan‐genome: A promising resource for noncoding RNA discovery in plants

Plant genomes contain both protein‐coding and noncoding sequences including transposable elements (TEs) and noncoding RNAs (ncRNAs). The ncRNAs are recognized as important elements that play fundamental roles in the structural organization and function of plant genomes. Despite various hypotheses, TEs are believed to be a major precursor of ncRNAs. Transposable elements are also prime factors that cause genomic variation among members of a species. Hence, TEs pose a major challenge in the discovery and analysis of ncRNAs. With the increase in the number of sequenced plant genomes, it is now accepted that a single reference genome is insufficient to represent the complete genomic diversity and contents of a species, and exploring the pan‐genome of a species is critical. In this review, we summarize the recent progress in the field of plant pan‐genomes. We also discuss TEs and their roles in ncRNA biogenesis and present our perspectives on the application of pan‐genomes for the discovery of ncRNAs to fully explore and exploit their biological roles in plants.


INTRODUCTION
In plants, a major portion of the genome is developmentally transcribed to produce ncRNAs and plays a crucial role in regulating many critical processes (Hou et al., 2019). The ncRNAs involved in regulatory processes are generally categorized into short ncRNAs, such as small interfering RNAs (siRNAs) and microRNAs (miRNAs), long ncRNAs (lncRNAs), and circular RNAs (circRNAs) (Wei, Huang, Yang, & Kang, 2017). Among the short ncRNAs, siRNAs (20-25 bp) play important roles in several processes including plant defense and stress responses (Borges et al., 2018;Martinez et al., 2018;Zhang et al., 2016), whereas miR-NAs (∼22 nt) participate in a variety of crucial biological processes including metabolism, development, stress response, transcription factor regulation, and antisilencing (Bartel, 2004;Fahlgren et al., 2007;Li, Li, Xia, & Jin, 2011;Voinnet, 2009). By contrast, lncRNAs (usually > 200 bp long) are involved in a variety of cellular functions and gene regulation or silencing pathways (Wierzbicki, Haag, & Pikaard, 2008). Additionally, circRNAs, new members of the ncRNA family, are ubiquitously expressed in plant genomes and play important roles in processes such as protein binding, miRNA binding, and transcriptional regulation (Zhao, Chu, & Jiao, 2019). Although a large number of ncRNAs have been identified to date, their true biological functions, mechanisms of origin, and role in genomic diversity remain largely unknown.
Advances in high-throughput sequencing technologies over the last decade have led to innovative approaches for analyzing plant genome content, diversity, and evolution. Previously, genomic studies were commonly focused on a single reference genome using standard approaches, such as the low-throughput and expensive Sanger sequencing method, with limited applications (Schmid, Ramos-Onsins, Ringys-Beckstein, Weisshaar, & Mitchell-Olds, 2005;Zhang & Hewitt, 2003). With the advent of nextgeneration sequencing (NGS) technologies, the focus of researchers has shifted from single-genome analysis to multiple-genome analysis and population studies (Redon et al., 2006). Ever since the first plant genome sequence became available (The Arabidopsis Genome Initiative, 2000), comparative genomic studies have largely focused on single nucleotide polymorphisms (SNPs) in different plant species (Gore et al., 2009;Lai et al., 2015;McNally et al., 2009). A general consensus of the current research is that a single reference genome does not adequately represent the complete genetic content and diversity of a species because of the presence of structural variations (SVs) and TEs, which significantly modify the genetic makeup of different individuals within a species (Saxena, Edwards, & Varshney, 2014). The SVs mainly include presence/absence variants (PAVs) and copy number variants (CNVs) (Supple-

Core Ideas
• ncRNAs play important regulatory functions in plant development. • Recent studies revealed that ncRNAs are evolved from TEs. • TEs create genomic variation not captured by a single reference genome, affecting ncRNA analyses. • Pan-genomes can capture entire variation help to explore functions of ncRNAs.
mental Figure S1). Presence/absence variants are specific sequences or genes that are variably present in individuals of a species and result in extreme structural and phenotypic variations among members of a species (Saxena et al., 2014;Wendel, Jackson, Meyers, & Wing, 2016). Copy number variants are specific sequences or genes present in different copy numbers among members of a species; these varied copy numbers are caused by deletions, insertions, or duplications (Saxena et al., 2014;Scherer et al., 2007). The presence of these SVs in plant genomes and their impact on important agronomic traits of plant species have been widely documented (Supplemental Table S1). Several comparative genomic studies in plants indicate that TEs are one of the key factors responsible for extensive variations in both genic and intergenic regions within a species or among closely related species (Morgante, De Paoli, & Radovic, 2007). Brunner, Fengler, Morgante, Tingey, and Rafalski (2005) compared four allelic genomic regions between maize (Zea mays L.) inbred lines, B73 and Mo17, and revealed that >50% of the compared sequence differs between these two inbred lines mainly because of TE insertions. Subsequently, Anderson et al. (2019) compared uniformly annotated genome sequences of two more maize genotypes (W22 and PH207) with B73 and Mo17 reference genomes and found that TEs are located within or near 78% of all genes in all four maize genotypes [the authors also identified that 78% of the variable TEs are missing in at least one of the maize genotypes (Anderson et al., 2019)]. Similar genomic diversity has also been observed in other plant species (Golicz, Batley, & Edwards, 2016;Morgante et al., 2007;Tao, Zhao, Mace, Henry, & Jordan, 2018;Tranchant-Dubreuil, Rouard, & Sabot, 2019). Additionally, recent studies showed that TEs are the prime source of ncRNAs (Hou et al., 2019). Therefore, characterization of ncRNAs to explore their roles in improving crop productivity, requires better understanding the mobility of TEs and their involvement in genic and nongenic variations.
Knowledge of TE underlying genetic variation may provide resources for the identification and functional characterization of ncRNAs.
Analysis of the pan-genomes is critical for analyzing the complete genomic contents of any given species and for increasing the efficiency of ncRNA discovery. In this review, we focus mainly on the recent progress in plant pan-genomics, the different approaches available for pan-genome analysis, and the need for pan-genomes for ncRNA discovery. We also describe the origin of ncR-NAs from TEs and how pan-genomes could facilitate ncRNA analyses.

THE PAN-GENOME PERSPECTIVE
To study phenotypic and genomic variations in a given species and to capture their entire genomic contents, it is vital to construct a pan-genome Tahir Ul Qamar, Zhu, Xing, & Chen, 2019;Tao et al., 2018). The concept of a pan-genome was first introduced in 2005 by Tettelin et al. (2005), who constructed the pan-genome of a bacterium, Streptococcus agalactiae. Subsequently, definitions and objectives of the pan-genome were modified and interpreted differently by various research groups (Alcaraz et al., 2010;Carlos Guimaraes et al., 2015;Plissonneau, Hartmann, & Croll, 2018;Rasko et al., 2008;Snipen, Almoy, & Ussery, 2009;Tetz, 2005), leading to the notion that a pan-genome can be either sequence-based or gene-based (Golicz, Bayer, Bhalla, Batley, & Edwards, 2020). A sequence-based pangenome refers to a complete collection of nonredundant sequences within members of a species. The advantage of a sequence-based pan-genome is that it captures genic as well as nongenic sequences. On the other hand, a genebased pan-genome contains a complete set of genes or orthologous genes families within members of a species (Hubner et al., 2019;Sun et al., 2016;Tranchant-Dubreuil et al., 2019;Wang et al., 2018). The pan-genome is further categorized as the core genome or variable genome (Golicz et al., 2020;Khan et al., 2019). The core genome refers to a common set of sequences or genes present in all individuals of a species and is described as a minimal genome required for an individual to survive and perform basic functions (Gordon et al., 2017;Segerman, 2012;Tranchant-Dubreuil et al., 2019;Wang et al., 2018). The variable genome (also known as an accessory, dispensable, or shell genome) is a collection of sequences or genes present in only some members of a species (Gordon et al., 2017;Li et al., 2014;Segerman, 2012;Vernikos, Medini, Riley, & Tettelin, 2015) (Supplemental Figure S2a). Genes present in the variable genome of an individual are usually involved in biotic and abi-otic stress responses (Tranchant-Dubreuil et al., 2019). The core and variable genomes are further subdivided into core or soft-core and variable or member-specific genomes, respectively (Gordon et al., 2017;Medini, Donati, Tettelin, Masignani, & Rappuoli, 2005;Tranchant-Dubreuil et al., 2019).
A pan-genome can also be characterized as open or closed (Khan et al., 2019) (Supplemental Figure  S2b). An open pan-genome contains no fixed number of genes or gene families, and sequencing of additional members of a species gradually increases the size of the pan-genome. By contrast, in a close pangenome, the number of genes or gene families is fixed, and the addition of members does not affect the size of the pan-genome Tao et al., 2018;Tranchant-Dubreuil et al., 2019).

METHODS FOR PLANT PAN-GENOME ASSEMBLY AND ANALYSIS
The method used to develop a pan-genome, together with the selection of suitable samples, ploidy level of samples, quality of genome assembly and annotation, and the approaches used to detect orthologous genes, greatly influence the quality of pan-genome analysis Tao et al., 2018). The commonly used methods for constructing plant pan-genomes ( Figure 1) can be categorized into distinct groups described below.

Methods for high-quality coverage data
If the members of a species have been sequenced with high-quality coverage, a comparative de novo assembly approach is an excellent choice for pan-genome analyses. In this method, genomes are separately assembled and annotated, followed by all-vs.-all alignment for screening core and variable sequences. This method is costly, time-consuming, and error-prone, especially if short-read sequencing is used for large genomes. However, rapid developments in sequencing technologies, specifically long-read sequencing technologies, and improvements in assembly tools will gradually reduce the cost of sequencing, minimize the time needed for sequence assembly, and empower the use of the comparative de novo assembly approach for plant pan-genome analyses in the near future Tao et al., 2018;Tranchant-Dubreuil et al., 2019). Additionally, genotyping-by-sequencing (GBS) together with the F I G U R E 1 Illustration of different approaches for pan-genome assembly and analyses. (a) Comparative de novo assembly approach; (b) iterative mapping and assembly approach; (c) de Bruijn graph method; (d) metagenome-like method; (e) presence/absence variant (PAV)-based pan-genome construction method; (f) map-to-pan approach; (g) pan-transcriptomics approach; (h) genotyping-by-sequencing (GBS) coupled with machine learning (ML) approach machine-learning (ML) approach can also be used if members of a species have been sequenced with high-quality coverage and a high-quality reference genome is available. First, members are sequenced by GBS, and then the GBS tags are mapped onto a high-quality reference genome using genome-wide association study (GWAS) and joint linkage mapping in the nested association mapping population. Furthermore, ML approaches are used to anchor all mapped tags to construct a pan-genome (Lu et al., 2015).

Methods for low-quality coverage and short-read sequencing data
If members of a species have been sequenced at low coverage by short-read sequencing, iterative mapping and assembly, and map-to-pan approaches can be used. In this method, a single whole-genome assembly is first developed as a pan-genome reference, which is then used for annotating and mapping all sequence reads. However, iterative mapping and assembly and map-to-pan approaches differ in the steps involved in pan-genome construction. In the iterative mapping and assembly approach, a pangenome is constructed by first developing a single wholegenome assembly reference and then mapping reads from other members of the species serially onto the reference. The reference genome is updated at each turn with assembled unmapped reads and then updated reference genome is used to successively map reads of other members until the final pan-genome is established. This approach saves time but can cause errors if similar sequences are designated as extra copies, duplicates, novel dispensable sequences, or alleles Hurgobin et al., 2018;Monat et al., 2018;Montenegro et al., 2017;Ou et al., 2018;Pinosio et al., 2016;Tao et al., 2018;Tranchant-Dubreuil et al., 2019). In the map-to-pan approach, a pangenome is constructed by de novo assembly of genomes of individual members of a species and then low-quality assemblies are mapped onto a high-quality reference genome (Hu et al., 2017;Tao et al., 2018;Wang et al., 2018). Additionally, metagenome-like and de Bruijn graph methods can also be used for low-coverage sequencing data. The metagenome-like method uses a bacterial metagenomelike approach and assembles all sequences. After assembling whole-genome sequences of all samples, contigs are reassigned to each member by mapping its data onto the metagenome assembly. This method is compatible with low-coverage data but may produce errors in the form of chimeric assembly of artificial sequences (Tranchant-Dubreuil et al., 2019;Yao et al., 2015). In the de Bruijn graph method, all sequences are divided into smaller fragments of K length. The association between K-mers of different samples represents graph edges. In the graph, every K-mer represents a node, and overlapping nodes are linked by an edge. A single graph can represent several edges connected by similar nodes that form a network. Information about the origin of a node is crucial if multiple genomes are present. Nodes originating from different genomes can be color coded for tracking purposes, and the whole pan-genome can be represented as multiple colored graphs. A balance between the total variant number and mapping accuracy is very important since an imbalance between these two parameters may lead to false positive mappings Iqbal, Caccamo, Turner, Flicek, & McVean, 2012;Lin et al., 2014;Marcus, Lee, & Schatz, 2014). The graph-based pan-genome analysis approach is advantageous compared with linear pan-genome analysis approaches, as it allows storing complex genomic variations in a compact and balanced manner, thus ensuring that their actual annotation remains preserved. Unlike plant pan-genomics, human pan-genomics has led to a consensus that graph-based pangenome approaches could overcome the challenge of storing all human sequence variation at a single platform. A comprehensive discussion of graph-based human pangenomics can be found in a recent review by Sherman and Salzberg (2020).

Method for hybridizing different sequencing approaches
In contrast to the method mentioned above, if members of a species have been sequenced by different sequencing approaches, including short reads and long reads, with reasonable coverage and quality, the PAV-based pangenome construction method (Tahir Ul Qamar et al., 2019) would be appropriate for constructing the pan-genome. This method requires a high-quality genome assembly as a reference. Then, genomes of other members of the species are iteratively mapped onto the reference genome assembly rather than mapping the reads against the selected reference. Additionally, this method identifies genic and nongenic PAVs and improves and annotates the reference genome successively. This method is less time-consuming and more cost-effective.

Method used to construct a pan-genome using transcriptomic data
An alternative approach for pan-genome assembly is the pan-transcriptomics approach, which is mainly based on transcriptomic data, although genomic data can also be used (Hubner et al., 2019). In this method, transcriptomes of all members of a species are sequenced and then all mRNA sequence reads are mapped on to a good quality reference genome. Unmapped reads are de novo assembled, and novel transcripts are identified. Finally, the reference genome is updated to the pan-genome. This method generates data not only on genomic PAVs but also on transcript PAVs. However, this method does not provide information on nongenic regions (Hirsch et al., 2014;Tao et al., 2018).
Plant pan-genomes are mainly based on the knowledge of TEs and SVs Morgante et al., 2007;Tao et al., 2018). The core genome in plant pan-genomes is composed of conserved genes that have lower SNP density and longer average length and are usually involved in basic cellular functions. By contrast, the dispensable genome in plant pan-genomes is based on genes that enable plants to withstand environmental stresses, have higher SNP density and nonsynonymous to synonymous substitution ratio (Ka/Ks), have shorter average length, and are likely involved in structural and functional diversity Tao et al., 2018). Although the available pan-genome data are not readily comparable, since they are based on different sample sizes and assembly approaches, the dispensable genome in plant pangenomes varies from 8-61% (regardless of the method used for analysis), and the available methods of pan-genome analyses are unable to capture all functional variation to fully explore the dispensable genes (Tao et al., 2018).
In maize, two pan-genome studies have been reported. Hirsch et al. (2014) developed a closed pan-genome of 503 maize inbred lines using the pan-transcriptomics approach. The authors revealed that 8,681 representative transcript assemblies were missing in the B73 maize reference genome (Hirsch et al., 2014). In another study of 14,129 maize inbred lines, Lu et al. (2015) developed a pan-genome using GBS and ML and revealed 1.1 million PAV tags.
In rice, four pan-genome studies have been published to date. Schatz et al. (2014) studied three divergent rice species using a comparative de novo assembly approach and revealed that 92% of all genes are core genes and only the remaining ∼8% are variable genes. The authors also showed that variable genes in rice have a lower number of exons than core genes . Yao et al. (2015) explored the dispensable genome of 1,483 rice cultivators using the metagenome-like method and showed that 8,000 genes are absent from the Nipponbare rice reference genome. Zhao et al. (2018) constructed a pan-genome of 66 rice species (57 divergent accessions and nine modern cultivars) using the comparative de novo assembly approach and identified 23 million intergenomic sequence variants. In another study, Wang et al. (2018) established a closed pan-genome of rice based on 3,010 diverse Asian cultivars and found more than 90,000 SVs and over 10,000 novel full-length protein-coding genes. The study involving 3,010 rice cultivars showed a higher percentage of variable genome (∼41%) compared with the study involving only three rice cultivars (8% variable genome). This clearly shows that the sample size influences pan-genome studies; however, increase in sample size can lead to a point where any further increase would not lead to a further expansion of the pan-genome size. In addition, the study of (Tao et al., 2018) involving 66 diverse species of rice revealed 42,580 genes (38% variable) compared with the study of (Wang et al., 2018) involving 3,010 Asian cultivars, which revealed 48,098 genes (41% variable). This clearly demonstrates that the type of samples used in the study (i.e., similar or diverse) also influences pan-genome analyses.
The pan-genome of Brassica species has been investigated in four studies. Lin et al. (2014) compared and functionally annotated three B. rapa genomes {turnip (B. rapa subsp. rapa), rapid-cycling B. rapa, and Chinese cabbage [B. rapa subsp. Pekinensis (Lour.) Hanelt]} using de Bruijn graph method and revealed significant divergence prior to their domestication.  studied the closed pan-genome of cabbage (Brassica oleracea L.) based on nine cultivars using the iterative mapping and assembly approach and found that ∼80% of all genes are core genes, while the remaining ∼20% are affected by PAVs. Hurgobin et al. (2018) constructed the pan-genome of rape (Brassica napus L.) based on 53 cultivars using the iterative mapping and assembly approach and showed that 62% of pangenes are core genes and 38% of genes are variable possibly because of homologous exchange. In a recent study, Song et al. (2020) constructed the pan-genome of rape based on eight cultivars using the PAV-based approach. The authors showed that 56% of pan-genes are core genes, 42% of genes are variable, and the remaining 2% are species-specific (Song et al., 2020). Details of other plant pan-genomes are listed in Table 2.

PAN-GENOMES AND TRANSPOSABLE ELEMENTS
Transposable elements are defined as genomic elements capable of moving from one location to another in the genome and represent one of the major components of eukaryotic genomes especially plants. Transposable elements are categorized into two classes: Class I (retrotransposons) and Class II (DNA transposons) (Cho, 2018;Feschotte, Jiang, & Wessler, 2002;Hadjiargyrou & Delihas, 2013;Morgante et al., 2007;Wicker et al., 2007). Retrotransposons move from one location to another via RNA intermediates, which are later converted into cDNAs, thus creating several extra copies in the genome. Retrotransposons mainly include long terminal repeat (LTR) retrotransposons, long interspersed nuclear elements, and short interspersed nuclear elements (Cho, 2018). Long terminal repeat retrotransposons are predominant in plant genomes, and their amplification is believed to be one of the key factors responsible for a substantial increase in  Several studies indicate that amplification of certain LTR retrotransposon families can cause significant variation in genome size among closely related species (Bennetzen, 2002), and certain LTR retrotransposons can also induce gene translocation and chromosomal rearrangement (Xiao, Jiang, Schaffner, Stockinger, & van der Knaap, 2008). Hawkins, Kim, Nason, Wing, and Wendel (2006) showed that amplification of Gorge3 LTR retrotransposon families led to a substantial increase in the genome size of cotton (Gossypium spp.). The latest studies suggest that LTR retrotransposons jump into the enhancer, repressor, and promoter regions of genes to regulate their expression (Kashkush, Feldman, & Levy, 2002;Schramke & Allshire, 2003). DNA transposons move in the genome using a copy-andpaste mechanism, which involves rolling circle amplification. DNA transposons are further divided into two subclasses: Helitrons and miniature inverted-repeat transposable elements (MITEs) (Cho, 2018;Liu et al., 2019;Morgante et al., 2007). These MITEs regulate the expression of host genes and are usually found in close proximity to genic regions (Piriyapongsa, Marino-Ramirez, & Jordan, 2007;Zerjal, Joets, Alix, Grandbastien, & Tenaillon, 2009). The insertion and removal of MITEs can lead to PAVs in the host genome, which is thought to be an important aspect of host genome evolution and diversity (Sampath et al., 2014;Sun et al., 2020).
Recent high-throughput sequencing data show that plant genomes are highly dynamic, plastic, and vari-able, and TEs are largely responsible for the SV in plant genomes (Cho, 2018;Liu et al., 2019;Morgante et al., 2007). Transposable elements are widely documented in several important crops including maize, rice, wheat, and barley (Hordeum vulgare L.) and constitute at least 50% of the genomes of cereal crops (Cho, 2018;Hadjiargyrou & Delihas, 2013;Tenaillon, Hollister, & Gaut, 2010). Recently, we developed a pan-genome of model plant Arabidopsis using whole-genome sequence assemblies of 19 ecotypes (Tahir Ul Qamar et al., 2019) (http://cbi.hzau.edu. cn/ppsPCP/) and identified TEs in the gradually developing pan-genome using RepeatMasker (Tarailo-Graovac & Chen, 2009) with default parameters. Our results showed that as the number of pan-genes increased with the addition of each assembly to the developing pan-genome the number of pan-TEs also increased (Supplemental Table  S2). Interestingly, the curve of pan-TEs was more open than that of pan-genes (Figure 2), possibly because of the less conserved and more variable nature of TEs. The reference genome of A. thaliana ecotype Columbia (Col-0) contains 33,467 genes and 34,255 TEs, whereas Arabidopsis ecotypes used to establish the pan-genome presented 34,899 pan-genes and 37,288 pan-TEs. In total, 3,033 new TEs were identified in the Arabidopsis pan-genome, which were absent in the Col-0 reference genome. These results further confirm that pan-genomes are crucial for characterizing the diversity of a species and for exploring and exploiting the mobility and functions of TEs and their involvement in genic and nongenic variations.
Moreover, recent studies show that a large number of lncRNAs and circRNAs originate from TE transcripts (Figure 3b,c), thus reinforcing the dynamic evolutionary role of TEs in ncRNA biogenesis (Chen et al., 2018;Kapusta et al., 2013;Kelley & Rinn, 2012;Liang & Wilusz, 2014;Liu et al., 2012;Z. P. Wang et al., 2017). Recently, D. Wang et al. (2017) studied lncRNAs in Arabidopsis, rice, and maize under different abiotic stress conditions and identified 47,611 and 398 TE-associated lncRNAs, respectively (Cho, 2018;D. Wang et al., 2017). In maize, Z. P. Wang et al. (2017) showed that sequences related to LINE1-like elements and their reverse complementary pairs (LLER-CPs) are significantly enriched in regions flanking circR-NAs. The authors concluded that TEs could potentially be involved in the development of circRNAs in plants (Z. P. Wang et al., 2017). Chen et al. (2018) also studied the association of TEs with circRNAs in maize and found similar results. Genes overlapping LLERCP-derived circRNAs are mostly enriched in loci associated with phenotypic variation. These results strengthen the concept that circRNAs likely originate from TEs and play important roles in phenotypic variation (Chen et al., 2018). The Plant Genome TA B L E 3 Number of predicted microRNAs (miRNAs), long noncoding RNAs (lncRNAs), and circular RNAs (circRNAs) in plant species

HOW PLANT PAN-GENOMES CAN INCREASE THE EFFICIENCY OF NONCODING RNA DISCOVERY
The origin of regulatory ncRNAs is not well understood. Besides the role of TEs and other mechanisms of ncRNA biogenesis have been proposed recently (Hou et al., 2019). For example, miRNAs are hypothesized to originate (a) from inverted duplication of pre-existing protein-coding genes (Tanzer & Stadler, 2004;Zhang, Jiang, & Gao, 2011) and (b) through endogenous hairpins that create stemloop structures in intronic or intergenic regions (Brodersen et al., 2008;Hou et al., 2019). Similarly, lncRNAs are also thought to originate from the duplication of pre-existing lncRNAs and by the decay of pseudo-proteincoding genes (Hou et al., 2019). Additionally, circRNAs can originate from exons, introns, or intergenic regions of protein-coding genes (Hou et al., 2019). Although the presence of these ncRNAs (miRNAs, lncRNAs, and cir-cRNAs) in major crops has been documented recently (Table 3), their mechanisms of biogenesis and functions remain poorly understood.
Recent studies suggest that improvements are required in ncRNA detection procedures. Using a pan-genome for predicting ncRNAs will increase the efficiency of predictions and the overall number of ncRNAs predicted compared with using a single reference genome. Because a pan-genome is a collection of all coding and noncoding sequences of all members of a species, it presents a more accurate and complete set of genomic contents (Figure 4). For example, in a recent study, Gao et al., (2019) constructed a pan-genome for tomato and revealed that 4,873 crucial genes involved in important agronomic traits were missing in the tomato reference genome. Moreover, if a pan-genome is used for ncRNA discovery, reads from several members of a species can be simultaneously mapped to the pan-genome, and there will be no need for matching or merging the results generated using different reference genomes to consolidate a single ncRNA data set. Thus, using a pan-genome for ncRNA discovery will significantly save labor, time, and computational resources. This approach will also enable further investigation of a variable genome and will help determine the exact number, type, and function of ncRNAs contributed by each individual to the core and variable portions of the pan-genome.

CONCLUSIONS
The biological mechanisms and functions of ncRNAs, previously overlooked, have recently been explored. The discovery of ncRNAs has largely been influenced by the repetitive nature of TEs, short-read length of NGS data, and inability of a single reference genome to represent the entire diversity of a species. For example, short reads generated by NGS cause ambiguity and missannotation while mapping ncRNAs reads. A single genome cannot represent the complete set of ncRNAs of a species. Therefore, if reads do not align accurately or cannot find target sites in the reference genome, the number of ncRNAs is overor underestimated, thus affecting downstream analyses. It is believed that analyses using a pan-genome, rather than a single reference genome, will capture the entire genomic contents and genetic diversity of a species. Thus, we conclude that the unknown functions of ncRNAs in plants can be explored using the pan-genome as a reference.

A C K N O W L E D G M E N T S
Authors would like to acknowledge National Key Research and Development Program of China, National Natural Science Foundation of China, Hubei Provincial Natural Science Foundation of China, Guangxi University, China and Huazhong Agricultural University, China for providing facilities for this study.

A U T H O R C O N T R I B U T I O N S
L.L.C. and M.T.Q conceived and designed this study. M.T.Q. performed literature review and wrote the paper. L.L.C., X.Z., M.S.K. and F.X. analyzed the data and contribute in paper improvements.

C O M P E T I N G I N T E R E S T S
The authors declare no competing interests.