This vignette describes how to use Qploidy, an R package designed for ploidy and aneuploidy estimation using genotyping platform data. For a detailed explanation of the Qploidy methodology, please refer to its publication.

When does the `Qploidy` methodology work?

The Qploidy approach is effective under the following conditions:
- Your marker data originates from Axiom or Illumina genotyping arrays.
- Your marker data is derived from targeted sequencing platforms (e.g., DArTag, GTseq, AgriSeq).
- All DNA samples were prepared following the same library preparation protocol.
- You know the ploidy of at least a subset of 60 samples or you know the most common ploidy in the dataset.
- Your dataset includes heterozygous samples.

When does the `Qploidy` methodology NOT work?

The methodology will not be effective under the following circumstances:
- Your marker data comes from RADseq or GBS (Genotyping-by-Sequencing) platforms.
- You intend to combine datasets from different sequencing batches.
- For example: If you extracted DNA and sequenced two plates (192 samples) as one batch, and later sequenced an additional three plates (288 samples) as a second batch, you would need to analyze the two batches separately in Qploidy. Combining all 480 samples into a single analysis will lead to incorrect results.
- You do not have a subset of samples with known ploidy or lack a predominant ploidy in your dataset.
- Your samples consist of inbred lines (homozygous individuals).

Installation

To install the Qploidy package, you can use the following commands:

# Install the latest stable version from CRAN
install.packages("Qploidy")

# Install the development version from GitHub
#install.packages("devtools")
devtools::install_github("cristianetaniguti/Qploidy")

Input files

In this vignette, we demonstrate the workflow using a simulated VCF file containing 50 samples and 500 genetic markers, allowing for a concise and computationally efficient example. We also present results derived from a real-world rose Axiom array dataset, which includes 524 garden rose cultivars and a total of 137,786 genetic markers.

Target sequencing VCF file

Let’s generate our simulated VCF file. This file contains 500 markers and 60 samples, with a mix of tetraploid, diploid, and triploid samples. The tetraploid samples are used as the reference for standardization because they are the most common ploidy in the dataset.

library(Qploidy)

# Simulate a VCF file with 500 markers and 60 samples
simulate_vcf(
  seed = 1234, file_path = "vcf_example_simulated.vcf",
  n_tetraploid = 35, n_diploid = 5, n_triploid = 10,
  n_markers = 500
)

data <- qploidy_read_vcf(vcf_file = "vcf_example_simulated.vcf")
head(data)

##   MarkerName  SampleName   X   Y   R     ratio
## 1   chr1_mk1 Tetraploid1  62 153 215 0.7116279
## 2   chr1_mk1 Tetraploid2  98 104 202 0.5148515
## 3   chr1_mk1 Tetraploid3  97  91 188 0.4840426
## 4   chr1_mk1 Tetraploid4 156  56 212 0.2641509
## 5   chr1_mk1 Tetraploid5 158  44 202 0.2178218
## 6   chr1_mk1 Tetraploid6 147  48 195 0.2461538

From Axiom array summary file

In case your samples were genotyped using the Axiom array platform, you should have a summary file or .CEL files. If you have .CEL files, you can convert them into a summary file using the apt-probeset-genotype plugin available in the Analysis Power Tools (APT) platform from Thermo Fisher Scientific (Waltham, MA). A summary is composed by a header that can have many lines and a body that looks like this:

##     probeset_id   Sample1    Sample2    Sample3    Sample4     Sample5
## 1 AX-95696335-A 66.387446  18.201038 104.790822 105.118016 104.5052361
## 2 AX-76690965-B 21.957280   7.406288  11.772518  53.006989   0.8532014
## 3 AX-83704427-A 12.368064  49.120263 110.625211  99.037832  98.9612293
## 4 AX-50778632-B  4.733086  17.542838   5.805427   4.022787   2.4019124
## 5 AX-52238885-A 97.683744 116.197899  96.799558  92.523135   5.5733040
## 6 AX-36184579-B 60.884661  16.177990  46.776783  12.399668  10.0071255
##     Sample6    Sample7    Sample8    Sample9  Sample10
## 1 102.55972 118.808939 102.848765  33.786032 97.345342
## 2  70.20994  15.659821  91.132107   3.416389  3.827478
## 3  46.44331  62.658054 106.543788 105.964029 91.509631
## 4  52.23666  35.481627  21.569666 123.018371  7.954378
## 5 107.16436   7.086115  12.757439 109.580758 96.399672
## 6  10.85804  59.444760   2.278098  74.499855 87.020296

In Qploidy, use the read_axiom function to read the summary file. It skips the header and reads the body of the file. The function will return a data frame.

data <- read_axiom("AxiomGT1_summary.txt")
head(data)

   MarkerName     SampleName         X         Y        R     ratio
1 AX-86752740      Lemon_Fiz  977.7548  793.4656 1771.220 0.4479768
2 AX-86752740   High_Voltage 1292.2205  582.7886 1875.009 0.3108191
3 AX-86752740    Lemon_Fiz.1  919.8055  870.9619 1790.767 0.4863624
4 AX-86752740 High_Voltage.1 1368.1266  645.9481 2014.075 0.3207170
5 AX-86752740     Brite_Eyes  238.5416 1242.7593 1481.301 0.8389648
6 AX-86752740 Morden_Blush.1  302.1575 1542.4037 1844.561 0.8361900

From Illumina array

In case you have a Illumina array summary file, Qploidy expects the following format:

[Header]
GSGT Version    2.0.5
Processing Date 4/17/2023 9:31 AM
Content         specieX.bpm
Num SNPs        26251
Total SNPs      26251
Num Samples     100
Total Samples   100

[Data]
SNP Name   Sample ID   GC Score   Theta   X      Y      X Raw   Y Raw   Log R Ratio
mk-1       SAMP-2      0.6814     0.247   0.023  0.009  504     100     0.5211668
mk-10      SAMP-2      0.9204     0.000   0.006  0.000  346     27      -1.021583
mk-100     SAMP-2      0.1695     0.367   0.030  0.020  576     158     1.177404
mk-101     SAMP-2      0.8655     0.222   0.026  0.010  538     102     0.3459635
mk-103     SAMP-2      0.8932     0.000   0.019  0.000  465     18      0.7578704
mk-104     SAMP-2      0.5146     0.175   0.041  0.012  675     114     1.016254
mk-105     SAMP-2      0.8060     0.450   0.018  0.015  458     132     0.4305636
mk-106     SAMP-2      0.0000     0.386   0.002  0.002  315     57      NaN
mk-107     SAMP-2      0.4804     0.495   0.026  0.025  534     189     1.0151

Verify if you file look like this one. If confirmed, you can use the read_illumina_array function to read the data. If you include more than one file, the function will merge them into a single data frame.

data <- read_illumina_array(
  "data/illumina/Set1.txt", # This is a example code, data not available
  "data/illumina/Set2.txt",
  "data/illumina/Set3.txt"
)
head(input_data)

MarkerName     SampleName     X     Y     R     ratio
1        CT1         S1-2_1 0.023 0.009 0.032 0.2812500
2       CT10         S1-2_1 0.006 0.000 0.006 0.0000000
3      CT100         S1-2_1 0.030 0.020 0.050 0.4000000
4      CT101         S1-2_1 0.026 0.010 0.036 0.2777778
5      CT103         S1-2_1 0.019 0.000 0.019 0.0000000
6      CT104         S1-2_1 0.041 0.012 0.053 0.2264151

Define Reference Samples - Choose Path (1) or (2)

Qploidy applies a standardization method to allele intensities or read counts to enable the comparison of copy numbers across the genome of a sample. To achieve this, it requires a set of reference samples. There are two ways to select these reference samples:

If you have a subset of samples with known ploidy, Qploidy will use this subset as the reference.
If you know the most common ploidy among your samples, you can initially run Qploidy using all samples as the reference. From this first run, you can identify samples with the common ploidy, then select this subset to use as the reference in a second run of Qploidy.

Below are the steps to proceed for each case, tagged as (1) and (2):

(1) Using a Subset with Known Ploidy

For this method, you will need to separate your subset of samples with known ploidy:

In our simulated example, the sample names tell us which ploidy they are. Let’s just create a vector with the tetraploid samples for us to use later:

# All samples
all_samples <- unique(data$SampleName)
all_samples

##  [1] "Tetraploid1"  "Tetraploid2"  "Tetraploid3"  "Tetraploid4"  "Tetraploid5" 
##  [6] "Tetraploid6"  "Tetraploid7"  "Tetraploid8"  "Tetraploid9"  "Tetraploid10"
## [11] "Tetraploid11" "Tetraploid12" "Tetraploid13" "Tetraploid14" "Tetraploid15"
## [16] "Tetraploid16" "Tetraploid17" "Tetraploid18" "Tetraploid19" "Tetraploid20"
## [21] "Tetraploid21" "Tetraploid22" "Tetraploid23" "Tetraploid24" "Tetraploid25"
## [26] "Tetraploid26" "Tetraploid27" "Tetraploid28" "Tetraploid29" "Tetraploid30"
## [31] "Tetraploid31" "Tetraploid32" "Tetraploid33" "Tetraploid34" "Tetraploid35"
## [36] "Diploid1"     "Diploid2"     "Diploid3"     "Diploid4"     "Diploid5"    
## [41] "Triploid1"    "Triploid2"    "Triploid3"    "Triploid4"    "Triploid5"   
## [46] "Triploid6"    "Triploid7"    "Triploid8"    "Triploid9"    "Triploid10"

tetraploid_samples <- all_samples[grep("Tetraploid", all_samples)]
tetraploid_samples

##  [1] "Tetraploid1"  "Tetraploid2"  "Tetraploid3"  "Tetraploid4"  "Tetraploid5" 
##  [6] "Tetraploid6"  "Tetraploid7"  "Tetraploid8"  "Tetraploid9"  "Tetraploid10"
## [11] "Tetraploid11" "Tetraploid12" "Tetraploid13" "Tetraploid14" "Tetraploid15"
## [16] "Tetraploid16" "Tetraploid17" "Tetraploid18" "Tetraploid19" "Tetraploid20"
## [21] "Tetraploid21" "Tetraploid22" "Tetraploid23" "Tetraploid24" "Tetraploid25"
## [26] "Tetraploid26" "Tetraploid27" "Tetraploid28" "Tetraploid29" "Tetraploid30"
## [31] "Tetraploid31" "Tetraploid32" "Tetraploid33" "Tetraploid34" "Tetraploid35"

(2) Using the Most Common Ploidy

For this method, you will use the entire dataset as the reference during the first round of the Qploidy run:

data_reference <- data

(1) and (2) Run Dosage Caller

To proceed, you need to determine the dosage for all reference samples. If you haven’t done this yet, you can use any suitable dosage caller.

For array data, we recommend using the fitPoly package.
For sequencing data, we suggest using the updog or polyRAD packages.

These packages determine dosages by analyzing the distribution of allele intensities or read counts across all samples for each marker, using an informed ploidy model. You will provide either the most common ploidy in your population or the known ploidy of your selected subset as the informed ploidy.

Path (2): If you provide the most common ploidy, samples with different ploidy levels may receive incorrect dosages. However, these samples can be filtered out in the next step.

Warning:
Depending on the number of samples and markers, this step may take a significant amount of time to complete. It is highly recommended to run it on a high-performance computing system where you can utilize multiple cores and avoid issues with excessive RAM usage.

Target sequencing VCF data

If the VCF file already contains dosage/genotype information called for the subset/most common ploidy:

genos <- qploidy_read_vcf("vcf_example_simulated.vcf", geno = TRUE)
dim(genos)

## [1] 50000     4

genos.pos <- qploidy_read_vcf("vcf_example_simulated.vcf", geno.pos = TRUE)
head(genos.pos)

##   MarkerName Chromosome Position
## 1   chr1_mk1       chr1   249732
## 2   chr1_mk2       chr1   347883
## 3   chr1_mk3       chr1   709781
## 4   chr1_mk4       chr1   758053
## 5   chr1_mk5       chr1   794442
## 6   chr1_mk6       chr1   823933

If the VCF file lacks marker names in the ID column of the fixed portion, the MarkerName column will contain NAs. To resolve this, you can use the following code to concatenate the chromosome and position as marker names:

genos.pos$MarkerName <- paste0(genos.pos$Chromosome, "_", genos.pos$Position)
head(genos.pos)

(1) Filter the genos object to keep only the known ploidy samples subset

genos <- genos[which(genos$SampleName %in% tetraploid_samples), ]

If the VCF file doesn’t contain dosage/genotype information (GT format field) called for the subset/most common ploidy:

library(tidyr)

# Prepare inputs for updog
## Approach (1)
# tobe_genotyped <- data[which(data$SampleName %in% tetraploid_samples),]

## Approach (2)
tobe_genotyped <- data

ref <- pivot_wider(tobe_genotyped[, 1:3], names_from = SampleName, values_from = X)
ref <- as.matrix(ref)
rownames_ref <- ref[, 1]
ref <- ref[, -1]
ref <- apply(ref, 2, as.numeric)
rownames(ref) <- rownames_ref

size <- pivot_wider(tobe_genotyped[, c(1, 2, 5)], names_from = SampleName, values_from = R)
size <- as.matrix(size)
rownames_size <- size[, 1]
size <- size[, -1]
size <- apply(size, 2, as.numeric)
rownames(size) <- rownames_size

size[1:10, 1:10]

##           Tetraploid1 Tetraploid2 Tetraploid3 Tetraploid4 Tetraploid5
## chr1_mk1          215         202         188         212         202
## chr1_mk2          184         194         208         205         213
## chr1_mk3          182         200         193         196         214
## chr1_mk4          189         202         202         203         203
## chr1_mk5          194         209         200         199         207
## chr1_mk6          186         206         178         195         205
## chr1_mk7          216         200         206         191         193
## chr1_mk8          196         215         201         209         199
## chr1_mk9          198         216         211         197         205
## chr1_mk10         204         195         205         187         205
##           Tetraploid6 Tetraploid7 Tetraploid8 Tetraploid9 Tetraploid10
## chr1_mk1          195         195         205         210          196
## chr1_mk2          200         195         209         200          195
## chr1_mk3          205         199         205         194          220
## chr1_mk4          215         204         218         203          188
## chr1_mk5          213         208         191         191          208
## chr1_mk6          203         191         200         197          200
## chr1_mk7          214         198         204         203          198
## chr1_mk8          212         200         195         197          205
## chr1_mk9          204         198         200         189          208
## chr1_mk10         212         200         201         199          200

ref[1:10, 1:10]

##           Tetraploid1 Tetraploid2 Tetraploid3 Tetraploid4 Tetraploid5
## chr1_mk1           62          98          97         156         158
## chr1_mk2           93          92          56         105         158
## chr1_mk3          143         110         146          52          98
## chr1_mk4           43         102         149         158          95
## chr1_mk5          101         107         200         153         151
## chr1_mk6          148         102          40          49          56
## chr1_mk7          112           0         150         141          94
## chr1_mk8           50          56          54         155         101
## chr1_mk9           98         104          58         101         108
## chr1_mk10         145         148         151         145         106
##           Tetraploid6 Tetraploid7 Tetraploid8 Tetraploid9 Tetraploid10
## chr1_mk1          147         149         154          99          145
## chr1_mk2          103          50         104           0           45
## chr1_mk3          150          53         100         105          109
## chr1_mk4           55          50         105         165          144
## chr1_mk5           56         154         154          98          101
## chr1_mk6          150          94           0         147            0
## chr1_mk7           70          99          50          98          154
## chr1_mk8          151         151         140          46          114
## chr1_mk9           54          44         101          86          156
## chr1_mk10          96         200         152         150          200

library(updog)
multidog_obj <- multidog(
  refmat = ref,
  sizemat = size,
  model = "norm", # It depends of your population structure
  ploidy = 4, # Most common ploidy in your population
  nc = 6
) # Change the parameters accordingly!!

##     |                                   *.#,%    
##    |||                                 *******/  
##  |||||||    (**..#**.                  */   **/  
## |||||||||    */****************************/*%   
##    |||    &****..,*.************************/    
##    |||     (....,,,*,...****%********/(******    
##    |||                ,,****%////,,,,./.****/    
##    |||                  /**//         .*///....  
##    |||                  .*/*/%#         .,/   ., 
##    |||               , **/   #%         .*    .. 
##    |||                               ,,,*        
## 
## Working on it...

## Warning in checkNumberOfLocalWorkers(workers): Careful, you are setting up 6
## localhost parallel workers with only 4 CPU cores available for this R process
## (per 'system'), which could result in a 150% load. The soft limit is set to
## 100%. Overusing the CPUs has negative impact on the current R process, but also
## on all other processes of yours and others running on the same machine. See
## help("parallelly.maxWorkers.localhost", package = "parallelly") for further
## explanations and how to override the soft limit that triggered this warning

## done!

genos <- data.frame(
  MarkerName = multidog_obj$inddf$snp,
  SampleName = multidog_obj$inddf$ind,
  geno = multidog_obj$inddf$geno,
  prob = multidog_obj$inddf$maxpostprob
)

head(genos)

##   MarkerName  SampleName geno      prob
## 1   chr1_mk1 Tetraploid1    1 0.9999976
## 2   chr1_mk1 Tetraploid2    2 0.9999999
## 3   chr1_mk1 Tetraploid3    2 0.9999999
## 4   chr1_mk1 Tetraploid4    3 0.9999997
## 5   chr1_mk1 Tetraploid5    3 1.0000000
## 6   chr1_mk1 Tetraploid6    3 1.0000000

Notice that, if you are following approach (2) diploids and triploids samples genotypes will be wrongly called as tetraploids. But, because they are not the most common samples in the dataset, they should not interfere significantly in the dosage call of the tetraploids. You can go back to this step after a first round of ploidy evaluation and call the dosage of the tetraploids only.

Array data

Here we show an option of genotype calling for array data using fitpoly software:

library(fitPoly)
saveMarkerModels(
  ploidy = 4, # Most common ploidy among Texas Garden Roses collection
  data = data_reference, # output of the previous function
  filePrefix = "fitpoly_output/texas_roses", # Define the path and prefix for the result files
  ncores = 2
) # Change it accordingly to your system!!!!

library(vroom) # Package to speed up reading files
fitpoly_scores <- vroom("fitpoly_output/texas_roses_scores.dat")

genos <- data.frame(
  MarkerName = fitpoly_scores$MarkerName,
  SampleName = fitpoly_scores$SampleName,
  geno = fitpoly_scores$maxgeno,
  prob = fitpoly_scores$maxP
)
head(genos)

   MarkerName  SampleName geno      prob
1 AX-86752740 1000 Wishes    2 0.7831438
2 AX-86752740  10004-N008    2 0.9892338
3 AX-86752740  10037_N046    2 0.9844171
4 AX-86752740  10038-N001    2 0.9916813
5 AX-86752740  10043_N019    2 0.9871057
6 AX-86752740  10043_N049    2 0.9955792

Differently than a VCF file, a summary file does not contain the genomic positions of the markers. You will need to provide these information in a separate file as the following example.

# File containing markers genomic positions
genos.pos_file <- read.table("geno.pos_roses_texas.txt", header = T)
head(genos.pos)

            SNP      probes chr      pos
1 Affx-86843634 AX-86752747   6   255852
2 Affx-86842821 AX-86752763   6 62032267
3 Affx-86839613 AX-86752769   5  9169310
4 Affx-86838724 AX-86752790   1 44259488
5 Affx-86840823 AX-86752809   7 12991931
6 Affx-86842443 AX-86752817   2 53861719

# Edit for Qploidy input format
genos.pos <- data.frame(
  MarkerName = genos.pos_file$probes,
  Chromosome = genos.pos_file$chr,
  Position = genos.pos_file$pos
)
head(genos.pos)

   MarkerName Chromosome Position
1 AX-86752747          6   255852
2 AX-86752763          6 62032267
3 AX-86752769          5  9169310
4 AX-86752790          1 44259488
5 AX-86752809          7 12991931
6 AX-86752817          2 53861719

(1) and (2) Standardization

## Generating standardized BAFs...
## [1] "Percentage of genotypes turned into missing data because of low genotype probability:2.27"
## [1] "Markers removed because of excess of missing data:0"
## Going to parallel mode...
## Back to single core usage
## [1] "Markers removed because of smaller number of clusters than set threshold:13"
## Going to parallel mode...
## Back to single core usage
## BAFs ready!
## Generating z scores...
## Z scores ready!
## Merging results into qploidy_standardization object...
## Writting Qploidy app input file:/tmp/RtmpwT629w/file1983ee3ea0f.tsv.gzDone!

## This is on object of class 'ploidy_standardization'
## --------------------------------------------------------------------
## Parameters
##                                                                         
## 1 standardization type:                                      intensities
## 2 Ploidy:                                                    4          
## 3 Minimum number of heterozygous classes (clusters) present: 5          
## 4 Maximum number of missing genotype by marker:              0.1        
## 5 Minimum genotype probability:                              0.8        
## --------------------------------------------------------------------
## Filters
##                                                                            
## 1 Number of markers at raw data:                                           
## 2 Percentage of filtered genotypes by probability threshold:               
## 3 Number of markers filtered by missing data:                              
## 4 Number of markers filtered for not having the minimum number of clusters:
## 5 Number of markers filtered for not having genomic information:           
## 6 Number of markers with estimated BAF:                                    
##                
## 1 1000 (100%)  
## 2 -    (2.27 %)
## 3 0    (0 %)   
## 4 13   (1.3 %) 
## 5 0    (0 %)   
## 6 987  (98.7 %)

simu_data_standardized <- standardize(
  data = data,
  genos = genos,
  geno.pos = genos.pos,
  ploidy.standardization = 4,
  threshold.n.clusters = 5,
  n.cores = 1,
  out_filename = "my_standardized_data.tsv.gz",
  verbose = TRUE
)

simu_data_standardized

How it look like for our real roses dataset:

This is on object of class 'ploidy_standardization'
--------------------------------------------------------------------
Parameters
                                                                        
1 standardization type:                                      intensities
2 Ploidy:                                                    4          
3 Minimum number of heterozygous classes (clusters) present: 5          
4 Maximum number of missing genotype by marker:              0.1        
5 Minimum genotype probability:                              0.8        
--------------------------------------------------------------------
Filters
                                                                                            
1 Number of markers at raw data:                                            137786 (100%)   
2 Percentage of filtered genotypes by probability threshold:                -      (16.79 %)
3 Number of markers filtered by missing data:                               5659   (4.11 %) 
4 Number of markers filtered for not having the minimum number of clusters: 101853 (73.92 %)
5 Number of markers filtered for not having genomic information:            252    (0.18 %) 
6 Number of markers with estimated BAF:                                     28100  (20.39 %)

See the details of the parameters with ?standardize.

The print function of the resulting object provides information about marker filtering during the process. One critical filtering step involves the number of dosage clusters represented for each marker. If your samples are highly inbred at certain loci, you might not have individuals representing all possible dosages for those marker locations.

For example, when using tetraploids as the reference, some markers may have individuals with dosages of only 0, 1, 3, and 4 (missing dosage 2). Without representatives for the missing dosage, Qploidy standardization cannot be performed.

If threshold.n.clusters is set to ploidy + 1, markers with missing dosages will be discarded.
If threshold.n.clusters is smaller than ploidy + 1, the missing dosage cluster centers will be imputed for standardization purposes.

Recover the `ploidy_standardization` Object from a Saved File

To revisit this step later without re-running the upstream functions, you can load the previously generated file using the read_qploidy_standardization function:

simu_data_standardized <- read_qploidy_standardization("my_standardized_data.tsv.gz")

(1) and (2) Plot results

Qploidy provides several options for visualization of the results. Check the complete description with ?plot_qploidy_standaridization. Bellow you can see examples:

# Select a sample to be evaluated
samples <- unique(simu_data_standardized$data$SampleName)
head(samples)

## [1] "Tetraploid1" "Tetraploid2" "Tetraploid3" "Tetraploid4" "Tetraploid5"
## [6] "Tetraploid6"

sample <- "Tetraploid1"

# Proportion of heterozygous loci, BAF (Qploidy standardized ratio), and zscore by genomic positions
p <- plot_qploidy_standardization(
  x = simu_data_standardized,
  sample = sample,
  type = c("het", "BAF", "zscore"),
  dot.size = 0.05,
  chr = 1:2
)

## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).

## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

See how it look like for our real roses dataset:

# Heterozygous frequency, BAF (Qploidy standardized ratio), and zscore by genomic positions
# centromere positions added
p <- plot_qploidy_standardization(
  x = simu_data_standardized,
  sample = sample,
  type = c("het", "BAF", "zscore"),
  dot.size = 0.05,
  chr = 1:2,
  add_centromeres = TRUE,
  centromeres = c("chr1" = 15000000, "chr2" = 19000000)
)

## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).

## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

See how it look like for our real roses dataset:

# Raw allele intensity or read count ratio and BAF (Qploidy standardized ratio) histograms
# combining all markers in the sample (sample level resolution)
p <- plot_qploidy_standardization(
  x = simu_data_standardized,
  sample = sample,
  type = c("Ratio_hist_overall", "BAF_hist_overall"),
  chr = 1:2,
  ploidy = 4,
  add_expected_peaks = TRUE
)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Notice that, because it is a simulated data, the ratio and the BAF don’t differ much. See how it look like for our real roses dataset:

# BAF histograms (chromosome level resolution) and Z score
p <- plot_qploidy_standardization(
  x = simu_data_standardized,
  sample = sample,
  type = c("BAF_hist", "zscore"),
  chr = 1:2,
  add_expected_peaks = TRUE,
  ploidy = c(4, 4)
)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).

## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

See how it look like for our real roses dataset:

# BAF histograms combining all markers in the sample (chromosome-arm level resolution) and Z score
p <- plot_qploidy_standardization(
  x = simu_data_standardized,
  sample = sample,
  type = c("BAF_hist", "zscore"),
  chr = 1:2,
  ploidy = c(4, 4, 4, 4), # Provide ploidy for each arm
  add_expected_peaks = TRUE,
  add_centromeres = TRUE,
  centromeres = c("chr1" = 15000000, "chr2" = 19000000)
)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).

## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

See how it look like for our real roses dataset:

It is important to note that the quality of results can vary across samples, as observed in different plots. Key aspects to assess include:
- How sharp the peaks are in the histogram plots.
- How well the dots cluster in the BAF vs. genomic position plots.
- Whether the patterns match the decay or increase of the Z-score (Z).

In our publication, we categorize sample quality into different resolutions based on the ability to estimate copy numbers, ranked from highest to lowest resolution:
1. Chromosome-arm resolution: When the copy number of all chromosome arms can be estimated.
2. Chromosome resolution: When at least one chromosome arm’s copy number cannot be estimated.
3. Sample resolution: When the copy number of at least one entire chromosome cannot be estimated.
4. None: When the histogram peaks, considering all markers, do not fit any of the expected ploidy levels tested.

For more details and examples, please refer to the publication.

(1) and (2) Ploidy Estimation for All Samples

Warning: This method may not be fully accurate in certain situations. While it offers a helpful initial guide, visual inspection of the plots described in the previous section is essential to assess the resolution and confirm the ploidy.

# If sample level resolution
estimated_ploidies_sample <- area_estimate_ploidy(
  qploidy_standardization = simu_data_standardized, # standardization result object
  samples = "all", # Samples or "all" to estimate all samples
  level = "sample", # Resolution level
  ploidies = c(2, 3, 4, 5)
) # Ploidies to be tested
estimated_ploidies_sample

## Object of class qploidy_area_ploidy_estimation                                                  
## 1 Number of samples:                     50       
## 2 Chromosomes:                           chr1,chr2
## 3 Tested ploidies:                       1,2,3,4,5
## 4 Number of euploid samples:             50       
## 5 Number of potential aneuploid samples: 0        
## 6 Number of highly inbred samples:       0

See how it look like for our real roses dataset:

Object of class qploidy_area_ploidy_estimation                                                        
1 Number of samples:                     524            
2 Chromosomes:                           1,3,5,2,6,7,4,0
3 Tested ploidies:                       1,2,3,4,5      
4 Number of euploid samples:             459            
5 Number of potential aneuploid samples: 0              
6 Number of highly inbred samples:       65

Highly inbred samples cannot be evaluated for copy number using this method. As a result, Qploidy returns a missing value (NA) for these cases.

head(estimated_ploidies_sample$ploidy)

##             [,1]
## Tetraploid1    4
## Tetraploid2    4
## Tetraploid3    4
## Tetraploid4    4
## Tetraploid5    4
## Tetraploid6    4

See how it look like for our real roses dataset:

               [,1]
Lemon_Fiz         4
High_Voltage      4
Lemon_Fiz.1       4
High_Voltage.1    4
Brite_Eyes        4
Morden_Blush.1    4

# If chromosome resolution
estimated_ploidies_chromosome <- area_estimate_ploidy(
  qploidy_standardization = simu_data_standardized,
  samples = "all",
  level = "chromosome",
  ploidies = c(2, 3, 4, 5)
)
estimated_ploidies_chromosome

## Object of class qploidy_area_ploidy_estimation                                                  
## 1 Number of samples:                     50       
## 2 Chromosomes:                           chr1,chr2
## 3 Tested ploidies:                       1,2,3,4,5
## 4 Number of euploid samples:             50       
## 5 Number of potential aneuploid samples: 0        
## 6 Number of highly inbred samples:       0

See how it look like for our real roses dataset:

Object of class qploidy_area_ploidy_estimation                                                        
1 Number of samples:                     524            
2 Chromosomes:                           0,1,2,3,4,5,6,7
3 Tested ploidies:                       1,2,3,4,5      
4 Number of euploid samples:             375            
5 Number of potential aneuploid samples: 92             
6 Number of highly inbred samples:       57

head(estimated_ploidies_chromosome$ploidy)

##             chr1 chr2
## Tetraploid1    4    4
## Tetraploid2    4    4
## Tetraploid3    4    4
## Tetraploid4    4    4
## Tetraploid5    4    4
## Tetraploid6    4    4

See how it look like for our real roses dataset:

               0 1 2 3 4 5 6 7
Lemon_Fiz      4 4 4 4 4 4 4 4
High_Voltage   4 4 4 4 4 4 4 4
Lemon_Fiz.1    4 4 4 4 4 4 4 4
High_Voltage.1 4 4 4 4 4 4 4 4
Brite_Eyes     4 4 4 4 4 4 4 4
Morden_Blush.1 4 4 4 4 4 4 4 4

# If chromosome-arm resolution
estimated_ploidies_chromosome_arm <- area_estimate_ploidy(
  qploidy_standardization = simu_data_standardized,
  samples = "all",
  level = "chromosome-arm",
  ploidies = c(2, 3, 4, 5),
  centromeres = c("chr1" = 15000000, "chr2" = 19000000)
)

estimated_ploidies_chromosome_arm

## Object of class qploidy_area_ploidy_estimation                                                                    
## 1 Number of samples:                     50                         
## 2 Chromosomes:                           chr1.1,chr1.2,chr2.1,chr2.2
## 3 Tested ploidies:                       1,2,3,4,5                  
## 4 Number of euploid samples:             50                         
## 5 Number of potential aneuploid samples: 0                          
## 6 Number of highly inbred samples:       0

See how it look like for our real roses dataset:

Object of class qploidy_area_ploidy_estimation                                                                                                  
1 Number of samples:                     524                                                      
2 Chromosomes:                           0,1.1,1.2,2.1,2.2,3.1,3.2,4.1,4.2,5.1,5.2,6.1,6.2,7.1,7.2
3 Tested ploidies:                       1,2,3,4,5                                                
4 Number of euploid samples:             304                                                      
5 Number of potential aneuploid samples: 171                                                      
6 Number of highly inbred samples:       49

head(estimated_ploidies_chromosome_arm$ploidy)

##             chr1.1 chr1.2 chr2.1 chr2.2
## Tetraploid1      4      4      4      4
## Tetraploid2      4      4      4      4
## Tetraploid3      4      4      4      4
## Tetraploid4      4      4      4      4
## Tetraploid5      4      4      4      4
## Tetraploid6      4      4      4      4

See how it look like for our real roses dataset:

               0 1.1 1.2 2.1 2.2 3.1 3.2 4.1 4.2 5.1 5.2 6.1 6.2 7.1 7.2
Lemon_Fiz      4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
High_Voltage   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
Lemon_Fiz.1    4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
High_Voltage.1 4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
Brite_Eyes     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
Morden_Blush.1 4   4   4   4   4  NA   4   4   4   4   4   4   4   4   4

Note that one of the chromosome arms (Morden_Blush.1 - 3.2) returned a value of NA. This occurs because the arm contains a high number of inbred loci.

estimated_ploidies_format <- merge_arms_format(estimated_ploidies_chromosome_arm)
estimated_ploidies_format

## Object of class qploidy_area_ploidy_estimation                                                  
## 1 Number of samples:                     50       
## 2 Chromosomes:                           chr1,chr2
## 3 Tested ploidies:                       1,2,3,4,5
## 4 Number of euploid samples:             50       
## 5 Number of potential aneuploid samples: 0        
## 6 Number of highly inbred samples:       0

See how it look like for our real roses dataset:

Object of class qploidy_area_ploidy_estimation                                                        
1 Number of samples:                     524            
2 Chromosomes:                           0,1,2,3,4,5,6,7
3 Tested ploidies:                       1,2,3,4,5      
4 Number of euploid samples:             304            
5 Number of potential aneuploid samples: 171            
6 Number of highly inbred samples:       49

head(estimated_ploidies_format$ploidy)

##             chr1 chr2
## Tetraploid1    4    4
## Tetraploid2    4    4
## Tetraploid3    4    4
## Tetraploid4    4    4
## Tetraploid5    4    4
## Tetraploid6    4    4

See how it look like for our real roses dataset:

               0   1   2   3      4   5   6   7  
Lemon_Fiz      "4" "4" "4" "4"    "4" "4" "4" "4"
High_Voltage   "4" "4" "4" "4"    "4" "4" "4" "4"
Lemon_Fiz.1    "4" "4" "4" "4"    "4" "4" "4" "4"
High_Voltage.1 "4" "4" "4" "4"    "4" "4" "4" "4"
Brite_Eyes     "4" "4" "4" "4"    "4" "4" "4" "4"
Morden_Blush.1 "4" "4" "4" "NA/4" "4" "4" "4" "4"

Export ploidy result table

write.csv(estimated_ploidies_format$ploidy, file = "ploidies_before_visual_evaluation.csv")

(1) and (2) Save plots for all samples

Facilitate individual visual inspection by generating figures for all samples in a loop:

dir.create("figures")

for (i in seq_along(samples)) {
  print(paste0("Part:", i, "/", length(samples)))
  print(paste("Generating figures for", samples[i], "..."))
  # This function will generate figures for sample, chromosome and chromosome-arm resolution level evaluation
  all_resolutions_plots(
    data_standardized = simu_data_standardized,
    sample = samples[i],
    ploidy = estimated_ploidies_chromosome$ploidy[i, 1:2],
    chr = 1:2,
    centromeres = c("chr1" = 15000000, "chr2" = 19000000),
    file_name = paste0("figures/", samples[i])
  )
  print(paste("Done!"))
}

By visualizing the plots, you can correct the results from area_estimate_ploidy and add resolution information for each sample. In our perfect simulated data, all samples correctly identified therefore we will just move forward with the analysis.

ploidy_corrected <- estimated_ploidies_chromosome$ploidy

(2) Improve standardization selecting a known ploidy subset after first round

If you used the most common ploidy as the reference instead of a subset with known ploidy, the standardization may not be as accurate. You can improve the resolution by following these steps:
i. Select the highest resolution plots from the first standardization round.
ii. Filter these samples from the data object.
iii. Run standardization again using this subset as the reference.
iv. Reevaluate the plots and estimate the ploidy once more.

# i) Select the highest resolution plots from the first standardization round.
tetraploids <- apply(ploidy_corrected, 1, function(x) all(x == 4))
tetraploids

data_reference2 <- rownames(ploidy_corrected)[which(tetraploids)]
length(data_reference2) 

# ii) Filter these samples from the `data` object.
genos_round2 <- genos[which(genos$SampleName %in% data_reference2), ]

# iii) Run standardization again using this subset as the reference.
simu_data_standardized_round2 <- standardize(
  data = data,
  genos = genos_round2,
  geno.pos = genos.pos,
  ploidy.standardization = 4,
  threshold.n.clusters = 5,
  n.cores = 1,
  out_filename = "my_round2_standardization.tsv.gz",
  verbose = TRUE
)
simu_data_standardized_round2

For our perfect simulated data, plot resolution didn’t change between first and second round but see how it look like for our real roses dataset:

This is on object of class 'ploidy_standardization'
--------------------------------------------------------------------
Parameters
                                                                        
1 standardization type:                                      intensities
2 Ploidy:                                                    4          
3 Minimum number of heterozygous classes (clusters) present: 5          
4 Maximum number of missing genotype by marker:              0.1        
5 Minimum genotype probability:                              0.8        
--------------------------------------------------------------------
Filters
                                                                                            
1 Number of markers at raw data:                                            137786 (100%)   
2 Percentage of filtered genotypes by probability threshold:                -      (16.34 %)
3 Number of markers filtered by missing data:                               6108   (4.43 %) 
4 Number of markers filtered for not having the minimum number of clusters: 111977 (81.27 %)
5 Number of markers filtered for not having genomic information:            145    (0.11 %) 
6 Number of markers with estimated BAF:                                     17634  (12.8 %)

New plots for roses real data after round 1 and 2 of standardization:

Round 1

Round 2

Round 1

Round 2

HMM-based Copy Number Estimation and Visualization

HMM-based Copy Number Estimation

Qploidy provides a powerful function, hmm_estimate_CN, for estimating copy number (CN) across the genome using a Hidden Markov Model (HMM) approach. This method combines standardized depth (z-score) and BAF (standardized ratio) information in windows to infer CN states for each region.

# Estimate copy number for a sample using HMM
hmm_result <- hmm_estimate_CN(
  qploidy_standarize_result = simu_data_standardized, # standardized object
  sample_id = sample,                                 # sample name
  chr = 1:2,                                          # chromosomes to analyze
  snps_per_window = 20,                               # SNPs per window
  min_snps_per_window = 5,                            # minimum SNPs per window
  cn_grid = c(2, 3, 4, 5),                            # ploidies to test
  M = 100,                                            # BAF bins
  bw = 0.03,                                          # BAF template bandwidth
  exp_ploidy = 4                                      # expected ploidy
)
head(hmm_result$result)

The output is a list with: - result: a data frame with CN calls, posterior probabilities, and window coordinates. - params: the fitted HMM parameters (means, transition matrix, etc.).

See ?hmm_estimate_CN for full details.

Parallel HMM Estimation for Multiple Samples

To efficiently estimate copy number for all samples in your dataset, use the new function hmm_estimate_CN_multi. This function runs HMM estimation in parallel across samples, leveraging multiple CPU cores for faster analysis. It is cross-platform compatible (Windows, macOS, Linux) and automatically handles cluster setup and package loading.

# Estimate copy number for all samples in parallel
multi_hmm_result <- hmm_estimate_CN_multi(
  qploidy_standarize_result = simu_data_standardized, # standardized object
  sample_ids = "all",                               # or a vector of sample names
  n_cores = 4,                                       # number of CPU cores to use
  chr = 1:2,                                         # chromosomes to analyze
  snps_per_window = 20,                              # SNPs per window
  min_snps_per_window = 5,                           # minimum SNPs per window
  cn_grid = c(2, 3, 4, 5),                           # ploidies to test
  M = 100,                                           # BAF bins
  bw = 0.03,                                         # BAF template bandwidth
  exp_ploidy = 4                                     # expected ploidy
)

The output is a list of HMM results for each sample.

Summarizing Parallel HMM Results

After running parallel HMM estimation, you can summarize the results using the new function summarize_cn_mode. This function extracts the most likely copy number state for each region and provides a summary table at the desired resolution (sample, chromosome, or chromosome-arm).

# Summarize HMM results for all samples
hmm_summary <- summarize_cn_mode(
  multi_hmm_result,           # output from hmm_estimate_CN_multi
  level = "chromosome"       # resolution: "sample", "chromosome", or "chromosome-arm"
)
head(hmm_summary)

You can further merge these results with area-based ploidy estimates using merge_cn_summary_with_estimates for integrated reporting.

See ?hmm_estimate_CN_multi, ?summarize_cn_mode, and ?merge_cn_summary_with_estimates for details and examples.

# Plot CN track for a sample
plot_cn_track(
  hmm_CN = hmm_result,         # output from hmm_estimate_CN
  sample_id = sample,          # sample name
  show_window_lines = TRUE     # show window boundaries
)

See ?plot_cn_track for more options and details.

Interpreting the CN Track Plot

The CN track plot provides a visual summary of copy number states across the genome for a given sample: - Top panel: Shows the z-score (standardized depth) for each window, highlighting regions of copy number variation. - Bottom panel: Displays the called copy number segments, colored by confidence (posterior probability). Each segment represents a genomic window, and the color intensity reflects the certainty of the CN call. - Window boundaries: Vertical lines indicate the start and end of each window used in the HMM analysis. - Legend: The plot includes a legend for copy number states and confidence levels, helping you quickly identify regions of aneuploidy or CN changes.

Use this plot to: - Assess the quality and resolution of CN calls across chromosomes. - Identify regions with high or low confidence in CN estimation. - Compare CN patterns between samples or across chromosomes.

Try the Qploidy Shiny App!

For an interactive experience, explore your data and run analyses using the Qploidy Shiny app. The app provides a user-friendly interface for data upload, parameter selection, visualization, and CN estimation.

To launch the app, simply run:

library(Qploidy)
run_app()

The app can be used locally or deployed on a server for collaborative analysis. It includes all major features of the package, including standardization, HMM-based CN estimation, and interactive plotting.

For more information, see the documentation and help files in the app, or visit the Qploidy GitHub page.

How to cite

Taniguti, C. H., Lau, J., Hochhaus, T., Arias, D. C. L., Hokanson, S. C., Zlesak, D. C., Byrne, D. H., Klein, P. E., & Riera-Lizarazu, O. (2025). Exploring chromosomal variations in garden roses: Insights from high-density SNP array data and a new tool, Qploidy. The Plant Genome, e70044. https://doi.org/10.1002/tpg2.70044

Acknowledgments

Version 1.0.0

Funded in part by the Robert E. Basye Endowment in Rose Genetics, Dept. of Horticultural Sciences, Texas A&M University, and USDA’s National Institute of Food and Agriculture (NIFA), Specialty Crop Research Initiative (SCRI) projects: ‘‘Tools for Genomics-Assisted Breeding in Polyploids: Development of a Community Resource’’ (Award No. 2020-51181-32156); and ‘‘Developing Sustainable Rose Landscapes via Rose Rosette Disease Education, Socioeconomic Assessments, and Breeding RRD-Resistant Roses with Stable Black Spot Resistance’’ (Award No. 2022-51181-38330).

Versions > 1.0.0

Supported by Breeding Insight.

Qploidy

When does the Qploidy methodology work?

When does the Qploidy methodology NOT work?

Installation

Input files

Target sequencing VCF file

From Axiom array summary file

From Illumina array

Define Reference Samples - Choose Path (1) or (2)

(1) Using a Subset with Known Ploidy

(2) Using the Most Common Ploidy

(1) and (2) Run Dosage Caller

Target sequencing VCF data

(1) Filter the genos object to keep only the known ploidy samples subset

Array data

(1) and (2) Standardization

Recover the ploidy_standardization Object from a Saved File

(1) and (2) Plot results

(1) and (2) Ploidy Estimation for All Samples

Export ploidy result table

(1) and (2) Save plots for all samples

(2) Improve standardization selecting a known ploidy subset after first round

HMM-based Copy Number Estimation and Visualization

HMM-based Copy Number Estimation

Parallel HMM Estimation for Multiple Samples

Summarizing Parallel HMM Results

Interpreting the CN Track Plot

Try the Qploidy Shiny App!

How to cite

Acknowledgments

Version 1.0.0

Versions > 1.0.0

When does the `Qploidy` methodology work?

When does the `Qploidy` methodology NOT work?

Recover the `ploidy_standardization` Object from a Saved File