This vignette describes how to use
Qploidy, an R package designed for ploidy and aneuploidy
estimation using genotyping platform data. For a detailed explanation of
the Qploidy methodology, please refer to its
publication.
Qploidy methodology work?The Qploidy approach is effective under the following
conditions:
- Your marker data originates from Axiom or
Illumina genotyping arrays.
- Your marker data is derived from targeted sequencing
platforms (e.g., DArTag, GTseq, AgriSeq).
- All DNA samples were prepared following the same library
preparation protocol.
- You know the ploidy of at least a subset of 60
samples or you know the most common ploidy in
the dataset.
- Your dataset includes heterozygous samples.
Qploidy methodology NOT work?The methodology will not be effective under the following
circumstances:
- Your marker data comes from RADseq or
GBS (Genotyping-by-Sequencing) platforms.
- You intend to combine datasets from different sequencing
batches.
- For example: If you extracted DNA and sequenced two plates (192
samples) as one batch, and later sequenced an additional three plates
(288 samples) as a second batch, you would need to analyze the two
batches separately in Qploidy. Combining
all 480 samples into a single analysis will lead to incorrect
results.
- You do not have a subset of samples with known ploidy
or lack a predominant ploidy in your dataset.
- Your samples consist of inbred lines (homozygous
individuals).
To install the Qploidy package, you can use the
following commands:
In this vignette, we demonstrate the workflow using a simulated VCF file containing 50 samples and 500 genetic markers, allowing for a concise and computationally efficient example. We also present results derived from a real-world rose Axiom array dataset, which includes 524 garden rose cultivars and a total of 137,786 genetic markers.
Let’s generate our simulated VCF file. This file contains 500 markers and 60 samples, with a mix of tetraploid, diploid, and triploid samples. The tetraploid samples are used as the reference for standardization because they are the most common ploidy in the dataset.
library(Qploidy)
# Simulate a VCF file with 500 markers and 60 samples
simulate_vcf(
seed = 1234, file_path = "vcf_example_simulated.vcf",
n_tetraploid = 35, n_diploid = 5, n_triploid = 10,
n_markers = 500
)## MarkerName SampleName X Y R ratio
## 1 chr1_mk1 Tetraploid1 62 153 215 0.7116279
## 2 chr1_mk1 Tetraploid2 98 104 202 0.5148515
## 3 chr1_mk1 Tetraploid3 97 91 188 0.4840426
## 4 chr1_mk1 Tetraploid4 156 56 212 0.2641509
## 5 chr1_mk1 Tetraploid5 158 44 202 0.2178218
## 6 chr1_mk1 Tetraploid6 147 48 195 0.2461538
In case your samples were genotyped using the Axiom array platform,
you should have a summary file or .CEL files.
If you have .CEL files, you can convert them into a
summary file using the apt-probeset-genotype
plugin available in the Analysis Power Tools (APT) platform from Thermo
Fisher Scientific (Waltham, MA). A summary is composed by a
header that can have many lines and a body that looks like this:
## probeset_id Sample1 Sample2 Sample3 Sample4 Sample5
## 1 AX-95696335-A 66.387446 18.201038 104.790822 105.118016 104.5052361
## 2 AX-76690965-B 21.957280 7.406288 11.772518 53.006989 0.8532014
## 3 AX-83704427-A 12.368064 49.120263 110.625211 99.037832 98.9612293
## 4 AX-50778632-B 4.733086 17.542838 5.805427 4.022787 2.4019124
## 5 AX-52238885-A 97.683744 116.197899 96.799558 92.523135 5.5733040
## 6 AX-36184579-B 60.884661 16.177990 46.776783 12.399668 10.0071255
## Sample6 Sample7 Sample8 Sample9 Sample10
## 1 102.55972 118.808939 102.848765 33.786032 97.345342
## 2 70.20994 15.659821 91.132107 3.416389 3.827478
## 3 46.44331 62.658054 106.543788 105.964029 91.509631
## 4 52.23666 35.481627 21.569666 123.018371 7.954378
## 5 107.16436 7.086115 12.757439 109.580758 96.399672
## 6 10.85804 59.444760 2.278098 74.499855 87.020296
In Qploidy, use the read_axiom function to
read the summary file. It skips the header and reads the
body of the file. The function will return a data frame.
MarkerName SampleName X Y R ratio
1 AX-86752740 Lemon_Fiz 977.7548 793.4656 1771.220 0.4479768
2 AX-86752740 High_Voltage 1292.2205 582.7886 1875.009 0.3108191
3 AX-86752740 Lemon_Fiz.1 919.8055 870.9619 1790.767 0.4863624
4 AX-86752740 High_Voltage.1 1368.1266 645.9481 2014.075 0.3207170
5 AX-86752740 Brite_Eyes 238.5416 1242.7593 1481.301 0.8389648
6 AX-86752740 Morden_Blush.1 302.1575 1542.4037 1844.561 0.8361900
In case you have a Illumina array summary file, Qploidy
expects the following format:
[Header]
GSGT Version 2.0.5
Processing Date 4/17/2023 9:31 AM
Content specieX.bpm
Num SNPs 26251
Total SNPs 26251
Num Samples 100
Total Samples 100
[Data]
SNP Name Sample ID GC Score Theta X Y X Raw Y Raw Log R Ratio
mk-1 SAMP-2 0.6814 0.247 0.023 0.009 504 100 0.5211668
mk-10 SAMP-2 0.9204 0.000 0.006 0.000 346 27 -1.021583
mk-100 SAMP-2 0.1695 0.367 0.030 0.020 576 158 1.177404
mk-101 SAMP-2 0.8655 0.222 0.026 0.010 538 102 0.3459635
mk-103 SAMP-2 0.8932 0.000 0.019 0.000 465 18 0.7578704
mk-104 SAMP-2 0.5146 0.175 0.041 0.012 675 114 1.016254
mk-105 SAMP-2 0.8060 0.450 0.018 0.015 458 132 0.4305636
mk-106 SAMP-2 0.0000 0.386 0.002 0.002 315 57 NaN
mk-107 SAMP-2 0.4804 0.495 0.026 0.025 534 189 1.0151
Verify if you file look like this one. If confirmed, you can use the
read_illumina_array function to read the data. If you
include more than one file, the function will merge them into a single
data frame.
data <- read_illumina_array(
"data/illumina/Set1.txt", # This is a example code, data not available
"data/illumina/Set2.txt",
"data/illumina/Set3.txt"
)
head(input_data)MarkerName SampleName X Y R ratio
1 CT1 S1-2_1 0.023 0.009 0.032 0.2812500
2 CT10 S1-2_1 0.006 0.000 0.006 0.0000000
3 CT100 S1-2_1 0.030 0.020 0.050 0.4000000
4 CT101 S1-2_1 0.026 0.010 0.036 0.2777778
5 CT103 S1-2_1 0.019 0.000 0.019 0.0000000
6 CT104 S1-2_1 0.041 0.012 0.053 0.2264151
Qploidy applies a standardization method to allele
intensities or read counts to enable the comparison of copy numbers
across the genome of a sample. To achieve this, it requires a set of
reference samples. There are two ways to select these reference
samples:
Qploidy will use this subset as the reference.Qploidy using all
samples as the reference. From this first run, you can identify samples
with the common ploidy, then select this subset to use as the reference
in a second run of Qploidy.Below are the steps to proceed for each case, tagged as (1) and (2):
For this method, you will need to separate your subset of samples with known ploidy:
In our simulated example, the sample names tell us which ploidy they are. Let’s just create a vector with the tetraploid samples for us to use later:
## [1] "Tetraploid1" "Tetraploid2" "Tetraploid3" "Tetraploid4" "Tetraploid5"
## [6] "Tetraploid6" "Tetraploid7" "Tetraploid8" "Tetraploid9" "Tetraploid10"
## [11] "Tetraploid11" "Tetraploid12" "Tetraploid13" "Tetraploid14" "Tetraploid15"
## [16] "Tetraploid16" "Tetraploid17" "Tetraploid18" "Tetraploid19" "Tetraploid20"
## [21] "Tetraploid21" "Tetraploid22" "Tetraploid23" "Tetraploid24" "Tetraploid25"
## [26] "Tetraploid26" "Tetraploid27" "Tetraploid28" "Tetraploid29" "Tetraploid30"
## [31] "Tetraploid31" "Tetraploid32" "Tetraploid33" "Tetraploid34" "Tetraploid35"
## [36] "Diploid1" "Diploid2" "Diploid3" "Diploid4" "Diploid5"
## [41] "Triploid1" "Triploid2" "Triploid3" "Triploid4" "Triploid5"
## [46] "Triploid6" "Triploid7" "Triploid8" "Triploid9" "Triploid10"
## [1] "Tetraploid1" "Tetraploid2" "Tetraploid3" "Tetraploid4" "Tetraploid5"
## [6] "Tetraploid6" "Tetraploid7" "Tetraploid8" "Tetraploid9" "Tetraploid10"
## [11] "Tetraploid11" "Tetraploid12" "Tetraploid13" "Tetraploid14" "Tetraploid15"
## [16] "Tetraploid16" "Tetraploid17" "Tetraploid18" "Tetraploid19" "Tetraploid20"
## [21] "Tetraploid21" "Tetraploid22" "Tetraploid23" "Tetraploid24" "Tetraploid25"
## [26] "Tetraploid26" "Tetraploid27" "Tetraploid28" "Tetraploid29" "Tetraploid30"
## [31] "Tetraploid31" "Tetraploid32" "Tetraploid33" "Tetraploid34" "Tetraploid35"
To proceed, you need to determine the dosage for all reference samples. If you haven’t done this yet, you can use any suitable dosage caller.
fitPoly
package.updog or polyRAD
packages.These packages determine dosages by analyzing the distribution of allele intensities or read counts across all samples for each marker, using an informed ploidy model. You will provide either the most common ploidy in your population or the known ploidy of your selected subset as the informed ploidy.
Warning:
Depending on the number of samples and markers, this step may take a
significant amount of time to complete. It is highly recommended to run
it on a high-performance computing system where you can utilize multiple
cores and avoid issues with excessive RAM usage.
## [1] 50000 4
## MarkerName Chromosome Position
## 1 chr1_mk1 chr1 249732
## 2 chr1_mk2 chr1 347883
## 3 chr1_mk3 chr1 709781
## 4 chr1_mk4 chr1 758053
## 5 chr1_mk5 chr1 794442
## 6 chr1_mk6 chr1 823933
ID column of
the fixed portion, the MarkerName column will contain
NAs. To resolve this, you can use the following code to
concatenate the chromosome and position as marker names:library(tidyr)
# Prepare inputs for updog
## Approach (1)
# tobe_genotyped <- data[which(data$SampleName %in% tetraploid_samples),]
## Approach (2)
tobe_genotyped <- data
ref <- pivot_wider(tobe_genotyped[, 1:3], names_from = SampleName, values_from = X)
ref <- as.matrix(ref)
rownames_ref <- ref[, 1]
ref <- ref[, -1]
ref <- apply(ref, 2, as.numeric)
rownames(ref) <- rownames_ref
size <- pivot_wider(tobe_genotyped[, c(1, 2, 5)], names_from = SampleName, values_from = R)
size <- as.matrix(size)
rownames_size <- size[, 1]
size <- size[, -1]
size <- apply(size, 2, as.numeric)
rownames(size) <- rownames_size
size[1:10, 1:10]## Tetraploid1 Tetraploid2 Tetraploid3 Tetraploid4 Tetraploid5
## chr1_mk1 215 202 188 212 202
## chr1_mk2 184 194 208 205 213
## chr1_mk3 182 200 193 196 214
## chr1_mk4 189 202 202 203 203
## chr1_mk5 194 209 200 199 207
## chr1_mk6 186 206 178 195 205
## chr1_mk7 216 200 206 191 193
## chr1_mk8 196 215 201 209 199
## chr1_mk9 198 216 211 197 205
## chr1_mk10 204 195 205 187 205
## Tetraploid6 Tetraploid7 Tetraploid8 Tetraploid9 Tetraploid10
## chr1_mk1 195 195 205 210 196
## chr1_mk2 200 195 209 200 195
## chr1_mk3 205 199 205 194 220
## chr1_mk4 215 204 218 203 188
## chr1_mk5 213 208 191 191 208
## chr1_mk6 203 191 200 197 200
## chr1_mk7 214 198 204 203 198
## chr1_mk8 212 200 195 197 205
## chr1_mk9 204 198 200 189 208
## chr1_mk10 212 200 201 199 200
## Tetraploid1 Tetraploid2 Tetraploid3 Tetraploid4 Tetraploid5
## chr1_mk1 62 98 97 156 158
## chr1_mk2 93 92 56 105 158
## chr1_mk3 143 110 146 52 98
## chr1_mk4 43 102 149 158 95
## chr1_mk5 101 107 200 153 151
## chr1_mk6 148 102 40 49 56
## chr1_mk7 112 0 150 141 94
## chr1_mk8 50 56 54 155 101
## chr1_mk9 98 104 58 101 108
## chr1_mk10 145 148 151 145 106
## Tetraploid6 Tetraploid7 Tetraploid8 Tetraploid9 Tetraploid10
## chr1_mk1 147 149 154 99 145
## chr1_mk2 103 50 104 0 45
## chr1_mk3 150 53 100 105 109
## chr1_mk4 55 50 105 165 144
## chr1_mk5 56 154 154 98 101
## chr1_mk6 150 94 0 147 0
## chr1_mk7 70 99 50 98 154
## chr1_mk8 151 151 140 46 114
## chr1_mk9 54 44 101 86 156
## chr1_mk10 96 200 152 150 200
library(updog)
multidog_obj <- multidog(
refmat = ref,
sizemat = size,
model = "norm", # It depends of your population structure
ploidy = 4, # Most common ploidy in your population
nc = 6
) # Change the parameters accordingly!!## | *.#,%
## ||| *******/
## ||||||| (**..#**. */ **/
## ||||||||| */****************************/*%
## ||| &****..,*.************************/
## ||| (....,,,*,...****%********/(******
## ||| ,,****%////,,,,./.****/
## ||| /**// .*///....
## ||| .*/*/%# .,/ .,
## ||| , **/ #% .* ..
## ||| ,,,*
##
## Working on it...
## Warning in checkNumberOfLocalWorkers(workers): Careful, you are setting up 6
## localhost parallel workers with only 4 CPU cores available for this R process
## (per 'system'), which could result in a 150% load. The soft limit is set to
## 100%. Overusing the CPUs has negative impact on the current R process, but also
## on all other processes of yours and others running on the same machine. See
## help("parallelly.maxWorkers.localhost", package = "parallelly") for further
## explanations and how to override the soft limit that triggered this warning
## done!
genos <- data.frame(
MarkerName = multidog_obj$inddf$snp,
SampleName = multidog_obj$inddf$ind,
geno = multidog_obj$inddf$geno,
prob = multidog_obj$inddf$maxpostprob
)
head(genos)## MarkerName SampleName geno prob
## 1 chr1_mk1 Tetraploid1 1 0.9999976
## 2 chr1_mk1 Tetraploid2 2 0.9999999
## 3 chr1_mk1 Tetraploid3 2 0.9999999
## 4 chr1_mk1 Tetraploid4 3 0.9999997
## 5 chr1_mk1 Tetraploid5 3 1.0000000
## 6 chr1_mk1 Tetraploid6 3 1.0000000
Notice that, if you are following approach (2) diploids and triploids samples genotypes will be wrongly called as tetraploids. But, because they are not the most common samples in the dataset, they should not interfere significantly in the dosage call of the tetraploids. You can go back to this step after a first round of ploidy evaluation and call the dosage of the tetraploids only.
Here we show an option of genotype calling for array data using fitpoly software:
library(fitPoly)
saveMarkerModels(
ploidy = 4, # Most common ploidy among Texas Garden Roses collection
data = data_reference, # output of the previous function
filePrefix = "fitpoly_output/texas_roses", # Define the path and prefix for the result files
ncores = 2
) # Change it accordingly to your system!!!!
library(vroom) # Package to speed up reading files
fitpoly_scores <- vroom("fitpoly_output/texas_roses_scores.dat")
genos <- data.frame(
MarkerName = fitpoly_scores$MarkerName,
SampleName = fitpoly_scores$SampleName,
geno = fitpoly_scores$maxgeno,
prob = fitpoly_scores$maxP
)
head(genos) MarkerName SampleName geno prob
1 AX-86752740 1000 Wishes 2 0.7831438
2 AX-86752740 10004-N008 2 0.9892338
3 AX-86752740 10037_N046 2 0.9844171
4 AX-86752740 10038-N001 2 0.9916813
5 AX-86752740 10043_N019 2 0.9871057
6 AX-86752740 10043_N049 2 0.9955792
Differently than a VCF file, a summary file does not
contain the genomic positions of the markers. You will need to provide
these information in a separate file as the following example.
# File containing markers genomic positions
genos.pos_file <- read.table("geno.pos_roses_texas.txt", header = T)
head(genos.pos) SNP probes chr pos
1 Affx-86843634 AX-86752747 6 255852
2 Affx-86842821 AX-86752763 6 62032267
3 Affx-86839613 AX-86752769 5 9169310
4 Affx-86838724 AX-86752790 1 44259488
5 Affx-86840823 AX-86752809 7 12991931
6 Affx-86842443 AX-86752817 2 53861719
# Edit for Qploidy input format
genos.pos <- data.frame(
MarkerName = genos.pos_file$probes,
Chromosome = genos.pos_file$chr,
Position = genos.pos_file$pos
)
head(genos.pos) MarkerName Chromosome Position
1 AX-86752747 6 255852
2 AX-86752763 6 62032267
3 AX-86752769 5 9169310
4 AX-86752790 1 44259488
5 AX-86752809 7 12991931
6 AX-86752817 2 53861719
## Generating standardized BAFs...
## [1] "Percentage of genotypes turned into missing data because of low genotype probability:2.27"
## [1] "Markers removed because of excess of missing data:0"
## Going to parallel mode...
## Back to single core usage
## [1] "Markers removed because of smaller number of clusters than set threshold:13"
## Going to parallel mode...
## Back to single core usage
## BAFs ready!
## Generating z scores...
## Z scores ready!
## Merging results into qploidy_standardization object...
## Writting Qploidy app input file:/tmp/RtmpwT629w/file1983ee3ea0f.tsv.gzDone!
## This is on object of class 'ploidy_standardization'
## --------------------------------------------------------------------
## Parameters
##
## 1 standardization type: intensities
## 2 Ploidy: 4
## 3 Minimum number of heterozygous classes (clusters) present: 5
## 4 Maximum number of missing genotype by marker: 0.1
## 5 Minimum genotype probability: 0.8
## --------------------------------------------------------------------
## Filters
##
## 1 Number of markers at raw data:
## 2 Percentage of filtered genotypes by probability threshold:
## 3 Number of markers filtered by missing data:
## 4 Number of markers filtered for not having the minimum number of clusters:
## 5 Number of markers filtered for not having genomic information:
## 6 Number of markers with estimated BAF:
##
## 1 1000 (100%)
## 2 - (2.27 %)
## 3 0 (0 %)
## 4 13 (1.3 %)
## 5 0 (0 %)
## 6 987 (98.7 %)
simu_data_standardized <- standardize(
data = data,
genos = genos,
geno.pos = genos.pos,
ploidy.standardization = 4,
threshold.n.clusters = 5,
n.cores = 1,
out_filename = "my_standardized_data.tsv.gz",
verbose = TRUE
)
simu_data_standardizedHow it look like for our real roses dataset:
This is on object of class 'ploidy_standardization'
--------------------------------------------------------------------
Parameters
1 standardization type: intensities
2 Ploidy: 4
3 Minimum number of heterozygous classes (clusters) present: 5
4 Maximum number of missing genotype by marker: 0.1
5 Minimum genotype probability: 0.8
--------------------------------------------------------------------
Filters
1 Number of markers at raw data: 137786 (100%)
2 Percentage of filtered genotypes by probability threshold: - (16.79 %)
3 Number of markers filtered by missing data: 5659 (4.11 %)
4 Number of markers filtered for not having the minimum number of clusters: 101853 (73.92 %)
5 Number of markers filtered for not having genomic information: 252 (0.18 %)
6 Number of markers with estimated BAF: 28100 (20.39 %)
See the details of the parameters with ?standardize.
The print function of the resulting object provides
information about marker filtering during the process. One critical
filtering step involves the number of dosage clusters represented for
each marker. If your samples are highly inbred at certain loci, you
might not have individuals representing all possible dosages for those
marker locations.
For example, when using tetraploids as the reference, some markers
may have individuals with dosages of only 0, 1, 3, and 4 (missing dosage
2). Without representatives for the missing dosage, Qploidy
standardization cannot be performed.
threshold.n.clusters is set to
ploidy + 1, markers with missing dosages will be
discarded.threshold.n.clusters is smaller than
ploidy + 1, the missing dosage cluster centers will be
imputed for standardization purposes.ploidy_standardization Object from a Saved
FileTo revisit this step later without re-running the upstream functions,
you can load the previously generated file using the
read_qploidy_standardization function:
Qploidy provides several options for visualization of
the results. Check the complete description with
?plot_qploidy_standaridization. Bellow you can see
examples:
# Select a sample to be evaluated
samples <- unique(simu_data_standardized$data$SampleName)
head(samples)## [1] "Tetraploid1" "Tetraploid2" "Tetraploid3" "Tetraploid4" "Tetraploid5"
## [6] "Tetraploid6"
sample <- "Tetraploid1"
# Proportion of heterozygous loci, BAF (Qploidy standardized ratio), and zscore by genomic positions
p <- plot_qploidy_standardization(
x = simu_data_standardized,
sample = sample,
type = c("het", "BAF", "zscore"),
dot.size = 0.05,
chr = 1:2
)## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
See how it look like for our real roses dataset:
# Heterozygous frequency, BAF (Qploidy standardized ratio), and zscore by genomic positions
# centromere positions added
p <- plot_qploidy_standardization(
x = simu_data_standardized,
sample = sample,
type = c("het", "BAF", "zscore"),
dot.size = 0.05,
chr = 1:2,
add_centromeres = TRUE,
centromeres = c("chr1" = 15000000, "chr2" = 19000000)
)## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
See how it look like for our real roses dataset:
# Raw allele intensity or read count ratio and BAF (Qploidy standardized ratio) histograms
# combining all markers in the sample (sample level resolution)
p <- plot_qploidy_standardization(
x = simu_data_standardized,
sample = sample,
type = c("Ratio_hist_overall", "BAF_hist_overall"),
chr = 1:2,
ploidy = 4,
add_expected_peaks = TRUE
)## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Notice that, because it is a simulated data, the ratio and the BAF don’t differ much. See how it look like for our real roses dataset:
# BAF histograms (chromosome level resolution) and Z score
p <- plot_qploidy_standardization(
x = simu_data_standardized,
sample = sample,
type = c("BAF_hist", "zscore"),
chr = 1:2,
add_expected_peaks = TRUE,
ploidy = c(4, 4)
)## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
See how it look like for our real roses dataset:
# BAF histograms combining all markers in the sample (chromosome-arm level resolution) and Z score
p <- plot_qploidy_standardization(
x = simu_data_standardized,
sample = sample,
type = c("BAF_hist", "zscore"),
chr = 1:2,
ploidy = c(4, 4, 4, 4), # Provide ploidy for each arm
add_expected_peaks = TRUE,
add_centromeres = TRUE,
centromeres = c("chr1" = 15000000, "chr2" = 19000000)
)## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
See how it look like for our real roses dataset:
It is important to note that the quality of results can vary across
samples, as observed in different plots. Key aspects to assess
include:
- How sharp the peaks are in the histogram plots.
- How well the dots cluster in the BAF vs. genomic position plots.
- Whether the patterns match the decay or increase of the Z-score
(Z).
In our publication, we categorize sample quality into different
resolutions based on the ability to estimate copy
numbers, ranked from highest to lowest resolution:
1. Chromosome-arm resolution: When the copy number of
all chromosome arms can be estimated.
2. Chromosome resolution: When at least one chromosome
arm’s copy number cannot be estimated.
3. Sample resolution: When the copy number of at least
one entire chromosome cannot be estimated.
4. None: When the histogram peaks, considering all
markers, do not fit any of the expected ploidy levels tested.
For more details and examples, please refer to the publication.
Warning: This method may not be fully accurate in certain situations. While it offers a helpful initial guide, visual inspection of the plots described in the previous section is essential to assess the resolution and confirm the ploidy.
# If sample level resolution
estimated_ploidies_sample <- area_estimate_ploidy(
qploidy_standardization = simu_data_standardized, # standardization result object
samples = "all", # Samples or "all" to estimate all samples
level = "sample", # Resolution level
ploidies = c(2, 3, 4, 5)
) # Ploidies to be tested
estimated_ploidies_sample## Object of class qploidy_area_ploidy_estimation
## 1 Number of samples: 50
## 2 Chromosomes: chr1,chr2
## 3 Tested ploidies: 1,2,3,4,5
## 4 Number of euploid samples: 50
## 5 Number of potential aneuploid samples: 0
## 6 Number of highly inbred samples: 0
See how it look like for our real roses dataset:
Object of class qploidy_area_ploidy_estimation
1 Number of samples: 524
2 Chromosomes: 1,3,5,2,6,7,4,0
3 Tested ploidies: 1,2,3,4,5
4 Number of euploid samples: 459
5 Number of potential aneuploid samples: 0
6 Number of highly inbred samples: 65
Highly inbred samples cannot be evaluated for copy number using this
method. As a result, Qploidy returns a missing value
(NA) for these cases.
## [,1]
## Tetraploid1 4
## Tetraploid2 4
## Tetraploid3 4
## Tetraploid4 4
## Tetraploid5 4
## Tetraploid6 4
See how it look like for our real roses dataset:
[,1]
Lemon_Fiz 4
High_Voltage 4
Lemon_Fiz.1 4
High_Voltage.1 4
Brite_Eyes 4
Morden_Blush.1 4
# If chromosome resolution
estimated_ploidies_chromosome <- area_estimate_ploidy(
qploidy_standardization = simu_data_standardized,
samples = "all",
level = "chromosome",
ploidies = c(2, 3, 4, 5)
)
estimated_ploidies_chromosome## Object of class qploidy_area_ploidy_estimation
## 1 Number of samples: 50
## 2 Chromosomes: chr1,chr2
## 3 Tested ploidies: 1,2,3,4,5
## 4 Number of euploid samples: 50
## 5 Number of potential aneuploid samples: 0
## 6 Number of highly inbred samples: 0
See how it look like for our real roses dataset:
Object of class qploidy_area_ploidy_estimation
1 Number of samples: 524
2 Chromosomes: 0,1,2,3,4,5,6,7
3 Tested ploidies: 1,2,3,4,5
4 Number of euploid samples: 375
5 Number of potential aneuploid samples: 92
6 Number of highly inbred samples: 57
## chr1 chr2
## Tetraploid1 4 4
## Tetraploid2 4 4
## Tetraploid3 4 4
## Tetraploid4 4 4
## Tetraploid5 4 4
## Tetraploid6 4 4
See how it look like for our real roses dataset:
0 1 2 3 4 5 6 7
Lemon_Fiz 4 4 4 4 4 4 4 4
High_Voltage 4 4 4 4 4 4 4 4
Lemon_Fiz.1 4 4 4 4 4 4 4 4
High_Voltage.1 4 4 4 4 4 4 4 4
Brite_Eyes 4 4 4 4 4 4 4 4
Morden_Blush.1 4 4 4 4 4 4 4 4
# If chromosome-arm resolution
estimated_ploidies_chromosome_arm <- area_estimate_ploidy(
qploidy_standardization = simu_data_standardized,
samples = "all",
level = "chromosome-arm",
ploidies = c(2, 3, 4, 5),
centromeres = c("chr1" = 15000000, "chr2" = 19000000)
)
estimated_ploidies_chromosome_arm## Object of class qploidy_area_ploidy_estimation
## 1 Number of samples: 50
## 2 Chromosomes: chr1.1,chr1.2,chr2.1,chr2.2
## 3 Tested ploidies: 1,2,3,4,5
## 4 Number of euploid samples: 50
## 5 Number of potential aneuploid samples: 0
## 6 Number of highly inbred samples: 0
See how it look like for our real roses dataset:
Object of class qploidy_area_ploidy_estimation
1 Number of samples: 524
2 Chromosomes: 0,1.1,1.2,2.1,2.2,3.1,3.2,4.1,4.2,5.1,5.2,6.1,6.2,7.1,7.2
3 Tested ploidies: 1,2,3,4,5
4 Number of euploid samples: 304
5 Number of potential aneuploid samples: 171
6 Number of highly inbred samples: 49
## chr1.1 chr1.2 chr2.1 chr2.2
## Tetraploid1 4 4 4 4
## Tetraploid2 4 4 4 4
## Tetraploid3 4 4 4 4
## Tetraploid4 4 4 4 4
## Tetraploid5 4 4 4 4
## Tetraploid6 4 4 4 4
See how it look like for our real roses dataset:
0 1.1 1.2 2.1 2.2 3.1 3.2 4.1 4.2 5.1 5.2 6.1 6.2 7.1 7.2
Lemon_Fiz 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
High_Voltage 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Lemon_Fiz.1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
High_Voltage.1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Brite_Eyes 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Morden_Blush.1 4 4 4 4 4 NA 4 4 4 4 4 4 4 4 4
Note that one of the chromosome arms (Morden_Blush.1 - 3.2) returned
a value of NA. This occurs because the arm contains a high
number of inbred loci.
estimated_ploidies_format <- merge_arms_format(estimated_ploidies_chromosome_arm)
estimated_ploidies_format## Object of class qploidy_area_ploidy_estimation
## 1 Number of samples: 50
## 2 Chromosomes: chr1,chr2
## 3 Tested ploidies: 1,2,3,4,5
## 4 Number of euploid samples: 50
## 5 Number of potential aneuploid samples: 0
## 6 Number of highly inbred samples: 0
See how it look like for our real roses dataset:
Object of class qploidy_area_ploidy_estimation
1 Number of samples: 524
2 Chromosomes: 0,1,2,3,4,5,6,7
3 Tested ploidies: 1,2,3,4,5
4 Number of euploid samples: 304
5 Number of potential aneuploid samples: 171
6 Number of highly inbred samples: 49
## chr1 chr2
## Tetraploid1 4 4
## Tetraploid2 4 4
## Tetraploid3 4 4
## Tetraploid4 4 4
## Tetraploid5 4 4
## Tetraploid6 4 4
See how it look like for our real roses dataset:
0 1 2 3 4 5 6 7
Lemon_Fiz "4" "4" "4" "4" "4" "4" "4" "4"
High_Voltage "4" "4" "4" "4" "4" "4" "4" "4"
Lemon_Fiz.1 "4" "4" "4" "4" "4" "4" "4" "4"
High_Voltage.1 "4" "4" "4" "4" "4" "4" "4" "4"
Brite_Eyes "4" "4" "4" "4" "4" "4" "4" "4"
Morden_Blush.1 "4" "4" "4" "NA/4" "4" "4" "4" "4"
Facilitate individual visual inspection by generating figures for all samples in a loop:
dir.create("figures")
for (i in seq_along(samples)) {
print(paste0("Part:", i, "/", length(samples)))
print(paste("Generating figures for", samples[i], "..."))
# This function will generate figures for sample, chromosome and chromosome-arm resolution level evaluation
all_resolutions_plots(
data_standardized = simu_data_standardized,
sample = samples[i],
ploidy = estimated_ploidies_chromosome$ploidy[i, 1:2],
chr = 1:2,
centromeres = c("chr1" = 15000000, "chr2" = 19000000),
file_name = paste0("figures/", samples[i])
)
print(paste("Done!"))
}By visualizing the plots, you can correct the results from
area_estimate_ploidy and add resolution information for
each sample. In our perfect simulated data, all samples correctly
identified therefore we will just move forward with the analysis.
If you used the most common ploidy as the reference instead of a
subset with known ploidy, the standardization may not be as accurate.
You can improve the resolution by following these steps:
i. Select the highest resolution plots from the first standardization
round.
ii. Filter these samples from the data object.
iii. Run standardization again using this subset as the reference.
iv. Reevaluate the plots and estimate the ploidy once more.
# i) Select the highest resolution plots from the first standardization round.
tetraploids <- apply(ploidy_corrected, 1, function(x) all(x == 4))
tetraploids
data_reference2 <- rownames(ploidy_corrected)[which(tetraploids)]
length(data_reference2)
# ii) Filter these samples from the `data` object.
genos_round2 <- genos[which(genos$SampleName %in% data_reference2), ]
# iii) Run standardization again using this subset as the reference.
simu_data_standardized_round2 <- standardize(
data = data,
genos = genos_round2,
geno.pos = genos.pos,
ploidy.standardization = 4,
threshold.n.clusters = 5,
n.cores = 1,
out_filename = "my_round2_standardization.tsv.gz",
verbose = TRUE
)
simu_data_standardized_round2For our perfect simulated data, plot resolution didn’t change between first and second round but see how it look like for our real roses dataset:
This is on object of class 'ploidy_standardization'
--------------------------------------------------------------------
Parameters
1 standardization type: intensities
2 Ploidy: 4
3 Minimum number of heterozygous classes (clusters) present: 5
4 Maximum number of missing genotype by marker: 0.1
5 Minimum genotype probability: 0.8
--------------------------------------------------------------------
Filters
1 Number of markers at raw data: 137786 (100%)
2 Percentage of filtered genotypes by probability threshold: - (16.34 %)
3 Number of markers filtered by missing data: 6108 (4.43 %)
4 Number of markers filtered for not having the minimum number of clusters: 111977 (81.27 %)
5 Number of markers filtered for not having genomic information: 145 (0.11 %)
6 Number of markers with estimated BAF: 17634 (12.8 %)
New plots for roses real data after round 1 and 2 of standardization:
Qploidy provides a powerful function,
hmm_estimate_CN, for estimating copy number (CN) across the
genome using a Hidden Markov Model (HMM) approach. This method combines
standardized depth (z-score) and BAF (standardized ratio) information in
windows to infer CN states for each region.
# Estimate copy number for a sample using HMM
hmm_result <- hmm_estimate_CN(
qploidy_standarize_result = simu_data_standardized, # standardized object
sample_id = sample, # sample name
chr = 1:2, # chromosomes to analyze
snps_per_window = 20, # SNPs per window
min_snps_per_window = 5, # minimum SNPs per window
cn_grid = c(2, 3, 4, 5), # ploidies to test
M = 100, # BAF bins
bw = 0.03, # BAF template bandwidth
exp_ploidy = 4 # expected ploidy
)
head(hmm_result$result)The output is a list with: - result: a data frame with
CN calls, posterior probabilities, and window coordinates. -
params: the fitted HMM parameters (means, transition
matrix, etc.).
See ?hmm_estimate_CN for full details.
To efficiently estimate copy number for all samples in your dataset,
use the new function hmm_estimate_CN_multi. This function
runs HMM estimation in parallel across samples, leveraging multiple CPU
cores for faster analysis. It is cross-platform compatible (Windows,
macOS, Linux) and automatically handles cluster setup and package
loading.
# Estimate copy number for all samples in parallel
multi_hmm_result <- hmm_estimate_CN_multi(
qploidy_standarize_result = simu_data_standardized, # standardized object
sample_ids = "all", # or a vector of sample names
n_cores = 4, # number of CPU cores to use
chr = 1:2, # chromosomes to analyze
snps_per_window = 20, # SNPs per window
min_snps_per_window = 5, # minimum SNPs per window
cn_grid = c(2, 3, 4, 5), # ploidies to test
M = 100, # BAF bins
bw = 0.03, # BAF template bandwidth
exp_ploidy = 4 # expected ploidy
)The output is a list of HMM results for each sample.
After running parallel HMM estimation, you can summarize the results
using the new function summarize_cn_mode. This function
extracts the most likely copy number state for each region and provides
a summary table at the desired resolution (sample, chromosome, or
chromosome-arm).
# Summarize HMM results for all samples
hmm_summary <- summarize_cn_mode(
multi_hmm_result, # output from hmm_estimate_CN_multi
level = "chromosome" # resolution: "sample", "chromosome", or "chromosome-arm"
)
head(hmm_summary)You can further merge these results with area-based ploidy estimates
using merge_cn_summary_with_estimates for integrated
reporting.
See ?hmm_estimate_CN_multi,
?summarize_cn_mode, and
?merge_cn_summary_with_estimates for details and
examples.
# Plot CN track for a sample
plot_cn_track(
hmm_CN = hmm_result, # output from hmm_estimate_CN
sample_id = sample, # sample name
show_window_lines = TRUE # show window boundaries
)See ?plot_cn_track for more options and details.
The CN track plot provides a visual summary of copy number states across the genome for a given sample: - Top panel: Shows the z-score (standardized depth) for each window, highlighting regions of copy number variation. - Bottom panel: Displays the called copy number segments, colored by confidence (posterior probability). Each segment represents a genomic window, and the color intensity reflects the certainty of the CN call. - Window boundaries: Vertical lines indicate the start and end of each window used in the HMM analysis. - Legend: The plot includes a legend for copy number states and confidence levels, helping you quickly identify regions of aneuploidy or CN changes.
Use this plot to: - Assess the quality and resolution of CN calls across chromosomes. - Identify regions with high or low confidence in CN estimation. - Compare CN patterns between samples or across chromosomes.
For an interactive experience, explore your data and run analyses using the Qploidy Shiny app. The app provides a user-friendly interface for data upload, parameter selection, visualization, and CN estimation.
To launch the app, simply run:
The app can be used locally or deployed on a server for collaborative analysis. It includes all major features of the package, including standardization, HMM-based CN estimation, and interactive plotting.
For more information, see the documentation and help files in the app, or visit the Qploidy GitHub page.
Taniguti, C. H., Lau, J., Hochhaus, T., Arias, D. C. L., Hokanson, S. C., Zlesak, D. C., Byrne, D. H., Klein, P. E., & Riera-Lizarazu, O. (2025). Exploring chromosomal variations in garden roses: Insights from high-density SNP array data and a new tool, Qploidy. The Plant Genome, e70044. https://doi.org/10.1002/tpg2.70044
Funded in part by the Robert E. Basye Endowment in Rose Genetics, Dept. of Horticultural Sciences, Texas A&M University, and USDA’s National Institute of Food and Agriculture (NIFA), Specialty Crop Research Initiative (SCRI) projects: ‘‘Tools for Genomics-Assisted Breeding in Polyploids: Development of a Community Resource’’ (Award No. 2020-51181-32156); and ‘‘Developing Sustainable Rose Landscapes via Rose Rosette Disease Education, Socioeconomic Assessments, and Breeding RRD-Resistant Roses with Stable Black Spot Resistance’’ (Award No. 2022-51181-38330).
Supported by Breeding Insight.