What are they for?

This tool will help you to compare and clasify data sets using diversity metrics (i.e. scaffold counts, fingerprints similarity, molecular properties), describe how diverse each data set is and determine which one is the most diverse.

What are their advantages?

Users are free to calculate and use the metrics they consider to be the most appropriate or the metrics that fit better their objectives.

We are saving the users the pain of having to analyze each metric separately and helping them to avoid the sketchy results obtained after comparing all data sets using only one diversity metric!

The users can compare data sets with different size.

1. For each data set calculate at least two of the metrics described.

2. Download the template (.csv file). Do not change the position of the columns or the header of the columns. The program will read your .csv file in that order and using those names.

The template will have these columns:

DataSet

Here you can introduce letters or numbers to help you identify your data sets, this name will be shown on the dot representing the data set, we suggest you use a short name. You can introduce as many data sets as you want.

Size

You can represent the relative size of the data set with a number between 1 and 20 (1 for the smallest and 20 for the biggest data set). Do not leave this column in blank, if you do not want to represent the size use the same number for all your data sets.

MACCs, ECFP and Fingerprints

Here you can introduce a representative similarity value calculated with the fingerprints (e.g. mean or median). You will choose one of these metrics to be plotted on the x axis of the plots, which represents the fingerprints diversity.

Chem/NumComp, AUC, F50 and SSEn

You will have a single Chem/NumComp, F50, SSEn and AUC value for each data set, these values will be plotted on the y axis of your plot, which represents the scaffold diversity.

Molecular_properties

Here you can introduce a representative value if the diversity using properties (e.g. mean or median Euclidean distances of the molecular properties) to set the color of each data set. Data sets with the highest intra-data set distance (i.e. the most diverse) will be green, the less diverse data sets will be red and intermediate data sets will be brown-orange. Do not leave this column it in blank.

3. You can fill the .csv with the metrics you are interested in. If you leave one of the columns in blank the plot for that metric will give you the following error: Discrete value supplied to continuous scale. The plot will only work for the metrics with a numeric value.

4. Do not leave the columns Size and Molecular_properties in blank. If you do not have a value for this columns you can fill them with cero.

5. Save your template as .csv and upload it on the panel with the plot for the chemotype fraction, F50 and SSEn. The plot for AUC will use the same file.

6. For each plot you can choose a scaffold diversity metric and one fingerprint.

7. Introduce a number from 0 to 1 for the thresholds, these number will depend on the data sets and the properties you are analyzing. You could use the median of the property for each axis.

8. You can save your CDPs by left-clicking on the plots.

Which metrics could you use and what do they mean?

On these plots you are going to visualize the global diversity of your data sets, this means you will visualize how diverse your data sets are comparing their diversity as measured by scaffold content with their diversity depending on how similar the entire structures of the compounds in your data sets are. In addition, you can visualize how diverse your data sets are if you compare their molecular properties.

About the axis:

For the y axis you can choose between four metrics to measure the scaffold or chemotype diversity:

1. Fraction of chemotypes (Chem/NumComp). This fraction is the number of scaffolds or chemotypes in your data set divided by the number of compounds in your data set. The highest value you could get is one and that would be the case only if each compound in your data set has a different scaffold, therefore values closer to one indicate high scaffold diversity. Reference

2. Area under the curve (AUC). This value comes from the CSR curves you make with the fraction of scaffolds or chemotypes plotted on the x axis and the fraction of compounds that contain those scaffolds on the y axis. The highest value you could get is 1, this value corresponds to a curve, meaning that all your compounds share the same scaffold (i.e. low diversity). The lowest value you could get for AUC is 0.5, this indicates you have a straight line and each of the compounds in your data set have a different scaffold (i.e. high scaffold diversity) Reference

3. F50 or the fraction of scaffolds required to retrieve 50% of the compounds in the corresponding data set. In this case the highest value you could get is 0.5, which means each of the compounds in your data set have a different scaffold. Reference

4. Scaled Shannon Entropy (SSEn) this entropy-based metric takes into account the frequency distribution of the compounds in the n most populated scaffolds or chemotypes. The values of SSE range between 0, where all the compounds are contained in one chemotype, and 1, where each chemotypes contains an equal number of compounds. Therefore, SSE values closer to 1 indicate large diversity within the n most populated chemotypes. Reference

You can define the scaffolds using the method of your preference.

On the x axis you can plot the intermolecular diversity of the entire molecules using MACCS keys, Extended-Connectivity or any other fingerprint of your choice, using a representative metric of similarity.

About the thresholds:

The thresholds define 4 quadrants on your plot and each quadrant has a meaning in terms of diversity. Let us say you are using F50 on your y axis and ECFP on your x axis, all the data sets with a F50 close to 0.5 and a similarity value closer to 0 would be considered diverse by their scaffold content and if the entire compounds are analyzed, these data sets will be in the quadrant colored in red, while the opposite case (i.e. low F50 and high similarity) will be in the white quadrant. Data sets with many different scaffolds but high similarity values would be on the yellow quadrant, these data sets contain cyclic systems with few side chains that do not contribute to the structural diversity. On the other hand, data sets with mostly acyclic systems and with low similarity values can be found on the blue quadrant.

To set the thresholds we suggest that you use the mean or median of your data sets values, depending on the metric you are using on each axis. However, you can set the thresholds in the way that better fulfills your diversity requirements.

The size and color of the dot for each data set

The size of the dot depends on the size of the data set it represents, bigger dots correspond to bigger data sets. The color scale of the dots represents the molecular properties diversity, you can determine the intra-data set distances using the molecular properties and then use the mean or the median of these distances to set the color of each data set. Green dots represent data sets with lower molecular properties diversity and red dots represent data sets with higher molecular diversity.

Template

Choose CSV file

Browse...

Scaffold diversity

Fingerprint

Thresholds

Molecular fingerprint mean/median

Scaffold diversity mean/median

Axis maximum value

Change the x axis maximum value

Change the y axis maximum value

Axis minimum value

Change the x axis minimum value

Change the y axis minimum value

Download image

Cyclic System Recovery (CSR) curves

Here you can compute the scaffold diversity of all your data sets. Just upload a comma delimited (,) file in which each column will have a different data set and each row will have the scaffold ID for the compounds in the data set. Compounds with the same scaffold must have the same ID. Reference.

Choose the CSV file

Browse...

Here you can obtain the number of compounds (M), number of chemotypes(N), fraction of chemotypes (FNM), number of singletons(NSING), fractions of singletons (FNSING,FNSINGM), area under de curve (AUC) and the fraction of scaffolds required to retrive the 50% (F50) of the compounds in the corresponding data set. You can use this information for the CDPlots.

Download CSR curves data

Download image

Cumulative Distribution Function (CDF)

Here you can compute the fingerprint diversity and obtain the Cumulative Distribution Function (CDF) of your data sets. Just upload a comma delimited (,) file in which the first column will have SMILES, the second columnn your data sets names and the third column your compounds IDs.

Choose the CSV file

Browse...

Select a fingerprint

Select the diameter for topological fingerprints

Wait until the plot appears before you download the data. This could take a few minutes, depending on the number of compounds you are analyzing.

You can use the similarity mean or median for the CDPlots.

Download the similarity summary

Download image

SSEn plotand data

Here you can compute the Scaled Shannon Entropy from the 10 (SSE10) to the 60 (SSE60) most populated scaffolds in all your data sets. Just upload a comma delimited (,) file in which each column will have a different data set and each row will have the scaffold ID for the compounds in the data set. Compounds with the same scaffold must have the same ID.

Choose the CSV file

Browse...

Plot

Select the entropy of the n most populated scaffolds.

SSEn

Write the number of the column with the data set you want to plot.

Column number

You will obtain the SSE10, SSE20, SSE30, SSE40, SSE50, SSE60 of the scaffolds or chemotypes in the corresponding data set. You can use this information for the CDPlots.

Download the SSE data

Download image