Consensus Diversity Plots Version 2 (CDPs-V.2)

DIFACQUIM


Welcome to the first online version of the Consensus Diversity Plots (CDPs)!

These plots were designed by members from DIFACQUIM based in Universidad Nacional Atonoma de Mexico.

If you use this app, please cite it using the following reference:

Gonzalez-Medina M, Prieto-Martinez FD, Owen JR, Medina-Franco JL. Consensus Diversity Plots: A Global Diversity Analysis of Chemical Libraries. J. Cheminform. 8(1),63 (2016). Visit our website
This Shiny app was developped by Mariana Gonzalez Medina, DIFACQUIM
Last update August 2017

What are they for?

This tool will help you to compare and clasify data sets using diversity metrics (i.e. scaffold counts, fingerprints similarity, molecular properties), describe how diverse each data set is and determine which one is the most diverse.


What are their advantages?

Users are free to calculate and use the metrics they consider to be the most appropriate or the metrics that fit better their objectives.

We are saving the users the pain of having to analyze each metric separately and helping them to avoid the sketchy results obtained after comparing all data sets using only one diversity metric!

The users can compare data sets with different size.

1. For each data set calculate at least two of the metrics described.

2. Download the template (.csv file). Do not change the position of the columns or the header of the columns. The program will read your .csv file in that order and using those names.

The template will have these columns:

DataSet

Here you can introduce letters or numbers to help you identify your data sets, this name will be shown on the dot representing the data set, we suggest you use a short name. You can introduce as many data sets as you want.

Size

You can represent the relative size of the data set with a number between 1 and 20 (1 for the smallest and 20 for the biggest data set). Do not leave this column in blank, if you do not want to represent the size use the same number for all your data sets.

MACCs, ECFP and Fingerprints

Here you can introduce a representative similarity value calculated with the fingerprints (e.g. mean or median). You will choose one of these metrics to be plotted on the x axis of the plots, which represents the fingerprints diversity.

Chem/NumComp, AUC, F50 and SSEn

You will have a single Chem/NumComp, F50, SSEn and AUC value for each data set, these values will be plotted on the y axis of your plot, which represents the scaffold diversity.

Molecular_properties

Here you can introduce a representative value if the diversity using properties (e.g. mean or median Euclidean distances of the molecular properties) to set the color of each data set. Data sets with the highest intra-data set distance (i.e. the most diverse) will be green, the less diverse data sets will be red and intermediate data sets will be brown-orange. Do not leave this column it in blank.

3. You can fill the .csv with the metrics you are interested in. If you leave one of the columns in blank the plot for that metric will give you the following error: Discrete value supplied to continuous scale. The plot will only work for the metrics with a numeric value.

4. Do not leave the columns Size and Molecular_properties in blank. If you do not have a value for this columns you can fill them with cero.

5. Save your template as .csv and upload it on the panel with the plot for the chemotype fraction, F50 and SSEn. The plot for AUC will use the same file.

6. For each plot you can choose a scaffold diversity metric and one fingerprint.

7. Introduce a number from 0 to 1 for the thresholds, these number will depend on the data sets and the properties you are analyzing. You could use the median of the property for each axis.

8. You can save your CDPs by left-clicking on the plots.


Which metrics could you use and what do they mean?

On these plots you are going to visualize the global diversity of your data sets, this means you will visualize how diverse your data sets are comparing their diversity as measured by scaffold content with their diversity depending on how similar the entire structures of the compounds in your data sets are. In addition, you can visualize how diverse your data sets are if you compare their molecular properties.

About the axis:

For the y axis you can choose between four metrics to measure the scaffold or chemotype diversity:

1. Fraction of chemotypes (Chem/NumComp). This fraction is the number of scaffolds or chemotypes in your data set divided by the number of compounds in your data set. The highest value you could get is one and that would be the case only if each compound in your data set has a different scaffold, therefore values closer to one indicate high scaffold diversity. Reference

2. Area under the curve (AUC). This value comes from the CSR curves you make with the fraction of scaffolds or chemotypes plotted on the x axis and the fraction of compounds that contain those scaffolds on the y axis. The highest value you could get is 1, this value corresponds to a curve, meaning that all your compounds share the same scaffold (i.e. low diversity). The lowest value you could get for AUC is 0.5, this indicates you have a straight line and each of the compounds in your data set have a different scaffold (i.e. high scaffold diversity) Reference

3. F50 or the fraction of scaffolds required to retrieve 50% of the compounds in the corresponding data set. In this case the highest value you could get is 0.5, which means each of the compounds in your data set have a different scaffold. Reference

4. Scaled Shannon Entropy (SSEn) this entropy-based metric takes into account the frequency distribution of the compounds in the n most populated scaffolds or chemotypes. The values of SSE range between 0, where all the compounds are contained in one chemotype, and 1, where each chemotypes contains an equal number of compounds. Therefore, SSE values closer to 1 indicate large diversity within the n most populated chemotypes. Reference

You can define the scaffolds using the method of your preference.

On the x axis you can plot the intermolecular diversity of the entire molecules using MACCS keys, Extended-Connectivity or any other fingerprint of your choice, using a representative metric of similarity.

About the thresholds:

The thresholds define 4 quadrants on your plot and each quadrant has a meaning in terms of diversity. Let us say you are using F50 on your y axis and ECFP on your x axis, all the data sets with a F50 close to 0.5 and a similarity value closer to 0 would be considered diverse by their scaffold content and if the entire compounds are analyzed, these data sets will be in the quadrant colored in red, while the opposite case (i.e. low F50 and high similarity) will be in the white quadrant. Data sets with many different scaffolds but high similarity values would be on the yellow quadrant, these data sets contain cyclic systems with few side chains that do not contribute to the structural diversity. On the other hand, data sets with mostly acyclic systems and with low similarity values can be found on the blue quadrant.

To set the thresholds we suggest that you use the mean or median of your data sets values, depending on the metric you are using on each axis. However, you can set the thresholds in the way that better fulfills your diversity requirements.

The size and color of the dot for each data set

The size of the dot depends on the size of the data set it represents, bigger dots correspond to bigger data sets. The color scale of the dots represents the molecular properties diversity, you can determine the intra-data set distances using the molecular properties and then use the mean or the median of these distances to set the color of each data set. Green dots represent data sets with lower molecular properties diversity and red dots represent data sets with higher molecular diversity.

Template

Here you can download the .csv file that contains the header with the names que program will read, you can fill the template with your data.

DO NOT change the headers or the order of the columns.

If your .csv is saved as separated with semicolon (;), it will give the following error: object 'Size' not found. Make sure it is saved as separated with comma (,).

Click here to download the template

Thresholds

Axis maximum value

Axis minimum value

Cyclic System Recovery (CSR) curves

Here you can compute the scaffold diversity of all your data sets. Just upload a comma delimited (,) file in which each column will have a different data set and each row will have the scaffold ID for the compounds in the data set. Compounds with the same scaffold must have the same ID. Reference.

Here you can obtain the number of compounds (M), number of chemotypes(N), fraction of chemotypes (FNM), number of singletons(NSING), fractions of singletons (FNSING,FNSINGM), area under de curve (AUC) and the fraction of scaffolds required to retrive the 50% (F50) of the compounds in the corresponding data set. You can use this information for the CDPlots.

Download CSR curves data

Cumulative Distribution Function (CDF)

Here you can compute the fingerprint diversity and obtain the Cumulative Distribution Function (CDF) of your data sets. Just upload a comma delimited (,) file in which the first column will have SMILES, the second columnn your data sets names and the third column your compounds IDs.

Wait until the plot appears before you download the data. This could take a few minutes, depending on the number of compounds you are analyzing.

You can use the similarity mean or median for the CDPlots.

Download the similarity summary

SSEn plotand data

Here you can compute the Scaled Shannon Entropy from the 10 (SSE10) to the 60 (SSE60) most populated scaffolds in all your data sets. Just upload a comma delimited (,) file in which each column will have a different data set and each row will have the scaffold ID for the compounds in the data set. Compounds with the same scaffold must have the same ID.

Plot

Select the entropy of the n most populated scaffolds.

Write the number of the column with the data set you want to plot.

You will obtain the SSE10, SSE20, SSE30, SSE40, SSE50, SSE60 of the scaffolds or chemotypes in the corresponding data set. You can use this information for the CDPlots.

Download the SSE data

We would appreciate your feedback!

If you have doubts or suggestions, please send an email to:

Mariana Gonzalez Medina mgm_14392@comunidad.unam.mx

Jose L. Medina Franco medinajl@unam.mx

Funding

UNAM: PAPIME PE200116; PAIP 5000-9163

Laboratorios Senosiain S.A. de C.V.

Nuevas Alternativas de Tratamiento para Enfermedades Infecciosas, Instituto de Investigaciones Biomedicas, UNAM