A Benchmark for Biomedical Knowledge Graph based Similarity

This notebook will guide you through the evaluation of semantic similarity measures. GitHub explains how to use the benchmark data sets for calculation and evaluation of semantic similarity measures.

Table of Contents

  1. Importing libraries
  2. Selecting datasets
  3. Correlation calculation
  4. Correlation plotting
  5. Protein-protein interaction prediction

1. Importing libraries

In this notebook, Python 3, Jupyter Notebook as well as the Python libraries Pandas, NumPy, Scikit-learn, Matplotlib and SciPy are used.

You should have Anaconda installed in your computer, which already installs Python 3, Jupyter, Scikit-learn, Pandas, NumPy, Matplotlib and SciPy.

If you have any issues, the library versions used in the development of this notebook can be found in the table below.

Library Version
Anaconda 5.3.0
Matplotlib 3.0.3
Numpy 1.15.4
Pandas 0.23.4
Scikit-learn 0.20.1
Scipy 1.1.0
In [1]:
import ppi_prediction as ppi
import evaluating as evlt
import matplotlib.pyplot as plt
from IPython import display

2. Selecting datasets

In this section, you should state the type of benchmark data sets you're dealing with (dataset_type = PPI, MF or GENE, case-insensitive) as well as the path for your test_dataset and benchmark_dataset.

The datasets for benchmark_dataset can be retrieved from GitHub.

Beware that using benchmark and test datasets of different sizes will raise an error in the evaluation and that selecting the wrong dataset_type can lead to wrongly calculated results.

In [2]:
dataset_type = 'ppi'
test_dataset = 'Results/PPI_DM_3.csv'
benchmark_dataset = 'Datasets/PPI_DM_3.csv'

test_dataset should be the path for the dataset containing the test semantic similarity measures. extract_new_ssm extracts only the column containing said measure (default is col_n = -1). The default delimiter for these datasets is ;, however it can be changed by adding the parameter sep. Additionally, if the dataset has no header, the parameter head = None should be added. An example of how to use these parameters is in a comment in the cell below.

measures is the name of the dictionary containing all the measures you want to evaluate, the key being the measure's name (that will feature in the plots and dataframes throughout this notebook) and the value being the calculated similarities for the pairs in the data set using that measure.

In [4]:
measures = {'Measure 1':evlt.extract_new_ssm(test_dataset, sep = ',', col_n = 3), 
            'Measure 2':evlt.extract_new_ssm(test_dataset, sep = ',', col_n = 4)}

3. Correlation calculation

This module will calculate Pearson Correlation Coefficient between the test semantic similarity measures and the similarity proxies available for the type of dataset you're working with. This produces a dataset that enables the comparison between state of the art semantic similarity measures and the test semantic similarity measures.

In [5]:
results = evlt.correlation_calculation(benchmark_dataset, measures, dataset_type)
display.display(results)
similarity proxy BMA Resnik BMA Seco GIC Resnik GIC Seco Measure 1 Measure 2
0 sequence 0.394638 0.475430 0.483673 0.496959 0.501267 0.488958
1 protein protein interaction 0.639708 0.606774 0.470468 0.468584 0.462193 0.465210

4. Correlation plotting

In this section you can plot each test semantic similarity measure and the similarity proxies available for the type of dataset you're working with. Semantic similarity measures can be plotted against the following proxies, according to dataset type:

Dataset type Similarity proxy
PPI sim(seq)
MF sim(seq) and sim(MF)
GENE sim(PS)

You can costumize your plots by changing the labels for the axis (xlabel and ylabel) and by adding size = int in any of the plotting functions. This argument will change the size of the plot points. The color of the plot points can also be changed with the argument col. Check Matplotlib for more information on colors and the comments in the cell below for information on how to use the alternative parameters.

If you're working with a MF dataset, two side by side plots will be produced. Alternativley, you can run the two cells below for separate plots.

In [6]:
selected_measure = "Measure 1"
In [7]:
if evlt.molecular_function_dataset(dataset_type):
    plot = evlt.plot_molecular_function(benchmark_dataset, measures[selected_measure])
    plot[0].set(xlabel = 'Sequence Similarity', ylabel=selected_measure)
    plot[1].set(xlabel = 'Molecular Function Similarity', ylabel=selected_measure)
else:
    plot = evlt.correlation_plot(benchmark_dataset, measures[selected_measure], dataset_type, selected_measure) 
In [8]:
if evlt.molecular_function_dataset(dataset_type):
    plot_pfam = evlt.plotting_molecular_function(benchmark_dataset, measures[selected_measure], selected_measure)
In [9]:
if evlt.molecular_function_dataset(dataset_type):    
    plot_seq = evlt.plotting_sequence(benchmark_dataset, measures[selected_measure], selected_measure)

5. Protein-protein interaction

In case you're dealing with a PPI dataset, it's possible to calculate how well the semantic similarity measures can predict Protein Protein Interactions. This module has two different functionalities:

  1. Plot a Precision-Recall plot and find the highest F1-score. This is found by using 10-fold cross validation to test a range of different thresholds, selecting the best one, for each measure (state of the art and test).
  2. Calculate precision, recall and F1-score for a given threshold (by changing the value of thresh). A dataframe and plot comparing precision, recall and F1-score for all semantic similarity measures at that threshold will be produced. The plot points color can be changed by adding the parameter col= [list of strings with size = number of measures], where each string is a valid color. You can also add size = int if you want to change the size of the plot points.

For both plots, you can check Matplotlib for more information on the colors for your plot, or check the comments for code examples. Beware that for both plots, the default color list only supports 10 different similarity measures. If you want to plot more than 10 measures in the same plot, you must input your own col list with Matplotlib colors.

In [10]:
if ppi.interaction_dataset(dataset_type):
        results = ppi.plot_precision_recall(benchmark_dataset, measures)
        display.display(results)
metric Measure 1 Measure 2 BMA Resnik BMA Seco GIC Resnik GIC Seco
0 Best Threshold 0.090000 0.080000 0.280000 0.370000 0.080000 0.090000
1 Precision 0.973684 0.973684 1.000000 0.950000 0.973684 0.973684
2 Recall 0.840909 0.840909 0.818182 0.863636 0.840909 0.840909
3 F1-score 0.902439 0.902439 0.900000 0.904762 0.902439 0.902439
In [11]:
#input your threshold of choice
thresh = 0.2 

if ppi.interaction_dataset(dataset_type):
        df_f1 = ppi.get_f1_score_df(benchmark_dataset, measures, thresh)
        ppi.plot_f1(df_f1)
        display.display(df_f1)
C:\Users\Carlota\Desktop\Jupyter\precisionRecall.py:32: RuntimeWarning: divide by zero encountered in double_scalars
  return (f * p / (2 * p - f))
metric BMA Resnik BMA Seco GIC Resnik GIC Seco Measure 1 Measure 2
0 Precision 0.745098 0.709677 1.000000 0.962963 0.960000 1.000000
1 Recall 0.863636 1.000000 0.590909 0.590909 0.545455 0.545455
2 F-measure 0.800000 0.830189 0.742857 0.732394 0.695652 0.705882

Authors

  • Carlota Cardoso
  • Rita Sousa
  • Cátia Pesquita

License

See the LICENSE.md file for details.

Acknowledgments

This project was funded by the Portuguese FCT through the LASIGE Research Unit (UID/CEC/00408/2019), and also by the SMILAX project (PTDC/EEI-ESS/4633/2014).