In this notebook, Python 3, Jupyter Notebook as well as the Python libraries Pandas, NumPy, Scikit-learn, Matplotlib and SciPy are used.
You should have Anaconda installed in your computer, which already installs Python 3, Jupyter, Scikit-learn, Pandas, NumPy, Matplotlib and SciPy.
If you have any issues, the library versions used in the development of this notebook can be found in the table below.
Library | Version |
---|---|
Anaconda | 5.3.0 |
Matplotlib | 3.0.3 |
Numpy | 1.15.4 |
Pandas | 0.23.4 |
Scikit-learn | 0.20.1 |
Scipy | 1.1.0 |
import ppi_prediction as ppi
import evaluating as evlt
import matplotlib.pyplot as plt
from IPython import display
In this section, you should state the type of benchmark data sets you're dealing with (dataset_type
= PPI, MF or GENE, case-insensitive) as well as the path for your test_dataset
and benchmark_dataset
.
The datasets for benchmark_dataset
can be retrieved from GitHub.
Beware that using benchmark and test datasets of different sizes will raise an error in the evaluation and that selecting the wrong dataset_type can lead to wrongly calculated results.
dataset_type = 'ppi'
test_dataset = 'Results/PPI_DM_3.csv'
benchmark_dataset = 'Datasets/PPI_DM_3.csv'
test_dataset
should be the path for the dataset containing the test semantic similarity measures. extract_new_ssm
extracts only the column containing said measure (default is col_n
= -1). The default delimiter for these datasets is ;
, however it can be changed by adding the parameter sep
. Additionally, if the dataset has no header, the parameter head
= None
should be added. An example of how to use these parameters is in a comment in the cell below.
measures
is the name of the dictionary containing all the measures you want to evaluate, the key being the measure's name (that will feature in the plots and dataframes throughout this notebook) and the value being the calculated similarities for the pairs in the data set using that measure.
measures = {'Measure 1':evlt.extract_new_ssm(test_dataset, sep = ',', col_n = 3),
'Measure 2':evlt.extract_new_ssm(test_dataset, sep = ',', col_n = 4)}
This module will calculate Pearson Correlation Coefficient between the test semantic similarity measures and the similarity proxies available for the type of dataset you're working with. This produces a dataset that enables the comparison between state of the art semantic similarity measures and the test semantic similarity measures.
results = evlt.correlation_calculation(benchmark_dataset, measures, dataset_type)
display.display(results)
In this section you can plot each test semantic similarity measure and the similarity proxies available for the type of dataset you're working with. Semantic similarity measures can be plotted against the following proxies, according to dataset type:
Dataset type | Similarity proxy |
---|---|
PPI | sim(seq) |
MF | sim(seq) and sim(MF) |
GENE | sim(PS) |
You can costumize your plots by changing the labels for the axis (xlabel
and ylabel
) and by adding size
= int
in any of the plotting functions. This argument will change the size of the plot points. The color of the plot points can also be changed with the argument col
. Check Matplotlib for more information on colors and the comments in the cell below for information on how to use the alternative parameters.
If you're working with a MF dataset, two side by side plots will be produced. Alternativley, you can run the two cells below for separate plots.
selected_measure = "Measure 1"
if evlt.molecular_function_dataset(dataset_type):
plot = evlt.plot_molecular_function(benchmark_dataset, measures[selected_measure])
plot[0].set(xlabel = 'Sequence Similarity', ylabel=selected_measure)
plot[1].set(xlabel = 'Molecular Function Similarity', ylabel=selected_measure)
else:
plot = evlt.correlation_plot(benchmark_dataset, measures[selected_measure], dataset_type, selected_measure)
if evlt.molecular_function_dataset(dataset_type):
plot_pfam = evlt.plotting_molecular_function(benchmark_dataset, measures[selected_measure], selected_measure)
if evlt.molecular_function_dataset(dataset_type):
plot_seq = evlt.plotting_sequence(benchmark_dataset, measures[selected_measure], selected_measure)
In case you're dealing with a PPI dataset, it's possible to calculate how well the semantic similarity measures can predict Protein Protein Interactions. This module has two different functionalities:
thresh
). A dataframe and plot comparing precision, recall and F1-score for all semantic similarity measures at that threshold will be produced. The plot points color can be changed by adding the parameter col
= [list of strings with size = number of measures], where each string is a valid color. You can also add size
= int
if you want to change the size of the plot points. For both plots, you can check Matplotlib for more information on the colors for your plot, or check the comments for code examples. Beware that for both plots, the default color list only supports 10 different similarity measures. If you want to plot more than 10 measures in the same plot, you must input your own col
list with Matplotlib colors.
if ppi.interaction_dataset(dataset_type):
results = ppi.plot_precision_recall(benchmark_dataset, measures)
display.display(results)
#input your threshold of choice
thresh = 0.2
if ppi.interaction_dataset(dataset_type):
df_f1 = ppi.get_f1_score_df(benchmark_dataset, measures, thresh)
ppi.plot_f1(df_f1)
display.display(df_f1)
See the LICENSE.md file for details.
This project was funded by the Portuguese FCT through the LASIGE Research Unit (UID/CEC/00408/2019), and also by the SMILAX project (PTDC/EEI-ESS/4633/2014).