Title: | Hierarchical Ensemble Methods for Directed Acyclic Graphs |
---|---|
Description: | An implementation of several Hierarchical Ensemble Methods (HEMs) for Directed Acyclic Graphs (DAGs). 'HEMDAG' package: 1) reconciles flat predictions with the topology of the ontology; 2) can enhance the predictions of virtually any flat learning methods by taking into account the hierarchical relationships between ontology classes; 3) provides biologically meaningful predictions that always obey the true-path-rule, the biological and logical rule that governs the internal coherence of biomedical ontologies; 4) is specifically designed for exploiting the hierarchical relationships of DAG-structured taxonomies, such as the Human Phenotype Ontology (HPO) or the Gene Ontology (GO), but can be safely applied to tree-structured taxonomies as well (as FunCat), since trees are DAGs; 5) scales nicely both in terms of the complexity of the taxonomy and in the cardinality of the examples; 6) provides several utility functions to process and analyze graphs; 7) provides several performance metrics to evaluate HEMs algorithms. (Marco Notaro, Max Schubach, Peter N. Robinson and Giorgio Valentini (2017) <doi:10.1186/s12859-017-1854-y>). |
Authors: | Marco Notaro [aut, cre]
|
Maintainer: | Marco Notaro <[email protected]> |
License: | GPL (>= 3) |
Version: | 2.7.4 |
Built: | 2025-02-15 05:41:00 UTC |
Source: | https://github.com/marconotaro/hemdag |
The HEMDAG package:
provides an implementation of several Hierarchical Ensemble Methods (HEMs) for Directed Acyclic Graphs (DAGs);
reconciles flat predictions with the topology of the ontology;
can enhance predictions of virtually any flat learning methods by taking into account the hierarchical relationships between ontology classes;
provides biologically meaningful predictions that obey the true-path-rule, the biological and logical rule that governs the internal coherence of biomedical ontologies;
is specifically designed for exploiting the hierarchical relationships of DAG-structured taxonomies, such as the Human Phenotype Ontology (HPO) or the Gene Ontology (GO), but can be safely applied to tree-structured taxonomies as well (as FunCat), since trees are DAGs;
scales nicely both in terms of the complexity of the taxonomy and in the cardinality of the examples;
provides several utility functions to process and analyze graphs;
provides several performance metrics to evaluate HEMs algorithms;
A comprehensive tutorial showing how to apply HEMDAG to real case bio-medical case studies is available at https://hemdag.readthedocs.io.
The HEMDAG package implements the following Hierarchical Ensemble Methods for DAGs:
HTD-DAG: Hierarchical Top Down (htd
);
GPAV-DAG: Generalized Pool-Adjacent Violators, Burdakov et al. (gpav
);
TPR-DAG: True-Path Rule (tpr.dag
);
DESCENS: Descendants Ensemble Classifier (tpr.dag
);
ISO-TPR: Isotonic-True-Path Rule (tpr.dag
);
Max, And, Or: Heuristic Methods, Obozinski et al. (obozinski.heuristic.methods
);
Marco Notaro (https://orcid.org/0000-0003-4309-2200);
Alessandro Petrini (https://orcid.org/0000-0002-0587-1484);
Giorgio Valentini (https://orcid.org/0000-0002-5694-3919);
Maintainer:
Marco Notaro
[email protected]
AnacletoLab, Computational Biology and Bioinformatics Laboratory, Computer Science Department, University of Milan, Italy
Marco Notaro, Max Schubach, Peter N. Robinson and Giorgio Valentini, Prediction of Human Phenotype Ontology terms by means of Hierarchical Ensemble methods, BMC Bioinformatics 2017, 18(1):449, doi:10.1186/s12859-017-1854-y
Compute a binary square upper triangular matrix where rows and columns correspond to the nodes' name of the graph g
.
adj.upper.tri(g)
adj.upper.tri(g)
g |
a graph of class |
The nodes of the matrix are topologically sorted (by using the tsort
function of the RBGL package).
Let's denote with adj
our adjacency matrix. Then adj
represents a partial order data set in which the class j
dominates the class i
. In other words, adj[i,j]=1
means that j
dominates i
; adj[i,j]=0
means that there
is no edge between the class i
and the class j
. Moreover the nodes of adj
are ordered such that adj[i,j]=1
implies , i.e.
adj
is upper triangular.
An adjacency matrix which is square, logical and upper triangular.
data(graph); adj <- adj.upper.tri(g);
data(graph); adj <- adj.upper.tri(g);
Compute the Area under the Precision Recall Curve (AUPRC) through precrec package.
auprc.single.class(labels, scores, folds = NULL, seed = NULL) auprc.single.over.classes(target, predicted, folds = NULL, seed = NULL)
auprc.single.class(labels, scores, folds = NULL, seed = NULL) auprc.single.over.classes(target, predicted, folds = NULL, seed = NULL)
labels |
vector of the true labels (0 negative, 1 positive examples). |
scores |
a numeric vector of the values of the predicted labels (scores). |
folds |
number of folds on which computing the AUPRC. If |
seed |
initialization seed for the random generator to create folds. Set |
target |
matrix with the target multilabel: rows correspond to examples and columns to classes.
|
predicted |
a numeric matrix with predicted values (scores): rows correspond to examples and columns to classes. |
The AUPRC (for a single class or for a set of classes) is computed either one-shot or averaged across stratified folds.
auprc.single.class
computes the AUPRC just for a given class.
auprc.single.over.classes
computes the AUPRC for a set of classes, returning also the averaged values across the classes.
For all those classes having zero annotations, the AUPRC is set to 0. These classes are discarded in the computing of the AUPRC averaged across classes, both when the AUPRC is computed one-shot or averaged across stratified folds.
Names of rows and columns of labels
and predicted
matrix must be provided in the same order, otherwise a stop message is returned.
auprc.single.class
returns a numeric value corresponding to the AUPRC for the considered class;
auprc.single.over.classes
returns a list with two elements:
average: the average AUPRC across classes;
per.class: a named vector with AUPRC for each class. Names correspond to classes.
data(labels); data(scores); data(graph); root <- root.node(g); L <- L[,-which(colnames(L)==root)]; S <- S[,-which(colnames(S)==root)]; prc.single.class <- auprc.single.class(L[,3], S[,3], folds=5, seed=23); prc.over.classes <- auprc.single.over.classes(L, S, folds=5, seed=23);
data(labels); data(scores); data(graph); root <- root.node(g); L <- L[,-which(colnames(L)==root)]; S <- S[,-which(colnames(S)==root)]; prc.single.class <- auprc.single.class(L[,3], S[,3], folds=5, seed=23); prc.over.classes <- auprc.single.over.classes(L, S, folds=5, seed=23);
Compute the Area under the ROC Curve (AUROC) through precrec package.
auroc.single.class(labels, scores, folds = NULL, seed = NULL) auroc.single.over.classes(target, predicted, folds = NULL, seed = NULL)
auroc.single.class(labels, scores, folds = NULL, seed = NULL) auroc.single.over.classes(target, predicted, folds = NULL, seed = NULL)
labels |
vector of the true labels (0 negative, 1 positive examples). |
scores |
a numeric vector of the values of the predicted labels (scores). |
folds |
number of folds on which computing the AUROC. If |
seed |
initialization seed for the random generator to create folds. Set |
target |
annotation matrix: rows correspond to examples and columns to classes.
|
predicted |
a numeric matrix with predicted values (scores): rows correspond to examples and columns to classes. |
The AUROC (for a single class or for a set of classes) is computed either one-shot or averaged across stratified folds.
auroc.single.class
computes the AUROC just for a given class.
auroc.single.over.classes
computes the AUROC for a set of classes, including their average values across all the classes.
For all those classes having zero annotations, the AUROC is set to 0.5. These classes are included in the computing of the AUROC averaged across classes, both when the AUROC is computed one-shot or averaged across stratified folds.
The AUROC is set to 0.5 to all those classes having zero annotations.
Names of rows and columns of labels
and predicted
must be provided in the same order, otherwise a stop message is returned.
auroc.single.class
returns a numeric value corresponding to the AUROC for the considered class;
auprc.single.over.classes
returns a list with two elements:
average: the average AUROC across classes;
per.class: a named vector with AUROC for each class. Names correspond to classes.
data(labels); data(scores); data(graph); root <- root.node(g); L <- L[,-which(colnames(L)==root)]; S <- S[,-which(colnames(S)==root)]; auc.single.class <- auroc.single.class(L[,3], S[,3], folds=5, seed=23); auc.over.classes <- auroc.single.over.classes(L, S, folds=5, seed=23);
data(labels); data(scores); data(graph); root <- root.node(g); L <- L[,-which(colnames(L)==root)]; S <- S[,-which(colnames(S)==root)]; auc.single.class <- auroc.single.class(L[,3], S[,3], folds=5, seed=23); auc.over.classes <- auroc.single.over.classes(L, S, folds=5, seed=23);
Build ancestors for each node of a graph.
build.ancestors(g) build.ancestors.per.level(g, levels) build.ancestors.bottom.up(g, levels)
build.ancestors(g) build.ancestors.per.level(g, levels) build.ancestors.bottom.up(g, levels)
g |
a graph of class |
levels |
a list of character vectors. Each component represents a graph level and the elements of any component correspond to nodes. The level 0 coincides with the root node. |
build.ancestos
returns a named list of vectors. Each component corresponds to a node of the graph and its vector
is the set of its ancestors including also
.
build.ancestors.per.level
returns a named list of vectors. Each component corresponds to a node
of the graph and its vector is the set of its ancestors including also
. The nodes are ordered from root (included) to leaves.
build.ancestors.bottom.up
a named list of vectors. Each component corresponds to a node of the
graph and its vector is the set of its ancestors including also
. The nodes are ordered from leaves to root (included).
data(graph); root <- root.node(g); anc <- build.ancestors(g); lev <- graph.levels(g, root=root); anc.tod <-build.ancestors.per.level(g,lev); anc.bup <- build.ancestors.bottom.up(g,lev);
data(graph); root <- root.node(g); anc <- build.ancestors(g); lev <- graph.levels(g, root=root); anc.tod <-build.ancestors.per.level(g,lev); anc.bup <- build.ancestors.bottom.up(g,lev);
Build children for each node of a graph.
build.children(g) build.children.top.down(g, levels) build.children.bottom.up(g, levels)
build.children(g) build.children.top.down(g, levels) build.children.bottom.up(g, levels)
g |
a graph of class |
levels |
a list of character vectors. Each component represents a graph level and the elements of any component correspond to nodes. The level 0 coincides with the root node. |
build.children
returns a named list of vectors. Each component corresponds to a node of the graph and its vector
is the set of its children.
build.children.top.down
returns a named list of character vectors. Each component corresponds to a node
of the graph (i.e. parent node) and its vector is the set of its children. The nodes are ordered from root (included) to leaves.
build.children.bottom.up
returns a named list of character vectors. Each component corresponds to a node
of the graph (i.e. parent node) and its vector is the set of its children. The nodes are ordered from leaves (included) to root.
data(graph); root <- root.node(g); children <- build.children(g); lev <- graph.levels(g, root=root); children.tod <- build.children.top.down(g,lev); children.bup <- build.children.bottom.up(g,lev);
data(graph); root <- root.node(g); children <- build.children(g); lev <- graph.levels(g, root=root); children.tod <- build.children.top.down(g,lev); children.bup <- build.children.bottom.up(g,lev);
Build a graph in which all nodes are reachable from root.
build.consistent.graph(g = g, root = "00")
build.consistent.graph(g = g, root = "00")
g |
an object of class |
root |
name of the class that is on the top-level of the hierarchy ( |
All nodes not accessible from root (if any) are removed from the graph and printed on stdout.
A graph (as an object of class graphNEL
) in which all nodes are accessible from root.
data(graph); root <- root.node(g); G <- graph::addNode(c("X","Y","Z"), g); G <- graph::addEdge(c("X","Y","Z"), c("HP:0011844","HP:0009810","HP:0012385"), G); G <- build.consistent.graph(G, root=root);
data(graph); root <- root.node(g); G <- graph::addNode(c("X","Y","Z"), g); G <- graph::addEdge(c("X","Y","Z"), c("HP:0011844","HP:0009810","HP:0012385"), G); G <- build.consistent.graph(G, root=root);
Build descendants for each node of a graph.
build.descendants(g) build.descendants.per.level(g, levels) build.descendants.bottom.up(g, levels)
build.descendants(g) build.descendants.per.level(g, levels) build.descendants.bottom.up(g, levels)
g |
a graph of class |
levels |
a list of character vectors. Each component represents a graph level and the elements of any component correspond to nodes. The level 0 coincides with the root node. |
build.descendants
returns a named list of vectors. Each component corresponds to a node of the graph, and its vector
is the set of its descendants including also
.
build.descendants.per.level
returns a named list of vectors.
Each component corresponds to a node of the graph and its vector is the set of its descendants including also
.
The nodes are ordered from root (included) to leaves.
build.descendants.bottom.up
returns a named list of vectors. Each component corresponds to a node of
the graph and its vector is the set of its descendants including also
. The nodes are ordered from leaves to root (included).
data(graph); root <- root.node(g); desc <- build.descendants(g); lev <- graph.levels(g, root=root); desc.tod <- build.descendants.per.level(g,lev); desc.bup <- build.descendants.bottom.up(g,lev);
data(graph); root <- root.node(g); desc <- build.descendants(g); lev <- graph.levels(g, root=root); desc.tod <- build.descendants.per.level(g,lev); desc.bup <- build.descendants.bottom.up(g,lev);
Read an HPO obo file (HPO) and write the edges of the dag on a plain text file. The format of the file is a sequence of rows and each row corresponds to an edge represented through a pair of vertexes separated by blank.
build.edges.from.hpo.obo(obofile = "hp.obo", file = "edge.file")
build.edges.from.hpo.obo(obofile = "hp.obo", file = "edge.file")
obofile |
an HPO obo file. The extension of the obofile can be plain (".txt") or compressed (".gz"). |
file |
name of the file of the edges to be written. The extension of the file can be plain (".txt") or compressed (".gz"). |
A faster and more flexible parser to handle obo file can be found here.
A text file representing the edges in the format: source destination (i.e. one row for each edge).
## Not run: hpobo <- "http://purl.obolibrary.org/obo/hp.obo"; build.edges.from.hpo.obo(obofile=hpobo, file="hp.edge"); ## End(Not run)
## Not run: hpobo <- "http://purl.obolibrary.org/obo/hp.obo"; build.edges.from.hpo.obo(obofile=hpobo, file="hp.edge"); ## End(Not run)
Build parents for each node of a graph.
build.parents(g, root = "00") build.parents.top.down(g, levels, root = "00") build.parents.bottom.up(g, levels, root = "00") build.parents.topological.sorting(g, root = "00")
build.parents(g, root = "00") build.parents.top.down(g, levels, root = "00") build.parents.bottom.up(g, levels, root = "00") build.parents.topological.sorting(g, root = "00")
g |
a graph of class |
root |
name of the class that it is on the top-level of the hierarchy ( |
levels |
a list of character vectors. Each component represents a graph level and the elements of any component correspond to nodes. The level 0 represents the root node. |
build.parents
returns a named list of character vectors. Each component corresponds to a node of the graph (i.e. child node)
and its vector is the set of its parents (the root node is not included).
build.parents.top.down
returns a named list of character vectors. Each component corresponds to a node of the graph (i.e. child node)
and its vector is the set of its parents. The order of nodes follows the levels of the graph from root (excluded) to leaves.
build.parents.bottom.up
returns a named list of character vectors. Each component corresponds to a node of the
graph (i.e. child node) and its vector is the set of its parents. The nodes are ordered from leaves to root (excluded).
build.parents.topological.sorting
a named list of character vectors. Each component corresponds to a node of the graph (i.e. child node)
and its vector is the set of its parents. The nodes are ordered according to a topological sorting, i.e. parents node come before children node.
data(graph); root <- root.node(g) parents <- build.parents(g, root=root); lev <- graph.levels(g, root=root); parents.tod <- build.parents.top.down(g, lev, root=root); parents.bup <- build.parents.bottom.up(g, lev, root=root); parents.tsort <- build.parents.topological.sorting(g, root=root);
data(graph); root <- root.node(g) parents <- build.parents(g, root=root); lev <- graph.levels(g, root=root); parents.tod <- build.parents.top.down(g, lev, root=root); parents.bup <- build.parents.bottom.up(g, lev, root=root); parents.tsort <- build.parents.topological.sorting(g, root=root);
Build a score matrix from file
build.scores.matrix.from.list(file = "scores.list.txt", split = "[(\t,|)]") build.scores.matrix.from.tupla(file = "scores.tupla.txt")
build.scores.matrix.from.list(file = "scores.list.txt", split = "[(\t,|)]") build.scores.matrix.from.tupla(file = "scores.tupla.txt")
file |
name of the text file to be read. The matrix of the input file can be either a list (e.g in the form |
split |
character vector containing a regular expression use for splitting. |
A named score matrix.
file.list <- system.file("extdata/scores.list.txt.gz", package="HEMDAG"); file.tupla <- system.file("extdata/scores.tupla.txt.gz", package="HEMDAG"); S <- build.scores.matrix.from.list(file.list, split="[(\t,|)]"); S <- build.scores.matrix.from.tupla(file.tupla);
file.list <- system.file("extdata/scores.list.txt.gz", package="HEMDAG"); file.tupla <- system.file("extdata/scores.tupla.txt.gz", package="HEMDAG"); S <- build.scores.matrix.from.list(file.list, split="[(\t,|)]"); S <- build.scores.matrix.from.tupla(file.tupla);
Build a subgraph with only the supplied nodes and any edges between them.
build.subgraph(nd, g, edgemode = "directed")
build.subgraph(nd, g, edgemode = "directed")
nd |
a vector with the nodes for which the subgraph must be built. |
g |
a graph of class |
edgemode |
can be "directed" or "undirected". |
A subgraph with only the supplied nodes.
data(graph); anc <- build.ancestors(g); nd <- anc[["HP:0001371"]]; subg <- build.subgraph(nd, g, edgemode="directed");
data(graph); anc <- build.ancestors(g); nd <- anc[["HP:0001371"]]; subg <- build.subgraph(nd, g, edgemode="directed");
Terms having less than n annotations are pruned. Terms having exactly n annotations are discarded as well.
build.submatrix(ann, n)
build.submatrix(ann, n)
ann |
the annotation matrix (0/1). Rows are examples and columns are functional terms. |
n |
an integer number representing the number of annotations to be pruned. |
An annotation matrix having only those terms with more than n annotations.
data(labels); subm <- build.submatrix(L,5);
data(labels); subm <- build.submatrix(L,5);
Assess the integrity of an annotation matrix where a transitive closure of annotations was performed.
check.annotation.matrix.integrity(anc, ann.spec, ann)
check.annotation.matrix.integrity(anc, ann.spec, ann)
anc |
the ancestor list. |
ann.spec |
the annotation matrix of the most specific annotations (0/1): rows are genes and columns are terms. |
ann |
the full annotation matrix (0/1), i.e. the matrix where the transitive closure of the annotation was performed. Rows are examples and columns are terms. |
If the transitive closure of the annotations is performed correctly, OK
is returned, otherwise an error message is printed on the stdout.
data(graph); data(labels); anc <- build.ancestors(g); tca <- transitive.closure.annotations(L, anc); check.annotation.matrix.integrity(anc, L, tca);
data(graph); data(labels); anc <- build.ancestors(g); tca <- transitive.closure.annotations(L, anc); check.annotation.matrix.integrity(anc, L, tca);
Check the integrity of a dag.
check.dag.integrity(g, root = "00")
check.dag.integrity(g, root = "00")
g |
a graph of class |
root |
name of the class that is on the top-level of the hierarchy ( |
If all the nodes are accessible from the root "dag is ok" is printed, otherwise a message error and the list of the not accessible nodes is printed on the stdout.
data(graph); root <- root.node(g); check.dag.integrity(g, root=root);
data(graph); root <- root.node(g); check.dag.integrity(g, root=root);
Compute a directed graph with edges in the opposite direction.
compute.flipped.graph(g)
compute.flipped.graph(g)
g |
a |
A graph (as an object of class graphNEL
) with edges in the opposite direction w.r.t. g
.
data(graph); g.flipped <- compute.flipped.graph(g);
data(graph); g.flipped <- compute.flipped.graph(g);
Return a matrix with two columns and as many rows as there are edges. The entries of the first columns are the index of the node the edge comes from (i.e. children nodes), the entries of the second columns indicate the index of node the edge is to (i.e. parents nodes). Referring to a dag this matrix defines a partial order.
constraints.matrix(g)
constraints.matrix(g)
g |
a graph of class |
A constraints matrix w.r.t the graph g
.
data(graph); m <- constraints.matrix(g);
data(graph); m <- constraints.matrix(g);
Create a data frame for stratified cross-validation.
create.stratified.fold.df(labels, scores, folds = 5, seed = 23)
create.stratified.fold.df(labels, scores, folds = 5, seed = 23)
labels |
vector of the true labels (0 negative, 1 positive). |
scores |
a numeric vector of the values of the predicted labels. |
folds |
number of folds of the cross validation ( |
seed |
initialization seed for the random generator to create folds ( |
Folds are stratified, i.e. contain the same amount of positive and negative examples.
A data frame with three columns:
scores
: contains the predicted scores;
labels
: contains the labels as pos
or neg
;
folds
: contains the index of the fold in which the example falls.
The index can range from 1 to the number of folds.
data(labels); data(scores); df <- create.stratified.fold.df(L[,3], S[,3], folds=5, seed=23);
data(labels); data(scores); df <- create.stratified.fold.df(L[,3], S[,3], folds=5, seed=23);
Compute the minimum distance of each node from one of the leaves of the graph.
distances.from.leaves(g)
distances.from.leaves(g)
g |
a graph of class |
A named vector. The names are the names of the nodes of the graph g
, and their values represent the distance from the leaves.
A value equal to 0 is assigned to the leaves, 1 to nodes with distance 1 from a leaf and so on.
data(graph); dist.leaves <- distances.from.leaves(g);
data(graph); dist.leaves <- distances.from.leaves(g);
Collection of real sub-datasets used in the examples of the HEMDAG package
data(graph) data(labels) data(scores) data(wadj) data(test.index)
data(graph) data(labels) data(scores) data(wadj) data(test.index)
The DAG g
contained in graph
data is an object of class graphNEL
. The graph g
has 23 nodes and 30 edges and
represents the "ancestors view" of the HPO term Camptodactyly of finger ("HP:0100490"
).
The matrix L
contained in the labels
data is a 100 X 23 matrix, whose rows correspond to genes
(Entrez GeneID) and columns to HPO classes.
means that the gene
belong to class
,
means that the gene
does not belong to class
.
The classes of the matrix
L
correspond to the nodes of the graph g
.
The matrix S
contained in the scores
data is a named 100 X 23 flat scores matrix, representing the likelihood
that a given gene belongs to a given class: higher the value higher the likelihood. The classes of the matrix correspond
to the nodes of the graph
g
.
The matrix W
contained in the wadj
data is a named 100 X 100 symmetric weighted adjacency matrix, whose rows and
columns correspond to genes.The genes names (Entrez GeneID) of the adjacency matrix W
correspond to the genes names of the
flat scores matrix S
and to genes names of the target multilabel matrix L
.
The vector of integer numbers test.index
contained in the test.index
data refers to the index of the examples of the scores
matrix S
to be used in the test set. It is useful only in holdout experiments.
Some examples of full data sets for the prediction of HPO terms are available at the following link.
Note that the processing of the full datasets should be done similarly to the processing of the small data examples provided directly in this package.
Please read the README
clicking the link above to know more details about the available full datasets.
Select the best hierarchical F-score by choosing an appropriate threshold in the scores.
find.best.f( target, predicted, n.round = 3, verbose = TRUE, b.per.example = FALSE )
find.best.f( target, predicted, n.round = 3, verbose = TRUE, b.per.example = FALSE )
target |
matrix with the target multilabel: rows correspond to examples and columns to classes.
|
predicted |
a numeric matrix with continuous predicted values (scores): rows correspond to examples and columns to classes. |
n.round |
number of rounding digits to be applied to predicted ( |
verbose |
a boolean value. If |
b.per.example |
a boolean value.
|
All the examples having no positive annotations are discarded. The predicted score matrix (predicted
) is rounded
according to parameter n.round
and all the values of predicted
are divided by max(predicted)
.
Then all the thresholds corresponding to all the different values included in predicted
are attempted, and the threshold
leading to the maximum F-measure is selected.
Names of rows and columns of target
and predicted
matrix must be provided in the same order, otherwise a stop message is returned.
Two different outputs respect to the input parameter b.per.example
:
b.per.example==FALSE
: a list with a single element average. A named vector with 7 elements relative to the best result in terms
of the F.measure: Precision (P), Recall (R), Specificity (S), F.measure (F), av.F.measure (av.F), Accuracy (A) and the best selected Threshold (T).
F is the F-measure computed as the harmonic mean between the average precision and recall; av.F is the F-measure computed as the average across
examples and T is the best selected threshold;
b.per.example==FALSE
: a list with two elements:
average: a named vector with with 7 elements relative to the best result in terms of the F.measure: Precision (P), Recall (R), Specificity (S), F.measure (F), av.F.measure (av.F), Accuracy (A) and the best selected Threshold (T);
per.example: a named matrix with the Precision (P), Recall (R), Specificity (S), Accuracy (A), F-measure (F), av.F-measure (av.F) and the best selected Threshold (T) for each example. Row names correspond to examples, column names correspond respectively to Precision (P), Recall (R), Specificity (S), Accuracy (A), F-measure (F), av.F-measure (av.F) and the best selected Threshold (T);
data(graph); data(labels); data(scores); root <- root.node(g); L <- L[,-which(colnames(L)==root)]; S <- S[,-which(colnames(S)==root)]; fscore <- find.best.f(L, S, n.round=3, verbose=TRUE, b.per.example=TRUE);
data(graph); data(labels); data(scores); root <- root.node(g); L <- L[,-which(colnames(L)==root)]; S <- S[,-which(colnames(S)==root)]; fscore <- find.best.f(L, S, n.round=3, verbose=TRUE, b.per.example=TRUE);
Find leaves of a directed graph.
find.leaves(g)
find.leaves(g)
g |
a graph of class |
A vector with the names of the leaves of g.
data(graph); leaves <- find.leaves(g);
data(graph); leaves <- find.leaves(g);
Compute the best hierarchical Fmax either one-shot or averaged across folds
compute.fmax( target, predicted, n.round = 3, verbose = TRUE, b.per.example = FALSE, folds = NULL, seed = NULL )
compute.fmax( target, predicted, n.round = 3, verbose = TRUE, b.per.example = FALSE, folds = NULL, seed = NULL )
target |
matrix with the target multilabel: rows correspond to examples and columns to classes.
|
predicted |
a numeric matrix with predicted values (scores): rows correspond to examples and columns to classes. |
n.round |
number of rounding digits to be applied to predicted ( |
verbose |
a boolean value. If |
b.per.example |
a boolean value.
|
folds |
number of folds on which computing the Fmax If |
seed |
initialization seed for the random generator to create folds. Set |
Names of rows and columns of target
and predicted
matrix must be provided in the same order, otherwise a stop message is returned.
Two different outputs respect to the input parameter b.per.example
:
b.per.example==FALSE
: a list with a single element average. A named vector with 7 elements relative to the best result in terms
of the F.measure: Precision (P), Recall (R), Specificity (S), F.measure (F), av.F.measure (av.F), Accuracy (A) and the best selected Threshold (T).
F is the F-measure computed as the harmonic mean between the average precision and recall; av.F is the F-measure computed as the average across
examples and T is the best selected threshold;
b.per.example==FALSE
: a list with two elements:
average: a named vector with with 7 elements relative to the best result in terms of the F.measure: Precision (P), Recall (R), Specificity (S), F.measure (F), av.F.measure (av.F), Accuracy (A) and the best selected Threshold (T);
per.example: a named matrix with the Precision (P), Recall (R), Specificity (S), Accuracy (A), F-measure (F), av.F-measure (av.F) and the best selected Threshold (T) for each example. Row names correspond to examples, column names correspond respectively to Precision (P), Recall (R), Specificity (S), Accuracy (A), F-measure (F), av.F-measure (av.F) and the best selected Threshold (T);
data(graph); data(labels); data(scores); root <- root.node(g); L <- L[,-which(colnames(L)==root)]; S <- S[,-which(colnames(S)==root)]; fmax <- compute.fmax(L, S, n.round=3, verbose=TRUE, b.per.example=TRUE, folds=5, seed=23);
data(graph); data(labels); data(scores); root <- root.node(g); L <- L[,-which(colnames(L)==root)]; S <- S[,-which(colnames(S)==root)]; fmax <- compute.fmax(L, S, n.round=3, verbose=TRUE, b.per.example=TRUE, folds=5, seed=23);
Build a full annotations matrix using the ancestor list and the most specific annotations matrix w.r.t. a given weighted adjacency matrix (wadj). The rows of the full annotation matrix correspond to all the examples of the given weighted adjacency matrix and the columns to the class/terms. The transitive closure of the annotations is performed.
full.annotation.matrix(W, anc, ann.spec)
full.annotation.matrix(W, anc, ann.spec)
W |
a symmetric adjacency weighted matrix of the graph. |
anc |
the ancestor list. |
ann.spec |
the annotation matrix of the most specific annotations (0/1): rows are genes and columns are terms. |
The examples present in the annotation matrix (ann.spec
) but not in the adjacency weighted matrix (W
) are purged.
A full annotation table T, that is a matrix where the transitive closure of annotations is performed.
Rows correspond to genes of the weighted adjacency matrix and columns to terms.
means that gene
is annotated for the term
,
means that gene
is not annotated for the term
.
data(wadj); data(graph); data(labels); anc <- build.ancestors(g); full.ann <- full.annotation.matrix(W, anc, L);
data(wadj); data(graph); data(labels); anc <- build.ancestors(g); full.ann <- full.annotation.matrix(W, anc, L);
Implementation of GPAV
(Generalized Pool-Adjacent Violators) algorithm.
(Burdakov et al., In: Di Pillo G, Roma M, editors. An O(n2) Algorithm for Isotonic Regression. Boston, MA: Springer US; 2006.
p. 25–33. Available from: doi:10.1007/0-387-30065-1_3
gpav(Y, W = NULL, adj)
gpav(Y, W = NULL, adj)
Y |
vector of scores relative to a single example. |
W |
vector of weight relative to a single example. If |
adj |
adjacency matrix of the graph which must be sparse, logical and upper triangular. Number of columns of |
Given the constraints adjacency matrix of the graph, a vector of scores and a vector of strictly positive
weights
, the
GPAV
algorithm returns a vector which is as close as possible, in the least-squares sense,
to the response vector
and whose components are partially ordered in accordance with the constraints matrix
adj
.
In other words, GPAV
solves the following problem:
where are the number of vertexes of the graph.
A list of 3 elements:
YFit
: a named vector with the scores of the classes corrected according to the GPAV
algorithm.
blocks
: list of vectors, containing the partitioning of nodes (represented with an integer number) into blocks;
W
: vector of weights.
data(graph); data(scores); Y <- S[3,]; adj <- adj.upper.tri(g); Y.gpav <- gpav(Y,W=NULL,adj);
data(graph); data(scores); Y <- S[3,]; adj <- adj.upper.tri(g); Y.gpav <- gpav(Y,W=NULL,adj);
Correct the computed scores in a hierarchy according to the GPAV
algorithm by applying a classical holdout procedure.
gpav.holdout( S, g, testIndex, W = NULL, parallel = FALSE, ncores = 1, norm = TRUE, norm.type = NULL )
gpav.holdout( S, g, testIndex, W = NULL, parallel = FALSE, ncores = 1, norm = TRUE, norm.type = NULL )
S |
a named flat score matrix with examples on rows and classes on columns (root node included). |
g |
a graph of class |
testIndex |
a vector of integer numbers corresponding to the indexes of the elements (rows) of the score matrix |
W |
vector of weight relative to a single example. If |
parallel |
a boolean value. Should the parallel version
|
ncores |
number of cores to use for parallel execution. Set |
norm |
a boolean value. Should the flat score matrix be normalized? By default |
norm.type |
a string character. It can be one of the following values:
|
A named matrix with the scores of the classes corrected according to the GPAV
algorithm. Rows of the matrix are shrunk to testIndex
.
data(graph); data(scores); data(test.index); S.gpav <- gpav.holdout(S, g, testIndex=test.index, norm=FALSE, norm.type=NULL);
data(graph); data(scores); data(test.index); S.gpav <- gpav.holdout(S, g, testIndex=test.index, norm=FALSE, norm.type=NULL);
Compute GPAV
across all the examples.
gpav.over.examples(S, g, W = NULL)
gpav.over.examples(S, g, W = NULL)
S |
a named flat score matrix with examples on rows and classes on columns (root node included). |
g |
a graph of class |
W |
vector of weight relative to a single example. If |
A named matrix with the scores of the classes corrected according to the GPAV
algorithm.
data(graph); data(scores); S.gpav <- gpav.over.examples(S,W=NULL,g);
data(graph); data(scores); S.gpav <- gpav.over.examples(S,W=NULL,g);
Compute GPAV
across all the examples (parallel implementation).
gpav.parallel(S, g, W = NULL, ncores = 8)
gpav.parallel(S, g, W = NULL, ncores = 8)
S |
a named flat score matrix with examples on rows and classes on columns (root node included). |
g |
a graph of class |
W |
vector of weight relative to a single example. If |
ncores |
number of cores to use for parallel execution ( |
A named matrix with the scores of the classes corrected according to the GPAV
algorithm.
data(graph); data(scores); if(Sys.info()['sysname']!="Windows"){ S.gpav <- gpav.parallel(S,W=NULL,g,ncores=2); }
data(graph); data(scores); if(Sys.info()['sysname']!="Windows"){ S.gpav <- gpav.parallel(S,W=NULL,g,ncores=2); }
Correct the computed scores in a hierarchy according to the GPAV
algorithm.
gpav.vanilla( S, g, W = NULL, parallel = FALSE, ncores = 1, norm = FALSE, norm.type = NULL )
gpav.vanilla( S, g, W = NULL, parallel = FALSE, ncores = 1, norm = FALSE, norm.type = NULL )
S |
a named flat score matrix with examples on rows and classes on columns (root node included). |
g |
a graph of class |
W |
vector of weight relative to a single example. If |
parallel |
a boolean value. Should the parallel version
|
ncores |
number of cores to use for parallel execution. Set |
norm |
a boolean value. Should the flat score matrix be normalized? By default |
norm.type |
a string character. It can be one of the following values:
|
A named matrix with the scores of the classes corrected according to the GPAV
algorithm.
data(graph); data(scores); S.gpav <- gpav.vanilla(S, g, W=NULL, parallel=FALSE, ncores=1, norm=FALSE, norm.type=NULL);
data(graph); data(scores); S.gpav <- gpav.vanilla(S, g, W=NULL, parallel=FALSE, ncores=1, norm=FALSE, norm.type=NULL);
Group a set of nodes in according to their maximum depth in the graph. Firstly, it inverts the weights of the graph and then it applies the Bellman Ford algorithm to find the shortest path, achieving in this way the longest path.
graph.levels(g, root = "00")
graph.levels(g, root = "00")
g |
an object of class |
root |
name of the class that it is on the top-level of the hierarchy ( |
A list of the nodes grouped w.r.t. the distance from the root: the first element of the list corresponds to the root node (level 0), the second to nodes at maximum distance 1 (level 1), the third to the node at maximum distance 3 (level 2) and so on.
data(graph); root <- root.node(g); lev <- graph.levels(g, root=root);
data(graph); root <- root.node(g); lev <- graph.levels(g, root=root);
Check if the true path rule is violated or not. In other words this function checks if the score of a parent or an ancestor node is always larger or equal than that of its children or descendants nodes.
check.hierarchy.single.sample(y.hier, g, root = "00") check.hierarchy(S.hier, g, root = "00")
check.hierarchy.single.sample(y.hier, g, root = "00") check.hierarchy(S.hier, g, root = "00")
y.hier |
vector of scores relative to a single example. It must be a named numeric vector (names are functional classes). |
g |
a graph of class |
root |
name of the class that is on the top-level of the hierarchy ( |
S.hier |
the matrix with the scores of the classes corrected in according to hierarchy. It must be a named matrix: rows are examples and columns are functional classes. |
A list of 3 elements:
status:
OK
if none hierarchical constraints have bee broken;
NOTOK
if there is at least one hierarchical constraints broken;
hierarchy_constraints_broken:
TRUE: example did not respect the hierarchical constraints;
FALSE: example broke the hierarchical constraints;
hierarchy_constraints_satisfied: how many terms satisfied the hierarchical constraint;
data(graph); data(scores); root <- root.node(g); S.hier <- htd(S,g,root); S.hier.single.example <- S.hier[sample(ncol(S.hier),1),]; check.hierarchy.single.sample(S.hier.single.example, g, root=root); check.hierarchy(S.hier, g, root);
data(graph); data(scores); root <- root.node(g); S.hier <- htd(S,g,root); S.hier.single.example <- S.hier[sample(ncol(S.hier),1),]; check.hierarchy.single.sample(S.hier.single.example, g, root=root); check.hierarchy(S.hier, g, root);
Implementation of the top-down procedure to correct the scores of the hierarchy according to the constraints that the score of a node cannot be greater than a score of its parents.
htd(S, g, root = "00")
htd(S, g, root = "00")
S |
a named flat score matrix with examples on rows and classes on columns. |
g |
a graph of class |
root |
name of the class that it is the top-level of the hierarchy ( |
The HTD-DAG
algorithm modifies the flat scores according to the hierarchy of a DAG through a unique run across
the nodes of the graph. For a given example
, the flat predictions
are hierarchically corrected to
, by per-level visiting the nodes of the DAG from top to bottom according to the following simple rule:
The node levels correspond to their maximum path length from the root.
A matrix with the scores of the classes corrected according to the HTD-DAG
algorithm.
data(graph); data(scores); root <- root.node(g); S.htd <- htd(S,g,root);
data(graph); data(scores); root <- root.node(g); S.htd <- htd(S,g,root);
Correct the computed scores in a hierarchy according to the HTD-DAG
algorithm applying a classical holdout procedure.
htd.holdout(S, g, testIndex, norm = FALSE, norm.type = NULL)
htd.holdout(S, g, testIndex, norm = FALSE, norm.type = NULL)
S |
a named flat score matrix with examples on rows and classes on columns. |
g |
a graph of class |
testIndex |
a vector of integer numbers corresponding to the indexes of the elements (rows) of the score matrix |
norm |
a boolean value. Should the flat score matrix be normalized? By default |
norm.type |
a string character. It can be one of the following values:
|
A matrix with the scores of the classes corrected according to the HTD-DAG
algorithm. Rows of the matrix are shrunk to testIndex
.
data(graph); data(scores); data(test.index); S.htd <- htd.holdout(S, g, testIndex=test.index, norm=FALSE, norm.type=NULL);
data(graph); data(scores); data(test.index); S.htd <- htd.holdout(S, g, testIndex=test.index, norm=FALSE, norm.type=NULL);
Correct the computed scores in a hierarchy according to the HTD-DAG
algorithm.
htd.vanilla(S, g, norm = FALSE, norm.type = NULL)
htd.vanilla(S, g, norm = FALSE, norm.type = NULL)
S |
a named flat score matrix with examples on rows and classes on columns. |
g |
a graph of class |
norm |
a boolean value. Should the flat score matrix be normalized? By default |
norm.type |
a string character. It can be one of the following values:
|
A matrix with the scores of the classes corrected according to the HTD-DAG
algorithm.
data(graph); data(scores); S.htd <- htd.vanilla(S, g, norm=FALSE, norm.type=NULL);
data(graph); data(scores); S.htd <- htd.vanilla(S, g, norm=FALSE, norm.type=NULL);
Nodes of a graph are sorted according to a lexicographical topological ordering.
lexicographical.topological.sort(g)
lexicographical.topological.sort(g)
g |
an object of class |
A topological sorting is a linear ordering of the nodes such that given an edge from u
to v
, the node u
comes before
node v
in the ordering. Topological sorting is not possible if the graph g
contains self-loop.
To implement the topological sorting algorithm we applied the Kahn’s algorithm.
A vector in which the nodes of the graph g
are sorted according to a lexicographical topological order.
data(graph); T <- lexicographical.topological.sort(g);
data(graph); T <- lexicographical.topological.sort(g);
Method for computing Precision, Recall, Specificity, Accuracy and F-measure for multiclass and multilabel classification.
F.measure.multilabel(target, predicted, b.per.example = FALSE) ## S4 method for signature 'matrix,matrix' F.measure.multilabel(target, predicted, b.per.example = FALSE)
F.measure.multilabel(target, predicted, b.per.example = FALSE) ## S4 method for signature 'matrix,matrix' F.measure.multilabel(target, predicted, b.per.example = FALSE)
target |
matrix with the target multilabel: rows correspond to examples and columns to classes.
|
predicted |
a numeric matrix with discrete predicted values: rows correspond to examples and columns to classes.
|
b.per.example |
a boolean value.
|
Names of rows and columns of target
and predicted
matrix must be provided in the same order, otherwise a stop message is returned.
Two different outputs respect to the input parameter b.per.example
:
b.per.example==FALSE
: a list with a single element average. A named vector with average precision (P), recall (R),
specificity (S), F-measure (F), average F-measure (avF) and Accuracy (A) across examples. F is the F-measure computed as the
harmonic mean between the average precision and recall; av.F is the F-measure computed as average across examples;
b.per.example==FALSE
: a list with two elements:
average: a named vector with average precision (P), recall (R), specificity (S), F-measure (F), average F-measure (avF) and Accuracy (A) across examples;
per.example: a named matrix with the Precision (P), Recall (R), Specificity (S), Accuracy (A), F-measure (F) and av.F-measure (av.F) for each example. Row names correspond to examples, column names correspond respectively to Precision (P), Recall (R), Specificity (S), Accuracy (A), F-measure (F) and av.F-measure (av.F);
data(labels); data(scores); data(graph); root <- root.node(g); L <- L[,-which(colnames(L)==root)]; S <- S[,-which(colnames(S)==root)]; S[S>0.7] <- 1; S[S<0.7] <- 0; fscore <- F.measure.multilabel(L,S);
data(labels); data(scores); data(graph); root <- root.node(g); L <- L[,-which(colnames(L)==root)]; S <- S[,-which(colnames(S)==root)]; S[S>0.7] <- 1; S[S<0.7] <- 0; fscore <- F.measure.multilabel(L,S);
Normalize the scores of a score matrix by dividing the score values of each class for the maximum score of the class.
normalize.max(S)
normalize.max(S)
S |
a score matrix. Rows are examples and columns are classes. |
A score matrix with the scores normalized.
data(scores); maxnorm <- normalize.max(S);
data(scores); maxnorm <- normalize.max(S);
Implementation of the Obozinski's heuristic methods Max
, And
, Or
(Obozinski et al., Genome Biology, 2008,
doi:10.1186/gb-2008-9-s1-s6).
obozinski.max(S, g, root = "00") obozinski.and(S, g, root = "00") obozinski.or(S, g, root = "00")
obozinski.max(S, g, root = "00") obozinski.and(S, g, root = "00") obozinski.or(S, g, root = "00")
S |
a named flat score matrix with examples on rows and classes on columns. |
g |
a graph of class |
root |
name of the class that it is the top-level of the hierarchy ( |
Obozinski's heuristic methods:
Max: reports the largest logistic regression (LR) value of self and all descendants: ;
And: reports the product of LR values of all ancestors and self. This is equivalent to computing the probability that all
ancestral terms are "on" assuming that, conditional on the data, all predictions are independent: ;
Or: computes the probability that at least one of the descendant terms is "on" assuming again that, conditional on the data,
all predictions are independent: ;
A matrix with the scores of the classes corrected according to the chosen Obozinski's heuristic algorithm.
data(graph); data(scores); root <- root.node(g); S.max <- obozinski.max(S,g,root); S.and <- obozinski.and(S,g,root); S.or <- obozinski.or(S,g,root);
data(graph); data(scores); root <- root.node(g); S.max <- obozinski.max(S,g,root); S.and <- obozinski.and(S,g,root); S.or <- obozinski.or(S,g,root);
Compute the Obozinski's heuristic methods Max
, And
, Or
(Obozinski et al., Genome Biology, 2008)
applying a classical holdout procedure.
obozinski.holdout( S, g, testIndex, heuristic = "and", norm = FALSE, norm.type = NULL )
obozinski.holdout( S, g, testIndex, heuristic = "and", norm = FALSE, norm.type = NULL )
S |
a named flat score matrix with examples on rows and classes on columns. |
g |
a graph of class |
testIndex |
a vector of integer numbers corresponding to the indexes of the elements (rows) of the score matrix |
heuristic |
a string character. It can be one of the following three values:
|
norm |
a boolean value. Should the flat score matrix be normalized? By default |
norm.type |
a string character. It can be one of the following values:
|
A matrix with the scores of the classes corrected according to the chosen heuristic algorithm. Rows of the matrix are shrunk to testIndex
.
data(graph); data(scores); data(test.index); S.and <- obozinski.holdout(S, g, testIndex=test.index, heuristic="and", norm=FALSE, norm.type=NULL);
data(graph); data(scores); data(test.index); S.and <- obozinski.holdout(S, g, testIndex=test.index, heuristic="and", norm=FALSE, norm.type=NULL);
Compute the Obozinski's heuristic methods Max
, And
, Or
(Obozinski et al., Genome Biology, 2008).
obozinski.methods(S, g, heuristic = "and", norm = FALSE, norm.type = NULL)
obozinski.methods(S, g, heuristic = "and", norm = FALSE, norm.type = NULL)
S |
a named flat score matrix with examples on rows and classes on columns. |
g |
a graph of class |
heuristic |
a string character. It can be one of the following three values:
|
norm |
a boolean value. Should the flat score matrix be normalized? By default |
norm.type |
a string character. It can be one of the following values:
|
A matrix with the scores of the classes corrected according to the chosen heuristic algorithm.
data(graph); data(scores); S.and <- obozinski.methods(S, g, heuristic="and", norm=TRUE, norm.type="maxnorm");
data(graph); data(scores); S.and <- obozinski.methods(S, g, heuristic="and", norm=TRUE, norm.type="maxnorm");
Compute the Precision-Recall (PxR) values through precrec package.
precision.at.all.recall.levels.single.class(labels, scores) precision.at.given.recall.levels.over.classes( target, predicted, folds = NULL, seed = NULL, recall.levels = seq(from = 0.1, to = 1, by = 0.1) )
precision.at.all.recall.levels.single.class(labels, scores) precision.at.given.recall.levels.over.classes( target, predicted, folds = NULL, seed = NULL, recall.levels = seq(from = 0.1, to = 1, by = 0.1) )
labels |
vector of the true labels (0 negative, 1 positive examples). |
scores |
a numeric vector of the values of the predicted labels (scores). |
target |
matrix with the target multilabel: rows correspond to examples and columns to classes.
|
predicted |
a numeric matrix with predicted values (scores): rows correspond to examples and columns to classes. |
folds |
number of folds on which computing the PXR. If |
seed |
initialization seed for the random generator to create folds. Set |
recall.levels |
a vector with the desired recall levels ( |
precision.at.all.recall.levels.single.class
computes the precision at all recall levels just for a single class.
precision.at.given.recall.levels.over.classes
computes the precision at fixed recall levels over classes.
precision.at.all.recall.levels.single.class
returns a two-columns matrix, representing a pair of precision and recall values.
The first column is the precision, the second the recall;
precision.at.given.recall.levels.over.classes
returns a list with two elements:
average: a vector with the average precision at different recall levels across classes;
fixed.recall: a matrix with the precision at different recall levels: rows are classes, columns precision at different recall levels;
data(labels); data(scores); data(graph); root <- root.node(g); L <- L[,-which(colnames(L)==root)]; S <- S[,-which(colnames(S)==root)]; labels <- L[,1]; scores <- S[,1]; rec.levels <- seq(from=0.25, to=1, by=0.25); pxr.single <- precision.at.all.recall.levels.single.class(labels, scores); pxr <- precision.at.given.recall.levels.over.classes(L, S, folds=5, seed=23, recall.levels=rec.levels);
data(labels); data(scores); data(graph); root <- root.node(g); L <- L[,-which(colnames(L)==root)]; S <- S[,-which(colnames(S)==root)]; labels <- L[,1]; scores <- S[,1]; rec.levels <- seq(from=0.25, to=1, by=0.25); pxr.single <- precision.at.all.recall.levels.single.class(labels, scores); pxr <- precision.at.given.recall.levels.over.classes(L, S, folds=5, seed=23, recall.levels=rec.levels);
Read a directed graph from a file and build a graphNEL
object.
read.graph(file = "graph.txt.gz")
read.graph(file = "graph.txt.gz")
file |
name of the file to be read. The format of the file is a sequence of rows and each row corresponds to an edge represented through a pair of vertexes separated by blanks. The extension of the file can be plain (".txt") or compressed (".gz"). |
An object of class graphNEL
.
ed <- system.file("extdata/graph.edges.txt.gz", package= "HEMDAG"); g <- read.graph(file=ed);
ed <- system.file("extdata/graph.edges.txt.gz", package= "HEMDAG"); g <- read.graph(file=ed);
Read a graph from a file and build a graphNEL
object. The format of the input file is a sequence of rows.
Each row corresponds to an edge represented through a pair of vertexes (blank separated) and the weight of the edge.
read.undirected.graph(file = "graph.txt.gz")
read.undirected.graph(file = "graph.txt.gz")
file |
name of the file to be read. The extension of the file can be plain (".txt") or compressed (".gz"). |
A graph of class graphNEL
.
edges <- system.file("extdata/edges.txt.gz", package="HEMDAG"); g <- read.undirected.graph(file=edges);
edges <- system.file("extdata/edges.txt.gz", package="HEMDAG"); g <- read.undirected.graph(file=edges);
Find the root node of a directed graph.
root.node(g)
root.node(g)
g |
a graph of class |
Name of the root node.
data(graph); root <- root.node(g);
data(graph); root <- root.node(g);
Normalize a score matrix w.r.t. max normalization (maxnorm
) or quantile normalization (qnorm
)
scores.normalization(norm.type = "maxnorm", S)
scores.normalization(norm.type = "maxnorm", S)
norm.type |
can be one of the following two values:
|
S |
A named flat score matrix with examples on rows and classes on columns. |
To apply the quantile normalization the preprocessCore package must be properly installed.
The matrix of the scores flat normalized w.r.t. maxnorm
or qnorm
.
data(scores); norm.types <- c("maxnorm","qnorm"); for(norm.type in norm.types){ scores.normalization(norm.type=norm.type, S=S); }
data(scores); norm.types <- c("maxnorm","qnorm"); for(norm.type in norm.types){ scores.normalization(norm.type=norm.type, S=S); }
Build the annotation list starting from the matrix of the most specific annotations.
specific.annotation.list(ann)
specific.annotation.list(ann)
ann |
an annotation matrix (0/1). Rows are examples and columns are the most specific functional terms. It must be a named matrix. |
A named list, where names of each component correspond to examples (genes) and elements of each component are the associated functional terms.
data(labels); spec.list <- specific.annotation.list(L);
data(labels); spec.list <- specific.annotation.list(L);
Build the annotation matrix of the most specific functional terms.
specific.annotation.matrix(file = "gene2pheno.txt.gz")
specific.annotation.matrix(file = "gene2pheno.txt.gz")
file |
text file representing the associations gene-ontology terms. The file must be written as sequence of rows. Each row represents a gene/protein and all its associations with an ontology term (pipe separated), i.e. in the form e.g.: gene1 |obo1|obo2|...|oboN. |
The input plain text file (representing the associations gene-ontology terms) can be obtained by cloning the GitHub repository obogaf-parser, a perl5 module specifically designed to handle HPO and GO obo files and their gene annotation files (gaf file).
The annotation matrix of the most specific annotations (0/1): rows are genes and columns are functional terms (such as GO or HPO).
Let's denote the labels matrix. If
, means that the gene
is annotated with the class
, otherwise
.
gene2pheno <- system.file("extdata/gene2pheno.txt.gz", package="HEMDAG"); spec.ann <- specific.annotation.matrix(file=gene2pheno);
gene2pheno <- system.file("extdata/gene2pheno.txt.gz", package="HEMDAG"); spec.ann <- specific.annotation.matrix(file=gene2pheno);
Generate data for the stratified cross-validation.
stratified.cv.data.single.class(examples, positives, kk = 5, seed = NULL) stratified.cv.data.over.classes(labels, examples, kk = 5, seed = NULL)
stratified.cv.data.single.class(examples, positives, kk = 5, seed = NULL) stratified.cv.data.over.classes(labels, examples, kk = 5, seed = NULL)
examples |
indices or names of the examples. Can be either a vector of integers or a vector of names. |
positives |
vector of integers or vector of names. The indices (or names) refer to the indices (or names) of 'positive' examples. |
kk |
number of folds ( |
seed |
seed of the random generator ( |
labels |
labels matrix. Rows are genes and columns are classes. Let's denote |
Folds are stratified, i.e. contain the same amount of positive and negative examples.
stratified.cv.data.single.class
returns a list with 2 two component:
fold.non.positives: a list with components. Each component is a vector with the indices (or names) of the non-positive elements.
Indexes (or names) refer to row numbers (or names) of a data matrix;
fold.positives: a list with components. Each component is a vector with the indices (or names) of the positive elements.
Indexes (or names) refer to row numbers (or names) of a data matrix;
stratified.cv.data.over.classes
returns a list with components, where
is the number of classes of the labels matrix.
Each component
is in turn a list with
elements, where
is the number of folds.
Each fold contains an equal amount of positives and negatives examples.
data(labels); examples.index <- 1:nrow(L); examples.name <- rownames(L); positives <- which(L[,3]==1); x <- stratified.cv.data.single.class(examples.index, positives, kk=5, seed=23); y <- stratified.cv.data.single.class(examples.name, positives, kk=5, seed=23); z <- stratified.cv.data.over.classes(L, examples.index, kk=5, seed=23); k <- stratified.cv.data.over.classes(L, examples.name, kk=5, seed=23);
data(labels); examples.index <- 1:nrow(L); examples.name <- rownames(L); positives <- which(L[,3]==1); x <- stratified.cv.data.single.class(examples.index, positives, kk=5, seed=23); y <- stratified.cv.data.single.class(examples.name, positives, kk=5, seed=23); z <- stratified.cv.data.over.classes(L, examples.index, kk=5, seed=23); k <- stratified.cv.data.over.classes(L, examples.name, kk=5, seed=23);
Collection of the true-path-rule-based hierarchical learning ensemble algorithms and its variants.
TPR-DAG
is a family of algorithms on the basis of the choice of the bottom-up step adopted for the selection of
positive children (or descendants) and of the top-down step adopted to assure ontology-based predictions.
Indeed, in their more general form the TPR-DAG
algorithms adopt a two step learning strategy:
in the first step they compute a per-level bottom-up visit from leaves to root to propagate positive predictions across the hierarchy;
in the second step they guarantee the consistency of the predictions.
tpr.dag( S, g, root = "00", positive = "children", bottomup = "threshold.free", topdown = "gpav", t = 0, w = 0, W = NULL, parallel = FALSE, ncores = 1 )
tpr.dag( S, g, root = "00", positive = "children", bottomup = "threshold.free", topdown = "gpav", t = 0, w = 0, W = NULL, parallel = FALSE, ncores = 1 )
S |
a named flat score matrix with examples on rows and classes on columns. |
g |
a graph of class |
root |
name of the class that it is on the top-level of the hierarchy ( |
positive |
choice of the positive nodes to be considered in the bottom-up strategy. Can be one of the following values:
|
bottomup |
strategy to enhance the flat predictions by propagating the positive predictions from leaves to root. It can be one of the following values:
|
topdown |
strategy to make scores “hierarchy-aware”. It can be one of the following values: |
t |
threshold for the choice of positive nodes ( |
w |
weight to balance between the contribution of the node |
W |
vector of weight relative to a single example. If |
parallel |
a boolean value:
Use |
ncores |
number of cores to use for parallel execution. Set |
The vanilla TPR-DAG
adopts a per-level bottom-up traversal of the DAG to correct the flat predictions
according to the following formula:
where are the positive children of
.
Different strategies to select the positive children
can be applied:
threshold-free strategy: the positive nodes are those children that can increment the score of the node , that is those nodes
that achieve a score higher than that of their parents:
threshold strategy: the positive children are selected on the basis of a threshold that can be selected in two different ways:
for each node a constant threshold is a priori selected:
For instance if the predictions represent probabilities it could be meaningful to a priori select .
the threshold is selected to maximize some performance metric estimated on the training data, as for instance
the Fmax or the AUPRC. In other words the threshold is selected to maximize some measure of accuracy of the predictions
on the training data for the class
with respect to the threshold
.
The corresponding set of positives
is:
For instance can be selected from a set of
through internal cross-validation techniques.
The weighted TPR-DAG
version can be designed by adding a weight to balance between the
contribution of the node
and that of its positive children
, through their convex combination:
If no weight is attributed to the children and the
TPR-DAG
reduces to the HTD-DAG
algorithm, since in this
way only the prediction for node is used in the bottom-up step of the algorithm. If
only the predictors
associated to the children nodes vote to predict node
. In the intermediate cases we attribute more importance to the predictor for the
node
or to its children depending on the values of
.
By combining the weighted and the threshold variant, we design the weighted-threshold variant.
Since the contribution of the descendants of a given node decays exponentially with their distance from the node itself, to enhance the
contribution of the most specific nodes to the overall decision of the ensemble we design the ensemble variant DESCENS
.
The novelty of DESCENS
consists in strongly considering the contribution of all the descendants of each node instead of
only that of its children. Therefore DESCENS
predictions are more influenced by the information embedded in the leaves nodes,
that are the classes containing the most informative and meaningful information from a biological and medical standpoint.
For the choice of the “positive” descendants we use the same strategies adopted for the selection of the “positive”
children shown above. Furthermore, we designed a variant specific only for DESCENS
, that we named DESCENS
-.
The
DESCENS
- variant balances the contribution between the “positives” children of a node
and that of its “positives” descendants excluding its children by adding a weight
:
where are the “positive” children of
and
the descendants of
without its children.
If
we consider only the contribution of the “positive” children of
; if
only the descendants that are not
children contribute to the score, while for intermediate values of
we can balance the contribution of
and
positive nodes.
Simply by replacing the HTD-DAG
top-down step (htd
) with the GPAV
approach (gpav
) we design the ISO-TPR
variant.
The most important feature of ISO-TPR
is that it maintains the hierarchical constraints by construction and it selects the closest
solution (in the least square sense) to the bottom-up predictions that obeys the True Path Rule.
A named matrix with the scores of the classes corrected according to the chosen TPR-DAG
ensemble algorithm.
data(graph); data(scores); data(labels); root <- root.node(g); S.tpr <- tpr.dag(S, g, root, positive="children", bottomup="threshold.free", topdown="gpav", t=0, w=0, W=NULL, parallel=FALSE, ncores=1);
data(graph); data(scores); data(labels); root <- root.node(g); S.tpr <- tpr.dag(S, g, root, positive="children", bottomup="threshold.free", topdown="gpav", t=0, w=0, W=NULL, parallel=FALSE, ncores=1);
Correct the computed scores in a hierarchy according to the a TPR-DAG
ensemble variant.
tpr.dag.cv( S, g, ann, norm = FALSE, norm.type = NULL, positive = "children", bottomup = "threshold", topdown = "gpav", W = NULL, parallel = FALSE, ncores = 1, threshold = seq(from = 0.1, to = 0.9, by = 0.1), weight = 0, kk = 5, seed = 23, metric = "auprc", n.round = NULL )
tpr.dag.cv( S, g, ann, norm = FALSE, norm.type = NULL, positive = "children", bottomup = "threshold", topdown = "gpav", W = NULL, parallel = FALSE, ncores = 1, threshold = seq(from = 0.1, to = 0.9, by = 0.1), weight = 0, kk = 5, seed = 23, metric = "auprc", n.round = NULL )
S |
a named flat score matrix with examples on rows and classes on columns. |
g |
a graph of class |
ann |
an annotation matrix: rows correspond to examples and columns to classes. |
norm |
a boolean value. Should the flat score matrix be normalized? By default |
norm.type |
a string character. It can be one of the following values:
|
positive |
choice of the positive nodes to be considered in the bottom-up strategy. Can be one of the following values:
|
bottomup |
strategy to enhance the flat predictions by propagating the positive predictions from leaves to root. It can be one of the following values:
|
topdown |
strategy to make the scores hierarchy-consistent. It can be one of the following values: |
W |
vector of weight relative to a single example. If |
parallel |
a boolean value:
Use |
ncores |
number of cores to use for parallel execution. Set |
threshold |
range of threshold values to be tested in order to find the best threshold ( |
weight |
range of weight values to be tested in order to find the best weight ( |
kk |
number of folds of the cross validation ( |
seed |
initialization seed for the random generator to create folds ( |
metric |
a string character specifying the performance metric on which maximizing the parametric ensemble variant. It can be one of the following values:
|
n.round |
number of rounding digits (def. |
The parametric hierarchical ensemble variants are cross-validated maximizing the parameter on the metric selected in metric
.
A named matrix with the scores of the functional terms corrected according to the chosen TPR-DAG
ensemble algorithm.
data(graph); data(scores); data(labels); S.tpr <- tpr.dag.cv(S, g, ann=NULL, norm=FALSE, norm.type=NULL, positive="children", bottomup="threshold.free", topdown="gpav", W=NULL, parallel=FALSE, ncores=1, threshold=0, weight=0, kk=NULL, seed=NULL, metric=NULL, n.round=NULL);
data(graph); data(scores); data(labels); S.tpr <- tpr.dag.cv(S, g, ann=NULL, norm=FALSE, norm.type=NULL, positive="children", bottomup="threshold.free", topdown="gpav", W=NULL, parallel=FALSE, ncores=1, threshold=0, weight=0, kk=NULL, seed=NULL, metric=NULL, n.round=NULL);
Correct the computed scores in a hierarchy according to the selected TPR-DAG
ensemble variant by applying a classical holdout procedure.
tpr.dag.holdout( S, g, ann, testIndex, norm = FALSE, norm.type = NULL, W = NULL, parallel = FALSE, ncores = 1, positive = "children", bottomup = "threshold", topdown = "htd", threshold = seq(from = 0.1, to = 0.9, by = 0.1), weight = seq(from = 0.1, to = 0.9, by = 0.1), kk = 5, seed = 23, metric = "auprc", n.round = NULL )
tpr.dag.holdout( S, g, ann, testIndex, norm = FALSE, norm.type = NULL, W = NULL, parallel = FALSE, ncores = 1, positive = "children", bottomup = "threshold", topdown = "htd", threshold = seq(from = 0.1, to = 0.9, by = 0.1), weight = seq(from = 0.1, to = 0.9, by = 0.1), kk = 5, seed = 23, metric = "auprc", n.round = NULL )
S |
a named flat score matrix with examples on rows and classes on columns. |
g |
a graph of class |
ann |
an annotation matrix: rows correspond to examples and columns to classes. |
testIndex |
a vector of integer numbers corresponding to the indexes of the elements (rows) of the score matrix |
norm |
a boolean value. Should the flat score matrix be normalized? By default |
norm.type |
a string character. It can be one of the following values:
|
W |
vector of weight relative to a single example. If |
parallel |
a boolean value:
Use |
ncores |
number of cores to use for parallel execution. Set |
positive |
choice of the positive nodes to be considered in the bottom-up strategy. Can be one of the following values:
|
bottomup |
strategy to enhance the flat predictions by propagating the positive predictions from leaves to root. It can be one of the following values:
|
topdown |
strategy to make the scores hierarchy-consistent. It can be one of the following values: |
threshold |
range of threshold values to be tested in order to find the best threshold ( |
weight |
range of weight values to be tested in order to find the best weight ( |
kk |
number of folds of the cross validation ( |
seed |
initialization seed for the random generator to create folds ( |
metric |
a string character specifying the performance metric on which maximizing the parametric ensemble variant. It can be one of the following values:
|
n.round |
number of rounding digits (def. |
The parametric hierarchical ensemble variants are cross-validated maximizing the parameter on the metric selected in metric
,
A named matrix with the scores of the classes corrected according to the chosen TPR-DAG
ensemble algorithm.
Rows of the matrix are shrunk to testIndex
.
data(graph); data(scores); data(labels); data(test.index); S.tpr <- tpr.dag.holdout(S, g, ann=NULL, testIndex=test.index, norm=FALSE, norm.type=NULL, positive="children", bottomup="threshold.free", topdown="gpav", W=NULL, parallel=FALSE, ncores=1, threshold=0, weight=0, kk=NULL, seed=NULL, metric=NULL, n.round=NULL);
data(graph); data(scores); data(labels); data(test.index); S.tpr <- tpr.dag.holdout(S, g, ann=NULL, testIndex=test.index, norm=FALSE, norm.type=NULL, positive="children", bottomup="threshold.free", topdown="gpav", W=NULL, parallel=FALSE, ncores=1, threshold=0, weight=0, kk=NULL, seed=NULL, metric=NULL, n.round=NULL);
Perform the transitive closure of the annotations using ancestors and the most specific annotation matrix. The annotations are propagated from bottom to top, enriching the most specific annotations table. Rows correspond to genes and columns to functional terms.
transitive.closure.annotations(ann.spec, anc)
transitive.closure.annotations(ann.spec, anc)
ann.spec |
the annotation matrix of the most specific annotations (0/1): rows are genes and columns are functional terms. |
anc |
the ancestor list. |
The annotation table T: rows correspond to genes and columns to ontology terms. means that gene
is annotated for the term
,
means that gene
is not annotated for the term
.
data(graph); data(labels); anc <- build.ancestors(g); tca <- transitive.closure.annotations(L, anc);
data(graph); data(labels); anc <- build.ancestors(g); tca <- transitive.closure.annotations(L, anc);
Transform a named score matrix in a tupla, i.e. in the form nodeX nodeY score
.
tupla.matrix(m, output.file = "net.file.gz", digits = 3)
tupla.matrix(m, output.file = "net.file.gz", digits = 3)
m |
a named score matrix. It can be either a |
output.file |
name of the file on which the matrix must be written. |
digits |
number of digits to be used to save scores of |
Only the non-zero interactions are kept, while the zero interactions are discarded.
A tupla score matrix stored in output.file.
data(wadj); file <- tempfile(); tupla.matrix(W, output.file=file, digits=3);
data(wadj); file <- tempfile(); tupla.matrix(W, output.file=file, digits=3);
This function splits a dataset in k-fold in an unstratified way, i.e. a fold does not contain an equal amount of positive and negative examples. This function is used to perform k-fold cross-validation experiments in a hierarchical correction contest where splitting dataset in a stratified way is not needed.
unstratified.cv.data(S, kk = 5, seed = NULL)
unstratified.cv.data(S, kk = 5, seed = NULL)
S |
matrix of the flat scores. It must be a named matrix, where rows are example (e.g. genes) and columns are classes/terms (e.g. GO terms). |
kk |
number of folds in which to split the dataset ( |
seed |
seed for random generator. If |
A list with components (folds). Each component of the list is a character vector contains the index of the examples,
i.e. the index of the rows of the matrix S.
data(scores); foldIndex <- unstratified.cv.data(S, kk=5, seed=23);
data(scores); foldIndex <- unstratified.cv.data(S, kk=5, seed=23);
Build a symmetric weighted adjacency matrix (wadj matrix) of a graph.
weighted.adjacency.matrix(file = "edges.txt")
weighted.adjacency.matrix(file = "edges.txt")
file |
name of the plain text file to be read ( |
A named symmetric weighted adjacency matrix of the graph.
edges <- system.file("extdata/edges.txt.gz", package="HEMDAG"); W <- weighted.adjacency.matrix(file=edges);
edges <- system.file("extdata/edges.txt.gz", package="HEMDAG"); W <- weighted.adjacency.matrix(file=edges);
Read an object of class graphNEL
and write the graph as sequence of rows on a plain text file.
write.graph(g, file = "graph.txt.gz")
write.graph(g, file = "graph.txt.gz")
g |
a graph of class |
file |
name of the file to be written. The extension of the file can be plain (".txt") or compressed (".gz"). |
A plain text file representing the graph. Each row corresponds to an edge represented through a pair of vertexes separated by blank.
data(graph); file <- tempfile(); write.graph(g, file=file);
data(graph); file <- tempfile(); write.graph(g, file=file);