2024-02-15

This is Marco's daily open-notebook.

Today is 2024.02.15

Notes

Call with Luca

Once I have found the best model, I can proceed with the following :

graph_with_only_in_taxon = graph.filter_from_names(
    edge_type_names_to_keep=["biolink:in_taxon"],
)
graph_without_in_taxon = graph.filter_from_names(
    edge_type_names_to_remove=["biolink:in_taxon"],
)
pos = graph_with_only_in_taxon
neg = graph_with_only_in_taxon.sample_negative_graph(
    number_of_negative_samples=graph_with_only_in_taxon.get_number_of_directed_edges(),
    sample_edge_types=False,
    only_from_same_component=False,
    use_scale_free_distribution=True,
    random_state=23391 * (3 + 1),
)

sketching_features = HyperSketchingPy(
    hops=number_of_hops,
    normalize=normalize,
    graph=graph
)
sketching_features.fit()


# sketching for positive training edges
pos_sources = pos.get_directed_source_node_ids()
pos_destinations = pos.get_directed_destination_node_ids()


sk_positive_features = sketching_features.positive(
    sources=pos_sources,
    destinations=pos_destinations,
    feature_combination=combination,
)


# sketching for training negatives
neg_sources = neg.get_directed_source_node_ids()
neg_destinations = neg.get_directed_destination_node_ids()
sk_negative_features = sketching_features.negative(
    sources=neg_sources,
    destinations=neg_destinations,
    feature_combination=combination,
)

X = np.concatenate([ sk_positive_features, sk_negative_features])
label_pos = np.ones(sk_positive_features.shape[0])
label_neg = np.zeros(sk_negative_features.shape[0])
label = np.concatenate([label_pos, label_neg])

# randomize the order of the training data
random_state=43
indices = np.arange(X.shape[0])
rnd = np.random.RandomState(random_state)
rnd.shuffle(indices)
X_shuffled = X[indices]

label_shuffled = label[indices]

Then fit the model. Save it.

Once we have our model we need to do the predictions on the bipartite graph (the bipartite grpah being the 8 billion pairs of possible molecules)

train graph
iterate over species
for each species genrerate link graph between that species and all molecules --> create features of that graph
predict
order the predictions and keep scores above 0.75
for each species create csv with species name
in csv you have the molecule corresponding to the node and the score ATTENTION : Flag pairs that are already present in graph ! graph.get_neighbour_node_ids_from_node_name("wd:Q43656")

Example of for one species :

sketching_features = HyperSketchingPy(
    hops=number_of_hops,
    normalize=normalize,
    graph=graph,
)
sketching_features.fit()


pair_sketching = sketching_features.unknown(
    sources=molecules.astype("uint32"),
    destinations=homo_sapiens.astype("uint32"),
    feature_combination="addition"
)

model.predict_proba(pair_sketching)

With : molecules = np.array([all_molecules nodes id ]) and homo_sapiens = np.array([15, 15, 15, ...])

Todo today

Doing

Done

Todo tomorrow