2023-12-19

This is Marco's daily open-notebook.

Today is 2023.12.19

Notes

Prep work for starting the job on January.

This query : https://w.wiki/8YzQ will recursively find all the parent taxon of a species all the way to the "Biota" group. This should then be useful to create the LOTUS graph to use in grape.

I should then :

get all the species that are linked to a referenced molecule using this query : https://w.wiki/8Z7Y
for each species get their taxonomy thanks to this query : https://w.wiki/8YzQ
create a list of all the nodes (species : 37082 ) and their taxonomy
create a list of all the edges of the tree
once this is done create a tree of molecules (already done in an other repo) and link them together (try to do it with weighted edges between species and molecules)
Then talk to Luca

Here is an example of how to retrieve the data :

import sys
from SPARQLWrapper import SPARQLWrapper, JSON

endpoint_url = "https://query.wikidata.org/sparql"

query = """SELECT ?entity ?entityLabel (count(?mid) as ?depth) WHERE {
  wd:Q27438471 wdt:P171* ?mid.
  ?entity wdt:P225 ?entityLabel.
  ?mid wdt:P171* ?entity.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
} group by ?entity ?entityLabel
order by ?depth"""


def get_results(endpoint_url, query):
    user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
    # TODO adjust user agent; see https://w.wiki/CX6
    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()


results = get_results(endpoint_url, query)
pd.json_normalize(results['results']['bindings'])

Todo today

packages used : python 3.11.*, pip, grape

Here are the NaN values of the species in LOTUS. It seems to confirm what I saw during the Master thesis where GBIF seems to have the most species known. The total is 37082 species.

NaN is NCBI : 8133
NaN is GBIF : 736
NaN is OTL : 8073

Done

Todo tomorrow

This is Marco's daily open-notebook.