2023-12-19

This is Marco's daily open-notebook.

Today is 2023.12.19

Notes

Prep work for starting the job on January.

This query : https://w.wiki/8YzQ will recursively find all the parent taxon of a species all the way to the "Biota" group. This should then be useful to create the LOTUS graph to use in grape.

I should then :

  • get all the species that are linked to a referenced molecule using this query : https://w.wiki/8Z7Y
  • for each species get their taxonomy thanks to this query : https://w.wiki/8YzQ
  • create a list of all the nodes (species : 37082 ) and their taxonomy
  • create a list of all the edges of the tree
  • once this is done create a tree of molecules (already done in an other repo) and link them together (try to do it with weighted edges between species and molecules)
  • Then talk to Luca

Here is an example of how to retrieve the data :

import sys
from SPARQLWrapper import SPARQLWrapper, JSON

endpoint_url = "https://query.wikidata.org/sparql"

query = """SELECT ?entity ?entityLabel (count(?mid) as ?depth) WHERE {
  wd:Q27438471 wdt:P171* ?mid.
  ?entity wdt:P225 ?entityLabel.
  ?mid wdt:P171* ?entity.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
} group by ?entity ?entityLabel
order by ?depth"""


def get_results(endpoint_url, query):
    user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
    # TODO adjust user agent; see https://w.wiki/CX6
    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()


results = get_results(endpoint_url, query)
pd.json_normalize(results['results']['bindings'])

Todo today

packages used : python 3.11.*, pip, grape

Here are the NaN values of the species in LOTUS. It seems to confirm what I saw during the Master thesis where GBIF seems to have the most species known. The total is 37082 species.

  • NaN is NCBI : 8133
  • NaN is GBIF : 736
  • NaN is OTL : 8073

Done

Todo tomorrow