Sparql

anticipated lotus

I am currently trying to be able to have all species of Wikidata available for predictions.

I neeed to use this endpoint to get my data quickly: https://qlever.cs.uni-freiburg.de/wikidata

The idea would be to have a dataframe with for each species, its taxonomy in the same way as in LOTUS. To do so,

tu as ?taxon ?parent, à partir de là tu peux reconstuire toute la taxo (en dehors de sparql, comme t'as dû prendre dans lotus-search?)

 import csv
 import logging
 from collections import defaultdict, deque
 from pathlib import Path
 from tqdm import tqdm
 
 logging.basicConfig(
     level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
 )
 def convert_to_int_safe(s: str) -> int | None:
     try:
         result = int(s)
         return result
     except ValueError:
         logging.error(f"{s} is not a valid integer.")
         return None
 
 
 def generate_taxon_parents_with_distance(path: Path) -> list[tuple[int, int, int]]:
     graph = defaultdict(list)
     distances = []
     with open(path / "full_wikidata_taxonomy_edges.csv", "r") as f:
         reader = csv.reader(f)
         headers = next(reader)
         taxon_index = headers.index("child")
         parent_index = headers.index("parent")
 
         for row in reader:
             taxon_id = row[taxon_index]
             parent_id = row[parent_index]
 
             if taxon_id is None or parent_id is None:
                 continue
             graph[taxon_id].append(parent_id)
     # Good ol' BFS
     for node in tqdm(list(graph.keys())):
         visited = {node: 0}
         queue = deque([node])
         while queue:
             current_node = queue.popleft()
             current_distance = visited[current_node]
 
             for neighbor in graph[current_node]:
                 if neighbor not in visited:
                     queue.append(neighbor)
                     visited[neighbor] = current_distance + 1
                     distances.append((node, neighbor, current_distance + 1))
 
     return distances

This is super fast and results in a 9'000'000 rows dataframe. We can then think of getting the child as index, the taxon_rank of each parent from this query:

PREFIX wdt: <http://www.wikidata.org/prop/direct/>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?rank ?rank_label{
    ?rank wdt:P31 wd:Q427626;
        rdfs:label ?rank_label.
FILTER (LANG(?rank_label) = "en")
}

tu as ?taxon ?rank que tu peux join sur ?taxon
tu as ?rank ?rank_label que tu peux join sur le résultat d'avant sur ?rank

anticipated lotus