2023-03-09
This is Maëlle's DBGI daily open-notebook.
Today is 2023.03.09
TODO
- Check discussion: https://github.com/orgs/digital-botanical-gardens-initiative/discussions
- Check Zulip
CODE
NOTES
dbgi meeting
To check: https://dash.gallery/dash-drug-discovery/
WIDS Geneva
Building a data practice from scratch for a reasonable budget: The Good, the Bad, and the Ugly
Christelle Marfaing, Head of Data, May
- Seek and define your Good target
- Avoid the bad traps and dead ends
- Take the ugly way to get there
Build you data stack
- modern data stack movement (good way to start but not perfect) (https://www.alation.com/blog/modern-data-stack-explained/) => Batch ingestion of multiple sources to a DWH
- Important to know what you want to build and what you want to buy
A stack is nothing without a team
- Full stack or specialized team? Specialization typically develops as a team matures and is heavily influenced by the prevailing trends at the time of the team's creation => will shape your HR policy
Data teams models:
- Centralized (one manager - small team)
- Hybrid
- Decentralized (each business unit works on different projects)
- Data Mesh (Trendy in Paris)
How to determine the model for your dream team?
- The degree of centralization of a team depends not only on the number of people but also on the team's maturity
- While decentralization can have advantages, a poorly executed ...
How to avoid spending all your time recruiting?
- Let them learn
- LET IT GO
Key takeaways
- Be prepared to start it all again.
- There is no one-size-fits-all solution, but the more you learn about building a tech stack and team, the more informed your decisions will be.
- Building a tech stack and a team are interdependent
- Recruiting and keeping employees is a big challenge.
Graph Neural Networks - applications and opportunities
Nadya Chernyavskaya, Senior Researcher, CERN
- When? Types of data to benefit from GNNs
- What For?
- How?
Graph and their properties
- nodes and edges with features
- directed/undirected
- connected/disconnected
- Self-loops
- multi-graph (edges of different types => heterogenous graph)
- permutation invariant (challenging to visualise)
==> Graph structure offers much more flexibility:
- arbitrary size and complex topological structure
What for?
Learning on graphs:
- Node level- predict a property of a node
- Edge/Link level - predict links between two nodes (recommendation systems)
- Community/Subgraph level - detect if nodes form a community (user profiling task/ diagnostic tasks for rare disorders)
- graph level - categorize different graphs (molecule property prediction)
Other tasks: Graph evolution, Graph generation (Drug discovery)
How?
How to represent?
- adjacency matrix
- edge list
- adjacency list
How to represent the features:
- Node features x edge features => 3D matrix
Message passing view of GNNs
We can get information about large scale structure by repeating a simple local rule
- Rule: My neighbours tell me what they think i should know
- This is done for all neghbourhoods and the information is updated
- After 3 iterations, i know what my 3 hop neighbours think we all should know
What about the previous information?
- Each node keep tracks
Impact ?
- traffic prediction in Google maps
- Nodes: road segments
- Edges: connectivity between road segments
- Prediction: Time of arrival
Conclusion
- When? Graph nerual networks is a very powerful tool for complex and sparse data
- What for? They can be successfully applied to a variety of different tasks such a s node classification, graph classification, edge prediction, etc.
- How? We learned how to convert graph into a ML ready representation and build a GNN
Databricks
https://github.com/marzervou/wids
MLOps
set of processes and automation for managing code, data and models to improve performance, stability and long-term efficiency of ML systems
Why should care ?
- Helps reduce risk
- technical risk - poorly performing models, fragile infrastructure
- Compliance risk - violating regulatory or corporate policies
- improves long-term efficiency through automation
- catch errors before they hit production
- avoid slow, manual processes
Databricks (https://www.databricks.com/)
- lakehouse platform
- Dev/staging/prod environments can be separated
- prod environments => ML flow(https://mlflow.org/)
Deploy code process
- Dev
- Develop training code
- Develop ancillary code
- promote code
- staging
- Test model training code on subset of data
- prod