Andrea Riba

Data systems,
machine learning,
at scale.

Senior Data Scientist building production-grade data pipelines, ML systems, and large-scale analytical workflows. Focused on distributed systems, cloud infrastructure, and real-world optimisation problems.

Open to remote opportunities in data infrastructure, ML systems, and high-performance data processing.

I design and build data and machine learning systems at scale. My work focuses on distributed data pipelines, real-time architectures, and applied ML, with production experience across cloud environments (GCP), event-driven systems (Kafka), and high-performance data processing (Spark, Polars). I began my career in academia as a computational biologist, publishing in journals such as Nature Communications and PNAS, where I developed deep learning models and large-scale simulations. Today, I bring the same rigor to industry problems, with an emphasis on scalability, performance, and measurable impact. Currently Data Scientist at Leroy Merlin, working on supply chain optimisation, pricing, and logistics systems.

TB+
data processed
real-time
streaming pipelines
prod ML
models deployed
1,020+
Google Scholar citations
12
peer-reviewed publications
10 yr
academic research career
2024 — Present
Data Scientist
Leroy Merlin Italia · Milan, Italy
Designed and deployed data-driven systems for supply chain optimisation, automation, and real-time decision making.
  • Built a large-scale pipeline to analyse overstock root causes, covering ~€45M in inventory value and attributing ~85% of overstock to actionable drivers (e.g. replenishment policies, stock constraints); solution being scaled across Adeo BUs.
  • Developed and deployed an ML model exposed via API to prioritise self-checkout (SCO) cart controls based on risk scoring, improving retrieved amounts by ~3× vs previous rule-based approach.
  • Implemented automated validation pipelines for carrier pre-invoices (GDrive ingestion, ZenML orchestration, Mailjet alerts), reducing manual checks and improving data reliability.
  • Built monitoring and alerting systems for logistics parameters using BigQuery, n8n, and Google Sheets, enabling real-time anomaly detection and tracking of corrective actions.
PythonSQLGCPZenMLn8n
2022 — 2024
Senior Consultant
Capgemini Financial Services · Turin, Italy
Built and maintained cloud-based data pipelines and event-driven systems for credit risk modelling at Intesa Sanpaolo. Worked on large-scale financial datasets, focusing on reliability, performance, and production deployment.
PythonGCPApache KafkaSpark
2017 — 2022
Postdoctoral Researcher
IGBMC, University of Strasbourg · France  ·  PI: Nacho Molina
Developed deep learning methods for cell cycle inference from single-cell RNA-seq data. Applied AI and mathematical modelling to transcriptional dynamics and mitosis reactivation.
PythonTensorFlowscRNA-seqR
2015 — 2017
Postdoctoral Researcher
Biozentrum, University of Basel · Switzerland  ·  PI: Mihaela Zavolan
Sequencing data analysis, machine learning, and biophysical modelling of mRNA translation. Built the TASEP simulator for ribosome dynamics published in PNAS 2019.
RC++PythonHPC / Slurm
2012 — 2015
PhD in Complex Systems for Life Sciences
University of Torino · Italy  ·  Supervisor: Prof. Michele Caselle
Biophysical modelling of microRNA-dependent gene regulation. Stochastic processes and gene expression dynamics.
C++RMathematica
2006 — 2011
BSc & MSc in Theoretical Physics
University of Torino · Italy
MSc grade: 110/110 cum laude. BSc grade: 110/110 cum laude.
Data systems
Batch & streaming pipelines ETL / ELT Orchestration
Distributed computing
Spark Kafka Parallel computing HPC (Slurm / Grid Engine)
Backend & APIs
FastAPI Litestar REST APIs
Data platforms
BigQuery PostgreSQL MongoDB
Languages
Python C++ SQL R
Machine learning
TensorFlow scikit-learn XGBoost
Scientific computing
Stochastic processes Simulation Numerical modelling
Bioinformatics
RNA-seq scRNA-seq Ribo-seq

A selection of systems and experiments focused on scalable data processing, machine learning, and simulation.

01

MoonBirths

Open data analysis · 2026

End-to-end data pipeline processing ~3M records from Wikidata, combining high-performance data processing (Polars) with astronomical computations and FFT-based signal analysis. Designed for scalability and reproducibility, the system demonstrates how large-scale statistical pipelines can rigorously test hypotheses on noisy real-world data.

PythonPolarsWikidatapyephemstatisticsastronomy
02

tf2tfjs

ML deployment · 2024

Lightweight pipeline for converting TensorFlow models into TensorFlow.js format, enabling browser-based inference and client-side deployment. The project focuses on model portability across runtime environments, demonstrating how trained models can be adapted for low-latency applications and privacy-preserving use cases without server-side inference.

Python TensorFlow TensorFlow.js model deployment ML systems web inference
03

Games & AI

Interactive demo

Live demo

A collection of AI algorithms for classic games — minimax, and AlphaZero-style policy networks. Play against the AI directly in the browser for Tic-Tac-Toe, and Connect Four.

PythonJavaScriptAlphaZeroMCTS
04

DeepCycle

Nature Communications · 2022

Published

Cell cycle inference in single-cell RNA-seq using RNA velocity and a circular autoencoder. Assigns a continuous transcriptional phase angle to each cell, revealing hidden oscillatory structure without external perturbations or cell sorting.

PythonTensorFlowscRNA-seqRNA velocitybioinformatics
05

Codon TASEP

PNAS · 2019

Published

Simulation of ribosome dynamics along mRNA transcripts using an inhomogeneous TASEP with codon-specific elongation rates. Combined with ribosome footprinting data, the model revealed that amino acid composition is as important a determinant of translation elongation speed as codon–tRNA adaptation — challenging a long-standing assumption in molecular biology.

C++TASEPribosome profilingtranslationyeastbiophysics
06

Portfolio App

Web application

A full-stack analytical application combining data processing, forecasting models, and interactive visualization. Designed to explore portfolio dynamics and simulate future scenarios.

PythonDashPlotlyweb app
07

Space 3D

Interactive demo

Live demo

Interactive 3D visualisation of the night sky using real HYG star catalogue data and physically accurate spectral colours. Navigate a rendition of the observable universe in your browser with Three.js.

JavaScriptThree.jsWebGLastronomydata viz

Background in computational biology with publications in Nature Methods, Nature Communications, PNAS and others.

Cell cycle gene regulation dynamics revealed by RNA velocity and deep-learning
Riba et al.  ·  Nature Communications 2022  ·  1st & corresponding author
nature.com ↗
Nat Comms
Protein synthesis rates and ribosome occupancies reveal determinants of translation elongation rates
Riba et al.  ·  PNAS 2019  ·  1st & corresponding author
pnas.org ↗
PNAS
Explicit modeling of siRNA-dependent on- and off-target repression improves the interpretation of screening results
Riba et al.  ·  Cell Systems 2017  ·  1st author
cell.com ↗
Cell Sys
Terminal exon characterization with TECtool reveals an abundance of cell-specific isoforms
Gruber, Gypas, Riba et al.  ·  Nature Methods 2018
nature.com ↗
Nat Meth

Full list (12 papers, 1,020+ citations) → Google Scholar