FAIRiCUBE-Hub

The Gridded GeoData Working Environment

Validation of Data Processing and ML Applications

Processing and machine learning validation in FAIRiCUBE ensures that data workflows and algorithms used within the UCs are reliable, well-documented, and ethically sound. This includes validating data cleaning, transformation, and integration steps, as well as confirming that algorithms are implemented correctly and perform as expected. Benchmarking is used to compare performance across methods and datasets, while thorough documentation supports transparency and reproducibility. The validation process also pays attention to ethical considerations, including identifying and addressing potential biases in machine learning models.

Process Check type Characteristic Description
Data processing Validation Algorithm implementation validation Technical Robustness and safety Ensuring robustness and safety of the implementation through e.g. unit tests that can verify that unit of codes behaves as expected in isolation (inc. individual components of our data processing pipeline and algorithms).
Assess the interactions Conduct integration tests to assess the interactions between different components of the system, verifying that data flows smoothly between processing stages and that the overall pipeline functions correctly.    
End-to-end testing Assess the entire data processing workflow. This involves testing the complete pipeline with controlled (i.e. synthetic) but representative data to ensure that it produces the expected results.    
Cross-validation Cross-validation to assess the model’s performance across different subsets of the data, ensuring it generalizes well to new, unseen data    
Benchmarking Monitor compute resources Monitoring and storing the consumption of computational resources as defined and described in the FAIRiCUBE GitHub repository.  
Re-run and compare If performance is in question, re-run of an application can be advised. The monitoring results of a compute task can be compared with other similar tasks as listed in the FAIRiCUBE knowledge base (htts://fairicube-kb.dev.epsilon-italia.it/).    
Comprehensive documentation Documentation and transparency Document the rationale behind design choices, assumptions, and dependencies to allow transparency of the processing and ML application methods.  
Meta-data Update, complete, and maintain meta-data records associated to the data set. This applies to the FAIRiCUBE analysis / processing meta-data.    
Machine learning validation   Dataset preparation for training Create subsets of data separate for training, testing (and validation), check the selection method (random, by consecutive index).
Define appropriate validation metrics Select and document appropriate metrics such as total accuracy, precision, recall, F1 score, area under the ROC curve to evaluate the performance of your ML method. Define expectations first, establish baseline methods to compare against.    
Prevent/Test overfitting and underfitting Comparing performance / accuracy metrics from applying an ML model to different datasets which have not been included in the training of the ML model will give insights on over or underfitting.    
Statistical bias validation Checking the statistical distribution of the input feature both within one feature space and across features will avoid unwanted biases in the training of the ML model. Some ML methods require certain statistical distributions (Gaussian or even distribution, etc.), some other methods require scaling of the data features.    
Human agency and oversight Implementation of human oversight mechanisms such as human-in-the-loop, human-on-the-loop, and human-in-command approaches. At each step of the ML application, user feedback and interaction should be foreseen.