Using Omnipy for data wrangling and metadata mapping (Part 1 - Beginner level)

Learn to make use of the new and powerful Omnipy Python library to wrangle your research data and/or metadata into proper shape

Time and place: Jan. 8, 2024 9:00 AM – 12:00 PM, GSH: Grupperom 1

A poster displaying the name of the workshop beside a picture of a young man using a computer.

Register

Researchers often spend a significant amount of time on data wrangling tasks, such as reformatting, cleaning, and integrating data from different sources. Despite the availability of software tools, they often end up with difficult to reuse workflows that require manual steps.Omnipy is a new Python library that offers a systematic and scalable approach to research data and metadata wrangling. It allows researchers to import data in various formats and continuously reshape it through typed transformations. For large datasets, Omnipy seamlessly scales up local test jobs and provides persistent access to the data state at every step.

This workshop will provide down-to-earth tutorials and examples to help data scientists from any field make use of Omnipy to wrangle real-world datasets into shape.

The workshop is divided into three parts:

1. The first part will introduce the concepts of models, datasets and tasks in Omnipy through small examples. We will also touch upon Python type hints and pydantic models as needed, as these are important building blocks for Omnipy.

2. In the second part, the participants will be provided with a rough example dataset that require cleaning. As a hands-on exercise, the participant will carry out step-wise parsing and shaping of the data to make it comply with a specified metadata schema.

3. In the last part, the participants will be introduced to the metadata mapping functionalities in Omnipy and will be led through another hands-on exercise to set up a transformation that maps the data from one metadata schema to another.This half-day workshop will form the knowledge basis for an intermediate-level workshop after lunch that will focus more on defining and orchestrating data flows, including integrating with data sources and deploying flows onto external compute resources.

Learning outcomes

Introduction to Python type hints and pydantic models
How to use type hints to define models, datasets and tasks in Omnipy
How to wrangle a rough dataset into the shape required by a metadata schema
How to set up an executable mapping of data from one metadata schema to another

Prerequisites

The participant should have some experience with Python programming/scripting. We will not spend time explaining basic syntax and concepts, other than what is related to type hints. Experience with type hints in Python is useful, but not required.

Target audience

PhDs, Postdocs, Technical personnel. Interest and experience with programming in an academic setting. Data science will be a particular focus, but the workshop is open to any interested participants. The use cases will not assume any domain knowledge.

Required material

The participants should bring a laptop. No software installation is required other than a modern browser. We will make use of Jupyter Notebook for the hands-on exercise. An online Jupyter notebook service will be made available, but participants can also install Jupyter notebook locally on their laptop if they prefer.

Organizer

Sveinung Gundersen, Federico Bianchini and Pável Vázquez

Published Nov. 27, 2023 12:01 PM - Last modified Feb. 2, 2024 12:47 PM