Machine Learning Data Gets Type Checking, Validation with Flyte, Pandera – InApps 2022

Main Contents:

Machine Learning Data Gets Type Checking, Validation with Flyte, Pandera – InApps is an article under the topic Software Development Many of you are most interested in today !! Today, let’s InApps.net learn Machine Learning Data Gets Type Checking, Validation with Flyte, Pandera – InApps in today’s post !

Beyond Pandas

Much of the data used today in ML is encoded in Python Data Frames, which are basically tables of imported data with little or no additional context. Python, however, is a dynamically typed language, in that it does not check what type of data is entered as a variable.

On its own, Python could not flag when, say, a string of data is inadvertently entered as a value, instead of an integer. Such an error would result in an error during runtime. Even if all the values are strings, Python’s math operators when applied to strings could lead to undesired results, Bantilan said.

Adding strings and integers together in Python results in runtime errors.

A strongly-typed “data lineage tracking platform,” Flyte can among other things, perform type checking, preproduction checks to ensure only integers are in the integer column.

With Flyte, the ML engineer writes tasks that can preprocess data. Each task is actually a Python Decorator function run in its own container. Tasks can be chained together as workloads, with the input and output of each task clearly defined.

With Flyte schemas, you can build a fully typesafe DAG ML workflow, which can ensure that the data used is correct.

In Flyte you build data processing tasks….

Flyte tasks can chained together into functions.

Flyte tasks can be chained together into functions.

“This is a great feature to have. Because now that you have type information, you basically have function types for your functions. And your function now can be analyzed to see what is a valid set of operations. So you can assess your workflow for validity just on the basis of the allowed input and output types,” Bantilan explained.

The resulting Python code can be run locally and deployed into a production environment. Flyte can be installed through the Python pip.

Beyond Type Safety

Pandera is a statistical typing and data testing tool that can be integrated in Flyte to validate additional properties beyond data types, in effect adding guardrails to a data processing pipeline.

Statistical typing specifies the properties of collections of data points. For instance, if you already know the range of values for input, you can check to ensure the data falls within this range. You can match against a regular expression, or that the null value is not entered too many times. You can check for the uniqueness of a column, or its “Monotonicity” (are the values increasing or decreasing?)

With data testing, Pandera can both validate the live data coming in as well as the functions handling that data. You can encode assumptions about Data Frames as schemas, which can be used as Python Type annotations, and checked by way of function calls.

Pandera can check to ensure the data handling functions of your ML workflow are working as intended.

” You can easily integrate data frame types with your pipelines that get informative formative errors if something goes wrong,” Bantilan said.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.