Home Community Google AI Introduces Croissant: A Metadata Format for Machine Learning-Ready Datasets

Google AI Introduces Croissant: A Metadata Format for Machine Learning-Ready Datasets

Google AI Introduces Croissant: A Metadata Format for Machine Learning-Ready Datasets

When constructing machine learning (ML) models using preexisting datasets, experts in the sector must first familiarize themselves with the info, decipher its structure, and determine which subset to make use of as features. A lot in order that a basic barrier, the good range of knowledge formats, is slowing advancement in ML.

Text, structured data, photos, audio, and video are only a couple of content categories in ML datasets. Even amongst datasets that include the identical subject material, there isn’t any standard layout of files or data formats. This obstacle lowers productivity through machine learning development—from data discovery to model training. Moreover, it makes it harder to create essential tools for coping with huge datasets.

Database metadata could be expressed in various formats, including schema.org and DCAT. Unfortunately, these formats weren’t made with machine learning data in mind. ML data has unique requirements, like combining and extracting data from structured and unstructured sources, having metadata allowing for responsible data use, or describing ML usage characteristics like training, test, and validation sets.

Google has recently introduced Croissant, a brand new format for metadata in ML-ready datasets. Together with the format specification, example datasets, and open-source Python library for validating, consuming, and generating Croissant metadata, this 1.0 release of Croissant also includes an open-source visual editor for loading, inspecting, and intuitively creating Croissant dataset descriptions.

Even though it offers a consistent approach to describing and organizing data, the Croissant format doesn’t change the info’s actual representation (similar to picture or text file formats). With over 40 million datasets currently using it, schema.org is the gold standard for publishing structured data online, and Croissant is an extension of that standard. Croissant adds extensive layers for data resources, default ML semantics, metadata, and data management to make it much more ML-relevant.

From the start, the first objective of the Croissant initiative was to advertise Responsible AI (RAI). As well as, the team also announced the primary release of the Croissant RAI vocabulary extension. This extension enhances Croissant by adding properties that describe various RAI use cases. These include data life cycle management, labeling, participatory data, ML safety and fairness evaluation, explainability, compliance, and more.

Dataset repositories and search engines like google can use metadata to assist users locate the right dataset. The info resources and organization information make tools for data cleansing, refining, and evaluation easier to design. Because of this metadata and default ML semantics, ML frameworks can use data for model training and testing with little coding. Taken as a complete, these enhancements significantly lessen the load of knowledge development.

Dataset writers also prioritize their datasets’ discoverability and use. Because of the available generation tools and support from ML data platforms, adopting Croissant enhances the worth of their datasets with no effort.

Use the Croissant editor’s user interface (GitHub) to look at and alter the metadata.

By evaluating the info the user gives, the Croissant editor UI (GitHub) may routinely construct a serious percentage of Croissant metadata. Essential metadata fields, like RAI properties, can then be filled out. Users can then publish their datasets.

Make the Croissant data easily discoverable and reusable by publishing it on their dataset website.

Croissant metadata shall be routinely generated if users post their data to a Croissant-compatible repository (e.g., OpenML, Kaggle, or HuggingFace).

Essential tools and repositories supporting this, including Kaggle, Hugging Face, and OpenML, are three popular ML dataset collections that may start supporting the Croissant format today. Users can seek for Croissant datasets on the internet with the Dataset Search tool. TensorFlow, PyTorch, and JAX, three popular ML frameworks, can load Croissant datasets easily with the TensorFlow Datasets (TFDS) package.

The researchers strongly suggest that platforms that host datasets make Croissant files available for download and supply Croissant information on dataset web pages. This may help dataset search engines like google find them more easily. Data evaluation and labeling tools, amongst others that assist users in working with ML datasets, must also consider adding support for Croissant datasets. Working together, the team believes we are able to ease the load of knowledge development and pave the best way for a more robust ML research and development environment.

Take a look at the Blog and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group

When you like our work, you’ll love our newsletter..

Don’t Forget to hitch our Telegram Channel

Chances are you’ll also like our FREE AI Courses….


” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/20221028_101632-Dhanshree-Shenwai-169×300.jpg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/20221028_101632-Dhanshree-Shenwai-576×1024.jpg”>

Dhanshree Shenwai is a Computer Science Engineer and has experience in FinTech firms covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is keen about exploring latest technologies and advancements in today’s evolving world making everyone’s life easy.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and plenty of others…


Please enter your comment!
Please enter your name here