Blog by Sabri Eyuboglu ([email protected]), Arjun Desai ([email protected]), Karan Goel ([email protected])

This blog post introduces Meerkat, a new Python library for wrangling complex datasets across stages of the ML lifecycle. You can find the project on GitHub.

Data is the lifeblood of machine learning. From training and validation data to predictions, embeddings, metadata, and more; it drives all parts of the machine learning development process. Organizing and managing all of this data is challenging, but it's critical for making machine learning work in practice.

This blog post introduces Meerkat, a new data library that we're building to help practitioners and researchers wrangle their data. Fair warning, Meerkat is pretty young (the project is less than 3 months old), and it certainly doesn't address all of the data problems that arise in ML. In this post, we'll highlight the problems that we think are particularly exciting and that Meerkat is designed to address. We'll talk about where we are with Meerkat, and where we hope Meerkat is headed.

Meerkat is motivated by a few trends in ML, many of which have directly impacted our own research:

  1. **Dataset manipulation (**e.g. slicing, shaping, and transforming datasets) ****has become an increasingly important part of the development process. We've realized that the quality of our models and evaluations are primarily products of the data we feed into them, so we're spending more time tuning datasets than tuning models.
  2. Model evaluation is emerging as a new bottleneck when building high-performing ML systems. Models have been commoditized to the extent that resources like Huggingface's fantastic Model Hub can give you a model for text, speech or vision in seconds. But while models are easy to build, they're hard to get right (Taori *et al.,* McCoy *et al.,* Badgeley et al.), and their failure modes can be opaque.
  3. Multi-modal datasets that combine multiple, complex data types are becoming more prevalent (e.g. CLIP from OpenAI combines natural language with images).

So what can Meerkat do? Meerkat provides the DataPanel abstraction, a data structure that takes inspiration from Pandas and the DataFrame. The DataPanel facilitates interactive dataset manipulation, can house diverse data modalities, and lets you evaluate models carefully with Robustness Gym. We built DataPanels like DataFrames because they're naturally interactive and work seamlessly across development contexts: Jupyter Notebooks, Python scripts, and Streamlit. Like them, we hope the Meerkat DataPanel can be an interactive data substrate for modern machine learning across all stages of the ML lifecycle.

Table of Contents

Desiderata for an ML data structure

Of course, there are other data structures out there that we could use to manage our machine learning data. Why don't they suffice?

Below we outline a set of desiderata informed by use cases encountered throughout the ML development pipeline. Popular data structures typically fall into two camps: (i) those that support complex data types and multiple modalities (e.g. PyTorch Dataset, TensorFlow Dataset – desiderata 1-3) and (ii) those that support manipulation and interaction (e.g. Pandas DataFrame – desiderata 4-6). With the Meerkat DataPanel, we support all of these desiderata in one data structure.

  1. Can store complex datatypes (e.g. images, graphs, videos, and time-series).
    Any data substrate for modern ML should support the wide range of complex datatypes that practitioners use. Common data abstractions like the PyTorch Dataset and TensorFlow Dataset give users the flexibility to work with (essentially) any complex data, by allowing arbitrary __getitem__ and __len__ implementations.
  2. Supports datasets that are larger-than-RAM with efficient I/O under-the-hood. Modern ML datasets are enormous (e.g. Kinetics, MIMIC-CXR, ImageNet), routinely exceeding hundreds of gigabytes. Holding this data in memory is not an option, so large objects should be loaded from disk only when they're needed (e.g. for a pass through the model during training). The PyTorch Dataset and TensorFlow Dataset support flexible data storage formats, and use multiprocessing to avoid bottlenecking on I/O. Notably, the Pandas DataFrame **does not** support datasets that are larger-than-RAM, with extensions such as Dask and Modin explicitly designed to ease this restriction.
  3. Supports multi-modal datasets. Increasingly, ML models are trained using a combination of different data modalities – sometimes with very different semantics. For instance, support for multi-modal datasets is critical in medicine, where patients are represented by a combination of structured data, medical images, doctor’s notes, time-series, and lab results. In addition to supporting complex datatypes (desideratum 1), it's important that we can hold these diverse datatypes in a single data structure. Since the PyTorch Dataset and TensorFlow Dataset support an arbitrary __getitem__, they implicitly support multi-modal data. However, an interactive, user-friendly data structure would benefit from making the support for multi-modality more explicit (e.g. by storing each modality in a separate column, each with an assigned type).
  4. Supports data creation and manipulation. In the data abstractions provided by popular deep learning frameworks, datasets are largely static objects. However, in our experience, over the course of model development, datasets change: we pull in "artifacts" like model predictions, embeddings, new labels and metadata. Because we use these artifacts to perform additional training (e.g. á la Liu et al.) or perform fine-grained error analysis down the line (e.g. Goel et al.), it's important that they are stored in an organized manner alongside the original data. In the Pandas API, functions like apply and map allow users to create new columns from existing ones and operations like concat and merge (i.e. database-style join) facilitate organizing all these columns with a relational data model. We'd like our data structure to bring these data wrangling features of Pandas to large datasets and complex data types.
  5. Supports data selection. When performing fine-grained error analysis, we routinely slice and dice our datasets as we look for subsets of examples with specific properties. For example, we might like to understand how model performance varies as a function of the time of day, so we query our dataset for all the images taken at night. In Pandas, these kinds of data selection operations are straightforward to implement with DataFrame and Series indexing. On the other hand, the PyTorch Dataset and TensorFlow Dataset do not provide out-of-the-box support for this sort of indexing.