Hugging Face is both a platform and a community focused on machine learning, particularly in the areas of natural language processing (NLP), computer vision, and audio processing. It is often referred to as the “GitHub of AI,” serving as the primary platform for hosting a vast majority of models and datasets in the AI community. It also furnishes tools that simplify the process of downloading models and datasets. With the Hugging Face Python libraries, achieving this task requires just a few lines of code. For instance, utilizing the huggingface_hub to load models and the datasets library to load datasets.

In this post, I won’t delve into the usage of the Hugging Face Python library for loading models or datasets, given the abundance of tutorials available online. Instead, I’ll address a slightly different challenge. While the datasets Python library can effortlessly load an entire dataset with a single line of code, my issue lies in not wanting the complete dataset stored on my computer. My goal is to obtain the file list from the dataset without downloading the actual data. Below is a piece of code to load dataset:

from datasets import load_dataset
def loadDataset():
    dataName = "tiiuae/falcon-refinedweb"
    dt = load_dataset(dataName)
    print("done")

The load_dataset function performs multiple tasks, encompassing the loading of dataset metadata, file lists, and the download of the entire dataset. However, given that this dataset is approximately 1.6TB, the download process will undoubtedly take a considerable amount of time. Particularly in cases where I’m only interested in the dataset information and not the actual data, this prolonged download poses a challenge.

Since there isn’t a straightforward method to halt data downloading from Hugging Face, I examined the source code of the datasets library. I attempted to manually interrupt the download process by setting a breakpoint in VS Code. When I invoke the load_dataset function, it triggers the load_dataset_builder, which subsequently calls dataset_model_factory. Importantly, the dataset_model_factory acquires all dataset information without initiating the data download. Consequently, setting a breakpoint after this particular call proves to be the appropriate intervention point.

load_dataset_builder
dataset_model_factory

At this stage, the file list for the dataset has been successfully loaded. I can effortlessly retrieve and write it into a file using the “Debug Console” in VS Code. With over 5000 files in this dataset, I can now leverage other tools to perform batch downloads.

Files in Dataset

The data files are stored in following variable:

dataset_module.builder_configs_parameters.builder_configs[0].data_files

So run following code to save the dataset in a file. Please remember to execute the code in the debug console while the program is paused at a breakpoint after dataset_module is created by dataset_model_factory.

    with open("starcoderdata.json", mode="w", encoding="utf-8") as f:
        for ds in dataset_module.builder_configs_parameters.builder_configs[0].data_files['train']:
            ds = ds.replace("hf://", "https://huggingface.co/")
            f.write(re.sub(r'@[^/\s]+/', "/resolve/main/", ds))
            f.write("\n")

Deep in DatasetBuilder

Upon thorough exploration of DatasetBuilder, I’ve identified substantial improvement opportunities, particularly in the realm of additional dataset test. For example, the dataset “bigcode/starcoderdata” will return the full files in data_files, but the dataset “math-ai/AutoMathText” doesn’t. The difference between “bigcode/starcoderdata” and “math-ai/AutoMathText” is AutoMathText provides multiple config names in metadata_configs. I can print all config names by following code (running in debug console when code is breaked)

for name in dataset_module.builder_configs_parameters.metadata_configs.keys():
    print(name)

The output will be like this:

web-0.50-to-1.00
web-0.60-to-1.00
web-0.70-to-1.00
web-0.80-to-1.00
web-full
arxiv-0.50-to-1.00
arxiv-0.60-to-1.00
arxiv-0.70-to-1.00
arxiv-0.80-to-1.00
arxiv-full
code-0.50-to-1.00
code-python-0.50-to-1.00
code-python-0.60-to-1.00
code-python-0.70-to-1.00
code-python-0.80-to-1.00
code-full

However, the dataset starcoderdata doesn’t provide any name in metadata_configs. Therefore, the job to get full file list from dataset becomes a manual job. Before recording all files, I need to check if metadata_configs has multiple names to select.

New Injection Point for Code

After debugging the code through several iterations, I’ve identified a more optimal position to inject the code for saving the file list. The right file is datasets/builder.py. The right function is _create_builder_config. See this image for more details:

Set Name in Dataset

For dataset like AutoMathText, we need to specify config_name when loading dataset. The config_name can be found on the dataset page, or in metadata_configs which I have mentioned above. Here is the code to specify the config name when loading the dataset:

from datasets import load_dataset, DownloadConfig

def loadDataset():
    dataName = "math-ai/AutoMathText"
    dc = DownloadConfig(proxies={
            "http": "http://127.0.0.1:7890",
            "https": "http://127.0.0.1:7890"
        })
    dt = load_dataset(dataName, name="code-full", download_config=dc)
    print("done")

Set Proxy in DownloadConfig

Sometimes, you might require a proxy to access the Hugging Face dataset. To configure the proxy, you can utilize the following code.

from datasets import load_dataset, DownloadConfig

def loadDataset():
    dataName = "tiiuae/falcon-refinedweb"
    dc = DownloadConfig(proxies={
            "http": "http://127.0.0.1:7890",
            "https": "http://127.0.0.1:7890"
        })
    dt = load_dataset(dataName, download_config=dc)
    print("done")

Set Token in DownloadConfig

Not all datasets are all public. Therefore, it’s better to set the token in DownloadConfig. You can check Huggingface document to get token. Then set token with following code:

from datasets import load_dataset, DownloadConfig
import re

def loadDataset():
    dataName = "bigcode/starcoderdata"
    dc = DownloadConfig(proxies={
            "http": "http://127.0.0.1:7890",
            "https": "http://127.0.0.1:7890"
        }, token="hf_*********")
    dt = load_dataset(dataName, download_config=dc)

After that the list of dataset files, you can use following command to download:

curl -L --header "Authorization: Bearer hf_***" -o train-00000-of-00004.parquet https://hf-mirror.com/datasets/bigcode/starcoderdata/resolve/main/yaml/train-00000-of-00004.parquet
Previous PostNext Post

Leave a Reply

Your email address will not be published. Required fields are marked *