Learn methods to access the datasets on Hugging Face Hub and the way you’ll be able to load them remotely using DuckDB and the Datasets library

As an AI platform, Hugging Face builds, trains and deploys state-of-the-art open source machine learning models. Along with hosting all these trained models, Hugging Face also hosts datasets (https://huggingface.co/datasets), where you’ll be able to make use of them for your individual projects.
In this text, I’ll show you ways you’ll be able to access the datasets in Hugging Face, and the way you’ll be able to programmatically download them onto your local computer. Specifically, I’ll show you methods to:
- load the datasets remotely using DuckDB’s support for httpfs
- stream the datasets using the Datasets library by Hugging Face
Hugging Face Datasets server is a light-weight web API for visualizing all the differing types of dataset stored on the Hugging Face Hub. You should utilize the provided REST API to question datasets stored on the Hugging Face Hub. The next sections provide a brief tutorial on the things you can do with the API at https://datasets-server.huggingface.co/
.
Getting a listing of datasets hosted on the Hub
To get a listing of datasets that you would be able to retrieve from Hugging Face, use the next statement with the valid
endpoint:
$ curl -X GET "https://datasets-server.huggingface.co/valid"
You will note a JSON result as shown below:
The datasets that may work without errors are listed in the worth of the valid
key within the result. An example of a legitimate dataset above is 0-hero/OIG-small-chip2
.
Validating a dataset
To validate a dataset, use the next statement with the is-valid
endpoint along with the dataset
parameter:
$ curl -X GET "https://datasets-server.huggingface.co/is-valid?dataset=0-hero/OIG-small-chip2"
If the dataset is valid, you will notice the next result:
{"valid":true}
Getting the list of configurations and splits of a dataset
A dataset typically have splits (training set, validation set, and testing set). They may have configurations — sub-dataset inside a bigger dataset.
Configurations are common for multilingual speech datasets. For more details on splits, visit: https://huggingface.co/docs/datasets-server/splits.
To get the splits of a dataset, use the next statement with the splits
endpoint and the dataset
parameter:
$ curl -X GET "https://datasets-server.huggingface.co/splits?dataset=0-hero/OIG-small-chip2"
The next result will likely be returned:
{
"splits": [
{
"dataset":"0-hero/OIG-small-chip2",
"config":"0-hero--OIG-small-chip2",
"split":"train"
}
],
"pending":[],
"failed":[]
}
For this dataset, there is simply a single train
split.
Here is an example of a dataset (“duorc”) that has multiple splits and configurations:
{
"splits": [
{
"dataset": "duorc",
"config": "SelfRC",
"split": "train",
"num_bytes": 239852925,
"num_examples": 60721
},
{
"dataset": "duorc",
"config": "SelfRC",
"split": "validation",
"num_bytes": 51662575,
"num_examples": 12961
},
{
"dataset": "duorc",
"config": "SelfRC",
"split": "test",
"num_bytes": 49142766,
"num_examples": 12559
},
{
"dataset": "duorc",
"config": "ParaphraseRC",
"split": "train",
"num_bytes": 496683105,
"num_examples": 69524
},
{
"dataset": "duorc",
"config": "ParaphraseRC",
"split": "validation",
"num_bytes": 106510545,
"num_examples": 15591
},
{
"dataset": "duorc",
"config": "ParaphraseRC",
"split": "test",
"num_bytes": 115215816,
"num_examples": 15857
}
]
}
Obtaining the primary 100 rows
Often, it is advisable to retrieve a subset of the dataset for examination purposes, as a substitute of downloading every thing. The next statement uses the first-rows
endpoint with the dataset
, config
, and split
parameters to extract the primary 100 rows of the required dataset:
$ curl -X GET "https://datasets-server.huggingface.co/first-rows?dataset=0-hero/OIG-small-chip2&config=0-hero--OIG-small-chip2&split=train"
The JSON result will return the primary 100 rows of the required dataset:
{
"dataset": "0-hero/OIG-small-chip2",
"config": "0-hero--OIG-small-chip2",
"split": "train",
"features": [{
"feature_idx": 0,
"name": "user",
"type": {
"dtype": "string",
"_type": "Value"
}
}, {
"feature_idx": 1,
"name": "chip2",
"type": {
"dtype": "string",
"_type": "Value"
}
}],
"rows": [{
"row_idx": 0,
"row": {
"user": "I've heard that it's a good idea to have a will. What is a will?nn",
"chip2": "A will is a legal document that specifies how your property should be distributed after you die. It can also specify who should care for any children or other dependents you may have. It's important to make sure that your will is valid and up-to-date, since the laws governing wills vary from state to state."
},
"truncated_cells": []
}, {
"row_idx": 1,
"row": {
"user": "How do I discover a job?nn",
"chip2": "The very best solution to discover a job is to create a powerful resume and canopy letter that highlights your skills, experience, and qualifications. You may as well search online for job openings in your area of experience or contact local employers on to inquire about available positions. Networking with people in the identical field or industry may also be helpful to find employment opportunities."
},
"truncated_cells": []
}, {
"row_idx": 2,
"row": {
"user": "Produce an in depth written description of a dark scene within a mansion with a domed ceiling, pillars, and a door.nn",
"chip2": "Circular in shape, the ground is totally created from square-cut sections of black marble. Large, blackened marble pillars line the partitions, ending with a domed glass ceiling high overhead. A narrow gap on the far side of the room results in a sweeping stairwell with a mithglin-plated door beneath it."
},
"truncated_cells": []
}, {...
}, {
"row_idx": 98,
"row": {
"user": "I want to learn methods to budget my money higher. What advice are you able to give me?nn",
"chip2": "Probably the greatest ways to budget your money higher is to make a budget and track your spending. Start by writing down all of your sources of income and all of your expenses. Go over each item closely to find out what you'll be able to in the reduction of on. Prioritize the things that should be paid first, comparable to rent and utilities, and check out to be certain that you’re setting aside some money for savings. Moreover, setting spending limits for yourself can assist you to stay on course. Finally, repeatedly review your budget to be certain that you might be staying inside your budgeting goals."
},
"truncated_cells": []
}, {
"row_idx": 99,
"row": {
"user": "Produce an in depth written description of a serene scene outside a mountain road with monir trees and a boulder.nn",
"chip2": "A big granite boulder, apparently fallen from the mountain's craggy cliffs in some distant age, lies beside the cobblestoned road. Surrounded by a thick stand of monir trees, the boulder provides a convenient resting place for travellers, and a transient respite from the trials of the journey."
},
"truncated_cells": []
}]
}
Getting a slice of the dataset
You may as well get a slice of the dataset. The next statement uses the rows
endpoint with the dataset
, config
, split
, offset
, and length
parameters to extract three rows of the required dataset ranging from the third row:
$ curl -X GET "https://datasets-server.huggingface.co/rows?dataset=0-hero/OIG-small-chip2&config=0-hero--OIG-small-chip2&split=train&offset=2&length=3"
The result will seem like this:
{
"features": [{
"feature_idx": 0,
"name": "user",
"type": {
"dtype": "string",
"_type": "Value"
}
}, {
"feature_idx": 1,
"name": "chip2",
"type": {
"dtype": "string",
"_type": "Value"
}
}],
"rows": [{
"row_idx": 2,
"row": {
"user": "Produce a detailed written description of a gloomy scene inside of a mansion with a domed ceiling, pillars, and a door.nn",
"chip2": "Circular in shape, the floor is completely made from square-cut sections of black marble. Large, blackened marble pillars line the walls, ending with a domed glass ceiling high overhead. A narrow gap on the far side of the room leads to a sweeping stairwell with a mithglin-plated door beneath it."
},
"truncated_cells": []
}, {
"row_idx": 3,
"row": {
"user": "I even have a sprained ankle and wish to know what I can do to assist it heal faster.nn",
"chip2": "Chances are you'll wish to try some ice therapy, which is able to help reduce swelling and inflammation. Moreover, you can use an elastic bandage or compression wrap to offer additional support on your ankle. It is also essential to remain off of it as much as possible, in order that it could heal properly."
},
"truncated_cells": []
}, {
"row_idx": 4,
"row": {
"user": "How do you roast a turkey?nn",
"chip2": "To roast a turkey, start by preheating your oven to 350 degrees F. Place the turkey in a roasting pan, and rub it with butter or olive oil. Sprinkle with salt and pepper, and every other herbs or seasonings you want. Place the turkey within the oven and cook for roughly 20 minutes per pound, or until the interior temperature registers 165 degrees. Once the turkey is cooked, remove it from the oven and let it sit for 10-Quarter-hour before carving."
},
"truncated_cells": []
}]
}
Getting the Parquet files of a dataset
While the datasets in Hugging Face Hub will be published in a wide range of formats (CSV, JSONL, etc), the Datasets server robotically converts all public datasets to the Parquet format. The Parquet format offers significant performance improvements, especially for big datasets. Later sections will show that.
Apache Parquet is a file format that’s designed to support fast data processing for complex data. For more information on Parquet, read my earlier article:
To load the dataset in Parquet format, use the next statement with the parquet
endpoint and the dataset
parameter:
$ curl -X GET "https://datasets-server.huggingface.co/parquet?dataset=0-hero/OIG-small-chip2"
The above statement returns the next JSON result:
{
"parquet_files": [{
"dataset": "0-hero/OIG-small-chip2",
"config": "0-hero--OIG-small-chip2",
"split": "train",
"url": "https://huggingface.co/datasets/0-hero/OIG-small-chip2/resolve/refs%2Fconvert%2Fparquet/0-hero--OIG-small-chip2/parquet-train.parquet",
"filename": "parquet-train.parquet",
"size": 51736759
}],
"pending": [],
"failed": []
}
Specifically, the worth of the url
key specifies the placement where you’ll be able to download the dataset in Parquet format, which is https://huggingface.co/datasets/0-hero/OIG-small-chip2/resolve/refs%2Fconvert%2Fparquet/0-hero--OIG-small-chip2/parquet-train.parquet
in this instance.
Now that you may have seen methods to use the Datasets server REST API, let’s see how you’ll be able to download the datasets programmatically.
In Python, the easiest method is to make use of the requests
library:
import requestsr = requests.get("https://datasets-server.huggingface.co/parquet?dataset=0-hero/OIG-small-chip2")
j = r.json()
print(j)
The results of the json()
function is a Python dictionary:
{
'parquet_files': [
{
'dataset': '0-hero/OIG-small-chip2',
'config': '0-hero--OIG-small-chip2',
'split': 'train',
'url': 'https://huggingface.co/datasets/0-hero/OIG-small-chip2/resolve/refs%2Fconvert%2Fparquet/0-hero--OIG-small-chip2/parquet-train.parquet',
'filename': 'parquet-train.parquet',
'size': 51736759
}
],
'pending': [],
'failed': []
}
Using this dictionary result, you need to use list comprehension to search out the URL for the dataset in Parquet format:
urls = [f['url'] for f in j['parquet_files'] if f['split'] == 'train']
urls
The urls
variable is a listing containing a listing of URLs for the dataset under the training set:
['https://huggingface.co/datasets/0-hero/OIG-small-chip2/resolve/refs%2Fconvert%2Fparquet/0-hero--OIG-small-chip2/parquet-train.parquet']
Downloading the Parquet file using DuckDB
If you happen to use DuckDB, you’ll be able to actually use DuckDB to remotely load a dataset.
If you happen to are latest to DuckDB, you’ll be able to read up on the fundamentals from this text:
First, make sure you install DuckDB if you may have not done so:
!pip install duckdb
Then, create a DuckDB instance and install httpfs:
import duckdbcon = duckdb.connect()
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")
The httpfs extension is a loadable extension implementing a file system that enables reading distant/writing distant files.
Once httpfs is installed and loaded, you’ll be able to load the Parquet dataset from Hugging Face Hub through the use of a SQL query:
con.sql(f'''
SELECT * from '{urls[0]}'
''').df()
The df()
function above converts the results of the query to a Pandas DataFrame:
One great feature of Parquet is that Parquet stores files in columnar format. And so in case your query only requests for less than a single column, only that requested column is downloaded to your computer:
con.sql(f'''
SELECT "user" from '{urls[0]}'
''').df()
Within the above query, only the “user
” column is downloaded:
This Parquet feature is particularly useful for big dataset — imagine the time and space you’ll be able to save by only downloading the columns you wish.
In some cases, you don’t even have to download the information in any respect. Consider the next query:
con.sql(f'''
SELECT count(*) from '{urls[0]}'
''').df()
No data must be downloaded as this request will be fulfilled just by reading the metadata of the dataset.
Here is one other example of using DuckDB to download one other dataset (“mstz/heart_failure”):
import requestsr = requests.get("https://datasets-server.huggingface.co/parquet?dataset=mstz/heart_failure")
j = r.json()
urls = [f['url'] for f in j['parquet_files'] if f['split'] == 'train']
con.sql(f'''
SELECT "user" from '{urls[0]}'
''').df()
This dataset has 299 rows and 13 columns:
We could perform some aggregation on the age column:
con.sql(f"""
SELECT
SUM(IF(age<40,1,0)) AS 'Under 40',
SUM(IF(age BETWEEN 40 and 49,1,0)) AS '40-49',
SUM(IF(age BETWEEN 50 and 59,1,0)) AS '50-59',
SUM(IF(age BETWEEN 60 and 69,1,0)) AS '60-69',
SUM(IF(age BETWEEN 70 and 79,1,0)) AS '70-79',
SUM(IF(age BETWEEN 80 and 89,1,0)) AS '80-89',
SUM(IF(age>89,1,0)) AS 'Over 89',
FROM '{urls[0]}'
"""
).df()
Here’s the result:
Using the result, we could also plot a bar plot:
con.sql(f"""
SELECT
SUM(IF(age<40,1,0)) AS 'Under 40',
SUM(IF(age BETWEEN 40 and 49,1,0)) AS '40-49',
SUM(IF(age BETWEEN 50 and 59,1,0)) AS '50-59',
SUM(IF(age BETWEEN 60 and 69,1,0)) AS '60-69',
SUM(IF(age BETWEEN 70 and 79,1,0)) AS '70-79',
SUM(IF(age BETWEEN 80 and 89,1,0)) AS '80-89',
SUM(IF(age>89,1,0)) AS 'Over 89',
FROM '{urls[0]}'
"""
).df().T.plot.bar(legend=False)
Using the Datasets library
To make working with data from Hugging Face easy and efficient, Hugging Face has its own Datasets
library (https://github.com/huggingface/datasets).
To put in the datasets
library, use the pip
command:
!pip install datasets
The load_dataset()
function loads the required dataset:
from datasets import load_datasetdataset = load_dataset('0-hero/OIG-small-chip2',
split='train')
If you load the dataset for the primary time, the whole dataset (in Parquet format) is downloaded to your computer:
The variety of data of the returned dataset
is datasets.arrow_dataset.Dataset
. So what are you able to do with it? First, you’ll be able to convert it to a Pandas DataFrame:
dataset.to_pandas()
You may as well get the primary row of the dataset through the use of an index:
dataset[0]
This can return the primary row of the information:
{
'user': "I've heard that it's an excellent idea to have a will. What's a will?nn",
'chip2': "A will is a legal document that specifies how your property must be distributed after you die. It may possibly also specify who should take care of any children or other dependents you might have. It is important to be certain that that your will is valid and up-to-date, for the reason that laws governing wills vary from state to state."
}
There are a bunch of other things you’ll be able to do with this datasets.arrow_dataset.Dataset
object. I’ll leave it to you to explore further.
Streaming the dataset
Again, when coping with large datasets, it just isn’t feasible to download the whole dataset to your computer before you do anything with it. Within the previous section, calling the load_dataset()
function downloads the whole dataset onto my computer:
This particular dataset it took up 82.2MB of disk space. You may imagine the time and disk space needed for larger datasets.
Fortunately, the Datasets library supports streaming. Dataset streaming helps you to work with a dataset without downloading it — the information is streamed as you iterate over the dataset. To make use of streaming, set the streaming
parameter to True
within the load_dataset()
function:
from datasets import load_datasetdataset = load_dataset('0-hero/OIG-small-chip2',
split='train',
streaming=True)
The variety of dataset
is now datasets.iterable_dataset.IterableDataset
, as a substitute of datasets.arrow_dataset.Dataset
. So how do you utilize it? You should utilize the iter()
function on it, which returns an iterator
object:
i = iter(dataset)
To get a row, call the next()
function, which returns the subsequent item in an iterator:
next(i)
You’ll now see first row as a dictionary:
{
'user': "I've heard that it's an excellent idea to have a will. What's a will?nn",
'chip2': "A will is a legal document that specifies how your property must be distributed after you die. It may possibly also specify who should take care of any children or other dependents you might have. It is important to be certain that that your will is valid and up-to-date, for the reason that laws governing wills vary from state to state."
}
Calling the next()
function on i
again will return the subsequent row:
{
'user': 'How do I discover a job?nn',
'chip2': 'The very best solution to discover a job is to create a powerful resume and canopy letter that highlights your skills, experience, and qualifications. You may as well search online for job openings in your area of experience or contact local employers on to inquire about available positions. Networking with people in the identical field or industry may also be helpful to find employment opportunities.'
}
And so forth.
Shuffling the dataset
You may as well shuffle the dataset through the use of the shuffle()
function on the dataset
variable, like this:
shuffled_dataset = dataset.shuffle(seed = 42,
buffer_size = 500)
Within the above example, say your dataset has 10,000 rows. The shuffle()
function will randomly select examples from the primary five hundred rows within the buffer.
By default, the buffer size is 1,000.
Other tasks
You may perform more tasks using streaming, comparable to:
- splitting the dataset
- Interleaving the dataset — combining two datasets by alternating rows between each dataset
- Modifying the columns of a dataset
- Filtering a dataset
Take a look at https://huggingface.co/docs/datasets/stream for more details.
If you happen to like reading my articles and that it helped your profession/study, please consider signing up as a Medium member. It’s $5 a month, and it gives you unlimited access to all of the articles (including mine) on Medium. If you happen to join using the next link, I’ll earn a small commission (at no additional cost to you). Your support signifies that I’ll have the ability to devote more time on writing articles like this.
In this text, I even have shown you ways you’ll be able to access the datasets stored on Hugging Face Hub. Because the datasets are stored in Parquet format, it lets you remotely access the datasets remotely without having to download the whole bulk of the dataset. You may access the datasets either using DuckDB, or using the Datasets library provided by Hugging Face.