To upload your dataset, please follow these steps:
-
1
Prepare Your Dataset
Your dataset should be uploaded to Hugging Face, ideally with the following columns:
-
instruction: A
clear and concise instruction for
each data point. If your original
data contains an
input
key (common in formats like Alpaca), you must concatenate theinput
value with theinstruction
value, using a\n
as a separator. - output: The expected output based on instruction.
- id (optional): The unique it for each sample.
- Q_scores (optional): Columns for various normalized scores about Q (instruction), including Clarity, Coherence, Completeness, Complexity, Correctness, Meaningfulness, Difficulty, Deita_Complexity, Thinking_Prob.
- QA_scores (optional): Columns for various normalized scores about QA (instruction + output), including Clarity, Coherence, Completeness, Complexity, Correctness, Meaningfulness, Relevance, IFD, Deita_Quality. Reward_Model, Fail_Rate, A_Length.
Q_scores and QA_scores can be efficiently calculated using our OpenDataArena-Tool Data Scorer. More details about each score can be found in OpenDataArena-Tool Data Scorer Documentation. An example can be found in example_upload.jsonl for your reference.
-
instruction: A
clear and concise instruction for
each data point. If your original
data contains an
-
2
Upload Your Dataset
You will upload your dataset directly to the Hugging Face Hub. For a detailed, step-by-step guide on the upload process, please refer to the official Hugging Face documentation:
https://huggingface.co/docs/hub/datasets-adding
A simple guide for uploading a JSONL file is provided below.
First, install the required packages and login to Hugging Face:
Bash Commandspip install huggingface_hub datasets huggingface-cli login # login your huggingface account python upload_your_ds.py
Python Codefrom datasets import load_dataset from utils_jsonl import read_jsonl #--- Configuration --- # Your Hugging Face username or organization name and desired dataset name. # Replace 'your-username' with your actual username or organization. repo_id = "your-username/my-awesome-viewable-dataset" # The path to your local JSONL file. local_file_path = "my_dataset.jsonl" # --- Load the local JSONL file into a Dataset --- dataset = load_dataset('json', data_files=local_file_path, split='train') print(f"Dataset loaded from {local_file_path}:") print(dataset) print(dataset[0]) # --- Push the dataset to Hugging Face Hub --- print(f"Attempting to push dataset to {repo_id}...") dataset.push_to_hub(repo_id) print(f"Successfully pushed dataset to '{repo_id}'") print(f"You can view your dataset here: https://huggingface.co/datasets/{repo_id}")
-
3
Information