When you create a dataset, the followings need to be clearly described:
Data collection (e.g., sources of the data).
Preprocessing if performed (e.g., scripts that you write, existing tools used).
Annotation scheme and guidelines if conducted with justification.
People involved in this process (e.g., annotators, survey subjects).
Quality of the created data (e.g., inter-annotator agreement).
Statistics and analysis of the original, preprocessed, annotated data.
Here are a few papers presenting new datasets:
Competence-Level Prediction and Resume & Job Description Matching Using Context-Aware Transformer Models, Li et al., EMNLP 2020 (see Section 3).
FriendsQA: Open-Domain Question Answering on TV Show Transcripts, Yang and Choi, SIGDIAL, 2019 (see Section 3).