Write the Experiments section in your individual overleaf project.
Recommended length: 100 - 200 lines (including tables and figures).
Submit the PDF version of your current draft up to the Experiments section.
Data Section: are the sources and choices of the datasets reasonably explained? (1 point)
Data Split: are the statistics of training/development/evaluation or cross-validation sets distinctly illustrated? (2 points)
Model Descriptions: are the models designed to soundingly distinguish differences in methods? (2 points)
Evaluation Metrics: are the evaluation metrics clearly explained? (2 points)
Experimental Settings: are the settings described in a way that readers can replicate the experiments? (1 point)
Model Development: is the model development progress depicted? (1 point)
Result Tables: are the experimental results evidently summarized in tables? (2 points)
Result Interpretations: are the key findings from the results convincingly interpreted? (2 points)
This section describes models used for comparative study.
Describe existing models or frameworks commonly adopted by your models (if any):
All our models adopt MODEL (CITATION) as the encoder.
List all models used for your experiments. Give a brief description of each model by referencing specific sections explaining the core methods used by the model:
The following three models are experimented:
BASELINE: DESCRIPTION (Section #)
ADVANCED: BASELINE + METHOD (Section #)
BEST: ADVANCED + METHOD (Section #)
It is important to design models in a way that clearly shows key differences in methods.
Because a neural model produces a different result every time trained, you need to train it 3 ~ 5 times and report its average score with the standard deviation (Section 5.3).
Why would a neural model produce a different result every time it is trained?
Thus, indicate how many times each model is trained and what is used as the evaluation metric(s):
Every model is trained 3 times and its average F1-score and the standard deviation is used as the evaluation metric.
If you experiment on datasets used in previous work, you must use the same evaluation metrics for fair comparisons. Even if you present a new metric, you still need to evaluate both the old and new metrics to show the advantage of your new metric.
If you use a non-standard metric that has not been used in previous work because:
The task is new,
The new aspect introduced for this task has never been tested before,
You find a better way of evaluating this, which has not been used in the previous work
explain why you cannot apply standard metrics to evaluate this task and describe the new metric:
Since TASK has not been evaluated on ASPECT(S) in prevoius work, we introduce new metrics ...
If your new evaluation metric is novel that requires bigger attention, explain it in the approach section so it would be considered one of the main contributions.
Describe hyper-parameters used to build the models (e.g., epoch, learning rate, hidden layer, optimizer, batch size):
MODEL is trained for # epochs using the learning rate of FLOAT, ...
Explain anything special that you do for training:
Early stop is adopted to control the number of epochs if the score on the development set does not improve over two epochs.
Describe computing devices used for the experiments:
Our experiments use NVIDIA Titan RTX GPUs, which takes 10/20/30 hours for training the BASELINE/ADVANCED/BEST MODELS, respectively.
It is important to describe the experimental settings for replication, although they are often put in the appendix due to the page limit for the actual paper submission.
If you observe enhanced training efficiency (e.g., your new loss function requires a fewer number of epochs to train), create a figure (e.g., x-axis: epochs, y-axis: accuracy) describing the training processes of the baseline and the enhanced models.
Our ENHANCED MODEL reaches the same accuracy (or higher) than the BASELINE model after only a third of epochs.
If you experience unusual phenomena during training (e.g., results on the development set are unstable), describe the phenomena and analyze why they are happening:
This section presents experimental results.
Create a table displaying experimental results from your on each and evaluation metric. The table should also include results from previous work directly comparable to yours.
If the table is too large (e.g., taking more than 1/3 of the page), it may overwhelm the readers. In this case, shrink it by including only the critical results and put the rest in the appendix.
Here are a few tips for creating the result table:
Expand it to the full page if it consists of many columns.
Use acronyms for the header titles if too long, and explain them in the caption.
Highlight the key results by making them bold.
Sometimes, it makes more sense to use multiple tables to present your results (e.g., working on multiple tasks), in which case, use a consistent scheme across the tables so they can be easily compared.
Once the result table is presented, you need to give an interpretation of the results. First, summarize the overall observations:
Each model shows an incremental improvement over its predecessor.
MODEL 2 shows a noticeable improvement over MODEL 1, indicating the effectiveness of our METHOD.
The ADVANCED MODEL shows a significant improvement of #.#% from the BASELINE MODEL.
Then, describe any key findings:
It is interesting that MODEL 2 shows better performance over MODEL 1 on DATASET 1 but the results are opposite on DATASET 2.
It is likely because METHOD works well for ASPECTS in DATASET 1, but not necessairly for ASPECTS in DATASET 2 (Section #.#).
In general, high-level interpretations are provided in the Experiments section whereas more detailed analyses are provided in the Analysis section. These two sections, however, can be merged into one if the space is limited.
Finally, explain any additional results that are not included in the table but help readers interpret this work better:
It it worth mentioning that we also experimented with METHOD 1, which showed a similar result as METHOD 2.
The interpretation should not be simply reading the table. The main goal of this interpretation is to provide insights that are not so obvious to the readers by reading the table, but you learn from the period of this study.
If space allows, include both the average scores and standard deviations. The standard deviation is usually notated by the plus-minus sign (e.g., ).
Give an interpretation for each key finding (and indicate a specific subsection in the section where further analysis is provided):
This section describes datasets used for the experiments.
If there are datasets for your task that have been widely used by previous work, you need to test your models on them for fair comparisons. For each dataset, briefly describe the dataset:
For our experiments, DATASET is used (CITATION) to evaluate our MODEL.
If it is not widely used but appropriate to evaluate your approach, explain why it is selected:
For our MODEL, the DATASET is used (CITATION) because REASON(S).
If it is a new dataset that you create, briefly mention it and reference the section describing how the dataset is created (e.g., Data Creation):
All our models are evaluated on the EXISTING DATASET (CITATION) as well as YOUR DATASET (Section #).
It is always better to experiment with multiple datasets to validate the generalizability of your approach.
Your work is not comparable to previous work unless it uses the same data split. Indicate which previous work you follow to split the training, development, and evaluation sets:
The same split as CITATION is used for our experiments.
During the development (including hyper-parameter tuning), your model should be trained on the training set and tested on the development set (aka. validation set).
While you are training, you should frequently check the performance of your model (usually once every epoch) and save the model that gives the highest performance on the development set.
Once the training is done, your best model (on the development set) is tested on the evaluation set, and the results are reported in the paper.
If a new dataset is used, you need to create your own split. If the dataset has:
< 1K instances, use -fold cross-validation for evaluation ().
[1K, 10K] instances, use the 75/10/15 split for the training/development/evaluation sets.
> 10K instances, use the 80/10/10 split.
It is important to create training, development, and evaluation sets that follow similar distributions (in terms of labels, etc.) as the entire data.
Cross-validation is typically used for development, and for evaluation. However, when the data is not sufficiently large, the evaluation set becomes too small, which can cause many variants in the model performance tested on the set. Thus, cross-validation is used to average out the variants.
Once you split the data, create a table describing the distribution and statistics of each set. This table should include all necessary statistics to help researchers understand your experimental results (e.g., how many entities per sentence for named entity recognition, how many questions per document for question answering).
If it is unclear how the data is split from the previous work, contact the authors. If the authors do not respond, create your own split and describe the datasets.