arrow-left

All pages
gitbookPowered by GitBook
1 of 5

Loading...

Loading...

Loading...

Loading...

Loading...

Results

This section presents experimental results.

hashtag
Tables

Create a table displaying experimental results from your models on each dataset and evaluation metric. The table should also include results from previous work directly comparable to yours.

Excerpted from .
circle-info

If the table is too large (e.g., taking more than 1/3 of the page), it may overwhelm the readers. In this case, shrink it by including only the critical results and put the rest in the appendix.

Here are a few tips for creating the result table:

  • Expand it to the full page if it consists of many columns.

  • Use acronyms for the header titles if too long, and explain them in the caption.

  • If space allows, include both the average scores and standard deviations. The standard deviation is usually notated by the plus-minus sign (e.g., ).

circle-info

Sometimes, it makes more sense to use multiple tables to present your results (e.g., working on multiple tasks), in which case, use a consistent scheme across the tables so they can be easily compared.

hashtag
Interpretations

Once the result table is presented, you need to give an interpretation of the results. First, summarize the overall observations:

Each model shows an incremental improvement over its predecessor.

MODEL 2 shows a noticeable improvement over MODEL 1, indicating the effectiveness of our METHOD.

The ADVANCED MODEL shows a significant improvement of #.#% from the BASELINE MODEL.

Then, describe any key findings:

It is interesting that MODEL 2 shows better performance over MODEL 1 on DATASET 1 but the results are opposite on DATASET 2.

Give an interpretation for each key finding (and indicate a specific subsection in the section where further analysis is provided):

It is likely because METHOD works well for ASPECTS in DATASET 1, but not necessairly for ASPECTS in DATASET 2 (Section #.#).

circle-info

In general, high-level interpretations are provided in the Experiments section whereas more detailed analyses are provided in the Analysis section. These two sections, however, can be merged into one if the space is limited.

Finally, explain any additional results that are not included in the table but help readers interpret this work better:

It it worth mentioning that we also experimented with METHOD 1, which showed a similar result as METHOD 2.

circle-info

The interpretation should not be simply reading the table. The main goal of this interpretation is to provide insights that are not so obvious to the readers by reading the table, but you learn from the period of this study.

Highlight the key results by making them bold.

±0.1\pm 0.1±0.1
Analysis
Xu and Choi, EMNLP 2020arrow-up-right

5.4. Homework

hashtag
Individual Writing

  • Write the Experiments section in your individual overleaf project.

  • Recommended length: 100 - 200 lines (including tables and figures).

  • Submit the PDF version of your current draft up to the Experiments section.

hashtag
Rubric

  • Data Section: are the sources and choices of the datasets reasonably explained? (1 point)

  • Data Split: are the statistics of training/development/evaluation or cross-validation sets distinctly illustrated? (2 points)

  • Model Descriptions: are the models designed to soundingly distinguish differences in methods? (2 points)

Evaluation Metrics: are the evaluation metrics clearly explained? (2 points)

  • Experimental Settings: are the settings described in a way that readers can replicate the experiments? (1 point)

  • Model Development: is the model development progress depicted? (1 point)

  • Result Tables: are the experimental results evidently summarized in tables? (2 points)

  • Result Interpretations: are the key findings from the results convincingly interpreted? (2 points)

  • Datasets

    This section describes datasets used for the experiments.

    hashtag
    Data Selection

    If there are datasets for your task that have been widely used by previous work, you need to test your models on them for fair comparisons. For each dataset, briefly describe the dataset:

    For our experiments, DATASET is used (CITATION) to evaluate our MODEL.

    If it is not widely used but appropriate to evaluate your approach, explain why it is selected:

    For our MODEL, the DATASET is used (CITATION) because REASON(S).

    If it is a new dataset that you create, briefly mention it and reference the section describing how the dataset is created (e.g., ):

    All our models are evaluated on the EXISTING DATASET (CITATION) as well as YOUR DATASET (Section #).

    circle-info

    It is always better to experiment with multiple datasets to validate the generalizability of your approach.

    hashtag
    Data Split

    Your work is not comparable to previous work unless it uses the same data split. Indicate which previous work you follow to split the training, development, and evaluation sets:

    The same split as CITATION is used for our experiments.

    circle-info

    During the development (including hyper-parameter tuning), your model should be trained on the training set and tested on the development set (aka. validation set).

    circle-info

    While you are training, you should frequently check the performance of your model (usually once every epoch) and save the model that gives the highest performance on the development set.

    circle-info

    Once the training is done, your best model (on the development set) is tested on the evaluation set, and the results are reported in the paper.

    If a new dataset is used, you need to create your own split. If the dataset has:

    • < 1K instances, use -fold for evaluation ().

    • [1K, 10K] instances, use the 75/10/15 split for the training/development/evaluation sets.

    • > 10K instances, use the 80/10/10 split.

    circle-info

    It is important to create training, development, and evaluation sets that follow similar distributions (in terms of labels, etc.) as the entire data.

    circle-exclamation

    Cross-validation is typically used for development, and for evaluation. However, when the data is not sufficiently large, the evaluation set becomes too small, which can cause many variants in the model performance tested on the set. Thus, cross-validation is used to average out the variants.

    Once you split the data, create a table describing the distribution and statistics of each set. This table should include all necessary statistics to help researchers understand your experimental results (e.g., how many entities per sentence for named entity recognition, how many questions per document for question answering).

    circle-info

    If it is unclear how the data is split from the previous work, contact the authors. If the authors do not respond, create your own split and describe the datasets.

    Models

    This section describes models used for comparative study.

    hashtag
    Descriptions

    Describe existing models or frameworks commonly adopted by your models (if any):

    All our models adopt MODEL (CITATION) as the encoder.

    List all models used for your experiments. Give a brief description of each model by referencing specific sections explaining the core methods used by the model:

    The following three models are experimented:

    • BASELINE: DESCRIPTION (Section #)

    • ADVANCED: BASELINE + METHOD (Section #)

    circle-info

    It is important to design models in a way that clearly shows key differences in methods.

    hashtag
    Evaluation Metrics

    Because a neural model produces a different result every time trained, you need to train it 3 ~ 5 times and report its average score with the standard deviation ().

    circle-exclamation

    Why would a neural model produce a different result every time it is trained?

    Thus, indicate how many times each model is trained and what is used as the evaluation metric(s):

    Every model is trained 3 times and its average F1-score and the standard deviation is used as the evaluation metric.

    circle-info

    If you experiment on datasets used in previous work, you must use the same evaluation metrics for fair comparisons. Even if you present a new metric, you still need to evaluate both the old and new metrics to show the advantage of your new metric.

    If you use a non-standard metric that has not been used in previous work because:

    • The task is new,

    • The new aspect introduced for this task has never been tested before,

    • You find a better way of evaluating this, which has not been used in the previous work

    explain why you cannot apply standard metrics to evaluate this task and describe the new metric:

    Since TASK has not been evaluated on ASPECT(S) in prevoius work, we introduce new metrics ...

    circle-info

    If your new evaluation metric is novel that requires bigger attention, explain it in the approach section so it would be considered one of the main contributions.

    hashtag
    Experimental Settings

    Describe hyper-parameters used to build the models (e.g., epoch, learning rate, hidden layer, optimizer, batch size):

    MODEL is trained for # epochs using the learning rate of FLOAT, ...

    Explain anything special that you do for training:

    Early stop is adopted to control the number of epochs if the score on the development set does not improve over two epochs.

    Describe computing devices used for the experiments:

    Our experiments use NVIDIA Titan RTX GPUs, which takes 10/20/30 hours for training the BASELINE/ADVANCED/BEST MODELS, respectively.

    circle-info

    It is important to describe the experimental settings for replication, although they are often put in the appendix due to the page limit for the actual paper submission.

    hashtag
    Development

    If you observe enhanced training efficiency (e.g., your new loss function requires a fewer number of epochs to train), create a figure (e.g., x-axis: epochs, y-axis: accuracy) describing the training processes of the baseline and the enhanced models.

    Our ENHANCED MODEL reaches the same accuracy (or higher) than the BASELINE model after only a third of epochs.

    If you experience unusual phenomena during training (e.g., results on the development set are unstable), describe the phenomena and analyze why they are happening:

    BEST: ADVANCED + METHOD (Section #)
    Section 5.3
    Excerpted fro.
    nnn
    n=[4,5]n = [4, 5]n=[4,5]
    Data Creation
    cross-validationarrow-up-right
    Excerpted from .
    Excerpted from .
    Yang and Choi, SIGDIAL 2019arrow-up-right
    Li et al., EMNLP, 2020arrow-up-right
    m Xu et al., BIONLP, 2020arrow-up-right

    Experiments

    This chapter guides you how to write the experiments section

    hashtag
    Contents

    • Datasets

    Models
    Results