What distinguishes a good search-oriented conversation with an AI from a bad one? Using a single aspect of conversations for evaluation fails to produce a complete picture, as aspects like information relevance, being true to some source, ease of understanding, and many more play a role. To assist researchers in evaluating their conversational search systems, this shared task calls to the community to collect comprehensive evaluation resources: example data of diverse search-oriented conversations, definitions and annotation guidelines for aspects to evaluate, and implemented evaluation metrics.

  1. Data. To capture the variety of search-oriented conversations (e.g., written or spoken, explorative or target-oriented), we seek contributions of one or more example conversations in English and in a common format.
  2. Aspect definitions. To compile a comprehensive overview of aspects to evaluate search-oriented conversations by, we seek contributions in terms of definitions and guidelines for human ground-truth annotation.
  3. Metrics. To create a software package for the evaluation of search-oriented conversations, we seek software that attempts to automatically recreate human annotations for search-oriented conversations (see 2.).

You can submit both novel and already published resources to one or more of the three categories. The submission process (see below) is open to allow for feedback and discussions from the start. All contributors will be invited to co-author a publication (in SIGIR Forum) that summarizes the collected evaluation suite. In preparation for discussions at the SCAI workshop and in the paper, we expect every co-author to invest a bit of their time (about two hours each) in February to annotate a few conversations according to the submitted guidelines. Moreover, we would like to discuss with you at the workshop the details of a follow-up competition on submitting search-oriented conversation systems on the collected resources.

REGISTER to stay informed, discuss, and participate!

Schedule

  • October 2, 2023: Submission starts
  • January 15, 2024: Submission deadline for new aspect definitions (but can be updated until Jan. 29)
  • January 29, 2024: Submission deadline for data and metrics
  • February 5, 2024: Annotation period starts
  • February 23, 2024: Annotation period ends
  • March 4, 2024: Results and annotations release
  • TBD: SCAI workshop and planning for overview publication

Data submission

We seek contributions of search-oriented conversations (with or without AI) in English. Already published conversations are also welcome. Submissions must adhere to the following format, but the only required attributes are turns, utterance, and response. If your data contains attributes that are not shown here, we will expand this format upon your request. The current format is:

{
  "id": "ID of the conversation in the original dataset",
  "topic": "Phrase describing the topic of the conversation",
  "turns": [
    {
      "id": "ID of the turn in the original dataset",
      "utterance": "Text that was sent by user (markdown)",
      "response": "Text that was sent by the system in response to utterance (markdown)",
      "rewritten_utterance": [
        {
          "source": "Name of the method used to rewrite the text, or 'human' if some human rewrote it",
          "text": "The utterance (see above) rewritten to make it less context dependent"
        }
      ],
      "topic": "Phrase describing the topic of the turn",
      "provenance": [
        {
          "id": "ID of the text in the dataset",
          "dataset": "Name of the dataset of the text",
          "text": "The text used to create the response (among others)",
          "title": "The title of the text",
          "url": "An URL of the text"
        }
      ]
    }
  ]
}

To submit data, prepare it as ndjson file (practically identical to JSON Lines or LDJSON files), where each line contains one JSON object as shown above. Use our format checker to validate your file. Then submit it via TIRA (once submission opens) using a separate upload group for each dataset you provide. Upload new versions of the same dataset in the same upload group. Once uploaded, use the edit-button to add a short description (at least one paragraph on the data source) to your dataset.

To request extensions to the data format or ask questions on the submission process, create a topic in the SCAI Eval Data Submission forum (login provided upon registration).

The collected dataset is freely available and an up-to-date version will be available for downloaded here.

Aspect definition submission

We seek contributions of aspects definitions and associated annotation guidelines for creating a ground-truth by annotating search-oriented conversations. To submit, create a topic in the SCAI Eval Aspect Definition Submission forum (login provided upon registration): this will bring up a template submission that you need to fill in, including the rationale for evaluating this aspect, unit of annotation (is it an aspect of a conversation or of a single turn?), possibility of multiple labels per unit, data attributes required for annotation (see data submission), annotation guidelines, and examples for each label.

Expect to discuss your submission with others in the created topic and have a look at the other submissions to provide feedback. If critical issues are raised, these need to be resolved before the annotation period. Our goal is to have a few conversations of the dataset annotated for every submitted aspect without critical issues. If this should not be possible due to a very large number of aspect submissions, we will ask all contributors to vote on which to annotate.

The collected guidelines will be made freely available in a Github repository when the annotation starts. We will accept pull requests to add new aspects or change old ones. Moreover, we will provide information on how to load these guidelines into some annotation software (still to be decided which software we will use), so that everyone can reproduce our annotation setup easily.

Metric submission

We seek contributions of containerized metrics that classify conversations (in the format above) to recreate human annotations according to one of the submitted aspect definiton. To submit, create a software that reads a ndjson file as shown above and writes a json file formatted as follows:

{
  "conversation-or-turn-id": [ "predicted-label" ]
}

The conversation-id or turn-id should be used depending on whether the aspect’s unit of annotation is the whole conversation or single turns.

You then need to submit the software as Docker image to TIRA (Howto in preparation), which allows everyone to use your metric easily on all other datasets of the same format. We support metrics that use LLMs with instruct programming by providing access to several LLMs through Chatnoir Chat, allowing you to switch between LLMs without changing your code and ensuring reproducibility through automated response archiving. We are happy to provide you with support for getting your software into a Docker image: please ask in the Forum (login provided upon registration) for help.

The final versions of the collected metrics will be made available for everyone to use. Moreover, we will provide a script to run all metrics on an input file. In cases where more than one metric was submitted for an aspect, we will compare their results on the annotated dataset as part of the planned overview publication.

Task Committee

  • Johannes Kiesel, Bauhaus-Universität Weimar
  • Maik Fröbe, Friedrich-Schiller-Universität Jena
  • Andrea Papenmeier, University of Twente
  • Alexander Frummet, Universität Regensburg
  • Paul Owoicho, University of Glasgow