ChEMU - Cheminformatics Elsevier Melbourne University lab

ChEMU lab series provides a unique opportunity for development of information extraction tools over chemical patents. As third running of ChEMU lab (ChEMU2020, ChEMU2021), ChEMU2022 focuses on information extraction in chemical patents, including 5 different tasks ranging from document-level to expression-level.

Overview of ChEMU 2022

Brought to you by the University of Melbourne natural language processing group in the School of Computing and Information System, the Elsevier Data Science, Life Sciences team, and RMIT University, the ChEMU lab series provides an opportunity for development of information extraction models over chemical patents.

News
  • 1 September 2022: Website is up!
Task 1: Expression-level information extraction
  • Task 1a - Named Entity Recognition: This task involves identifying chemical compounds as well as their types in context, i.e., to assign the label of a chemical compound according to the role which the compound plays within a chemical reaction.
  • Task 1b - Event Extraction: This task aims to extract event trigger detection and argument recognition from chemical reactions.
  • Task 1c - Anaphora Resolution: This task requires identification of references between expressions in chemical patents.
Task 2: Document-level information extraction
  • Task 2a - Chemical Reaction Reference Resolution: Given a chemical reaction snippet, the task aims to find similar chemical reactions and general conditions that it refers to.
  • Task 2b - Table Semantic Classification: This task is about categorising tables in chemical patents based on their contents.
Key Dates
  • 15 November 2021: Labs Registration opens
  • 16 May 2022: End of evaluation cycle and feedback for participants
  • 9 June 2022: Submission of CLEF 2022 Working Notes (participants)

All the above times are in AoE zone.

Task 1a: Named Entity Recognition

In general, a chemical reaction is a process leading to the transformation of one set of chemical substances to another. Task 1 aims to identify chemical compounds and their specific types, i.e. to assign the label of a chemical compound according to the role which it plays within a chemical reaction. In addition to chemical compounds, this task also requires identification of the temperatures and reaction times at which the chemical reaction is carried out, as well as yields obtained for the final chemical product and the label of the reaction.

In particular, we define 10 different entity type labels as shown in the following table.

Entity Type Definition
EXAMPLE_LABEL A label associated with a reaction specification.
REACTION_PRODUCT A product is a substance that is formed during a chemical reaction.
STARTING_MATERIAL A substance that is consumed in the course of a chemical reaction providing atoms to products is considered as starting material.
REAGENT_CATALYST A reagent is a compound added to a system to cause or help with a chemical reaction. Compounds like catalysts, bases to remove protons or acids to add protons must be also annotated with this tag.
SOLVENT A solvent is a chemical entity that dissolves a solute resulting in a solution.
TIME The reaction time of the reaction.
TEMPERATURE The temperature of the reaction.
YIELD_PERCENT Yields given in percent values.
YIELD_OTHER Yields provided in other units than %.
OTHER_COMPOUND Other chemical compounds that are not the products, starting materials, reagents, catalysts and solvents.

This task aims to identify the entities with the above 10 class labels. It also requires you predict the boundary of those entities.

Submission format

A valid submission is a compressed folder (e.g. submission.zip) consisting of prediction files (*.ann files).

  • Please just submit *.ann files, as other files are not necessary for evaluation.

Evaluation

We use standard precision, recall, and F-score as our primary evaluation metrics. The evaluation system we use on the server is available on the download page of BRATEval repository.

Task 1b: Event Extraction

A chemical reaction leading to an end product often consists of a sequence of individual event steps. Task 1b is to identify those steps which involve chemical entities recognized from Task 1a. Task 1b requires identification of event trigger words (e.g. "added" and "stirred") which all have the same type of "EVENT_TRIGGER", and then determination of the chemical entity arguments of these events.

When predicting event arguments, we adapt semantic argument role labels Arg1 and ArgM from the Proposition Bank to label the relations between the trigger words and the chemical entities: Arg1 is used to label the relation between an event trigger word and a chemical compound. Here, Arg1 represents argument roles of being causally affected by another participant in the event. ArgM represents adjunct roles with respect to an event, used to label the relation between a trigger word and a temperature, time or yield entity.

Submission format

A valid submission is a compressed folder (e.g. submission.zip) consisting of prediction files (*.ann files).

  • Please just submit *.ann files, as other files are not necessary for evaluation.

Evaluation

We use standard precision, recall, and F-score as our primary evaluation metrics. The evaluation system we use on the server is available on the download page of BRATEval repository.

Task 1c: Anaphora Resolution

This task requires the resolution of general anaphoric dependencies between expressions in chemical patents. In this task, we define five types of anaphoric relationships, common in chemical patents:

  • Co-reference: two expressions/mentions that refer to the same entity.
  • Transformed: two chemical compound entities that are initially based on the same chemical components and have undergone possible changes through various conditions (e.g., pH and temperature).
  • Reaction-associated: the relationship between a chemical compound and its immediate sources via a mixing process. The immediate sources do need to be reagents, but they need to end up in the corresponding product. The source compounds retain their original chemical structure.
  • Work-up: the relationship between chemical compounds that were used for isolation or purification purposes, and their corresponding output products.
  • Contained: the association holding between chemical compounds and the related equipment in which they are placed. The direction of the relation is from the related equipment to the previous chemical compound.

Data format

There are two types of data files, which are both in plain text using UTF-8 encoding:

  • *.txt files: the text snippets extracted from origianl patent pdfs.
  • *.ann files: the annotation files containing span and relation annotations in BRAT standoff format.
    • {COREFERENCE, TRANSFORMED, REACTION_ASSOCIATED, WORK_UP, CONTAINED} are the five anaphoric relations we consider in this task.
    • Every identified relation should be labeled as one of the five types.
    • The referring direction is from anaphor to its corresponding antecedent.
    • The anaphor in a relation has the same label as the relation, i.e. one of the five types, while the antecedent is always labeled as ENTITY.
    • Note that an anaphor/antecedent may have multiple ranges (discontinuous text-bound).
For more details, please refer to our annotation guileline which is avaible here.

Transitive co-reference relationships

Suppose there are co-reference links T1->T2 and T2->T3, then T1->T3 is also a valid link. In evaluation, we are looking for all valid links, and missing links are considered as false negatives. To help you post-process your submission, we provide our code to generate all valid links given existing ones [python][HTML]. The code will append new links to existing files.

Submission format

A valid submission is a compressed folder (e.g. submission.zip) consisting of prediction files (*.ann files).

  • Please just submit *.ann files, as other files are not necessary for evaluation.

Evaluation

We use standard precision, recall, and F-score as our primary evaluation metrics. The evaluation system we use on the server is available on the download page of BRATEval repository.

Task 2a: Chemical Reaction Reference Resolution

Given a reaction description, this task requires identifying references to other reactions that the reaction relates to, and to the general conditions that it depends on.

Assume a set of reaction statements (RSs), each of which corresponds to a description of an individual chemical reaction or a general condition for the reaction. By identifying all the reference relationships amongst these reaction statements, the details of reactions can be fully specified by connecting related reaction statements. Two types of reference relationships are defined in this task, namely Analogous Reactions and General Conditions.

Data format

There are two types of data files, which are both in plain text using UTF-8 encoding:

  • *.txt files: the text files converted from origianl patent pdfs.
    • Note that there are some special tags like <img>, <header>, <table>.
  • *.ann files: the annotation files containing span and relation annotations in BRAT standoff format.
    • REACTION_SPAN -> a reaction statement
    • REF -> a reference relation between two REACTION_SPANs
    • To help participants build better models, we provide CUE annotations, where a CUE indicates the analogy in a parent-child reaction pair. Related annotations are CUE, CUE_PARENT, CHILD_CUE.
    • IMG_CUE, IMG_CUE_PARENT, IMG_CHILD_CUE rely on images in the original patents, which are not made available in this task. Participants may ignore these annotations if they are not useful.
For more details, please refer to our annotation guileline.

Dataset and visualization

The dataset is annotated using our modified version of BRAT. The challenge we faced is that reaction spans are often very long which makes annotators hard to link two spans that are far away from each other. Therefore, we use a side-by-side view, where the original text and annotated spans are displayed on the left side and some dummy nodes corresponding to the the spans are shown on the right side. Then the annotators can just link the dummy nodes on the right side.

Visualization for the sample data can be found here (it may take a few seconds to load).

In our release, we provide two versions of the datasets, in folders like sample and sample-vis, where the first one has spans and relations in one ann file for a patent, while the second one has them separated in two ann files to support the side-by-side view. The second one could be visualized by our modified BRAT, and the first one is more friendly for program.

Submission format

A valid submission is a compressed folder (e.g. submission.zip) consisting of prediction files (*.ann files).

  • Please just submit *.ann files, as other files are not necessary for evaluation.
  • We don't require participants predict CUEs, and the evaluation is based on REACTION_SPAN and REF only. So please exclude CUE related predictions.

Evaluation

We use standard precision, recall, and F-score as our primary evaluation metrics. The evaluation system we use on the server is available on the download page of BRATEval repository.

Task 2b: Table Semantic Classification

This task is about categorising tables in chemical patents based on their contents. This supports identification of tables containing key information.

In particular, we define 8 different types of tables as shown in the following table.

Label Description Examples
SPECT Spectroscopic data mass spectrometry, IR/NMR spectroscopy
PHYS Physical data melting point, quantum chemical calculations
IDE Identification of compounds chemical name, structure, formula, label
RX All properties of reactions starting materials, products, yields
PHARM Pharmacological data pharmacological usage of chemicals
COMPOSITION Compositions of mixtures compositions made up by multiple ingredients
PROPERTY Properties of chemicals the time of resistance of a photoresist
OTHER Other tables -

This task aims to classify tables into the above 8 classes.

Submission format

A valid submission is a compressed folder (e.g. submission.zip) consisting of prediction files (*.ann files).

  • Please just submit *.ann files, as other files are not necessary for evaluation.

Evaluation

We overall accuray as the main evaluation metric for this task.

Annotation Guidelines

To know how the datasets are annotated and gain further insight into the task, please see the annotation guidelines:

Pre-trained ChemPatent Word Embeddings

In a related work, we have publicized a set of new word embeddings, named ChemPatent Word Embeddings, which is trained on a collection of 84,076 full patent documents (1B tokens) across 7 patent offices. We have also released an ELMo model pre-trained on the same corpus which provides contextualized word presentations. We have demonstrate that ChemPatent Word Embeddings produce better performance than the word embeddings pre-trained on biomedical literature corpora.

To access and utilize the released ChemPatent Word Embeddings and the pre-trained ELMo model, please click Github Repository for ChemPatent Embeddings

To see detailed information about the embeddings, please find the original paper in https://www.aclweb.org/anthology/W19-5035.pdf.

Relevant Background:
  1. Nguyen DQ, Zhai Z, Yoshikawa H, Fang B, Druckenbrodt C, Thorne C, Hoessel R, Akhondi SA, Cohn T, Baldwin T and Verspoor K. (2020) ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. In ECIR 2020. PDF.
  2. Zhai Z, Nguyen DQ, Akhondi S, Thorne C, Druckenbrodt C, Cohn T, Gregory M and Verspoor K. (2019) Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings. Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2019. https://www.aclweb.org/anthology/W19-5035.pdf
  3. Yoshikawa H, Verspoor K, Baldwin T, Nguyen DQ, Zhai Z, Zkhondi S, Thorne C, Druckenbrodt C. (2019) Detecting Chemical Reaction Schemes in Patents. Australian Language Technology Association Workshop (ALTA 2019). Sydney, Australia, December 2019. https://www.aclweb.org/anthology/U19-1014.pdf
In preparation
  1. Zhai, Z, Druckenbrodt, C, Thorne, C, Akhondi, SA, Nguyen, DQ, Cohn, T, & Verspoor, K. (2020) ChemTables: A Dataset for Semantic Classification of Tables in Chemical Patents. [paper]
  2. He J, Zhai Z, Druckenbrodt C, Akhondi SA, Thorne C, Yoshikawa H, Verspoor K. ChEMU-BERT: A pre-trained Language Model for Chemical Patents.
2021
  1. Yoshikawa H, Akhondi SA, Thorne C, Druckenbrodt C, Hoessel R, Zhai Z, He J, Baldwin T, Verspoor K. (2021) Chemical Reaction Reference Resolution in Patents. PatentSemTech2021.
  2. He J, Nguyen DQ, Akhondi SA, Druckenbrodt C, Thorne C, Hoessel R, Afzal Z, Zhai Z, Fang B, Yoshikawa H, Albahem A, Cavedon L, Cohn T, Baldwin T, Verspoor K. (2021) ChEMU 2020: Natural Language Processing Methods are Effective for Information Extraction from Chemical Patents. Frontiers in Research Metrics and Analytics, section Text-mining and Literature-based Discovery, special issue on Information Extraction from Bio-Chemical Text.
  3. He J, Fang B, Yoshikawa H, Li Y, Akhondi SA, Druckenbrodt C, Thorne C, Afzal Z, Zhai Z, Cavedon L, Cohn T, Baldwin T, Verspoor K. (2021) ChEMU 2021: Reaction Reference Resolution and Anaphora Resolution in Chemical Patents. In: Advances in Information Retrieval. ECIR 2021.
  4. Fang B, Druckenbrodt C, Akhondi SA, He J, Baldwin T, Verspoor K. (2021) ChEMU-Ref: A corpus for modeling anaphora resolution in the chemical domain. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021).
  5. Elangovan A, He J, Verspoor K. (2021) Memorization vs. Generalization: Quantifying Data Leakage in NLP Performance Evaluation. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021).
  6. Fang B, Druckenbrodt C, Yeow HS, Novakovic S, Hössel R, Akhondi SA, He J, Mistica M, Baldwin T, Verspoor K. (2021) ChEMU-Ref dataset for Modeling Anaphora Resolution in the Chemical Domain. Mendeley Data, V1, doi: 10.17632/r28xxr6p92.1 [dataset].
2020
  1. He J, Nguyen DQ, Akhondi SA, Druckenbrodt C, Thorne C, Hoessel R, Afzal Z, Zhai Z, Fang B, Yosikawa H, Albahem A, Cavedon L, Cohn T, Baldwin T, Verspoor K. Overview of ChEMU 2020: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, Lecture Notes in Computer Science, vol. 12260: 237-254.
  2. He J, Nguyen DQ, Akhondi SA, Druckenbrodt C, Thorne C, Hoessel R, Afzal Z, Zhai Z, Fang B, Yoshikawa H, Albahem A, Wang j, Ren Y, Zhang Z, Zhang Y, Dao MH, Ruas P, Lamurias A, Couto F, Copara J, Naderi N, Knafou J, Ruch P, Teodoro D, Lowe D, Mayfield J, Koksal A, Donmez H, Ozkirimli E, Ozgur A, Mahendran D, Gurdin G, Lweinski N, Tang C, McInnes BT, Malarkodi CS, Rao TP, Devi SL, Cavedon L, Cohn T, Baldwin T, Verspoor K (2020) An Extended Overview of the CLEF 2020 ChEMU Lab: Information Extraction of Chemical Reactions from Patents. Proceedings of the CLEF 2020 conference. Thessaloniki, Greece. 2020-09. http://hesso.tind.io/record/6175
  3. Nguyen DQ, Zhai Z, Yoshikawa H, Fang B, Druckenbrodt C, ThorneC, Hoessel R, Akhondi SA, Cohn T, Baldwin T, Verspoor K. (2020) ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. In: Jose J. et al. (eds) Advances in Information Retrieval. ECIR 2020. Lecture Notes in Computer Science, vol 12036. Springer, Cham. doi: 10.1007/978-3-030-45442-5_74 PDF
  4. Zhai, Z, Druckenbrodt, C, Thorne, C, Akhondi, SA, Nguyen, DQ, Cohn, T, & Verspoor, K. (2020) ChemTables: A Dataset for Semantic Classification of Tables in Chemical Patents. [paper], [dataset].
  5. Verspoor K, Nguyen DQ, Akhondi SA, Druckenbrodt C, Thorne C, Hoessel R, He J, Zhai Z. ChEMU dataset for information extraction from chemical patents. [dataset].
2019
  1. Yoshikawa H, Nguyen DQ, Zhai Z, Druckenbrodt C, Thorne C, Akhondi SA, Baldwin T, Verspoor K. (2019) Detecting Chemical Reactions in Patents. Australian Language Technology Association Workshop (ALTA 2019). Sydney, Australia, December 2019. https://www.aclweb.org/anthology/U19-1014.pdf [Best Paper Award]
  2. Zhai Z, Nguyen DQ, Akhondi S, Thorne C, Druckenbrodt C, Cohn T, Gregory M and Verspoor K. (2019) Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings. Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2019. https://www.aclweb.org/anthology/W19-5035.pdf
2018
  1. Zhai Z, Nguyen DQ, Verspoor K*. (2018) Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition. Proceedings of the 9th International Workshop on Health Text Mining and Information Analysis (LOUHI 2018), pages 38–43, Brussels, Belgium, October 31, 2018. arXiv:1808.08450. http://aclweb.org/anthology/W18-5605
  2. Nguyen DQ, Verspoor K*. (2018) Convolutional neural networks for chemical-disease relation extraction are improved with character-based word embeddings. Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP) at ACL2018. arXiv:1805.10586. http://aclweb.org/anthology/W18-2314
  1. Can I log in using my credentials used in CLEF registration?

    To provide a more secured environment in our submission website, we use an independent registration system from CLEF. To log into our submission website for the first time, you will need to sign up by providing some simple information including your username, email, password, and your institution. We apologize for any inconvenience incurred.

  2. How can I make a submission?

    You can choose to make a submission against the development or test dataset by toggling the "data split" in the submission panel. You will be provided with evaluation result right after your submission is uploaded successfully. A ranking of all your submissions is provided in your private leaderboard. You may also click "publish" to make the performance of a submission visible to all teams. By "publishing" a submission, the performance of the submission will appear in the public leaderboard.