Paper
8 November 2024 Joint information extraction from semi-structured web pages using BERT
Ge Zhang, Jian Wang, Jing Ling Wang
Author Affiliations +
Proceedings Volume 13416, Fourth International Conference on Advanced Algorithms and Neural Networks (AANN 2024); 134161X (2024) https://doi.org/10.1117/12.3049720
Event: 2024 4th International Conference on Advanced Algorithms and Neural Networks, 2024, Qingdao, China
Abstract
Open Information Extraction from Semi-Structured Web Pages plays a pivotal role in constructing knowledge graphs and is a hot topic in the field of information extraction. Unlike traditional closed information extraction from semi-structured web pages, open information extraction can extract triples that do not adhere to predefined ontological relationships. Typically, open information extraction from semi-structured web pages involves first extracting relationships, followed by the corresponding entities, a pipeline approach that often leads to error propagation. To address this, our study introduces an end-to-end method for open information extraction from semi-structured web pages. This method models the extraction task as a cascading labeling task and employs a joint decoder to simultaneously extract both relationships and corresponding entities. The model was trained on websites from three domains: movies, universities, and NBA players, using an extended SWDE dataset. Experiments in zero-shot and few-shot extraction demonstrate that our method outperforms baseline models in open information extraction tasks across all three domains, as evidenced by higher F1 scores. Analysis of the results indicates that the proposed method not only maintains high performance but also exhibits robust generalizability.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Ge Zhang, Jian Wang, and Jing Ling Wang "Joint information extraction from semi-structured web pages using BERT", Proc. SPIE 13416, Fourth International Conference on Advanced Algorithms and Neural Networks (AANN 2024), 134161X (8 November 2024); https://doi.org/10.1117/12.3049720
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Performance modeling

Data modeling

Education and training

Online learning

Object recognition

Matrices

Transformers

Back to Top