Automating web table column annotation using supervised learning


In this research 2 concept embedding models, TransE and TransH, were experimented. A suitable dataset to train the models had to be generated, as there were no pretrained models or prepared datasets available that were made specifically for DBPedia. There were 2 possible ways to do this. Either to prepare a dataset (a set of triplets) using the T2D data itself (which is a subset of DBPedia full set of triplets) or to train the model using DBPedia full set of triplets. In the first approach, the model would yield very good results with almost 99% accuracy as the evaluation is against the T2D data itself. But the model would be overfitted and will not perform well for other datasets of nature (eg: T2D*). If trained using the full DBPedia set of triplets, the model would be very large and will take a long time to train and would also contain a lot of noise as it is only required for the model to contain data of T2D tables.

To overcome both above challenges, it was decided to create a dataset larger than T2D and smaller than full DBPedia. All the cell values in the T2D tables were scraped, the ones containing only numbers were removed, and the DBPedia lookup endpoint was queried for the remainder. This resulted in a list of labels and their corresponding DBPedia resource URIs. Then DBPedia SparQL endpoint was queried for triplets of the pattern (e1, c?, e2?) and (e2?, c?, e1) where e1 is an entity URI from above list. This process has generated a considerably large dataset to train TransE and TransH. The models were trained  using these datasets and it has resulted in success.

Numerical data of different entities was required as training data. However, number of manually annotated data for numerical values were not adequate. Therefore, numerical data had to be scraped from the web.

First attempt - Attempted to extract numerical data columns from T2D data tables. Very few numerical data columns could be extracted.

Second attempt - Used Google table search to find table columns with the concept label. This was exhausting.

Third attempt - DBPedia SparQL was used to query set of values for a given numerical concept (numerical concepts were extracted using range of the concept from the dbpedia.owl file). Out of 103 numerical concepts, data for only 57 concepts were returned by the endpoint.

Final dataset was created using T2D and queried data.

Associated Publication: -

Paper Title: Automating web table columns to knowledge base mapping using translation embedding
Published in: 2020 IEEE 14th International Conference on Semantic Computing (ICSC)
Date of Conference: 3-5 Feb. 2020
DOI: 10.1109/ICSC.2020.00029

Citation: -

K. Chamiran, A. Rukshan and U. Thayasivam, "Automating web Table Columns to Knowledge Base Mapping using Translation Embedding," 2020 IEEE 14th International Conference on Semantic Computing (ICSC), 2020, pp. 150-153, doi: 10.1109/ICSC.2020.00029.