Friday, April 27, 2007

Feature of the Week: Information Extraction supports the Import of References from Homepages

Todays feature of the week post will point you to one of the hidden features of the system. As most of you certainly know one way to acquire the meta data of a publication is to use the screen scraping facility of BibSonomy. A list of supported sites can be found here and is extended constantly. Today we released a new scraper for Highwire and LibraryThing. It's also possible to write your own extension. A description of the internal scraper interface is provided here and allows you to implement scrapers for BibSonomy.

At the end of the list you find the IEScraper which is not designed for a special web page but rather supports you in general by the import of "usual" formated publication metadata like the following one:

Emma Tonkin and Marieke Guy. Folksonomies: Tidying Up Tags? . D-Lib,volume 12(1), January 2006.
which you can find at:
http://www.cs.bris.ac.uk/Publications/pub_info.jsp?id=2000478
To use this scraper you have to highlight the text of the reference you like to copy and then press the post_publication button. What happens in the backgroud is: The marked reference is send to the BibSonomy server and as no other scraper is able to process this kind of entry the IEScraper processes the entry and tries to find the different parts of the reference like: author, title, or year. You end up in the publication input mask where you find a prefilled form containing all information the scraper was able to extract. Now you can add your tags and adapt the entry. As an example the above entry in BibSonomy:

http://www.bibsonomy.org/bibtex/29488117bf156fe15b2fb3b8ab4376dec/hotho

Unfortunately the information extraction technology is not able to process all entries correctly. For the following entry:

Philipp Cimiano, Andreas Hotho, Steffen Staab. Comparing conceptual, partitional and agglomerative clustering for learning taxonomies from text. Proceedings of the European Conference on Artificial Intelligence (ECAI'04). 2004.

title and authors are extracted correctly but the booktitle is wrong. It contains the missing year, too. You have to correct this mistake manually. We are logging this correction and using this kind of information to tune the IEScraper. Currently we have to start the training process manually but we are working on an automatic learning setup.

We hope that this feature supports everybody who finds references not at the common digital archives but rather at homepages of researchers. As the IEScraper is not perfect it takes over a reasonable amount of the work and we hope you find this feature useful.


Have fun!

     Andreas

Popular Posts