Computational Identification of Author Style on Electronic Libraries - Case of Lexical Features | |
---|---|
Author | |
Abstract |
In the present work, we intend to present a thorough study developed on a digital library, called HAT corpus, for a purpose of authorship attribution. Thus, a dataset of 300 documents that are written by 100 different authors, was extracted from the web digital library and processed for a task of author style analysis. All the documents are related to the travel topic and written in Arabic. Basically, three important rules in stylometry should be respected: the minimum document size, the same topic for all documents and the same genre too. In this work, we made a particular effort to respect those conditions seriously during the corpus preparation. That is, three lexical features: Fixed-length words, Rare words and Suffixes are used and evaluated by using a centroid based Manhattan distance. The used identification approach shows interesting results with an accuracy of about 0.94. |
Year of Publication |
2022
|
Conference Name |
2022 5th International Symposium on Informatics and its Applications (ISIA)
|
Google Scholar | BibTeX |