eprintid: 9367 rev_number: 11 eprint_status: archive userid: 2098 dir: disk0/00/00/93/67 datestamp: 2024-09-24 02:10:55 lastmod: 2024-09-24 02:10:55 status_changed: 2024-09-24 02:10:55 type: article metadata_visibility: show contact_email: repository@staff.ukdw.ac.id creators_name: , Antonius Rachmat Chrismanto creators_name: , Yohanes Suyanto creators_name: , Anny Kartika Sari creators_id: 0523128101 title: SPAMID-PAIR SISTEM REPOSITORI: INDONESIAN SPAM PAIR DATASET ispublished: pub subjects: QA75 divisions: fak_tein full_text_status: public keywords: Dataset; natural language processing; spam detection; spamid-pair; post-comment pairs abstract: The detection of spam content is an important task especially in social media. It has become a topic to be continuely studied in Natural Language Processing (NLP) area in the last few years. However, limited data sets are available for this research topic because most researchers collect the data by themselves and make it private. Moreover, most available data sets only provide the post content without considering the comment content. This data becomes a limitation because the post-comment pair is needed when determining the context of a comment from a particular post. The context may contribute to the decision of whether a comment is a spam or not. The scarcity of non-English data sets, including Indonesian, is also another issue. To solve these problems, the authors introduce SPAMID-PAIR, a novel post-comment pair data set collected from Instagram (IG) in Indonesian. It is collected from selected 13 Indonesian actress/actor accounts, each of which has more than 15 million followers. It contains 72874 pairs of data. This data set has been annotated with spam/non-spam labels in Unicode (UTF-8) text format. The data also includes a lot of emojis/emoticons from IG. To test the baseline performance, the data is tested with some machine learning methods using several scenarios and achieves good performance. This dataset aims to be used for the replicable experiment in spam content detection on social media and other tasks in the NLP area. date: 2022 publication: Jurnal HKI Program Komputer volume: 13 number: 1 publisher: Ditjen HKI dan Desain Industri id_number: doi:10.14569/IJACSA.2022.0131110 refereed: TRUE official_url: http://dx.doi.org/10.14569/IJACSA.2022.0131110 citation: Antonius Rachmat Chrismanto and Yohanes Suyanto and Anny Kartika Sari (2022) SPAMID-PAIR SISTEM REPOSITORI: INDONESIAN SPAM PAIR DATASET. Jurnal HKI Program Komputer, 13 (1). document_url: https://katalog.ukdw.ac.id/9367/1/SPAMID-PAIR.pdf