Our social communications and the expression of our beliefs and thoughts are becoming increasingly mediated and diffused by online social media. Beyond countless other advantages, this democratization and freedom of expression is also entailing the transfer of unpleasant offline behaviors to the online life, such as cyberbullying, sexting, hate speech and, in general, any behavior not suitable for the online community people belong to. To mitigate or even remove these threats from their platforms, most of the social media providers are implementing solutions for the automatic detection and filtering of such inappropriate contents. However, the data they use to train their tools are not publicly available. In this context, we release a dataset gathered from Mastodon, a distribute online social network which is formed by communities that impose the rules of publication, and which allows its users to mark their posts inappropriate if they perceived them not suitable for the community they belong to. The dataset consists of all the posts with public visibility published by users hosted on servers which support the English language. These data have been collected by implementing an ad-hoc tool for downloading the public timelines of the servers, namely instances, that form the Mastodon platform, along with the meta-data associated to them. The overall corpus contains over 5 million posts, spanning the entire life of Mastodon. We associate to each post a label indicating whether or not its content is inappropriate, as perceived by the user who wrote it. Moreover, we also provide the full description of each instance. Finally, we present some basic statistics about the production of inappropriate posts and the characteristics of their associated textual content.

Mastodon Content Warnings: Inappropriate Contents in a Microblogging Platform / M. Zignani, C. Quadri, A. Galdeman, S. Gaito, G.P. Rossi (PROCEEDINGS OF THE ... INTERNATIONAL AAAI CONFERENCE ON WEBLOGS AND SOCIAL MEDIA). - In: Proceedings of the International AAAI Conference on Web and Social Media[s.l] : Association for the Advancement of Artificial Intelligence, 2019. - ISBN 9781577358060. - pp. 639-645 (( Intervento presentato al 13. convegno International AAAI Conference on Web and Social Media tenutosi a Munich nel 2019.

Mastodon Content Warnings: Inappropriate Contents in a Microblogging Platform

C. Quadri;S. Gaito;G.P. Rossi
2019

Abstract

Our social communications and the expression of our beliefs and thoughts are becoming increasingly mediated and diffused by online social media. Beyond countless other advantages, this democratization and freedom of expression is also entailing the transfer of unpleasant offline behaviors to the online life, such as cyberbullying, sexting, hate speech and, in general, any behavior not suitable for the online community people belong to. To mitigate or even remove these threats from their platforms, most of the social media providers are implementing solutions for the automatic detection and filtering of such inappropriate contents. However, the data they use to train their tools are not publicly available. In this context, we release a dataset gathered from Mastodon, a distribute online social network which is formed by communities that impose the rules of publication, and which allows its users to mark their posts inappropriate if they perceived them not suitable for the community they belong to. The dataset consists of all the posts with public visibility published by users hosted on servers which support the English language. These data have been collected by implementing an ad-hoc tool for downloading the public timelines of the servers, namely instances, that form the Mastodon platform, along with the meta-data associated to them. The overall corpus contains over 5 million posts, spanning the entire life of Mastodon. We associate to each post a label indicating whether or not its content is inappropriate, as perceived by the user who wrote it. Moreover, we also provide the full description of each instance. Finally, we present some basic statistics about the production of inappropriate posts and the characteristics of their associated textual content.
Settore INF/01 - Informatica
2019
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
2018_NotSafeForWork_ICWSM19_DatasetPaper (1).pdf

accesso riservato

Tipologia: Post-print, accepted manuscript ecc. (versione accettata dall'editore)
Dimensione 1.57 MB
Formato Adobe PDF
1.57 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
3262-Article Text-6311-1-10-20190531.pdf

accesso aperto

Tipologia: Publisher's version/PDF
Dimensione 1.63 MB
Formato Adobe PDF
1.63 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/641570
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 9
  • ???jsp.display-item.citation.isi??? ND
social impact