World's largest profanity audio dataset
Dataset consists of โญ26,365 audio files
Click here for documentation
See The Abuse Project
TAPAD (โฟ) is an open dataset, meaning it will grow over time as more data is contributed. In order to enable reproducibility and accurate citation the dataset is versioned using git tags.
| Category | Const |
|---|---|
| Total files | 26,365 |
| Dataset updated | July 30, 2019 |
| Language classes | 75 |
| File Type | MP3 |
| Mime Type | audio/mpeg |
| Mpeg Audio Version | 2 |
| Audio Layer | 3 |
| Audio Bitrate | 32 kbps |
| Sample Rate | 24000 |
| Channel Mode | Single Channel |
| Ms Stereo | Off |
| Intensity Stereo | Off |
| Codec Type | audio |
| Codec Time Base | 1/24000 |
| Codec Tag | 0x0000 |
| Sample Fmt | fltp |
| Sample Rate | 24000 |
| Channels | 1 |
| Channel Layout | mono |
| Bits Per Sample | 0 |
| R Frame Rate | 0/0 |
| Avg Frame Rate | 0/0 |
| Time Base | 1/14112000 |
Languages are required to be 2 letters, normally their 2 letter ISO code, see: ISO_639-1
| Filename | Location | Description | Type |
|---|---|---|---|
record.py |
acquire\custom |
Records audio in WAV format (default: 3 sec) | Helper script |
wingen.py |
acquire\generate |
TTS conversion using SAPI.SpVoice |
Helper script |
gTTSgen.py |
acquire\generate |
TTS conversion using gTTS & abuse 0.1.1 |
Helper script |
gspectogram.py |
utils |
Generates spectrogram of a wav file | Utility tool |
.
โโโโaf
โโโโar
โโโโbn
โโโโbs
โโโโca
โโโโcs
โโโโcy
โโโโda
โโโโde
โโโโel
โโโโen
โ โโโโ1 (340 wav files)
โ โโโโ2
โโโโen-au
โโโโen-ca
โโโโen-gb
โโโโen-gh
โโโโen-ie
โโโโen-in
โโโโen-ng
โโโโen-nz
โโโโen-ph
โโโโen-tz
โโโโen-uk
โโโโen-us
โโโโen-za
โโโโeo
โโโโes
โโโโes-es
โโโโes-us
โโโโet
โโโโfi
โโโโfr
โโโโfr-ca
โโโโfr-fr
โโโโhi
โโโโhr
โโโโhu
โโโโhy
โโโโid
โโโโis
โโโโit
โโโโja
โโโโjw
โโโโkm
โโโโko
โโโโla
โโโโlv
โโโโmk
โโโโml
โโโโmr
โโโโmy
โโโโne
โโโโnl
โโโโno
โโโโpl
โโโโpt
โโโโpt-br
โโโโpt-pt
โโโโro
โโโโru
โโโโsi
โโโโsk
โโโโsq
โโโโsr
โโโโsu
โโโโsv
โโโโsw
โโโโta
โโโโte
โโโโth
โโโโtl
โโโโtr
โโโโuk
โโโโvi
โโโโzh-cn
โโโโzh-tw
Most of these audio classes have 347 MP3 files of ~5.783 minutes each. MP3 had a lot of patent issues but according to Wikipedia, "If the longest-running patent mentioned in the aforementioned references is taken as a measure, then the MP3 technology became patent-free in the United States on 16 April 2017 when U.S. Patent 6,009,399, held by and administered by Technicolor, expired".
find audio/ -type f | wc -lDid you use or saw TAPAD in a paper, project or app? Add it here!
- The Abuse Project
- (...)
The dataset is regularly updated and maintained by,
- Piyush Raj (@0x48piraj)
The textual data was collected was from different places which all have been listed below,
- Offensive/Profane Word List from Luis von Ahn's Research Group at Carnegie Mellon University
- The Alphabet Of Swearing
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
To view a copy of this license, visit NC-SA 4.0 or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

