The Broad Twitter Corpus is a diverse named entity recognition resource aimed at addressing the biases in current datasets that are predominantly focused on news media. It highlights the varied types of entities found in social media, the challenges of annotation, and the advantages of using diverse crowdsourcing methods to improve recall. The results indicate a more representative dataset that captures temporal and spatial variations in language use across different anglophone regions.
Related topics: