NLPExplorer
Papers
Venues
Authors
Authors Timeline
Field of Study
URLs
ACL N-gram Stats
TweeNLP
API
Team
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Jesse Dodge
|
Maarten Sap
|
Ana Marasović
|
William Agnew
|
Gabriel Ilharco
|
Dirk Groeneveld
|
Margaret Mitchell
|
Matt Gardner
|
Paper Details:
Month: November
Year: 2021
Location: Online and Punta Cana, Dominican Republic
Venue:
EMNLP |
Citations
URL
No Citations Yet
https://github.com/allenai/c4-
https://2020.emnlp.org/blog/2020-05-20-
https://c4-search.apps.allenai.org/
https://github.com/allenai/c4-
https://commoncrawl.org/
https://git.io/vSyEu
https://pypi.org/project/langdetect/
https://en.wikipedia.org/wiki/List_
https://spacy.io/api/tokenizer
https://pypi.org/
https://lite.ip2location.com/
https://en.wikipedia.org/wiki/List_
https://git.io/vSyEu
https://github.com/allenai/c4-
https://www.wired.com/story/ai-
https://patents.google.com/
https://patents.google.com/
https://github.com/nyu-
https://github.com/
https://github.com/
https://github.com/
https://raw.githubusercontent
https://github.com/
https://github.com/zdwls/
https://github.com/mcdm/
https://github.com/drwiner/
https://raw.githubusercontent
https://github.com/aEE25/
https://github.com/xiandong79/
https://www.nytimes.com
https://www.aljazeera.com
Field Of Study