dataset
stackexchange_flattened
midwestern-simulation
or hover any field below to flag it
Overview
Name
stackexchange_flattened
Source
midwestern-simulation
Episodes
0
Robot count
0
Format
parquet
Description
7.6M threads of posts + answers + comments from stackexchange (omitting stackoverflow).
with the Llama2 tokenizer (32k vocab) this should come out to ~7.94GT
Robots used
null
Links
HuggingFace dataset