dataset

stackexchange_flattened

midwestern-simulation

or hover any field below to flag it

Overview

Name

Source

midwestern-simulation

Episodes

Robot count

Format

parquet

Description

7.6M threads of posts + answers + comments from stackexchange (omitting stackoverflow). with the Llama2 tokenizer (32k vocab) this should come out to ~7.94GT

Robots used

null

Links

HuggingFace dataset

midwestern-simulation/stackexchange_flattened