An interesting property of this collection is its time dimension: Many text corpora contain linguistic annotations, representing POS tags, named entities, syntactic structures, semantic roles, and so forth.NLTK provides convenient ways to access several of these corpora, and has data packages containing corpora and corpus samples, freely downloadable for use in teaching and research. For information about downloading them, see : Cumulative Word Length Distributions: Six translations of the Universal Declaration of Human Rights are processed; this graph shows that words having 5 or fewer letters account for about 80% of Ibibio text, 60% of German text, and 25% of Inuktitut text.

As just mentioned, a text corpus is a large body of text.

Many corpora are designed to contain a careful balance of material in one or more genres.

For convenience, the corpus methods accept a single fileid or a list of fileids.

Similarly, we can specify the words or sentences we want in terms of files or categories.

Unfortunately, for many languages, substantial corpora are not yet available.

Often there is insufficient government or industrial support for developing language resources, and individual efforts are piecemeal and hard to discover or re-use.

The corpus contains over 10,000 posts, anonymized by replacing usernames with generic names of the form "User NNN", and manually edited to remove any other identifying information.

The corpus is organized into 15 files, where each file contains several hundred posts collected on a given date, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a generic adults chatroom).

Some languages have no established writing system, or are endangered.

(See 7 for suggestions on how to locate language resources.) We have seen a variety of corpus structures so far; these are summarized in 1.3.

: Common Structures for Text Corpora: The simplest kind of corpus is a collection of isolated texts with no particular organization; some corpora are structured into categories like genre (Brown Corpus); some categorizations overlap, such as topic categories (Reuters Corpus); other corpora represent language use over time (Inaugural Address Corpus).

