Moreover, notice that all of the data types included in the TIMIT corpus fall into the two basic categories of lexicon and text, which we will discuss below.

Even the speaker demographics data is just another instance of the lexicon data type.

As in other chapters, there will be many examples drawn from practical experience managing linguistic data, including data that has been collected in the course of linguistic fieldwork, laboratory work, and web crawling.

The TIMIT corpus of read speech was the first annotated speech database to be widely distributed, and it has an especially clear organization.

Structured collections of annotated linguistic data are essential in most areas of NLP, however, we still face many obstacles in using them.