Text as Data (H) - Lecture 01 Quiz

1. Tokenisation is simply the process of splitting text into words on space characters

True

False

2. A one-Âhot encoding records the frequency of each word in a piece of text

True

False

3. We must store the offsets of words in the vectors using a dictionary in order to implement one-Âhot encoding

True

False

4. A cosine similarity of 1.0 means two texts:

Contain both identical and orthogonal words

Contain identical words

Contain completely different words

5. Normalising the dot product in the cosine similarity function allows document of different lengths to be more fairly compared?

True

False

6. Why is stemming useful?

It allows matching of misspelled words

It allows matching of words with morphological variations

It allows matching of words that sound the same