Text as Data (H) - Lecture 01 Quiz

1. Tokenisation is simply the process of splitting text into words on space characters

2. A one-­hot encoding records the frequency of each word in a piece of text

3. We must store the offsets of words in the vectors using a dictionary in order to implement one-­hot encoding

4. A cosine similarity of 1.0 means two texts:

5. Normalising the dot product in the cosine similarity function allows document of different lengths to be more fairly compared?

6. Why is stemming useful?