The true believer: Week 14 reading note

Because we have to look at the data at least once, NB can be said to have optimal time complexity. Its efficiency is one reason why NB is a popular text classification method.

The answer is that even though the probability estimates of NB are of low quality, its classification decisions are surprisingly good.

NB’s main strength is its efficiency: Training and classification can be accomplished with one pass over the data. Because it combines efficiency with good accuracy it is often used as a baseline in text classification research.

Thus, linear models in high-dimensional spaces are quite powerful despite their linearity. Even more powerful nonlinear learning methods can model decision boundaries that are more complex than a hyper plane, but they are also more sensitive to noise in the training data. Nonlinear learning methods sometimes perform better if the training set is large, but by no means in all cases.

On the definition of hard clustering that permits multiple membership, the difference between soft clustering and hard clustering is that membership values in hard clustering are either 0 or 1, whereas they can take on any non-negative value in soft clustering.

Purity is a simple and transparent evaluation measure. Normalized mutual information can be information-theoretically interpreted. The Rand index penalizes both false positive and false negative decisions during clustering. The F measure in addition supports differential weighting of these two types of errors.

Most problems that require the computation of a large number of dot products

benefit from an inverted index. This is also the case for HAC clustering.

Computational savings due to the inverted index are large if there are many

zero similarities – either becausemany documents do not share any terms or

because an aggressive stop list is used.

The true believer

Thursday, April 10, 2014

Week 14 reading note

No comments:

Post a Comment

About Me