The true believer

Thursday, April 17, 2014

Week 15 reading note

The use of wildcards allows generalization beyond specific words, while contextual restrictions limit the wildcard-matching to entities related to the user's query.

The first type of wildcards should match entities of a specific type. We call these as S-Wildcards (Specific entity type Wildcards). The second type of wildcards can match against all types of entities. We call these as A-Wildcards (Any entity type Wildcards).

The ontology becomes a means of communication between the user and the system and helps overcome the bottlenecks in information access, which is primarily based on keyword searches. It supports information retrieval based on the actual content of a page and helps navigate the information space based on semantic concepts.

Monday, April 14, 2014

Week 14 muddiestPoint

Why the tools like stemming is different between TextCat and IR?

Thursday, April 10, 2014

Week 14 reading note

Because we have to look at the data at least once, NB can be said to have optimal time complexity. Its efficiency is one reason why NB is a popular text classification method.

The answer is that even though the probability estimates of NB are of low quality, its classification decisions are surprisingly good.

NB’s main strength is its efficiency: Training and classification can be accomplished with one pass over the data. Because it combines efficiency with good accuracy it is often used as a baseline in text classification research.

Thus, linear models in high-dimensional spaces are quite powerful despite their linearity. Even more powerful nonlinear learning methods can model decision boundaries that are more complex than a hyper plane, but they are also more sensitive to noise in the training data. Nonlinear learning methods sometimes perform better if the training set is large, but by no means in all cases.

Week 13 muddiestPoint

How to compute a centroid vector for each cluster/aspect?

Tuesday, April 1, 2014

Week 13 reading note

The user profiling process generally consists of three main phases.

No matter which construction method is chosen, the profile must be kept current to reflect the user’s preferences accurately; this has proven to be a very challenging task.

Week 12 muddiestPoint

How to find out that Darwish's Probabilistic Structured Query is better than Pirkola's theory through the graph in slides?
How to understand the relationship between CLIA, MT and summarization?

Tuesday, March 25, 2014

Week 12 reading note

Cross-Language information retrieval addresses the problem of finding information in one language in response to queries expressed in another. CLIR is sometimes called” trans lingual information retrieval”. There are three major challenges in translation-based CLIR: what to translate, how to obtain translation knowledge, and how to apply the translation knowledge.

There are also some natural language processing techniques like text processing progress.

The evaluation of an interactive CLIR system can be modelled by examining how well a CLIR system can support: (2) query formulation and translations, and (2) document selection and examination.

Probably the most noticeable achievement in CLIR is that cross-language document ranking can often achieve near 100%, or even higher. Of the retrieval effectiveness of monolingual document ranking.