The true believer: 2014

Thursday, April 17, 2014

Week 15 reading note

The use of wildcards allows generalization beyond specific words, while contextual restrictions limit the wildcard-matching to entities related to the user's query.

The first type of wildcards should match entities of a specific type. We call these as S-Wildcards (Specific entity type Wildcards). The second type of wildcards can match against all types of entities. We call these as A-Wildcards (Any entity type Wildcards).

The ontology becomes a means of communication between the user and the system and helps overcome the bottlenecks in information access, which is primarily based on keyword searches. It supports information retrieval based on the actual content of a page and helps navigate the information space based on semantic concepts.

Monday, April 14, 2014

Week 14 muddiestPoint

Why the tools like stemming is different between TextCat and IR?

Thursday, April 10, 2014

Week 14 reading note

Because we have to look at the data at least once, NB can be said to have optimal time complexity. Its efficiency is one reason why NB is a popular text classification method.

The answer is that even though the probability estimates of NB are of low quality, its classification decisions are surprisingly good.

NB’s main strength is its efficiency: Training and classification can be accomplished with one pass over the data. Because it combines efficiency with good accuracy it is often used as a baseline in text classification research.

Thus, linear models in high-dimensional spaces are quite powerful despite their linearity. Even more powerful nonlinear learning methods can model decision boundaries that are more complex than a hyper plane, but they are also more sensitive to noise in the training data. Nonlinear learning methods sometimes perform better if the training set is large, but by no means in all cases.

Week 13 muddiestPoint

How to compute a centroid vector for each cluster/aspect?

Tuesday, April 1, 2014

Week 13 reading note

The user profiling process generally consists of three main phases.

No matter which construction method is chosen, the profile must be kept current to reflect the user’s preferences accurately; this has proven to be a very challenging task.

Week 12 muddiestPoint

How to find out that Darwish's Probabilistic Structured Query is better than Pirkola's theory through the graph in slides?
How to understand the relationship between CLIA, MT and summarization?

Tuesday, March 25, 2014

Week 12 reading note

Cross-Language information retrieval addresses the problem of finding information in one language in response to queries expressed in another. CLIR is sometimes called” trans lingual information retrieval”. There are three major challenges in translation-based CLIR: what to translate, how to obtain translation knowledge, and how to apply the translation knowledge.

There are also some natural language processing techniques like text processing progress.

The evaluation of an interactive CLIR system can be modelled by examining how well a CLIR system can support: (2) query formulation and translations, and (2) document selection and examination.

Probably the most noticeable achievement in CLIR is that cross-language document ranking can often achieve near 100%, or even higher. Of the retrieval effectiveness of monolingual document ranking.

Week 11 muddiestPoint

In PageRank, if the number of out-link is too large, this page is almost equal to the page has no out-link, how to distinguish these two types of pages?

Monday, March 17, 2014

Week 11 reading note(IIR)

After reading chapter 19, I know the early web search engines. Early attempts at making web information “discoverable” fell into two broad categories: (1) full-text index search engines such as Altavista, Excite and Infoseek and (2) taxonomies populated with web pages in categories, such as Yahoo! The web search engine also comes out a new field called search engine marketing. For advertisers, understanding how search engines do this ranking and how to allocate marketing campaign budgets to different keywords and to different sponsored search engines has become a profession known as search engine marketing (SEM).

How do search engines differentiate themselves and grow their traffic? Here Google identified two principles that helped it grow at the expense of its competitors: (1) a focus on relevance, specifically precision rather than recall in the first few results; (2) a user experience that is lightweight, meaning that both the search query page and the search results page are uncluttered and almost entirely textual, with very few graphical elements. The effect of the first was simply to save users time in locating the information they sought. The effect of the second is to provide a user experience that is extremely responsive, or at any rate not bottleneck by the time to load the search query or results page.

Week 8 muddiestPoint

There is no muddiest point for this week

Friday, February 28, 2014

Week 7 muddiestPoint

Why said CG is not really sensitive to the ranking?

Week 8 reading notes (MIR)

I think this topic user interface and visualization is related to another course I learned “interactive system design”, most accounts of the information access process assume an interaction cycle consisting of query specification, receipt and examination of retrieval results, and then either stopping or reformulating the query and repeating the process until a perfect result set is found. In more details, the standard process can be described according to the following sequence of steps:

Also the berry-pick is illustrated that the information seeking process consisted of a series of interconnected but diverse searches on one problem-based theme. We can found that it is convenient to divide the entire information access process into two main components: search/retrieval, and analysis/synthesis of results. User interfaces should allow both kinds of activity to be tightly interwoven. There are four main types of starting points: lists, overviews, examples, and automated source selection.

Shneiderman identifies five primary human-computer interaction styles. These are: command language, form fill-in, menu selection, direct manipulation, and natural language.

In systems with statistical ranking, a numerical score or percentage is also often shown alongside the title, where the score indicates a computed degree of match or probability of relevance. This kind of information is sometimes referred to as a document surrogate.

User interfaces for information access in general do not do a good job of supporting strategies, or even of sequences of movements from one operation to the next.

Week 6 muddiestPoint

Why BIR model ignore term frequency and document length lead to not suitable for full text retrieval?

Recall related to the relevant documents, but why said” Recall is the kitchen sink – you try to get all the relevant documents possible (understanding that you may get many non-relevant documents as well.)”, it should related to the understanding that you may get many non-relevant documents.

Week 7 reading notes (IIR)

Relevance feedback can improve both recall and precision. But, in practice, it has been shown to be most useful for increasing recall in situations where recall is important. This is partly because the technique expands the query, but it is also partly an effect of the use case: when they want high recall, users can be expected to take time to review results and to iterate on the search.

There is some subtlety to evaluating the effectiveness of relevance feedback in a sound and enlightening way.

1. The obvious first strategy is to start with an initial query q0 and to compute a precision-recall graph.

2. A second idea is to use documents in the residual collection (the set of documents minus those assessed relevant) for the second round of evaluation.

3. A third method is to have two collections, one which is used for the initial query and relevance judgments, and the second that is then used for comparative evaluation.

Overall, query expansion is less successful than relevance feedback, though it may be as good as pseudo relevance feedback. It does, however, have the advantage of being much more understandable to the system use

Friday, February 14, 2014

Week 5 muddiestPoint

How to understand Maximum Likelihood Estimate controbute to the estimation?
How to set constant to smooth?

Week 6 reading notes (IIR)

The information need is a little different from the query, the query is more like the SQL query in database. There are some kinds of standard test collections. Such as the Cranfield collection, TREC, GOV2, NTCIR, CLEF, REUTERS, 20 Newsgroups. Precision (P) is the fraction of retrieved documents that are relevant. Recall (R) is the fraction of relevant documents that are retrieved.

Examining the entire precision-recall curve is very informative, but there is often a desire to boil this information down to a few numbers, or perhaps even a single number. The traditional way of doing this is the 11-point interpolated average precision. In recent years, other measures have become more common. Most standard among the TREC community is Mean Average Precision (MAP), which provides a single-figure measure of quality across recall levels. An ROC curve plots the true positive rate or sensitivity against the false positive rate or (1 − specificity). Here, sensitivity is just another term for recall. The false positive rate is given by f p / ( f p+ t n).

I also find evaluating the IR system is related to something about interactive systems design. We also need to take consideration about user utility. A/B tests are easy to deploy, easy to understand, and easy to explain to management. Dynamic summaries are generally regarded as greatly improving the usability of IR systems, but they present a complication for IR system design. A dynamic summary cannot be precomputed, but, on the other hand, if a system has only a positional index, then it cannot easily reconstruct the context surrounding search engine hits in order to generate such a dynamic summary. This is one reason for using static summaries.

Friday, February 7, 2014

Week 4 muddiestPoint

How to understand the difference between document frequency and collection frequency?
Which skill can come up with a query that produces a manageable number of hits?

Wednesday, February 5, 2014

Week 5 reading notes (IIR)

When I look the part of the third classic IR model: the probabilistic model, I find many terms, the obvious order in which to present documents to the user is to rank documents by their estimated probability of relevance with respect to the information need: P(R = 1|d, q). This is the basis of the Probability Ranking Principle (PRP).

We also have to use some probabilistic term to illustrate and improve the probabilistic model, like Binary Independence Model (BIM), The resulting quantity used for ranking is called the Retrieval Status Value RETRIEVAL STATUS (RSV) maximum likelihood estimate MLE maximum a posteriori (MAP ). Length normalization of the query is unnecessary because retrieval is being done with respect to a single fixed query

Week 3 muddiestPoint

Why in the slide 54, based on the input: elk, hog, bee, fox, cat, gnu, ant, dog, we build a binary tree stucture like that?It is not alphabetical
How to use edited distance to make correction suggestions on query terms?

Week 4 reading notes (IIR)

There are some alternative techniques to store the posting lists, however, such alternative techniques are difficult to combine with postings list compression of the sort. Moreover, standard postings list intersection operations remain necessary when both terms of a query are very common.

Algorithm for computing the weighted zone score from two postings lists are really similar to algorithm for the intersection of two postings lists.

Week 3 reading notes (IIR)

I mainly focus on these two indexing methods in single machine. One is blocked sort-based indexing and another is single-pass in-memory indexing. Compared with SPIMI, blocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to termIDs. For very large collections, this data structure does not fit into memory. SPIMI can index collections of any size as long as there is enough disk space available.

Also even when we save posing list, we may be waste some memory, however, the overall memory requirements for the dynamically constructed index of a block in SPIMI are still lower than in BSBI.

Week 2 muddiestPoint

When we treat every character n-gram as a term, how can we decide he number n to make our indexing more efficiently? When query comes, for example, the query term is h*llo$, why should it changes to llo$h*, why we need to adjust the order of the characters?

Sunday, January 12, 2014

week 1 reading note

When I read about the evolving process of the information retrieval system, I find its impact on the computer science become more important. I also have a basic understanding of IR system through its definition, the primary goal of an IR system is to retrieve all the documents that are relevant to a user query while retrieving as few non relevant documents as possible. The notion of relevance is of central importance in IR. A data retrieval system, such as a relational database, deals with data that has a well-defined structure and semantics, while an IR system deals with natural language text which is not well structured. And also the development of web also boost the development of IR system. It also bring some new challenge to the IR system at the same time.

Friday, January 10, 2014

Week1 muddiestPoint

Considering the variety types of data, there are some differences between database and IR system, what about some special database, like No SQL database? What the difference between these databases and IR system?

week 2 reading note (IIR)

When I finish reading the first three chapters on the book “An Introduction to Information

Retrieval”, I have a brief impression about inverted indexes for handling Boolean and proximity queries. I have learnt that the basic steps in inverted index construction.

1. Collect the documents to be indexed.

2. Tokenize the text.

3. Do linguistic preprocessing of tokens.

4. Index the documents that each term occurs in.

I have a basic idea that IR system handling the linguistic documents written in different country just like the process that I learning and reciting words. We have to use some methods to remember words efficiently. Just like the searching engine use the method such as Stemming and lemmatization. Also we have to take consideration about the different construction of the language in different country.

And I also learnt that we have to compare many data structures to make process efficiently. To implement spelling correction, we can also need dynamic programming algorithm for computing the edit distance between two strings.

I find something really related to the course “Data Analytic” I have taken last term.