The true believer: January 2014

Wednesday, January 29, 2014

Week 3 muddiestPoint

Why in the slide 54, based on the input: elk, hog, bee, fox, cat, gnu, ant, dog, we build a binary tree stucture like that?It is not alphabetical
How to use edited distance to make correction suggestions on query terms?

There are some alternative techniques to store the posting lists, however, such alternative techniques are difficult to combine with postings list compression of the sort. Moreover, standard postings list intersection operations remain necessary when both terms of a query are very common.

Algorithm for computing the weighted zone score from two postings lists are really similar to algorithm for the intersection of two postings lists.

Week 3 reading notes (IIR)

I mainly focus on these two indexing methods in single machine. One is blocked sort-based indexing and another is single-pass in-memory indexing. Compared with SPIMI, blocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to termIDs. For very large collections, this data structure does not fit into memory. SPIMI can index collections of any size as long as there is enough disk space available.

Also even when we save posing list, we may be waste some memory, however, the overall memory requirements for the dynamically constructed index of a block in SPIMI are still lower than in BSBI.

Week 2 muddiestPoint

When we treat every character n-gram as a term, how can we decide he number n to make our indexing more efficiently? When query comes, for example, the query term is h*llo$, why should it changes to llo$h*, why we need to adjust the order of the characters?

Sunday, January 12, 2014

week 1 reading note

When I read about the evolving process of the information retrieval system, I find its impact on the computer science become more important. I also have a basic understanding of IR system through its definition, the primary goal of an IR system is to retrieve all the documents that are relevant to a user query while retrieving as few non relevant documents as possible. The notion of relevance is of central importance in IR. A data retrieval system, such as a relational database, deals with data that has a well-defined structure and semantics, while an IR system deals with natural language text which is not well structured. And also the development of web also boost the development of IR system. It also bring some new challenge to the IR system at the same time.

Friday, January 10, 2014

Week1 muddiestPoint

Considering the variety types of data, there are some differences between database and IR system, what about some special database, like No SQL database? What the difference between these databases and IR system?

week 2 reading note (IIR)

When I finish reading the first three chapters on the book “An Introduction to Information

Retrieval”, I have a brief impression about inverted indexes for handling Boolean and proximity queries. I have learnt that the basic steps in inverted index construction.

1. Collect the documents to be indexed.

2. Tokenize the text.

3. Do linguistic preprocessing of tokens.

4. Index the documents that each term occurs in.

I have a basic idea that IR system handling the linguistic documents written in different country just like the process that I learning and reciting words. We have to use some methods to remember words efficiently. Just like the searching engine use the method such as Stemming and lemmatization. Also we have to take consideration about the different construction of the language in different country.

And I also learnt that we have to compare many data structures to make process efficiently. To implement spelling correction, we can also need dynamic programming algorithm for computing the edit distance between two strings.

I find something really related to the course “Data Analytic” I have taken last term.

The true believer