- Why in the slide 54, based on the input: elk, hog, bee, fox, cat, gnu, ant, dog, we build a binary tree stucture like that?It is not alphabetical
- How to use edited distance to make correction suggestions on query terms?
Wednesday, January 29, 2014
Week 3 muddiestPoint
Week 4 reading notes (IIR)
There are some alternative techniques to
store the posting lists, however, such alternative techniques are difficult to
combine with postings list compression of the sort. Moreover, standard postings
list intersection operations remain necessary when both terms of a query are
very common.
Algorithm for computing the weighted zone
score from two postings lists are really similar to algorithm for the intersection
of two postings lists.
Monday, January 20, 2014
Week 3 reading notes (IIR)
I mainly focus on these two indexing methods in single machine. One
is blocked sort-based indexing and another is single-pass in-memory indexing. Compared
with SPIMI, blocked sort-based indexing has excellent scaling properties, but
it needs a data structure for mapping terms to termIDs. For very large
collections, this data structure does not fit into memory. SPIMI can index
collections of any size as long as there is enough disk space available.
Also even when we save posing list, we may be waste some memory,
however, the overall memory requirements for the dynamically constructed index
of a block in SPIMI are still lower than in BSBI.
Week 2 muddiestPoint
When we treat every character n-gram as a
term, how can we decide he number n to make our indexing more efficiently? When
query comes, for example, the query term is h*llo$, why should it changes to
llo$h*, why we need to adjust the order of the characters?
Sunday, January 12, 2014
week 1 reading note
When I read about the evolving process of
the information retrieval system, I find its impact on the computer science
become more important. I also have a basic understanding of IR system through
its definition, the primary goal of an IR system is to retrieve all the documents
that are relevant to a user query while retrieving as few non relevant documents
as possible. The notion of relevance is of central importance in IR. A data
retrieval system, such as a relational database, deals with data that has a well-defined
structure and semantics, while an IR system deals with natural language text
which is not well structured. And also the development of web also boost the
development of IR system. It also bring some new challenge to the IR system at
the same time.
Friday, January 10, 2014
Week1 muddiestPoint
Considering the variety types of data, there are some differences between database and IR system, what about some special database, like No SQL database? What the difference between these databases and IR system?
week 2 reading note (IIR)
When I finish reading the first three
chapters on the book “An Introduction to Information
Retrieval”, I have a brief impression about
inverted indexes for handling Boolean and proximity queries. I have learnt that
the basic steps in inverted index construction.
1. Collect the documents to be indexed.
2. Tokenize the text.
3. Do linguistic preprocessing of tokens.
4. Index the documents that each term
occurs in.
I have a basic idea that IR system handling
the linguistic documents written in different country just like the process
that I learning and reciting words. We have to use some methods to remember words
efficiently. Just like the searching engine use the method such as Stemming and
lemmatization. Also we have to take consideration about the different
construction of the language in different country.
And I also
learnt that we have to compare many data structures to make process
efficiently. To implement spelling correction, we can also need dynamic programming
algorithm for computing the edit distance between two strings.
I
find something really related to the course “Data Analytic” I have taken last
term.
Subscribe to:
Comments (Atom)