Knowledge Technology

Basic Concepts

Document representation

String processing

Regular expression

They are greedy

Placing a pattern in parentheses leads to the match being stored as a var

Similarity (of text documents)



Weight up these two vectors

Jaccard Similarity

Dice Similarity

Cosine Distance

Relative entropy (Kullback-Leibler deivergence)

A measure of difference between to probability distributions

Skew divergence

Jensen-Shannon divergence

Where $m = \frac{X+Y}{2}$


Conditional Probability

Spelling Correction

Information Retrieval

IR is the subfield of computer science that deals with storage and retrieval of documents

Documents are not always text. They can be defined as messages: an object that conveys information from one person to another


Categories of searching:

Approaches to retrieval

Consider the criteria that a human might use to judge whether a document should be returned in response to a query.

Boolean querying

Documents match if they contain the terms and don’t contain the NOT terms. There is no ordering, only yes/no (Start with least frequent terms to reduce cost)


By looking for evidence in the document that it is on the same topic as the query

Cosine with TFIDF weighting model

This is nothing more than calculating the cosine distance between query the documents. The term $w_{d,t}$ and $w_{q,t}$ are just vector representation of document and query using the key terms. Sometimes they are just TF and IDF. In most cases, they are given in questions.

How to calculate?

Main technological components:

Add-on technologies

Web crawler

Inverted list

Phrase queries

How to find the pages in which the words occur as a phrase


Pagerank overview

Machine Learning



Testing strategy

Bias and variance

Evaluation Metrics

Information retrieval



Recommendation System

Content bases

Collaborative Filtering

Rule mining


Brute-force (prohibitive)

Two-step approach



Subset always have higher support

Generate Hash Tree

Accelerates counting support count for candidates

Further Issues

Tao Lu

Calculating document ranking for query

Dicision Tree Select Splitting Attr

Using IG

Using Gain Ratio

Using GINI-Splite

GINI is the 1- sum of probability square


Acc limiting approach

Limiting the size of accumulators, if hit the limit, stop creating accumulators. (order query terms by $w_{q,t}$)

Acc threshold approach

If inner prod is smaller than threshold, do not create accumulator.(order query terms by $w_{q,t}$)

Calculate Evaluation Matrics for Classifier

Calculate user-based/item-based recommendation system

Fuck it. Jeremy, if you want me to lose marks, exam this, I won’t remember any single character of the formula. I think remembering this formula is the only motherfucking thing that takes time to do in this subject. I paid my tuition fee to learn something, not this kind of easy shit :).