Knowledge Technology
 Knowledge Technology
 Tao Lu
Basic Concepts
 Data Measurements
 Information Processed data, patterns that are satisfied for given data
 Knowledge Information interpretted with respect to a user’s context to extend human understanding in a given area.
 Concrete Tasks Mechanically processing data to an unambiguous solution; Limited contribution to human understanding
 Knowledge tasks Data is unreliable or the outcome is illdefined; Computers mediate between user and the data, where context is critical; Enhance human understanding
Document representation
 Structured data Conforms to a schema
 SemiStructured data Conforms in part to a schema
String processing
Regular expression
 * : Zero or more
 \? : Zero or one
 + : One or more
They are greedy
 {m,n} : Between m and n inclusively
 [09] = \d
 [azAZ09] = \w
 [\ \t\r\n\f] = \s
 [^09] = \D
 [^azAZ09] = \W
 [^\ \t\r\n\f] = \S
Placing a pattern in parentheses leads to the match being stored as a var
 \n : nth var
Similarity (of text documents)
Terminologies
 $f_d$ number of terms in document d
 $f_{d,t}$ Freq of term t in document d (TF)
 $f_{ave}$ The average number of terms contained in a document
 $N$ Number of documents
 $f_t$ Number of documents that contains t
 $Ft Total number of occurrence of t across all documents
 $n$ the number of indexed terms in the collection
TFIDF
 Terms that occur frequently in a given document have high utility
 Terms that occur in a wide varity of docs have low utility
Weight up these two vectors
Jaccard Similarity
Dice Similarity
Cosine Distance
Relative entropy (KullbackLeibler deivergence)
A measure of difference between to probability distributions
Skew divergence
JensenShannon divergence
Where $m = \frac{X+Y}{2}$
Probability
Conditional Probability

Sum rule

Multiplication rule

Bayes rule
Approx String Search
Spelling Correction
 Neighbourhood search
 Generate all variants of w that utilise at most k changes (Insertions/Deletions/Replacements)
 Check whether generated variants exist in dictionary
 All results found are returned

For k edits, $O(\Sigma^k \cdot w ^k )$ neighbours where $\Sigma$ is alphabet size
 Global Edit distance
 Transform the string of interest into each dictionary entry, using insert, delete, replace and match
 How to calculate: First init the table, left top corner = 0, going along “from” cause addition of “Delete”, going along “to” direction cause addition of “Insert”. Go through each cell onebyone, take max of inserting from last row, deleting from last column or match/replace from leftup cell. Last cell is the result.
 Local edit distance: This is like global edit distance, but we are searching the best substring match.
 how to calculate: Init table, but this time first row and col as 0. Go through just like before, but max() take 0 as additional argument as well. Final result is the greatest value in the table.

NGram Distance
 Soundex
 Except for init char, translate string chars according to the table
 remove duplicates
 remove 0s
 truncate to four symbols
Information Retrieval
IR is the subfield of computer science that deals with storage and retrieval of documents
Documents are not always text. They can be defined as messages: an object that conveys information from one person to another
 Stored documents are realworld objects that have been created for individual reasons without consistent format, wording, language, length
 The retrieval system is concerned with the document as originally created not with a formal representation of the document
 Users may not agree on the value of a particular document, even in relation to the same query
 Documents are rich and ambiguous
 Text in some kind of collection has structured attrs, but these are only occasionally useful for searching
Searching
Categories of searching:
 Informational
 Factoid
 Topic tracking
 Navigational
 Transactional
 Geospatial
Approaches to retrieval
Consider the criteria that a human might use to judge whether a document should be returned in response to a query.
 Try and guess what the query might be inspired by, and what kind of info or doc is being sought
 Consider current news and events, or their own experience with query terms
 Approach the task of looking through the docs with expectations of what a match is that is based on much more than the terms
 Be ready to consider a doc even if the terminology is completely different
Boolean querying
Documents match if they contain the terms and don’t contain the NOT terms. There is no ordering, only yes/no (Start with least frequent terms to reduce cost)
Ranking
By looking for evidence in the document that it is on the same topic as the query
 Choose docs with words in common with query
 Choose docs with the query terms oin title
 Created recently
 Translate between languages
 Choose authoritative, reliable documents
Cosine with TFIDF weighting model
This is nothing more than calculating the cosine distance between query the documents. The term $w_{d,t}$ and $w_{q,t}$ are just vector representation of document and query using the key terms. Sometimes they are just TF and IDF. In most cases, they are given in questions.
How to calculate?
 Remember the fucking terminology:
 N is the fucking number of documents
 $f_t$ is the number of fucking documents which contains the fucking term

$ q $ and $ d $ are the length of the god damn vector representation of the fucking query and document using given $w_{d,t} and w_{q,t}$
Web Search
Main technological components:
 Crawling
 Parsing Translate data into canonical form
 Indexing Data structure build to allow search to take place efficiently
 Querying The data structures must be processed in response to queries
Addon technologies
 Snippet generation
 Asyoutype querying
 Query correction
 Answer consolidation
 Info boxes
 Tokenization: reduce a webpage or a query to a sequence of tokens
 Stemming: Stripping away affixes. implemented as cascaded set of rewrite rules.

Zoning: Web documents can usually be segmented into discrete zones such as title, anchor text, headings and so on. Calculating the weight of each zone
 Indexing:
 Search structure For each word t, the search structure contains a pointer to the start of the corresponding inverted list, and a count $f_t$ of the documents containning t
 Inverted lists For each word t, the inverted list contains the identifier d of documents containing t as ordinal numbers, and the associated freq $f_{d,t}$ of t in d.
 Inverted list allows for fast querying because
 The terms in the query correspind to the search structure
 The index only indicates documents where the term is present

Ranked querying

Link analysis
A string piece of evidence for a page’s importance is given by links, how many pages have links to this page

Pagerank
Use random walks to calculate the $\pi(d)$ value for page, with assumption that each page has same probability of being the start point and probabilities of visiting outgoing links are equal.
Web crawler
 Challenges:
 There is no central index of URLs of interest
 Some website return the same content as a new url at each visit
 Some pages never return status done on access
 Some websites are not intended to be crawled
 Much web content is generated onthefly from databases, which can be costly for content provider, so excessive numbers of visits to a site are unwelcome
 Some contents has a short lifespan
 Some regions and content providers have low bandwidth
Inverted list
 Search structure
 Pointer to the start of inverted list for each term
 A count of docs that contains t $f_t$
 Inverted list
 d that contains t
 Freq of t in d(We could store $w_{d,t}$ or $\frac{w_{d,t}}{W_{d\cdot}}$)
 Boolean Querying
 Fetch inverted lists
 Union for OR
 Intersection for AND
 Complement for NOT
Phrase queries
How to find the pages in which the words occur as a phrase
Strategies:
 Process as bag of words, then postprocess (Can be slow, start with most infrequent terms to speed up)
 Add position to index entries
 Use some form of phrase index of wordpair index so that they can be directly indetified without using inverted index
Pagerank overview
Machine Learning
Supervised
 Classification
 Decision Tree
 Best split (Check out Tao Lu)

Parameters: Total number of nodes, depth, minimum number of data points for a split
 How to set params? Crossvalidation
 Continuous attributes: Discretization at the beginning or find ranges by equal interval bucketing
 Random forest: train multiple trees on random subset of samples, decision via majority voting (Can be subset of record, subset of attr)
 Num of trees
 Use bagging to come up with different training dataset for each tree
 When building our tree, at each node, we only consider a random sample of attributes.
 SVM
 Find hyperplane that maximises the margin (find w and b)
 Lagrange multiplier applied to get dual problem: Solve $\alpha$
 $x_i$ with nonzero $\alpha_i$ will be support vectors
 Softmargin introduces parameters to tradeoff the relative importance of maximizing the margin and fitting the data
 Use kernel function for nonlinear SVM
 Onevsall multiclass strategy: Build M classifiers for M classes, choose class with largest margin for test data
 Onevsone: One classifier per pair, choose class selected by most classifiers
How to derive
Write down the margin by calculating the distance between 2 aprallel lines and
Distance between parallel lines is given by where $x_1$ and $y_1$ are points on one line and $ax+by+c=0$ is another line.
Then maximize the margin using constraint optimization with Lagrange multiplier $\alpha$. Write all the parameter in form of $\alpha$, then solve the dual problem, which is dead easy.
For soft margin, trade off the width of margin and how many points have to be moded around. The margin can be smaller than 1 for a point $x_i$ by setting $\xi_i>0$ but pay penalty of $C\xi_i$ in the minimization objective, i.e., minimize st

KNN
Select K nearest neighbours and look at their label

Naive Bayes
 NN
 Define E
 Forward pass: calculate hidden $H_i$ and output $O_j$ values
 Calculate error for each layer
 Update weight using gradient descent
 Until converge
 Decision Tree
Unsupervised
 Association
 Clustering
 Kmeans
 Init clusters
 Assign cluster
 Recalculate cluster center
 Reassign cluster …
 + Efficient
 + Cane be extended to hierarchical clustering
  Local minimum
  Need k in advance
  Unable to handle unconvex
  Ill defined “mean”
  Data contains outliers
 Buttomup
 Start with singleinstance clusters
 Iteratively merge closest two
 Topdown
 Start with one
 Find two partitioning clusters
 Proceed recursively on each subset
 Cluster dist measurement
 MIN (single link) use two closest points in clusters
 MAX (complete link) Use two farthest points in clusters
 Group average Use average dist between all points
 Kmeans
 Reinforcement learning
 Recommender system
 Anomaly dection
Testing strategy
Bias and variance
 (Training) bias is the average distance between expected value and estimated value

(Testing) variance is the standard deviation between the estimated value and the average estimated value(Inconsistency of decision)
 Holdout Fixed training and testing set
 + Simple
 + High reproducibility
  Tradeoff: more training or testong
  Representativeness of the training the test data
 Random subsampling Multiple iterations, randomly selecting training and test data
 + Reduce bias and variance
  Reproducability
 Leave one out Train N times, 1 test instance each time. Measure average performance
 + No sampling bias
 + Higher accuracy
  Expensive
 Mfold cross validation
 + Only train M times
 + Can measure stability of the system
  Sampling bias
  Results varies unless partition identically
  Slightly low acc
  Not suitable for small dataset

Regression

Feature Selection
To choose a subset of attributes that give best performance on the development data (Slow)
Greedy approach:
 Train and eval on each single attr
 Choose best attr
 Repeat
 Train and evel model on best attrs, plus each remaining single attr
 Choose best attr out of remaining set
 Until performance stops increasing
Ablation approach:
 Start with all attributes
 Remove one, train and eval
 Until divergence
 From remaining attr, remove each attr, train and eval
 Remove the attr that cause least performance degredation
 Termination condition: Performance starts to degrade by more than $\epsilon$
Mutual Information Approach
if attr independent from class. Which means
The equation
 If $LHS \approx 1$, almost random chance
 If $LHS »1$, they occur together much more often than randomly
 If $LHS « 1$, negatively correlated
Therefore
Greatest PMI indicates best attribute valie A for class C. Taking into consideration of all the possible classes and attribute values
Evaluation Metrics
 Generalise when it learns the target function well, rather than specifics of the training set.
 Overfitting when the classifier fails to generalise(describe training set very well, but does not describe the test data well)
Information retrieval
 Precision Fraction of correct responses among attempted responses
 Recall Proportion of words with a correct response
 Precision at k(P@k) Fraction of number of returned relevant results in top k

Average Precision A single number that characterizes the performance instead of comparing curves.
This is basically integrating precision pver recall from 0 to 1. Descretize the calculation to single samples to get the following form:
Where R is total num of relevant docs for query
Classification

Accuracy Proportion of correctly classified instance among all instances
 Error Rate 1ACC
 Error Rate Reduction where $ER_0$ is baseline ER
 Precision Proportion of correct positive predictions
 Recall Proportion of correctly predicted positive instances among all actual positive instances
 Specificity Proportion of correctly predicted negative instances among all actual negative instances.
 Fscore Gives an overall performance
 ROC(Receiver Operating Characteristics) Best prediction yields upper left corner.
 AUC(Area Under the Curve) Probability that a classifier will rank a randomly chosen instance higher than a randomly chosen negative one.
 Microaveraging Summing up numerator and denominator for each class
 Macroaveraging Summing up the results then divide by number of classes
Clustering
 Unsupervised Check the cohesion and separation of the clusters.
 Supervised Compare the cluster structure with external structure. For example entropy in a cluster is the measurement of the purity of instances.
 Relative Compares different clusterings with either unsupervised or supervised evaluation strategy.
 SSE Sum of squared error. Sum of the square of distance of each instance to the center of cluster for all the clusters.
Recommendation System
Content bases
 Can recomment new items
 Feature extraction can be difficult
Collaborative Filtering
 Userbased Similar users have similar ratings on the same item.
 Identify set of items rated by the target user
 Identify other users rated at least 1 items in this set
 Compute how similar each neighbor is to the target user
 Select k most similar neighbors
 Predict ratings for the target user’s unrated items
 Recommend to the user top N products based on predicted rating
User Similarity (Pearson correlation)
Predicted ratings
 Itembased Similar items are rated in a similar way by the same user
 Identify set of users who rated the target item i
 Identify which other items were rated
 Compute similarity between each neighbour and target item
 Select k most similar neighbours
 Predict ratings for the target item
 Challenges
 Few data each user
 Too many items to choose from
 New user no data
 Few recommendation to propose
 Large dataset
Rule mining

Itemset is a collection of one or more items, like {A,B,C}.
 kitemset contains k items.
 Support count ($\sigma$) is freq of occurrance of itemset
 Support is fraction of transactions that contain an itemset
 Frequest Itemset is an itemset whose support is greater than or equal to a minsup threshold
 Association Rule
An implication expression of form $A \rightarrow B$ where A and B are itemsets
 Support: Fraction of transactions that contain both A and B

Confidence: Measure how often item in A appear in transactions that contain B
 Useful: High quality, actionable information
 Trivial: Already known to anyone familiar with the context
 Inexplicable: This which have no apparent explanation
Approaches
Bruteforce (prohibitive)
 List all possible rules
 Compute support and confidence
 Prune rules that fail the minsup and minconf
Twostep approach
 Frequent itemset generation
 Reduce candidate (Apriori)
 Reduce transaction
 Reduce number of comparisons
 Rule generation (Generate Hash Tree)
Techniques
Apriori
Subset always have higher support
 Let k=1
 Generate frequent itemsets of length 1
 Repeat until no new frequent itemsets are identified
 Prune candidate itemsets containing subset of length k that are infreq
 Count support of each candidate by scanning database
 Eliminate infreq candidates
 Generate length k+1 candidate itemsets from length k itemsets
Generate Hash Tree
Accelerates counting support count for candidates
 Build up an hashtree by hashing items in candidate itemsets
 Given a transaction, start from 1th layer, recursively:
 At $k^{th}$ layer, hash $k^{th}, k+1^{th}…k+n^{th}$ items at current node, until the leaf of the tree. Then Compare the transaction with candidates at leaf node.
Further Issues
 Correlation vs causation
 May be wrong direction
 Causation with unseen intermediary
 Unseen cause of both events
 Coincidental
 Data mining
 Valid: Data has support for the pattern
 Nontrivial: The pattern isn’t selfevident from the data
 Previously unknown
Tao Lu
Calculating document ranking for query
 Write vector of documents using given tf formula
 Write vector of queries using given idf formula. Pay attention that when the freq of term in query is zero, the value should probabily be 0
 Calculate the similarity
Dicision Tree Select Splitting Attr
Using IG
 Calculate the root entropy using class distribution $P(C)$
 For each attribute A, calculate:
 Entropy of $P(C)$ for the data subset separated by different A values.
 Sum up the entropies with weight (attr A value count distribution) as H(A)
 Calculate IG(A) by subtracting from H(R)
Using Gain Ratio
 Gain ratio is obtained by dividing IG using SI
 SI (Split information) is the entropy of attr count distribution
Using GINISplite
 Calculate GINI(R)
 Calculate weighted sum of GINI of different values of the attr
 Substract to get GS value of the attr
GINI is the 1 sum of probability square
Accumulator
 Init accumulator for each doc
 For each query term t
 Add weight product $w_{q,t}w_{d,t}$ to corresponding accumulator for doc in inverted list
 Take max
Acc limiting approach
Limiting the size of accumulators, if hit the limit, stop creating accumulators. (order query terms by $w_{q,t}$)
Acc threshold approach
If inner prod is smaller than threshold, do not create accumulator.(order query terms by $w_{q,t}$)
Calculate Evaluation Matrics for Classifier
 Look at the confusion matrix, the diagonal elements are correctly classified item counts.
 Accuracy is calculated once per classification, summing up diagonal divided by total item count
 Precision and Recall are calculated once per class. The sum along actual except the diagonal element of a class is FN(because they are incorrect, and negative). Sum along classified except diagonal ones is FP because they are incorrect, and should be positive
 Then use the formula to calculate the god damn values
Calculate userbased/itembased recommendation system
Fuck it. Jeremy, if you want me to lose marks, exam this, I won’t remember any single character of the formula. I think remembering this formula is the only motherfucking thing that takes time to do in this subject. I paid my tuition fee to learn something, not this kind of easy shit :).