Unstructured Data for Information Retrieval¶
Elements of information retrieval:
- Information Needs (User Query)
- Information Resources (Documents)
- System that identifies Sources and indicates relevance
Text Representations in IR¶
Unstructured Representation¶
- Standart text representation
- Bag of words: unordered set of terms
- Good performance but not always relevant results
Example:
“One evening Frodo and Sam were walking together in the cool twilight. Both of them felt restless again. On Frodo suddenly the shadow of parting had falling: the time to leave Lothlorien was near.”
{
"one": 1,
"evening": 1,
"frodo": 2,
...
}
Weakly-Structured Representation¶
- Some group of terms which are more important (e.g. Nouns, named entities)
- Other Terms: ignored or down-scaled
- Needs preprocessing: NLP (e.g. POS tagging: part-of-speech tagging to identify nouns)
Example:
“One evening Frodo and Sam were walking together in the cool twilight. Both of them felt restless again. On Frodo suddenly the shadow of parting had falling: the time to leave Lothlorien was near.”
// Count of Nouns
{
"evening": 1,
"frodo": 2,
"sam": 1,
"twilight": 1,
"shadow": 1,
"lothlorien": 1
}
// Count of Names in Nouns
{
"frodo": 2,
"sam": 1,
}
Structured Representation¶
- Graph with nodes and edges
- Describes relations between terms
Example:
“One evening Frodo and Sam were walking together in the cool twilight. Both of them felt restless again. On Frodo suddenly the shadow of parting had falling: the time to leave Lothlorien was near.”
graph LR
F[Frodo] --> w[walking]
S[Sam] --> w
F --> f[felt restless]
S --> f
F --> h[had falling]
F --> l[leave]
l --> lo[Lothlorien]
h --> s[shadow]
w -- same time --> f
f -- before --> h
h -- before --> l
Text Preprocessing¶
Note
Goal: Reduce Vocabulary Size: remove noise
- How to split words?
- Remove filler words
- Handle differend verb forms, adjectives and nouns with the same stem
- Handle synonyms and similar concepts in other languages
- Different document types needs different preprocessing
Procedure¶
- Extract plain text from document (e.g.
.txt
,.pdf
,.docx
,.html
) - Language detection
- Tokenization (split text to character sequences)
- Morphological Normalization (e.g. stemming, lemmatization)
- Stopwords Removal
- Indexing
Tokens and Terms¶
word | string with characters |
term | normalized form of word |
token | instance of term in text |
tokenization | process of splitting text into tokens |
Tokenization/Normlization¶
Methods:
- Rule-based
- Supervised learning
Procedure Normalization 1¶
- Error/Spelling correction
- Case folding: all characters to lowercase
- Morphological normalization: reducing different forms of the same word to a common word
- Inflectional normalization / Lemmatization: normalization of same words to the lexically correct base form
- Derivational normalization: normalization of word to the lexically correct base form (specialized)
- Stemming: normalization of word stem (e.g. "automates", "automated", "automatic" to "automat")
- Porter Stemmer (algorithm which cuts off certain suffixes) 2
Expanding Queries¶
Theoretically very powerful but in practice inefficient because of indexing similar words.
Stopword Removal 3 4¶
- Stopwords: words which are very common and not relevant for IR
- Reduces vocabulary size
- Low significance
- Can match false positives
Boolean Retrieval¶
Note
Method for matching documents against queries with boolean operators (AND, OR, NOT)
- Inefficient because of large text, ignores spacing to words in sentence
Models:
- Boolean Model (True/False)
- Algebraic Model (Vector Space Model)
- Probabilistic Model (Binary Independence Model)
- Semantic Model
Components¶
- Query Text: representation of raw query
- Documents: representation of raw documents
- Function: maps relevant documents for query:
fd
maps document to representation (IR standard)fq
maps query to representation (IR standard)r
Ranking: calculates relevance of document for query
Therm Document / Incidence Matrix¶
- Good for small collections
- Dictionary: all terms in collections
- Vector: count of terms in document
Example:
Document:
One evening Frodo and Sam were walking together in the cool twilight.
Dimension:
{
"one": 1,
"evening": 1,
"frodo": 1,
"sam": 1,
"walking": 1,
"together": 1,
"cool": 1,
"twilight": 1,
"restless": 0,
"shadow": 0
}
Vector:
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
Inverted Index¶
- Good for efficient handling
- Contains word and list of documents where word occurs
Example:
Document 1:
One evening Frodo and Sam were walking together in the cool twilight.
Document 2:
Both of them felt restless again.
Dimension:
{
"one": [1],
"evening": [1],
"frodo": [1],
"sam": [1],
"walking": [1],
"together": [1],
"cool": [1],
"twilight": [1],
"restless": [2]
}
Query Sam AND Frodo
:
sam = [1]
frodo = [1]
sam ∩ frodo = [1]
Intersection Optimization¶
- Intersection from small list to large list
- Sort index alphabetically
- Add Skip Pointers (Step size: \(\sqrt(N)\))
graph LR
Sam --> 1 --> 2 --> 7 --> 174 --> 210 --> 331 --> 2046
1 -- skip --> 174 -- skip --> 2046
Frodo --> f2[2] --> 3 --> 4 --> f7[7] --> 21 --> 29 --> 38 --> 91 --> 101 --> 122 --> 134 --> 171 --> 1137
f2 -- skip --> 21 -- skip --> 134 -- skip --> 1137
Ranked Retrieval¶
Note
Boolean retrieval returns only True/False for query. Ranked retrieval returns a ranking for each document.
Ranking Principle 1¶
Note
Relevance bei amount of unique query terms in document (count of unique query terms)
Example Query: earthquake epicentre richter scale
Document 1:
Yesterday there was an earthquake in China. The epicentre was 200 miles north of the capitol. The measurement on the richter scale was about ...
Document 2:
After the strong earhquake in 2020, the damages has been repaired really fast. The beautiful city Bolognese in Italy with its new mayor Andrea Richter currently experiences ...
Ranking Principle 2¶
Note
Relevance by total amount of query terms in document (count of all query terms)
Example Query: earthquake richter
Document 1:
Yesterday there was an earthquake in China. The epicentre was 200 miles north of the capitol. The measurement on the richter scale was about 5.7. The duration of the earthquake was about 30 seconds.
Document 2:
After the strong earhquake in 2020, the damages has been repaired really fast. The beautiful city Bolognese in Italy with its new mayor Andrea Richter currently experiences ...
Ranking Principle 3¶
Note
Relevance by rarity of occurence of query terms in document (focus: min. all query terms)
Example Query: earthquake bolognese
Document 1:
Yesterday there was an earthquake in China. The epicentre was 200 miles north of the capitol. The measurement on the richter scale was about 5.7. The duration of the earthquake was about 30 seconds.
Document 2:
After the strong earhquake in 2020, the damages has been repaired really fast. The beautiful city Bolognese in Italy with its new mayor Andrea Richter currently experiences ...
Ranking Principle 4¶
Note
Relevance by distance of query terms to each other in document (how close are the query terms)
Example Query: Hans Meyer
Document 1:
Hans Meyer elected as new president. Yesterday, Hans Meyer was elected as new president of ...
Document 2:
Hans Gfeller, Horst Peter, Markus Meyer
Ranking Principle 5¶
Note
Relevance by position of query terms in document (how early are the query terms in the document)
Example Query: Hans Meyer
Document 1:
Hans Meyer elected as new president. Yesterday, Hans Meyer was elected as new president of ...
Document 2:
After the election of a new president, a lot of thins are updated. According to Hans Meyer there will be ...
Modified Ranking Index¶
Ranking (R1, R2, R3)¶
- R1: Relevance by amount of unique query terms in document
- R2: Relevance by total amount of query terms in document
- R3: Relevance by rarity of occurence of query terms in document
Schema:
{
"value": [[Document Nr., Amount of Occurences], ...],
"example": [[1, 4], [2, 3]]
}
Ranking (R1, R2, R3, R4, R5)¶
- R1: Relevance by amount of unique query terms in document
- R2: Relevance by total amount of query terms in document
- R3: Relevance by rarity of occurence of query terms in document
- R4: Relevance by distance of query terms to each other in document
- R5: Relevance by position of query terms in document
Schema:
{
"value": [[Document Nr., Amount of Occurences, [Position 1, Position 2, ...]], ...],
"example": [[1, 4, [1, 2, 3, 4]], [2, 3, [1, 2, 3]]]
}
Vector Space Model¶
Note
Model for representing documents and queries as vectors in a multidimensional space. Similarity between documents is measured by the angle.
Cosiine Similarity:
Example:
Document 1: | Document 2: |
---|---|
Frodo and Sam were walking together in the cool twilight. | Frodo likes to walk in the evening. |
Dimension:
["frodo", "sam", "walk", "together", "cool", "twilight", "like", "evening"]
Vectors:
D1 = [1, 1, 1, 1, 1, 1, 0, 0]
D2 = [1, 0, 1, 0, 0, 0, 1, 1]
Query: sam twilight
q = [0, 1, 0, 0, 0, 1, 0, 0]
Result:
cos(D1, q) = 0.5
cos(D2, q) = 0
Term Frequency (TF)¶
Note
Weighted term frequency in a document
Example:
Document 1:
One evening Frodo and Sam were walking together in the cool twilight. Both of them felt restless again. On Frodo suddenly the shadow of parting had falling: the time to leave Lothlorien was near. Total count: 34
Term | Count | Term Frequency |
---|---|---|
Frodo | 2 | \(tf(\text{Frodo}, d_1) = 2 / 34 = 0.0588\) |
Sam | 1 | \(tf(\text{Sam}, d_1) = 1 / 34 = 0.0294\) |
Inverse Document Frequency (IDF)¶
Note
Importance of a term in all documents. High if term occurs in few documents.
Example:
Document 1:
One evening Frodo and Sam were walking together in the cool twilight.
Document 2:
Both of them felt restless again.
Document 3:
On Frodo suddenly the shadow of parting had falling: the time to leave Lothlorien was near.
Term | Count | Inverse Document Frequency |
---|---|---|
Frodo | 2 | \(\log(\frac{3}{2}) = 0.176\) |
Sam | 1 | \(\log(\frac{3}{1}) = 0.477\) |
TF-IDF¶
Note
Weighted term frequency in a document with importance of a term in all documents.
High Value: Term occurs often in a document and rarely in other documents
Example:
Document 1:
One evening Frodo and Sam were walking together in the cool twilight.
Document 2:
Both of them felt restless again.
Document 3:
On Frodo suddenly the shadow of parting had falling: the time to leave Lothlorien was near.
Term | TF | IDF | TF-IDF |
---|---|---|---|
Frodo | \(2 / 34 = 0.0588\) | \(\log(\frac{3}{2}) = 0.176\) | 0.0006 |
Sam | \(1 / 34 = 0.0294\) | \(\log(\frac{3}{1}) = 0.477\) | 0.0140 |
Data Processing / Record Linkage¶
Synonyms: Entity Resolution, Duplicate Detection, Coreference Resolution, Fuzzy Matching
Data Integration¶
Providing data that exists in several autonomous data resources
Record Linkage¶
Note
Record linkage identifies records of the same entity by comparing identifiers and avoids duplicates. It is often used in data integration and marketing.
Levenshtein Distance¶
Note
Distance between two strings (e.g. words) by counting the minimum number of operations to transform one string to another.
Costs:
Operation | Cost |
---|---|
Copy Character | 0 |
Insert Character | 1 |
Delete Character | 1 |
Replace Character | 1 |
Procedure:
Path: Backwards, smallest cost first than diagonal backwards