【Information Retrieval】Boolean Retrieval and Inverted Index

Purpose

Master the process of establishing an inverted index; master the merging algorithm of postings lists

experiment procedure

1. Inverted index

According to the detailed process of establishing the inverted index (reverted index) described in Figure 1.4 on page 8 of the textbook "Introduction to Information Retrieval", use the 60 documents in the attachment "HW1.txt" (at the end of the text) (each line represents a document ), use Java language or other common languages ​​to implement the detailed process of creating an inverted index.

Language used: Python3

  1. First read the document into a one-dimensional data
    Code:
# 打开文件
f = open('HW1.txt')
# 读取文章,并删除每行结尾的换行符
doc = pd.Series(f.read().splitlines())

Result: print the array for viewing

0 Adaptive Pairwise Preference Learning for Coll…
1 HGMF: Hierarchical Group Matrix Factorization …
2 Adaptive Bayesian Personalized Ranking for Het…
3 Compressed Knowledge Transfer via Factorizatio…
4 A Survey of Transfer Learning for Collaborativ…
5 Mixed Factorization for Collaborative Recommen…
6 Transfer Learning for Heterogeneous One-Class …
7 Group Bayesian Personalized Ranking with Rich …
8 Transfer Learning for Semi-Supervised Collabor…
9 Transfer Learning for Behavior Prediction
10 TOCCF: Time-Aware One-Class Collaborative Filt…

  1. Convert each document into a list of tokens
    Steps:
    a. Convert all the text to lowercase
    b. Use regular expressions to cut the text according to "non-characters" (Note: When two consecutive non-characters appear, the cut will appear Empty string, need to be deleted manually)
    code:
# 转换为小写,并使用正则表达式进行切割
doc = doc.apply(lambda x: re.split('[^a-zA-Z]', x.lower()))
# 删除空串
for list in doc:
    while '' in list:
        list.remove('')

Result: print the token list for viewing

Output exceeds the size limit. Open the full output data in a text editor
0 [adaptive, pairwise, preference, learning, for…
1 [hgmf, hierarchical, group, matrix, factorizat…
2 [adaptive, bayesian, personalized, ranking, fo…
3 [compressed, knowledge, transfer, via, factori…
4 [a, survey, of, transfer, learning, for, colla…
5 [mixed, factorization, for, collaborative, rec…
6 [transfer, learning, for, heterogeneous, one, …
7 [group, bayesian, personalized, ranking, with,…
8 [transfer, learning, for, semi, supervised, co…
9 [transfer, learning, for, behavior, prediction]
10 [toccf, time, aware, one, class, collaborative…

  1. Build an inverted index
    Steps:
    a. Build the following data structure:
    build a hash table, the key value is a string, and the value value is a list.
    Among them, all words are stored in the key value and used as the index of the hash table; the first bit in the value value records the length of the inverted index, and the second bit starts to record the serial number of the article in which each word appears.
    insert image description here
    b. Traverse the token list:
    1. If the word appears, add the article number to the end of the list and increase the length by one.
    2. The first time a word appears, add the word to the hash table.

Code:

hashtable = {
    
    }
for index, list in enumerate(doc):
    for word in list:
        if word in hashtable:
            hashtable[word].append(index+1)
            hashtable[word][0] += 1
        else:
            hashtable[word] = [1, index+1]
hashtable = dict(
    sorted(hashtable.items(), key=lambda kv: (kv[1][0], kv[0]), reverse=True))

result:

inverted_index = pd.DataFrame(columns=['term', 'doc.freq', 'postings list'])
for term in hashtable:
    inverted_index = inverted_index.append(
        {
    
    'term': term, 'doc.freq': hashtable[term][0], 'postings list': hashtable[term][1:]}, ignore_index=True)
display(inverted_index[:20])

insert image description here

2. Boolean search

According to the merging algorithm of postings lists described in Figure 1.6 on page 11 of the textbook "Introduction to Information Retrieval", using the inverted index in question (1), implement the following in Java or other commonly used languages Boolean retrieval:
a. transfer AND learning
b. transfer AND learning AND filtering
c. recommendation AND filtering
d. recommendation OR filtering
e. transfer AND NOT (recommendation OR filtering)

Before completing the question, first complete the writing of the three functions AND, OR, AND NOT

  1. AND
    idea:
    The parameter is the inverted index list corresponding to the two words, and the return value is the result list after completing the AND operation.
    The operation that needs to be done is to filter out the indexes that appear in both list1 and list2. Because the original two lists are sorted from small to large, it is only necessary to continuously move the pointer to the smaller number backwards. When encountering the same index, add the index to the result list until one pointer goes to the end.
def And(list1, list2):
    i, j = 0, 0
    res = []
    while i < len(list1) and j < len(list2):
        # 同时出现,加入结果列表
        if list1[i] == list2[j]:
            res.append(list1[i])
            i += 1
            j += 1
        # 指向较小数的指针后移
        elif list1[i] < list2[j]:
            i += 1
        else:
            j += 1
    return res

  1. OR
    idea:
    The parameter is the inverted index list corresponding to the two words, and the return value is the result list after completing the OR operation.
    The operation that needs to be done is to merge and filter out all indexes that appear in list1 and list2. The idea is roughly similar to the solution of AND. Originally, the two lists are sorted from small to large. Therefore, it is only necessary to continuously move the pointer to the smaller number backwards. The difference is that when the size of the index is different, the index still needs to be changed. Join the result list until a pointer goes to the end.
    Because the OR operation is to merge the two lists, it is also necessary to add the remaining untraversed indexes in the two lists to the result list.
def Or(list1, list2):
    i, j = 0, 0
    res = []
    while i < len(list1) and j < len(list2):
        # 同时出现,只需要加入一次
        if list1[i] == list2[j]:
            res.append(list1[i])
            i += 1
            j += 1
        # 指向较小数的指针后移,并加入列表
        elif list1[i] < list2[j]:
            res.append(list1[i])
            i += 1
        else:
            res.append(list2[j])
            j += 1
    # 加入未遍历到的index
    res.extend(list1[i:]) if j == len(list2) else res.extend(list2[j:])
    return res

  1. AND NOT
    idea:
    The parameter is the inverted index list corresponding to the two words, and the return value is the result list after completing the AND NOT operation.
    The operation that needs to be done is to filter out the index that appears in list1, but does not appear in list2. Originally, the two lists are sorted from small to large. Therefore, it is also necessary to continuously move the pointer to the smaller number backward, and when the index to list1 is smaller, add the index to the result list until a pointer goes to the end.
    Assuming that list1 has not been traversed and list2 has ended, the remaining indexes of list1 must not appear in list2, so the remaining untraversed indexes need to be added to the result list.
def AndNot(list1, list2):
    i, j = 0, 0
    res = []
    while i < len(list1) and j < len(list2):
        # index相等时,同时后移
        if list1[i] == list2[j]:
            i += 1
            j += 1
        # 指向list1的index较小时,加入结果列表
        elif list1[i] < list2[j]:
            res.append(list1[i])
            i += 1
        else:
            j += 1
    # list1 未遍历完,加入剩余index
    if i != len(list1):
        res.extend(list1[i:])
    return res

  1. Helper function: Get the inverted index list from the hash table, and delete the first element (used to record the number of elements)
def getList(word):
    return hashtable[word][1:]

a) transfer AND learning

结果:transfer AND learning: [5, 7, 9, 10, 16, 17, 25, 32, 33, 49, 55, 56]

print('transfer AND learning:', And(getList('transfer'), getList('learning')),'\n')

print('transfer:', getList('transfer'))
print('learning:',getList('learning'))

transfer AND learning: [5, 7, 9, 10, 16, 17, 25, 32, 33, 49, 55, 56]

transfer: [4, 5, 7, 9, 10, 16, 17, 24, 25, 29, 32, 33, 43, 49, 55, 56]
learning: [1, 5, 7, 9, 10, 14, 16, 17, 19, 20, 25, 30, 32, 33, 36, 41, 47, 49, 54, 55, 56, 58]

the result is correct

b) transfer AND learning AND filtering

Calculate transfer AND learning first, then calculate AND filtering
results: transfer AND learning AND filtering: [7, 25, 33, 55, 56]

print('transfer AND learning AND filtering:', And(And(getList('transfer'), getList('learning')), getList('filtering')), '\n')

print('transfer:', getList('transfer'))
print('learning:', getList('learning'))
print('filtering:', getList('filtering'))
print('transfer AND learning:', And(getList('transfer'), getList('learning')))

transfer AND learning AND filtering: [7, 25, 33, 55, 56]

transfer: [4, 5, 7, 9, 10, 16, 17, 24, 25, 29, 32, 33, 43, 49, 55, 56]
learning: [1, 5, 7, 9, 10, 14, 16, 17, 19, 20, 25, 30, 32, 33, 36, 41, 47, 49, 54, 55, 56, 58]
filtering: [7, 8, 11, 12, 13, 18, 19, 22, 24, 25, 26, 27, 30, 33, 36, 37, 38, 41, 42, 46, 52, 54, 55, 56, 57, 58]
transfer AND learning: [5, 7, 9, 10, 16, 17, 25, 32, 33, 49, 55, 56]

the result is correct

c) recommendation AND filtering

Result: recommendation AND filtering: [13, 26, 38]

print('recommendation AND filtering:', And(getList('recommendation'), getList('filtering')), '\n')

print('recommendation:', getList('recommendation'))
print('filtering:', getList('filtering'))

recommendation AND filtering: [13, 26, 38]

recommendation: [1, 2, 4, 5, 6, 9, 13, 14, 15, 17, 20, 21, 26, 29, 31, 32, 34, 35, 38, 39, 43, 44, 45, 47, 48, 49, 50, 51, 53, 59, 60]
filtering: [7, 8, 11, 12, 13, 18, 19, 22, 24, 25, 26, 27, 30, 33, 36, 37, 38, 41, 42, 46, 52, 54, 55, 56, 57, 58]

the result is correct

d) recommendation OR filtering

Result: recommendation OR filtering: [1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 24, 25, 26 , 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53 , 54, 55, 56, 57, 58, 59, 60]

print('recommendation OR filtering:', Or(getList('recommendation'), getList('filtering')), '\n')

print('recommendation:', getList('recommendation'))
print('filtering:', getList('filtering'))

recommendation OR filtering: [1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60]

recommendation: [1, 2, 4, 5, 6, 9, 13, 14, 15, 17, 20, 21, 26, 29, 31, 32, 34, 35, 38, 39, 43, 44, 45, 47, 48, 49, 50, 51, 53, 59, 60]
filtering: [7, 8, 11, 12, 13, 18, 19, 22, 24, 25, 26, 27, 30, 33, 36, 37, 38, 41, 42, 46, 52, 54, 55, 56, 57, 58]

the result is correct

e) transfer AND NOT (recommendation OR filtering)

Calculate recommendation OR filtering first, then calculate AND NOT
result: transfer AND NOT (recommendation OR filtering) [10, 16]

print('transfer AND NOT (recommendation OR filtering)', \
    AndNot(getList('transfer'), Or(getList('recommendation'), getList('filtering'))), '\n')

print('transfer:', getList('transfer'))
print('recommendation:', getList('recommendation'))
print('filtering:', getList('filtering'))
print('recommendation OR filtering:', Or(getList('recommendation'), getList('filtering')))

transfer AND NOT (recommendation OR filtering) [10, 16]

transfer: [4, 5, 7, 9, 10, 16, 17, 24, 25, 29, 32, 33, 43, 49, 55, 56]
recommendation: [1, 2, 4, 5, 6, 9, 13, 14, 15, 17, 20, 21, 26, 29, 31, 32, 34, 35, 38, 39, 43, 44, 45, 47, 48, 49, 50, 51, 53, 59, 60]
filtering: [7, 8, 11, 12, 13, 18, 19, 22, 24, 25, 26, 27, 30, 33, 36, 37, 38, 41, 42, 46, 52, 54, 55, 56, 57, 58]
recommendation OR filtering: [1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60]

Appendix: HW1.txt

Adaptive Pairwise Preference Learning for Collaborative Recommendation with Implicit Feedbacks
HGMF: Hierarchical Group Matrix Factorization for Collaborative Recommendation
Adaptive Bayesian Personalized Ranking for Heterogeneous Implicit Feedbacks
Compressed Knowledge Transfer via Factorization Machine for Heterogeneous Collaborative Recommendation
A Survey of Transfer Learning for Collaborative Recommendation with Auxiliary Data
Mixed Factorization for Collaborative Recommendation with Heterogeneous Explicit Feedbacks
Transfer Learning for Heterogeneous One-Class Collaborative Filtering
Group Bayesian Personalized Ranking with Rich Interactions for One-Class Collaborative Filtering
Transfer Learning for Semi-Supervised Collaborative Recommendation
Transfer Learning for Behavior Prediction
TOCCF: Time-Aware One-Class Collaborative Filtering
RBPR: Role-based Bayesian Personalized Ranking for Heterogeneous One-Class Collaborative Filtering
Hybrid One-Class Collaborative Filtering for Job Recommendation
Mixed Similarity Learning for Recommendation with Implicit Feedback
Collaborative Recommendation with Multiclass Preference Context
Transfer Learning for Behavior Ranking
Transfer Learning from APP Domain to News Domain for Dual Cold-Start Recommendation
k-CoFi: Modeling k-Granularity Preference Context in Collaborative Filtering
CoFi-points: Collaborative Filtering via Pointwise Preference Learning over Item-Sets
Personalized Recommendation with Implicit Feedback via Learning Pairwise Preferences over Item-sets
BIS: Bidirectional Item Similarity for Next-Item Recommendation
RLT: Residual-Loop Training in Collaborative Filtering for Combining Factorization and Global-Local Neighborhood
MF-DMPC: Matrix Factorization with Dual Multiclass Preference Context for Rating Prediction
Transfer to Rank for Heterogeneous One-Class Collaborative Filtering
Neighborhood-Enhanced Transfer Learning for One-Class Collaborative Filtering
Next-Item Recommendation via Collaborative Filtering with Bidirectional Item Similarity
Asymmetric Bayesian Personalized Ranking for One-Class Collaborative Filtering
Context-aware Collaborative Ranking
Transfer to Rank for Top-N Recommendation
Dual Similarity Learning for Heterogeneous One-Class Collaborative Filtering
Sequence-Aware Factored Mixed Similarity Model for Next-Item Recommendation
PAT: Preference-Aware Transfer Learning for Recommendation with Heterogeneous Feedback
Adaptive Transfer Learning for Heterogeneous One-Class Collaborative Filtering
A General Knowledge Distillation Framework for Counterfactual Recommendation via Uniform Data
FISSA: Fusing Item Similarity Models with Self-Attention Networks for Sequential Recommendation
Asymmetric Pairwise Preference Learning for Heterogeneous One-Class Collaborative Filtering
k-Reciprocal Nearest Neighbors Algorithm for One-Class Collaborative Filtering
CoFiGAN: Collaborative Filtering by Generative and Discriminative Training for One-Class Recommendation
Conditional Restricted Boltzmann Machine for Item Recommendation
Matrix Factorization with Heterogeneous Multiclass Preference Context
CoFi-points: Collaborative Filtering via Pointwise Preference Learning on User/Item-Set
A Survey on Heterogeneous One-Class Collaborative Filtering
Holistic Transfer to Rank for Top-N Recommendation
FedRec: Federated Recommendation with Explicit Feedback
Factored Heterogeneous Similarity Model for Recommendation with Implicit Feedback
FCMF: Federated Collective Matrix Factorization for Heterogeneous Collaborative Filtering
Sequence-Aware Similarity Learning for Next-Item Recommendation
FR-FMSS: Federated Recommendation via Fake Marks and Secret Sharing
Transfer Learning in Collaborative Recommendation for Bias Reduction
Mitigating Confounding Bias in Recommendation via Information Bottleneck
FedRec++: Lossless Federated Recommendation with Explicit Feedback
VAE++: Variational AutoEncoder for Heterogeneous One-Class Collaborative Filtering
TransRec++: Translation-based Sequential Recommendation with Heterogeneous Feedback
Collaborative filtering with implicit feedback via learning pairwise preferences over user-groups and item-sets
Interaction-Rich Transfer Learning for Collaborative Filtering with Heterogeneous User Feedback
Transfer Learning in Heterogeneous Collaborative Filtering Domains
GBPR: Group Preference based Bayesian Personalized Ranking for One-Class Collaborative Filtering
CoFiSet: Collaborative Filtering via Learning Pairwise Preferences over Item-sets
Modeling Item Category for Effective Recommendation
Position-Aware Context Attention for Session-Based Recommendation

Guess you like

Origin blog.csdn.net/pigpigpig64/article/details/123492848