•  


GitHub - dcarpintero/github-semantic-search: Semantic Search on Langchain Github Issues with Weaviate
Skip to content

dcarpintero/github-semantic-search

Folders and files

Name Name
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open_inStreamlit Python CodeFactor License

?? Semantic Search on Langchain Github Issues with Weaviate ??

?? What's Semantic Search?

Semantic search refers to search algorithms that consider the intent and contextual meaning of search phrases when generating results, rather than solely focusing on keyword matching. The goal is to provide more accurate and relevant results by understanding the semantics, or meaning, behind the query.

?? How does it work?

  • Ingesting Github Issues : We use the Langchain Github Loader to connect to the Langchain Repository and fetch the GitHub issues (nearly 2.000), which are then converted to a pandas dataframe and stored in a pickle file. See ./data-pipeline/ingest.py .

  • Generate and Index Vector Embeddings with Weaviate : Weaviate generates vector embeddings at the object level (rather than for individual properties), it includes by default properties that use the text data type, in our case we skip the 'url' field (which will be also not filterable and not searchable) and set up the 'text2vec-openai' vectorizer. Given that our use case values fast queries over loading time, we have opted for the HNSW vector index type, which incrementally builds a multi-layer structure consisting from hierarchical set of proximity graphs (layers).

class_obj
 =
 {
        
"class"
: 
"GitHubIssue"
,
        
"description"
: 
"This class contains GitHub Issues from the langchain repository."
,
        
"vectorIndexType"
: 
"hnsw"
,
        
"vectorizer"
: 
"text2vec-openai"
,
        
"moduleConfig"
: {
            
"text2vec-openai"
: {
                
"model"
: 
"ada"
,
                
"modelVersion"
: 
"002"
,
                
"type"
: 
"text"

            }
        },
        
"properties"
: [
            {
                
"name"
: 
"title"
,
                
"dataType"
: [
"text"
]
            },
            {
                
"name"
: 
"url"
,
                
"dataType"
: [
"text"
],
                
"indexFilterable"
: 
False
,  
                
"indexSearchable"
: 
False
,
                
"vectorizePropertyName"
: 
False

            },
            {
                
"name"
: 
"description"
,
                
"dataType"
: [
"text"
]
            },
            {
                
"name"
: 
"creator"
,
                
"dataType"
: [
"text"
],
            },
            {
                
"name"
: 
"created_at"
,
                
"dataType"
: [
"date"
]
            },
            {
                
"name"
: 
"state"
,
                
"dataType"
: [
"text"
],
            },
        ]
    }

The ingestion follows in batches of 100 records:

with
 client
.
batch
 as
 batch
: 
    
batch
.
batch_size
 =
 100

    for
 item
 in
 df
.
itertuples
():
        
properties
 =
 {
            
"title"
: 
item
.
title
,
            
"url"
: 
item
.
url
,
            
"labels"
: 
item
.
labels
,
            
"description"
: 
item
.
description
,
            
"creator"
: 
item
.
creator
,
            
"created_at"
: 
item
.
created_at
,
            
"state"
: 
item
.
state
,
        }

        
batch
.
add_data_object
(
            
data_object
=
properties
, 
            
class_name
=
"GitHubIssue"
)
  • Searching with Weaviate : Our App supports:

Near-Text-Vector-Search :

@
st
.
cache_data

def
 query_with_near_text
(
_w_client
: 
weaviate
.
Client
, 
query
, 
max_results
=
10
) 
->
 pd
.
DataFrame
:
    
"""

    Search GitHub Issues in Weaviate with Near Text.

    Weaviate converts the input query into a vector through the inference API (OpenAI) and uses that vector as the basis for a vector search.

    """


    response
 =
 (
        
_w_client
.
query

        .
get
(
"GitHubIssue"
, [
"title"
, 
"url"
, 
"labels"
, 
"description"
, 
"created_at"
, 
"state"
])
        .
with_near_text
({
"concepts"
: [
query
]})
        .
with_limit
(
max_results
)
        .
do
()
    )

    
data
 =
 response
[
"data"
][
"Get"
][
"GitHubIssue"
]
    
return
  pd
.
DataFrame
.
from_dict
(
data
, 
orient
=
'columns'
)

BM25-Search :

@
st
.
cache_data

def
 query_with_bm25
(
_w_client
: 
weaviate
.
Client
, 
query
, 
max_results
=
10
) 
->
 pd
.
DataFrame
:
    
"""

    Search GitHub Issues in Weaviate with BM25.

    Keyword (also called a sparse vector search) search that looks for objects that contain the search terms in their properties according to 

    the selected tokenization. The results are scored according to the BM25F function. It is .

    """


    response
 =
 (
        
_w_client
.
query

        .
get
(
"GitHubIssue"
, [
"title"
, 
"url"
, 
"labels"
, 
"description"
, 
"created_at"
, 
"state"
])
        .
with_bm25
(
query
=
query
)
        .
with_limit
(
max_results
)
        .
with_additional
(
"score"
)
        .
do
()
    )

    
data
 =
 response
[
"data"
][
"Get"
][
"GitHubIssue"
]
    
return
  pd
.
DataFrame
.
from_dict
(
data
, 
orient
=
'columns'
)

Hybrid-Search :

@
st
.
cache_data

def
 query_with_hybrid
(
_w_client
: 
weaviate
.
Client
, 
query
, 
max_results
=
10
) 
->
 pd
.
DataFrame
:
    
"""

    Search GitHub Issues in Weaviate with BM25.

    Keyword (also called a sparse vector search) search that looks for objects that contain the search terms in their properties according to 

    the selected tokenization. The results are scored according to the BM25F function. It is .

    """


    response
 =
 (
        
_w_client
.
query

        .
get
(
"GitHubIssue"
, [
"title"
, 
"url"
, 
"labels"
, 
"description"
, 
"created_at"
, 
"state"
])
        .
with_hybrid
(
query
=
query
)
        .
with_limit
(
max_results
)
        .
with_additional
([
"score"
])
        .
do
()
    )

?? Quickstart

  1. Clone the repository:
git@github.com:dcarpintero/github-semantic-search.git
  1. Create and Activate a Virtual Environment:
Windows:

py -m venv .venv
.venv\scripts\activate

macOS/Linux

python3 -m venv .venv
source .venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Ingest Data
python ./data-pipeline/ingest.py
  1. Index Data
python ./data-pipeline/index.py
  1. Launch Web Application
streamlit run ./app.py

????? Streamlit Web App

Demo Web App deployed to Streamlit Cloud and available at https://gh-semantic-search.streamlit.app/

?? References

- "漢字路" 한글한자자동변환 서비스는 교육부 고전문헌국역지원사업의 지원으로 구축되었습니다.
- "漢字路" 한글한자자동변환 서비스는 전통문화연구회 "울산대학교한국어처리연구실 옥철영(IT융합전공)교수팀"에서 개발한 한글한자자동변환기를 바탕하여 지속적으로 공동 연구 개발하고 있는 서비스입니다.
- 현재 고유명사(인명, 지명등)을 비롯한 여러 변환오류가 있으며 이를 해결하고자 많은 연구 개발을 진행하고자 하고 있습니다. 이를 인지하시고 다른 곳에서 인용시 한자 변환 결과를 한번 더 검토하시고 사용해 주시기 바랍니다.
- 변환오류 및 건의,문의사항은 juntong@juntong.or.kr로 메일로 보내주시면 감사하겠습니다. .
Copyright ⓒ 2020 By '전통문화연구회(傳統文化硏究會)' All Rights reserved.
 한국   대만   중국   일본