•  


Vector Embedding Database not persisting when using YAML configuration · Issue #600 · neuml/txtai · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector Embedding Database not persisting when using YAML configuration #600

Open
vnguye65 opened this issue Nov 15, 2023 · 7 comments
Open

Comments

@vnguye65
Copy link

I have the following workflow configuration with subindices for two different datasets.

workflow.yaml

writable: true
path: vector-database

embeddings:
  content: true
  defaults: false
  indexes: 
      document:
          path: sentence-transformers/multi-qa-mpnet-base-dot-v1
          tokenize: true
          columns:
              text: document
      csv: 
           path: sentence-transformers/multi-qa-mpnet-base-dot-v1
           tokenize: true
           columns:
               text: csv

After adding the data, app.count() returns 1. However, when this data doesn't persist when the session is refreshed. app.count() returns 0 when run in another separate environment.

from txtai.app import Application

app = Application("search-workflow.yaml")
app.add([{'document': 'dummy data 1',
            'csv': 'dummy data 2'}])
app.upsert()
app.count()

@davidmezzetti Could you please confirm if I am missing anything in the code and suggest what we could do to persist data?

@vnguye65
Copy link
Author

I'm using this search workflow as an intermediate step in another workflow in a RAG solution. The search workflow is not able to find the data in the vector database to retrieve relevant texts.

@davidmezzetti
Copy link
Member

Sorry for the delayed response. I will try your config and let you know.

@davidmezzetti
Copy link
Member

I ran the above configuration and it saves content to the vector-database directory.

Can you share more on search workflow? Once you load an embeddings database it doesn't automatically refresh.You would need a way to reload the read-only search index when data is loaded. Could this be the issue?

@vnguye65
Copy link
Author

Sorry for the delay. Yes, that seems to be the issue. Looks like it doesn't automatically load in read-only search index.

@davidmezzetti
Copy link
Member

Ok, thank you for confirming. I can think about a method to autodetect index changes and force a refresh. I think that would be a good addition.

@davidmezzetti
Copy link
Member

One idea in the meantime to think about is if you can trigger anything with the read-only process when you update your index.

@jbouder
Copy link

Not sure if its the same thing, but with the same configuration above, i'm seeing situations where the index directories are not always created. I also noticed when I try to call index, it eventually errors. Should the index directories be added when txtai starts up or not until an index is called?

Sign up for free to join this conversation on GitHub . Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
- "漢字路" 한글한자자동변환 서비스는 교육부 고전문헌국역지원사업의 지원으로 구축되었습니다.
- "漢字路" 한글한자자동변환 서비스는 전통문화연구회 "울산대학교한국어처리연구실 옥철영(IT융합전공)교수팀"에서 개발한 한글한자자동변환기를 바탕하여 지속적으로 공동 연구 개발하고 있는 서비스입니다.
- 현재 고유명사(인명, 지명등)을 비롯한 여러 변환오류가 있으며 이를 해결하고자 많은 연구 개발을 진행하고자 하고 있습니다. 이를 인지하시고 다른 곳에서 인용시 한자 변환 결과를 한번 더 검토하시고 사용해 주시기 바랍니다.
- 변환오류 및 건의,문의사항은 juntong@juntong.or.kr로 메일로 보내주시면 감사하겠습니다. .
Copyright ⓒ 2020 By '전통문화연구회(傳統文化硏究會)' All Rights reserved.
 한국   대만   중국   일본