Bug in langchain.vectorstores.chroma.Chroma

Not sure where else to report this bug. No such link on LangChain’s home:

Repl link:

I’ll just write it out here and maybe someone can guide me on submitting to the right place. Basically it seems like Chroma is adding an extra space (very consistently) in front of '≤ ’ characters and it has a real impact.
EXAMPLE:

  • chunks object below in my code contains the following string: leflunomide (LEF) (≤ 20 mg/day)
 Chroma.from_documents(
  documents=chunks,
  embedding=embeddings,
  collection_name=collection_name,
  persist_directory=persist_db,
 )
  • after saving and retrieving from my local file with:
db = Chroma(
 persist_directory=persist_db,
 embedding_function=embeddings,
 collection_name=collection_name,
)

. . . then extracting with . . .

db.get(include=['documents'])
  • that string is now: leflunomide (LEF) ( ≤20 mg/day), with a single space newly inserted before the ≤

  • this matters because it messes up retrieval augmentation with queries like “what doses of leflunomide are appropriate?” using Claude-2 as the llm

Any guidance in directing this appropriately is greatly appreciated.

1 Like

You can post an issue in the GitHub repo here:

Thanks!
Also - thanks for whomever formatted my messy post!

2 Likes

Maybe this has something to do with being equivalent to <= which takes two characters.

1 Like