Sign in

James Briggs

Overpowered entity extraction using roBERTa

Image by author

“Apple reached an all-time high stock price of 143 dollars this January.”

We might want to extract the key pieces of information — or ‘entities’ — and categorize each of those entities. Like so:

Apple — Organization

143 dollars — Monetary Value

this January — Date

For us humans, this is easy. But how can we teach a machine to distinguish between a granny smith apple and the Apple we trade on NASDAQ?

(No, we can’t rely on the ‘A’…


High-performance semantic similarity with BERT

Image by author

A big part of NLP relies on similarity in highly-dimensional spaces. Typically an NLP solution will take some text, process it to create a big vector/array representing said text — then perform several transformations.

It’s highly-dimensional magic.

Sentence similarity is one of the clearest examples of how powerful highly-dimensional magic can be.

The logic is this:

  • Take a sentence, convert it into a…


Write code that goes the extra mile

Image by the author


Euclidean distance, dot product, and cosine similarity

Image by the author

When we convert language into a machine-readable format, the standard approach is to use dense vectors.

A neural network typically generates dense vectors. They allow us to convert words and sentences into high-dimensional vectors — organized so that each vector's geometric position can attribute meaning.


Next-gen Q&A techniques for next-gen intelligent solutions

Photo by Emily Morter on Unsplash

The first of these components is typically a document store. The two most popular stores we use here are Elasticsearch and FAISS.

Next up is our retriever — the topic of this article. The job of the retriever is to filter through our document store for relevant chunks of information (the documents) and pass them to the reader/generator model.

The reader/generator model is the final model in our Q&A stack. We can either have a reader, which extracts an answer directly from…


A simple guide to FAISS

Photo by NeONBRAND on Unsplash

The reason that similarity search is so good is that it enables us to search for images, text, videos, or any other form of data — without getting too specific in our search queries — which is something that we humans are not so great at.

I will use the example of image similarity search. We can take a picture, and search for similar images. This works…


Learn how Python code should be packaged for PyPI

Photo by Lawless Capture on Unsplash

Need to send mobile/email alerts? pip install knockknock — Build ML apps? pip install streamlit — Bored of your terminal? pip install colorama — It’s too easy!

I know this is obvious, but those libraries didn’t magically appear. For each package, there is a person, or many persons — that actively developed and deployed that package.

Every single one.

All 300K+ of them.

That is why Python is Python, the level of support is phenomenal — mindblowing.

In this…


Preprocess, train, and predict with BERT

Image by Author

However, from the perspective of natural language processing — transformers are much more than that. Since their introduction in 2017, they’ve come to dominate the majority of NLP benchmarks — and continue to impress daily.

The thing is, transformers are damn cool. And with libraries like HuggingFace’s transformers — it has become too easy to build incredible solutions with them.

So, what’s not to love? Incredible performance paired with the ultimate ease-of-use.

In this article, we’ll work through…


The newest release shows the new logic

Image by author

Switch-statements have been absent from Python despite being a common feature of most languages.

Back in 2006, PEP 3103 was raised, recommending the implementation of a switch-case statement. However, after a poll at PyCon 2007 received no support for the feature, the Python devs dropped it.

Fast-forward to 2020, and Guido van Rossum, the creator of Python, committed the first documentation showing the new…


Your guide to this essential method for quality NLP

Photo by Elena Mozhvilo on Unsplash

ℕ𝕠-𝕠𝕟𝕖 𝕚𝕟 𝕥𝕙𝕖𝕚𝕣 𝕣𝕚𝕘𝕙𝕥 𝕞𝕚𝕟𝕕 𝕨𝕠𝕦𝕝𝕕 𝕖𝕧𝕖𝕣 𝕦𝕤𝕖 𝕥𝕙𝕖𝕤𝕖 𝕒𝕟𝕟𝕠𝕪𝕚𝕟𝕘 𝕗𝕠𝕟𝕥 𝕧𝕒𝕣𝕚𝕒𝕟𝕥𝕤. 𝕋𝕙𝕖 𝕨𝕠𝕣𝕤𝕥 𝕥𝕙𝕚𝕟𝕘, 𝕚𝕤 𝕚𝕗 𝕪𝕠𝕦 𝕕𝕠 𝕒𝕟𝕪 𝕗𝕠𝕣𝕞 𝕠𝕗 ℕ𝕃ℙ 𝕒𝕟𝕕 𝕪𝕠𝕦 𝕙𝕒𝕧𝕖 𝕔𝕙𝕒𝕣𝕒𝕔𝕥𝕖𝕣𝕤 𝕝𝕚𝕜𝕖 𝕥𝕙𝕚𝕤 𝕚𝕟 𝕪𝕠𝕦𝕣 𝕚𝕟𝕡𝕦𝕥, 𝕪𝕠𝕦𝕣 𝕥𝕖𝕩𝕥 𝕓𝕖𝕔𝕠𝕞𝕖𝕤 𝕔𝕠𝕞𝕡𝕝𝕖𝕥𝕖𝕝𝕪 𝕦𝕟𝕣𝕖𝕒𝕕𝕒𝕓𝕝𝕖.

We also find that text like this is incredibly common — particularly on social media.

Another pain-point comes from diacritics (the little glyphs in Ç, é, Å) that you’ll find in almost every European language.

These characters have a hidden property that can trip up any NLP model — take a look at the Unicode for two versions of Ç:

Both \u00C7

James Briggs

Data scientist learning and writing about everything.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store