Named entity recognition (NER) consists of extracting ‘entities’ from text — what we mean by that is given the sentence:
“Apple reached an all-time high stock price of 143 dollars this January.”
We might want to extract the key pieces of information — or ‘entities’ — and categorize each of those entities. Like so:
Apple — Organization
143 dollars — Monetary Value
this January — Date
For us humans, this is easy. But how can we teach a machine to distinguish between a granny smith apple and the Apple we trade on NASDAQ?
(No, we can’t rely on the ‘A’…
All we ever seem to talk about nowadays are BERT this, BERT that. I want to write about something else, but BERT is just too good — so this article will be about BERT and sequence similarity!
A big part of NLP relies on similarity in highly-dimensional spaces. Typically an NLP solution will take some text, process it to create a big vector/array representing said text — then perform several transformations.
It’s highly-dimensional magic.
Sentence similarity is one of the clearest examples of how powerful highly-dimensional magic can be.
The logic is this:
Type annotations have a long and convoluted history with Python, going all the way back to the first release of Python 3 with the initial implementation of function annotations.
When we convert language into a machine-readable format, the standard approach is to use dense vectors.
A neural network typically generates dense vectors. They allow us to convert words and sentences into high-dimensional vectors — organized so that each vector's geometric position can attribute meaning.
In open-domain question answering, we typically design a model architecture that contains a data source, retriever, and reader/generator.
The first of these components is typically a document store. The two most popular stores we use here are Elasticsearch and FAISS.
Next up is our retriever — the topic of this article. The job of the retriever is to filter through our document store for relevant chunks of information (the documents) and pass them to the reader/generator model.
The reader/generator model is the final model in our Q&A stack. We can either have a reader, which extracts an answer directly from…
Accurate, fast, and memory-efficient similarity search is a hard thing to do — but something that, if done well, lends itself very well to our huge repositories of endless (and exponentially growing) data.
The reason that similarity search is so good is that it enables us to search for images, text, videos, or any other form of data — without getting too specific in our search queries — which is something that we humans are not so great at.
I will use the example of image similarity search. We can take a picture, and search for similar images. This works…
The most powerful feature of Python is its community. Almost every use-case out there has a package built specifically for it.
Need to send mobile/email alerts?
pip install knockknock — Build ML apps?
pip install streamlit — Bored of your terminal?
pip install colorama — It’s too easy!
I know this is obvious, but those libraries didn’t magically appear. For each package, there is a person, or many persons — that actively developed and deployed that package.
Every single one.
All 300K+ of them.
That is why Python is Python, the level of support is phenomenal — mindblowing.
Transformers have been described as the fourth pillar of deep learning , alongside the likes of convolutional and recurrent neural networks.
However, from the perspective of natural language processing — transformers are much more than that. Since their introduction in 2017, they’ve come to dominate the majority of NLP benchmarks — and continue to impress daily.
The thing is, transformers are damn cool. And with libraries like HuggingFace’s transformers — it has become too easy to build incredible solutions with them.
So, what’s not to love? Incredible performance paired with the ultimate ease-of-use.
In this article, we’ll work through…
Python 3.10 is beginning to fill-out with plenty of fascinating new features. One of those, in particular, caught my attention — structural pattern matching — or as most of us will know it, switch/case statements.
Switch-statements have been absent from Python despite being a common feature of most languages.
Back in 2006, PEP 3103 was raised, recommending the implementation of a switch-case statement. However, after a poll at PyCon 2007 received no support for the feature, the Python devs dropped it.
Fast-forward to 2020, and Guido van Rossum, the creator of Python, committed the first documentation showing the new…
ℕ𝕠-𝕠𝕟𝕖 𝕚𝕟 𝕥𝕙𝕖𝕚𝕣 𝕣𝕚𝕘𝕙𝕥 𝕞𝕚𝕟𝕕 𝕨𝕠𝕦𝕝𝕕 𝕖𝕧𝕖𝕣 𝕦𝕤𝕖 𝕥𝕙𝕖𝕤𝕖 𝕒𝕟𝕟𝕠𝕪𝕚𝕟𝕘 𝕗𝕠𝕟𝕥 𝕧𝕒𝕣𝕚𝕒𝕟𝕥𝕤. 𝕋𝕙𝕖 𝕨𝕠𝕣𝕤𝕥 𝕥𝕙𝕚𝕟𝕘, 𝕚𝕤 𝕚𝕗 𝕪𝕠𝕦 𝕕𝕠 𝕒𝕟𝕪 𝕗𝕠𝕣𝕞 𝕠𝕗 ℕ𝕃ℙ 𝕒𝕟𝕕 𝕪𝕠𝕦 𝕙𝕒𝕧𝕖 𝕔𝕙𝕒𝕣𝕒𝕔𝕥𝕖𝕣𝕤 𝕝𝕚𝕜𝕖 𝕥𝕙𝕚𝕤 𝕚𝕟 𝕪𝕠𝕦𝕣 𝕚𝕟𝕡𝕦𝕥, 𝕪𝕠𝕦𝕣 𝕥𝕖𝕩𝕥 𝕓𝕖𝕔𝕠𝕞𝕖𝕤 𝕔𝕠𝕞𝕡𝕝𝕖𝕥𝕖𝕝𝕪 𝕦𝕟𝕣𝕖𝕒𝕕𝕒𝕓𝕝𝕖.
We also find that text like this is incredibly common — particularly on social media.
Another pain-point comes from diacritics (the little glyphs in Ç, é, Å) that you’ll find in almost every European language.
These characters have a hidden property that can trip up any NLP model — take a look at the Unicode for two versions of Ç:
Data scientist learning and writing about everything.