Sign in

James Briggs

A look at the best features included in the latest iteration of Python

Photo by Pablo Guerrero on Unsplash

It’s that time again, a new version of Python is imminent. Now in beta (3.9.0b3), we will soon be seeing the full release of Python 3.9.

Some of the newest features are incredibly exciting, and it will be amazing to see them used after release. We’ll cover the following:

Let’s take a first look at these new features and how we use them.

(Versione in Italiano)

Dictionary Unions

One of my favorite new features with a sleek syntax. If we have two dictionaries a and…

Euclidean distance, dot product, and cosine similarity

Image by the author

When we convert language into a machine-readable format, the standard approach is to use dense vectors.

A neural network typically generates dense vectors. They allow us to convert words and sentences into high-dimensional vectors — organized so that each vector's geometric position can attribute meaning.

Next-gen Q&A techniques for next-gen intelligent solutions

Photo by Emily Morter on Unsplash

In open-domain question answering, we typically design a model architecture that contains a data source, retriever, and reader/generator.

The first of these components is typically a document store. The two most popular stores we use here are Elasticsearch and FAISS.

Next up is our retriever — the topic of this article. The job of the retriever is to filter through our document store for relevant chunks of information (the documents) and pass them to the reader/generator model.

The reader/generator model is the final model in our Q&A stack. We can either have a reader, which extracts an answer directly from…

A simple guide to FAISS

Photo by NeONBRAND on Unsplash

Accurate, fast, and memory-efficient similarity search is a hard thing to do — but something that, if done well, lends itself very well to our huge repositories of endless (and exponentially growing) data.

The reason that similarity search is so good is that it enables us to search for images, text, videos, or any other form of data — without getting too specific in our search queries — which is something that we humans are not so great at.

I will use the example of image similarity search. We can take a picture, and search for similar images. This works…

Learn how Python code should be packaged for PyPI

Photo by Lawless Capture on Unsplash

The most powerful feature of Python is its community. Almost every use-case out there has a package built specifically for it.

Need to send mobile/email alerts? pip install knockknock — Build ML apps? pip install streamlit — Bored of your terminal? pip install colorama — It’s too easy!

I know this is obvious, but those libraries didn’t magically appear. For each package, there is a person, or many persons — that actively developed and deployed that package.

Every single one.

All 300K+ of them.

That is why Python is Python, the level of support is phenomenal — mindblowing.

In this…

Preprocess, train, and predict with BERT

Image by Author

Transformers have been described as the fourth pillar of deep learning [1], alongside the likes of convolutional and recurrent neural networks.

However, from the perspective of natural language processing — transformers are much more than that. Since their introduction in 2017, they’ve come to dominate the majority of NLP benchmarks — and continue to impress daily.

The thing is, transformers are damn cool. And with libraries like HuggingFace’s transformers — it has become too easy to build incredible solutions with them.

So, what’s not to love? Incredible performance paired with the ultimate ease-of-use.

In this article, we’ll work through…

The newest release shows the new logic

Image by author

Python 3.10 is beginning to fill-out with plenty of fascinating new features. One of those, in particular, caught my attention — structural pattern matching — or as most of us will know it, switch/case statements.

Switch-statements have been absent from Python despite being a common feature of most languages.

Back in 2006, PEP 3103 was raised, recommending the implementation of a switch-case statement. However, after a poll at PyCon 2007 received no support for the feature, the Python devs dropped it.

Fast-forward to 2020, and Guido van Rossum, the creator of Python, committed the first documentation showing the new…

Your guide to this essential method for quality NLP

Photo by Elena Mozhvilo on Unsplash

ℕ𝕠-𝕠𝕟𝕖 𝕚𝕟 𝕥𝕙𝕖𝕚𝕣 𝕣𝕚𝕘𝕙𝕥 𝕞𝕚𝕟𝕕 𝕨𝕠𝕦𝕝𝕕 𝕖𝕧𝕖𝕣 𝕦𝕤𝕖 𝕥𝕙𝕖𝕤𝕖 𝕒𝕟𝕟𝕠𝕪𝕚𝕟𝕘 𝕗𝕠𝕟𝕥 𝕧𝕒𝕣𝕚𝕒𝕟𝕥𝕤. 𝕋𝕙𝕖 𝕨𝕠𝕣𝕤𝕥 𝕥𝕙𝕚𝕟𝕘, 𝕚𝕤 𝕚𝕗 𝕪𝕠𝕦 𝕕𝕠 𝕒𝕟𝕪 𝕗𝕠𝕣𝕞 𝕠𝕗 ℕ𝕃ℙ 𝕒𝕟𝕕 𝕪𝕠𝕦 𝕙𝕒𝕧𝕖 𝕔𝕙𝕒𝕣𝕒𝕔𝕥𝕖𝕣𝕤 𝕝𝕚𝕜𝕖 𝕥𝕙𝕚𝕤 𝕚𝕟 𝕪𝕠𝕦𝕣 𝕚𝕟𝕡𝕦𝕥, 𝕪𝕠𝕦𝕣 𝕥𝕖𝕩𝕥 𝕓𝕖𝕔𝕠𝕞𝕖𝕤 𝕔𝕠𝕞𝕡𝕝𝕖𝕥𝕖𝕝𝕪 𝕦𝕟𝕣𝕖𝕒𝕕𝕒𝕓𝕝𝕖.

We also find that text like this is incredibly common — particularly on social media.

Another pain-point comes from diacritics (the little glyphs in Ç, é, Å) that you’ll find in almost every European language.

These characters have a hidden property that can trip up any NLP model — take a look at the Unicode for two versions of Ç:

Both \u00C7

Hands-on Tutorials

Restore the power of NLP for long sequences

Photo by Sebastian Staines on Unsplash

The de-facto standard in many natural language processing (NLP) tasks nowadays is to use a transformer. Text generation? Transformer. Question-and-answering? Transformer. Language classification? Transformer!

However, one of the problems with many of these models (a problem that is not just restricted to transformer models) is that we cannot process long pieces of text.

Almost every article I write on Medium contains 1000+ words, which, when tokenized for a transformer model like BERT, will produce 1000+ tokens. BERT (and many other transformer models) will consume 512 tokens max — truncating anything beyond this length.

Although I think you may struggle to…

Never worry about accuracy in your language models again

Photo by Wynand van Poortvliet on Unsplash

Measuring the results of our model outputs gets a lot more complex when we’re dealing with language.

This is something that becomes quite clear very quickly for many NLP-based problems — how do we measure the accuracy of a language-based sequence when dealing with language summarization or translation?

For this, we can use Recall-Oriented Understudy for Gisting Evaluation (ROUGE). Fortunately, the name is deceptively complicated — it’s incredibly easy to understand, and even easier to implement.

Let’s jump straight into it.

Contents> What is ROUGE
- Recall
- Precision
- F1 Score

James Briggs

Data scientist learning and writing about everything.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store