Not long ago, I joined Brainial’s growing team as a first-time Data Analyst. Every day, for a few hours at a time, I train our models via an online interface. As a noob to the field, I get more than enough occasions to stop and reflect on my workflow – which is where this little blog posts sprouted from. Hopefully, it’ll help you reflect on your own annotation practice and lift your spirit up a bit.
Before we dive in, I would like to mention that at Brainial we deal with three types of model training. Articles from across the web are retrieved to my computer screen, where I then need to highlight named entities (Entity Recognition), choose labels based on the text’s meaning (Signal Recognition) or simply point out whether the content is relevant to our client (Relevancy Filter). What I will be writing about next applies mostly to the first two, as the Relevancy Filter requires very little conscious effort to perform.
Let’s move on the juicier stuff now!
Annotation mistakes are something you’d like to minimize, reason for which I recommend taking time to plan thoroughly. A good way to kick off training tasks is through an exploratory test run. Annotate a small batch of texts containing enough examples of the sort you’ll be dealing with. This way you familiarize yourself with the content and get to discuss the kinds of labels you will be using for each scenario. At Brainial, we use broad labels that span multiple topics (i.e. “Acquisition” for both, the purchase of one company by another and the buying of objects or objects). This is, of course, a matter of personal preference, although I suspect very specific labels will slow down the work pace. Right from the start, it is important to get yourself used to the concept of “Consistency over precision”. The subtle nuances we can detect easily trip up our digital counterparts. It’s better to ignore them while annotating.
Again, you should aim for consistency and avoid mistakes. Preparation does help in that regard but, unfortunately, as humans we are subject to fatigue and getting distracted. Therefore, I usually work in sprints of 25 minutes with a five-minute break in between. If you work as part of a team, discuss your labelling regularly to ensure everyone is still on the same page. Because our company deals extensively with news articles, I formed the habit of posting the most hilarious titles on our Slack group. Besides being a source of entertainment, this also sparks conversations regarding the annotation every now and then. Feedback is crucial with this task – it usually takes a while before you start seeing results and it can get demoralizing quickly due to repetitiveness. As you progress, it is worth performing model analyses consistently, to track improvement, and going over small batches of annotated text manually, to check for mistakes.
Still a student, my background lays in Marketing. Even so, the simplicity of annotating, as well as our team’s enthusiasm, got me looking into NLP past the training software’s grey interface and straight over Python code, Spacy, Panda and a few other libraries we use. Understanding the effects my work has under the hood has positively impacted the enjoyment of my work. I’d advise anyone just getting started with training models to grasp at least the basics of the technology to become more conscious of their annotation work.
To all data annotators out there, your work matters! How else is the software supposed to understand that the regulation your client is most worried about is a “LAW” and not a modernist “WORK_OF_ART”?