Seeking medical aid on the internet can be deceiving. Thankfully there are sites like who offer professional help. It goes like this:

  • A patient submits his problems as a text quarry (median 100 words)
  • The request goes to one of 50 practitioners, who sorts it into one of 88 categories
  • A specialist reads the patients problems again and replies via system
  • If the problem is marked as acute a nurse calls the pacient and orders him to a doctor

Smart Health Hackathon Prague 2018

There were 12 challenges divided into 4 categories determined by the sponsors and a main price. Our team MedAI (I as a ML programmer, Jiří Diblík as a salesman and Ondřej Korol as an iOS developer) picked a challenge:

Design a neural network able to understand patient query and match it with the right answer using Q&A database from online medical consultation platform.

hackathon madai team

We won our category from and negotiated a real deployment of our model.

The dataset analysis

  • Over 183 000 rows of (Question, Answer, Category, icd_code).
  • Questions in both Czech and Slovak
  • Questions with grammatical errors
  • Median 100 words per question
  • Average 271 words per question


Preprocessing pipeline overview.

Source: Own work

Translation and Word embeddings

To solve the problem with multilingual questions and grammatical mistakes, the questions are first translated to English Google translate. This also enables the use of versatile pre-trained word vectors (which I couln’t find for Czech or Slovak). For this pipeline I used Standfords GloVe (Wikipedia 2014 + Gigaword 5; 6B tokens, 400K vocab, 100d)1.

The use of world vectors injects additional information about the language and thus makes possible even training on smaller dataset.

Crop & padding

From the dataset analysis I figured out that 90 % of the questions are under 280 words long => I decided to fix the word count on 280 by using cropping and padding with zeros.

To distinguish between zero word vector and padding a new dimension was added. The extra dimension carries information whether the word is first, last, in-between or padding.

Used tricks

  • Pareto principle applied to the classes => classifying only 18 most common classes to handle over 90 % of all questions
  • Normalization of class counts by multiplying the less common (significantly decreased overfitting)



As a standard approach in NLP2 I tried both LSTM and 1D CNN with different hidden layer sizes and numbers of layers as stated in the paper.

Used trics

  • Dropout – decreased overfitting


  1. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. 

  2. Wenpeng Yin, Katharina Kann, Mo Yu and Hinrich Schutze. 2017. Comparative Study of CNN and RNN for Natural Language Processing. arXiv:1702.01923