METHODS
Study population
Data collection will span from 1st January 2012 to 31st December 2023, encompassing both adults and children across the country. The data will be obtained from multiple data lakes (refer to Figure 3). Two large pooled datasets (FinnATOPY Main Dataset and LAB Dataset) will be created to address specific research questions. Over the 12-year study period, patients will have undergone varying numbers of revisits, with an estimated total of more than two million individuals.
Demographic and geographical disparities, including rural vs. urban distinctions, and temporal trends will be carefully examined for all study inquiries. Additionally, a pre-post COVID-19 data comparison will be conducted.
For the allergen sensitization population-level analysis (LAB Dataset), all IgE test results (based on whole allergen extracts and allergen components) will be compiled from Synlab, the nation-wide laboratory service provider for Terveystalo, as well as from major regional laboratories. It is anticipated that these laboratories cover at least 80% of the tested population. By comparing IgE test results and sensitization patterns among different laboratories and regions, the study aims to map the sensitization pattern in Finland and assess the accuracy and applicability of Terveystalo's data.
Patient selection from the Terveystalo data lake (Figure 3, section T) will be based on allergen IgE and lung function measurements, diagnostic codes, and prescribed medication relevant to atopic diseases. Particular attention will be given to individuals with at least one positive IgE test result and those whose IgE test results consistently fall below the reference value of 0.35 kU/l. Furthermore, random samples of matched cohorts will be generated for each study question to represent non-atopic control populations. These control groups will consist of individuals without elevated specific IgE levels, no diagnosis of asthma, atopic eczema, allergic rhinitis, or any other allergic conditions, and no atopic disease-related medication.
Supplementing the Terveystalo data information on diagnoses, medication purchases sick leaves and disability compensations will be collected from the Finnish Social Insurance Institute (KEALA, Figure 3, section K).
Figure 3: Project outline of FinnATOPY
Kela, Finnish Social Insurance Institution
£Prescriptons for allergen immunotherapy (AIT) extracts, ATC codes V01AA* in KELA registers.
†ATC codes: systemic glucocorticoids (H02AB*) , topical corticosteroids (D07*), nasal preparations (R01*), drugs for obstructive airway diseases (R03*), cough medicine (R05*), antihistamines for systemic use (R06A*), ophtalmologicals (R01B*, R01C*, R01G*), allergen exctracts (V01AA*), intramuscular epinephrine (C01CA24)
‡Diagnosis (ICD-10): asthma (J45*, J46*), cough, (R05), abnormalities of breathing (R06*), rhinitis (J30*, J31.0), conjunctivits (H10*), allergic colitis (K52.2), atopic & allergic eczema (L20*, L30*, L23*), food allergy (L27.7), urticaria, (L50*), drug allergy (Z88*, Z91.0), anaphylaxis (T78*)
We will meticulously gather all pertinent data, including visit location, the doctor's specialization, visit dates, age at each visit, all diagnoses, diagnostic test results, and sick leave information. Our primary focus will be on allergic rhinitis, conjunctivitis, sinusitis, food allergy, atopic eczema, asthma, hymenoptera venom allergy, asthma, and any comorbidities that may arise.
​
Electronic records containing lung function measurements will also be at our disposal. We intend to conduct a comprehensive analysis of all spirometry results, enabling us to reconstruct lung function curves for meticulous quality assessment.
Terveystalo electronic health record system
Terveystalo, Finland's largest private healthcare service company, offers comprehensive healthcare to corporate, private, and public sector customers. With over 300 clinics and 13,000 medical doctors covering 50 specialties, they handle 6.5 million annual visits, serving 1.2 million customers. Diagnoses and prescriptions are efficiently recorded in a centralized EHR system.
As a prominent occupational healthcare provider, Terveystalo receives 600,000 corporate contract-covered patients yearly. Meticulously registered sick leave data will help analyze correlations between atopic diseases and comorbidities, comparing with control groups.
Terveystalo uses DynamicHealth by TietoEVRY, efficiently managing diverse EHR data like patient info, diagnoses (ICD-10), prescriptions (ATC codes), test results, referrals, procedures, and healthcare practitioners' details. Patient age is derived accurately from birth dates. Data adheres to GDPR and Finland's data security laws.
Data analysis and statistical methods
Open-source tools, specifically the R/Python + Jupyter stack, will be the primary choice for data analysis due to their widespread availability and accessibility across all data lake analysis environments. In cases where usage regulations demand it, other statistical software like Stata or SPSS may be considered.
​
Appropriate statistical tests will be employed to compare continuous and categorical variables between patient groups. Given the inherent skewness of real-world distributions, bootstrap analysis will also be used to determine confidence intervals.
​
For retrospective cohorts, high-dimensional propensity score matching (PSM) will be utilized to create 1:1 or 1:2 control groups, accounting for characteristic and potential confounding variables. The K-means for longitudinal data (KmL) statistical method will be employed to identify homogeneous patient trajectories, accommodating missing values and varying starting conditions or the number of clusters sought.
​
Adherence to medications will be assessed through a three-step process: initiation, implementation, and discontinuation. Medication dispensation events from pharmacy databases, including patient identifiers, event dates, medication types, and quantities, will be recorded. EHR-based algorithms, supported by the open-source script AdhereR, will estimate medication adherence and persistence from EHR and KELA data.
​
Patient text mining, utilizing Natural Language Processing (NLP) methods, will be employed to extract input data like symptoms or diagnoses without structural ICD-10 codes. Rule-based, regular expression-based, and neural network-based methods will be used, especially for classifying diverse inputs into well-defined classes, such as smoking status and patient history.