Journal Description

Data

Data is a peer-reviewed, open access journal on data in science, with the aim of enhancing data transparency and reusability. The journal publishes in two sections: a section on the collection, treatment and analysis methods of data in science; a section publishing descriptions of scientific and scholarly datasets (one dataset per paper). The journal is published monthly online by MDPI.

Open Access— free for readers, with article processing charges (APC) paid by authors or their institutions.
High Visibility: indexed within Scopus, ESCI (Web of Science), Ei Compendex, dblp, Inspec, RePEc, and other databases.
Journal Rank: CiteScore - Q2 (Information Systems and Management)
Rapid Publication: manuscripts are peer-reviewed and a first decision is provided to authors approximately 22 days after submission; acceptance to publication is undertaken in 3.9 days (median values for papers published in this journal in the second half of 2023).
Recognition of Reviewers: reviewers who provide timely, thorough peer-review reports receive vouchers entitling them to a discount on the APC of their next publication in any MDPI journal, in appreciation of the work done.

Impact Factor: 2.6 (2022); 5-Year Impact Factor: 3.0 (2022)

Imprint Information Journal Flyer Open Access ISSN: 2306-5729

Latest Articles

19 pages, 2321 KiB

Open AccessData Descriptor

In Vivo and In Vitro Electrochemical Impedance Spectroscopy of Acute and Chronic Intracranial Electrodes

by Kyle P. O’Sullivan, Brian J. Philip, Jonathan L. Baker, John D. Rolston, Mark E. Orazem, Kevin J. Otto and Christopher R. Butson

Data 2024, 9(6), 78; https://doi.org/10.3390/data9060078 - 6 Jun 2024

Abstract

Invasive intracranial electrodes are used in both clinical and research applications for recording and stimulation of brain tissue, providing essential data in acute and chronic contexts. The impedance characteristics of the electrode–tissue interface (ETI) evolve over time and can change dramatically relative to pre-implantation baseline. Understanding how ETI properties contribute to the recording and stimulation characteristics of an electrode can provide valuable insights for users who often do not have access to complex impedance characterizations of their devices. In contrast to the typical method of characterizing electrical impedance at a single frequency, we demonstrate a method for using electrochemical impedance spectroscopy (EIS) to investigate complex characteristics of the ETI of several commonly used acute and chronic electrodes. We also describe precise modeling strategies for verifying the accuracy of our instrumentation and understanding device–solution interactions, both in vivo and in vitro. Included with this publication is a dataset containing both in vitro and in vivo device characterizations, as well as some examples of modeling and error structure analysis results. These data can be used for more detailed interpretation of neural recordings performed on common electrode types, providing a more complete picture of their properties than is often available to users. Full article

► Show Figures

Figure 1

6 pages, 530 KiB

Open AccessData Descriptor

Data on Stark Broadening of N VI Spectral Lines

by Milan S. Dimitrijević, Magdalena D. Christova and Sylvie Sahal-Bréchot

Data 2024, 9(6), 77; https://doi.org/10.3390/data9060077 - 29 May 2024

Abstract

Data on Stark broadening parameters, spectral line widths, and shifts for 15 multiplets of N VI, whose spectral lines are broadened by collisions with electrons, protons, alpha particles (He III) and B III, B IV, B V and B VI ions, are presented. They have been calculated using the semiclassical perturbation theory, for temperatures from 50,000 K to 2,000,000 K, and perturber densities from 10¹⁶ cm⁻³ up to 10²⁴ cm⁻³. The data for e, p and He III are of particular interest for the analysis and modelling of atmospheres of hot and dense stars, as, e.g., white dwarfs, and for investigation of their spectra, and data for boron ions are used for analysis and modelling of laser-driven plasma in proton–boron fusion research. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

7 pages, 614 KiB

Open AccessData Descriptor

The China Historical Christian Database: A Dataset Quantifying Christianity in China from 1550 to 1950

by Alex Mayfield, Margaret Frei, Daryl Ireland and Eugenio Menegon

Data 2024, 9(6), 76; https://doi.org/10.3390/data9060076 - 29 May 2024

Abstract

The era of digitization is revolutionizing traditional humanities research, presenting both novel methodologies and challenges. This field harnesses quantitative techniques to yield groundbreaking insights, contingent upon comprehensive datasets on historical subjects. The China Historical Christian Database (CHCD) exemplifies this trend, furnishing researchers with a rich repository of historical, relational, and geographical data about Christianity in China from 1550 to 1950. The study of Christianity in China confronts formidable obstacles, including the mobility of historical agents, fluctuating relational networks, and linguistic disparities among scattered sources. The CHCD addresses these challenges by curating an open-access database built in neo4j that records information about Christian institutions in China and those that worked inside of them. Drawing on historical sources, the CHCD contains temporal, relational, and geographic data. The database currently has over 40,000 nodes and 200,000 relationships, and continues to grow. Beyond its utility for religious studies, the CHCD encompasses broader interdisciplinary inquiries including social network analysis, geospatial visualization, and economic modeling. This article introduces the CHCD’s structure, and explains the data collection and curation process. Full article

► Show Figures

Figure 1

27 pages, 512 KiB

Open AccessArticle

De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier Attacks

by Nicolás Torres and Patricio Olivares

Data 2024, 9(6), 75; https://doi.org/10.3390/data9060075 - 27 May 2024

Abstract

The widespread availability of pseudonymized user datasets has enabled personalized recommendation systems. However, recent studies have shown that users can be de-anonymized by exploiting the uniqueness of their data patterns, raising significant privacy concerns. This paper presents a novel approach that tackles the challenging task of linking user identities across multiple rating datasets from diverse domains, such as movies, books, and music, by leveraging the consistency of users’ rating patterns as high-dimensional quasi-identifiers. The proposed method combines probabilistic record linkage techniques with quasi-identifier attacks, employing the Fellegi–Sunter model to compute the likelihood of two records referring to the same user based on the similarity of their rating vectors. Through extensive experiments on three publicly available rating datasets, we demonstrate the effectiveness of the proposed approach in achieving high precision and recall in cross-dataset de-anonymization tasks, outperforming existing techniques, with F1-scores ranging from 0.72 to 0.79 for pairwise de-anonymization tasks. The novelty of this research lies in the unique integration of record linkage techniques with quasi-identifier attacks, enabling the effective exploitation of the uniqueness of rating patterns as high-dimensional quasi-identifiers to link user identities across diverse datasets, addressing a limitation of existing methodologies. We thoroughly investigate the impact of various factors, including similarity metrics, dataset combinations, data sparsity, and user demographics, on the de-anonymization performance. This work highlights the potential privacy risks associated with the release of anonymized user data across diverse contexts and underscores the critical need for stronger anonymization techniques and tailored privacy-preserving mechanisms for rating datasets and recommender systems. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

16 pages, 1931 KiB

Open AccessArticle

CVs Classification Using Neural Network Approaches Combined with BERT and Gensim: CVs of Moroccan Engineering Students

by Aniss Qostal, Aniss Moumen and Younes Lakhrissi

Data 2024, 9(6), 74; https://doi.org/10.3390/data9060074 - 24 May 2024

Abstract

Deep learning (DL)-oriented document processing is widely used in different fields for extraction, recognition, and classification processes from raw corpus of data. The article examines the application of deep learning approaches, based on different neural network methods, including Gated Recurrent Unit (GRU), long short-term memory (LSTM), and convolutional neural networks (CNNs). The compared models were combined with two different word embedding techniques, namely: Bidirectional Encoder Representations from Transformers (BERT) and Gensim Word2Vec. The models are designed to evaluate the performance of architectures based on neural network techniques for the classification of CVs of Moroccan engineering students at ENSAK (National School of Applied Sciences of Kenitra, Ibn Tofail University). The used dataset included CVs collected from engineering students at ENSAK in 2023 for a project on the employability of Moroccan engineers in which new approaches were applied, especially machine learning, deep learning, and big data. Accordingly, 867 resumes were collected from five specialties of study (Electrical Engineering (ELE), Networks and Systems Telecommunications (NST), Computer Engineering (CE), Automotive Mechatronics Engineering (AutoMec), Industrial Engineering (Indus)). The results showed that the proposed models based on the BERT embedding approach had more accuracy compared to models based on the Gensim Word2Vec embedding approach. Accordingly, the CNN-GRU/BERT model achieved slightly better accuracy with 0.9351 compared to other hybrid models. On the other hand, single learning models also have good metrics, especially based on BERT embedding architectures, where CNN has the best accuracy with 0.9188. Full article

► Show Figures

Figure 1

22 pages, 1226 KiB

Open AccessArticle

Comparative Analysis of the Predictive Performance of an ANN and Logistic Regression for the Acceptability of Eco-Mobility Using the Belgrade Data Set

by Jelica Komarica, Draženko Glavić and Snežana Kaplanović

Data 2024, 9(5), 73; https://doi.org/10.3390/data9050073 - 19 May 2024

Abstract

To solve the problem of environmental pollution caused by road traffic, alternatives to vehicles with internal combustion engines are often proposed. As such, eco-mobility microvehicles have significant potential in the fight against environmental pollution, but only on the condition that they are widely accepted and that they replace the vehicles that predominantly pollute the environment. With this in mind, this study aims to elucidate the main variables that influence the acceptability of these vehicles, using prediction models based on binary logistic regression and a multilayer artificial neural network—a multilayer perceptron (ANN). The data of a random sample obtained via an online questionnaire, answered by 503 inhabitants of Belgrade (Serbia), were used for training and testing the model. A multilayer perceptron with 9 and 7 neurons in two hidden layers, a hyperbolic tangent activation function in the hidden layer, and an identity function in the output layer performed slightly better than the binary logistic regression model. With an accuracy of 85%, a precision of 79%, a recall of 81%, and an area under the ROC curve of 0.9, the multilayer perceptron model recognized the influential variables in predicting acceptability. The results of the model indicate that a respondent’s relationship to their current environmental pollution, the frequency of their use of modes of transport such as bicycles and motorcycles, their mileage for commuting, and their personal income have the greatest influence on the acceptability of using eco-mobility vehicles. Full article

► Show Figures

Figure 1

13 pages, 2681 KiB

Open AccessArticle

A Benchmark Data Set for Long-Term Monitoring in the eLTER Site Gesäuse-Johnsbachtal

by Florian Lippl, Alexander Maringer, Margit Kurka, Jakob Abermann, Wolfgang Schöner and Manuela Hirschmugl

Data 2024, 9(5), 72; https://doi.org/10.3390/data9050072 - 18 May 2024

Abstract

This paper gives an overview over all currently available data sets for the European Long-term Ecosystem Research (eLTER) monitoring site Gesäuse-Johnsbachtal. The site is part of the LTSER platform Eisenwurzen in the Alps of the province of Styria, Austria. It contains both protected (National Park Gesäuse) and non-protected areas (Johnsbachtal). Although the main research focus of the eLTER monitoring site Gesäuse-Johnsbachtal is on inland surface running waters, forests and other wooded land, the eLTER whole system (WAILS) approach was followed in regard to the data selection, systematically screening all available data in regard to its suitability as eLTER’s Standard Observations (SOs). Thus, data from all system strata was included, incorporating Geosphere, Atmosphere, Hydrosphere, Biosphere and Sociosphere. In the WAILS approach these SOs are key data for a whole system approach towards long term ecosystem research. Altogether, 54 data sets have been collected for the eLTER monitoring site Gesäuse-Johnsbachtal and included in the Dynamical Ecological Information Management System – Site and Data Registry (DEIMS-SDR), which is the eLTER data platform. The presented work provides all these data sets through dedicated data repositories for FAIR use. This paper gives an overview on all compiled data sets and their main properties. Additionally, the available data are evaluated in a concluding gap analysis with regard to the needed observation data according to WAILS, followed by an outlook on how to fill these gaps. Full article

► Show Figures

Figure 1

24 pages, 545 KiB

Open AccessArticle

Neural Architecture Comparison for Bibliographic Reference Segmentation: An Empirical Study

by Rodrigo Cuéllar Hidalgo, Raúl Pinto Elías, Juan-Manuel Torres-Moreno, Osslan Osiris Vergara Villegas , Gerardo Reyes Salgado and Andrea Magadán Salazar

Data 2024, 9(5), 71; https://doi.org/10.3390/data9050071 - 18 May 2024

Abstract

In the realm of digital libraries, efficiently managing and accessing scientific publications necessitates automated bibliographic reference segmentation. This study addresses the challenge of accurately segmenting bibliographic references, a task complicated by the varied formats and styles of references. Focusing on the empirical evaluation of Conditional Random Fields (CRF), Bidirectional Long Short-Term Memory with CRF (BiLSTM + CRF), and Transformer Encoder with CRF (Transformer + CRF) architectures, this research employs Byte Pair Encoding and Character Embeddings for vector representation. The models underwent training on the extensive Giant corpus and subsequent evaluation on the Cora Corpus to ensure a balanced and rigorous comparison, maintaining uniformity across embedding layers, normalization techniques, and Dropout strategies. Results indicate that the BiLSTM + CRF architecture outperforms its counterparts by adeptly handling the syntactic structures prevalent in bibliographic data, achieving an F1-Score of 0.96. This outcome highlights the necessity of aligning model architecture with the specific syntactic demands of bibliographic reference segmentation tasks. Consequently, the study establishes the BiLSTM + CRF model as a superior approach within the current state-of-the-art, offering a robust solution for the challenges faced in digital library management and scholarly communication. Full article

(This article belongs to the Special Issue Advances in Text Mining Techniques and Applications for Knowledge Discovery)

► Show Figures

Figure 1

17 pages, 6833 KiB

Open AccessData Descriptor

Continuous Wave Measurements Collected in Intermediate Depth throughout the North Sea Storm Season during the RealDune/REFLEX Experiments

by Jantien Rutten, Marion Tissier, Paul van Wiechen, Xinyi Zhang, Sierd de Vries, Ad Reniers and Jan-Willem Mol

Data 2024, 9(5), 70; https://doi.org/10.3390/data9050070 - 17 May 2024

Cited by 1

Abstract

High-resolution wave measurements at intermediate water depth are required to improve coastal impact modeling. Specifically, such data sets are desired to calibrate and validate models, and broaden the insight on the boundary conditions that force models. Here, we present a wave data set collected in the North Sea at three stations in intermediate water depth (6–14 m) during the 2021/2022 storm season as part of the RealDune/REFLEX experiments. Continuous measurements of synchronized surface elevation, velocity and pressure were recorded at 2–4 Hz by Acoustic Doppler Profilers and an Acoustic Doppler Velocimeter for a 5-month duration. Time series were quality-controlled, directional-frequency energy spectra were calculated and common bulk parameters were derived. Measured wave conditions vary from calm to energetic with 0.1–5.0 m sea-swell wave height, 5–16 s mean wave period and W-NNW direction. Nine storms, i.e., wave height beyond 2.5 m for at least six hours, were recorded including the triple storms Dudley, Eunice and Franklin. This unique data set can be used to investigate wave transformation, wave nonlinearity and wave directionality for higher and lower frequencies (e.g., sea-swell and infragravity waves) to compare with theoretical and empirical descriptions. Furthermore, the data can serve to force, calibrate and validate models during storm conditions. Full article

► Show Figures

Figure 1

38 pages, 390 KiB

Open AccessReview

Review of Data Processing Methods Used in Predictive Maintenance for Next Generation Heavy Machinery

by Ietezaz Ul Hassan, Krishna Panduru and Joseph Walsh

Data 2024, 9(5), 69; https://doi.org/10.3390/data9050069 - 15 May 2024

Abstract

Vibration-based condition monitoring plays an important role in maintaining reliable and effective heavy machinery in various sectors. Heavy machinery involves major investments and is frequently subjected to extreme operating conditions. Therefore, prompt fault identification and preventive maintenance are important for reducing costly breakdowns and maintaining operational safety. In this review, we look at different methods of vibration data processing in the context of vibration-based condition monitoring for heavy machinery. We divided primary approaches related to vibration data processing into three categories–signal processing methods, preprocessing-based techniques and artificial intelligence-based methods. We highlight the importance of these methods in improving the reliability and effectiveness of heavy machinery condition monitoring systems, highlighting the importance of precise and automated fault detection systems. To improve machinery performance and operational efficiency, this review aims to provide information on current developments and future directions in vibration-based condition monitoring by addressing issues like imbalanced data and integrating cutting-edge techniques like anomaly detection algorithms. Full article

15 pages, 1153 KiB

Open AccessData Descriptor

EEG and Physiological Signals Dataset from Participants during Traditional and Partially Immersive Learning Experiences in Humanities

by Rebeca Romo-De León, Mei Li L. Cham-Pérez, Verónica Andrea Elizondo-Villegas, Alejandro Villarreal-Villarreal, Alexandro Antonio Ortiz-Espinoza, Carol Stefany Vélez-Saboyá, Jorge de Jesús Lozoya-Santos, Manuel Cebral-Loureda and Mauricio A. Ramírez-Moreno

Data 2024, 9(5), 68; https://doi.org/10.3390/data9050068 - 15 May 2024

Abstract

The relevance of the interaction between Humanities-enhanced learning using immersive environments and simultaneous physiological signal analysis contributes to the development of Neurohumanities and advancements in applications of Digital Humanities. The present dataset consists of recordings from 24 participants divided in two groups (12 participants in each group) engaging in simulated learning scenarios, traditional learning, and partially immersive learning experiences. Data recordings from each participant contain recordings of physiological signals and psychometric data collected from applied questionnaires. Physiological signals include electroencephalography, real-time engagement and emotion recognition calculation by a Python EEG acquisition code, head acceleration, electrodermal activity, blood volume pressure, inter-beat interval, and temperature. Before the acquisition of physiological signals, participants were asked to fill out the General Health Questionnaire and Trait Meta-Mood Scale. In between recording sessions, participants were asked to fill out Likert-scale questionnaires regarding their experience and a Self-Assessment Manikin. At the end of the recording session, participants filled out the ITC Sense of Presence Inventory questionnaire for user experience. The dataset can be used to explore differences in physiological patterns observed between different learning modalities in the Humanities. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—2nd Edition)

► Show Figures

Figure 1

41 pages, 2238 KiB

Open AccessArticle

Unveiling University Groupings: A Clustering Analysis for Academic Rankings

by George Matlis, Nikos Dimokas and Petros Karvelis

Data 2024, 9(5), 67; https://doi.org/10.3390/data9050067 - 11 May 2024

Abstract

The evaluation and ranking of educational institutions are of paramount importance to a wide range of stakeholders, including students, faculty members, funding organizations, and the institutions themselves. Traditional ranking systems, such as those provided by QS, ARWU, and THE, have offered valuable insights into university performance by employing a variety of indicators to reflect institutional excellence across research, teaching, international outlook, and more. However, these linear rankings may not fully capture the multifaceted nature of university performance. This study introduces a novel clustering analysis that complements existing rankings by grouping universities with similar characteristics, providing a multidimensional perspective on global higher education landscapes. Utilizing a range of clustering algorithms—K-Means, GMM, Agglomerative, and Fuzzy C-Means—and incorporating both traditional and unique indicators, our approach seeks to highlight the commonalities and shared strengths within clusters of universities. This analysis does not aim to supplant existing ranking systems but to augment them by offering stakeholders an alternative lens through which to view and assess university performance. By focusing on group similarities rather than ordinal positions, our method encourages a more nuanced understanding of institutional excellence and facilitates peer learning among universities with similar profiles. While acknowledging the limitations inherent in any methodological approach, including the selection of indicators and clustering algorithms, this study underscores the value of complementary analyses in enriching our understanding of higher educational institutions’ performance. Full article

► Show Figures

Figure 1

9 pages, 191 KiB

Open AccessData Descriptor

A Series Production Data Set for Five-Axis CNC Milling

by Anna-Maria Schmitt and Bastian Engelmann

Data 2024, 9(5), 66; https://doi.org/10.3390/data9050066 - 30 Apr 2024

Abstract

The described data set contains features from the machine control of a five-axis milling machine. The features were recorded during thirteen series productions. Each series production includes a changeover process in which the machine was set up for the production of a different product. In addition to the timestamps and the twenty recorded features derived from Numerical Control (NC) variables, the data set also contains labels for the different production phases. For this purpose, up to 23 phases were assigned, which are based on a generalized milling process. The data set consists of thirteen .csv files, each representing a series production. The data set was recorded in a production company in the contract manufacturing sector for components with real series orders in ongoing industrial production. Full article

► Show Figures

Figure 1

16 pages, 2904 KiB

Open AccessArticle

Spectral Library of Plant Species from Montesinho Natural Park in Portugal

by Isabel Pôças, Cátia Rodrigues de Almeida, Salvador Arenas-Castro, João C. Campos, Nuno Garcia, João Alírio, Neftalí Sillero and Ana C. Teodoro

Data 2024, 9(5), 65; https://doi.org/10.3390/data9050065 - 30 Apr 2024

Abstract

In this work, we present and describe a spectral library (SL) with 15 vascular plant species from Montesinho Natural Park (MNP), a protected area in Northeast Portugal. We selected species from the vascular plants that are characteristic of the habitats in the MNP, based on their prevalence, and also included one invasive species: Alnus glutinosa (L.) Gaertn, Castanea sativa Mill., Cistus ladanifer L., Crataegus monogyna Jacq., Frangula alnus Mill., Fraxinus angustifolia Vahl, Quercus pyrenaica Willd., Quercus rotundifolia Lam., Trifolium repens L., Arbutus unedo L., Dactylis glomerata L., Genista falcata Brot., Cytisus multiflorus (L’Hér.) Sweet, Erica arborea L., and Acacia dealbata Link. We collected spectra (300–2500 nm) from five records per leaf and leaf side, which resulted in 538 spectra compiled in the SL. Additionally, we computed five vegetation indices from spectral data and analysed them to highlight specific characteristics and differences among the sampled species. We detail the data repository information and its organisation for a better understanding of the data and to facilitate its use. The SL structure can add valuable information about the selected plant species in MNP, contributing to conservation purposes. This plant species SL is publicly available in Zenodo platform. Full article

► Show Figures

Figure 1

17 pages, 7237 KiB

Open AccessData Descriptor

A Comprehensive Dataset of the Aerodynamic and Geometric Coefficients of Airfoils in the Public Domain

by Kanak Agarwal, Vedant Vijaykrishnan, Dyutit Mohanty and Manikandan Murugaiah

Data 2024, 9(5), 64; https://doi.org/10.3390/data9050064 - 30 Apr 2024

Abstract

This study presents an extensive collection of data on the aerodynamic behavior at a low Reynolds number and geometric coefficients for 2900 airfoils obtained through the class shape transformation (CST) method. By employing a verified OpenFOAM-based CFD simulation framework, lift and drag coefficients were determined at a Reynolds number of 10⁵. Considering the limited availability of data on low Reynolds number airfoils, this dataset is invaluable for a wide range of applications, including unmanned aerial vehicles (UAVs) and wind turbines. Additionally, the study offers a method for automating CFD simulations that could be applied to obtain aerodynamic coefficients at higher Reynolds numbers. The breadth of this dataset also supports the enhancement and creation of machine learning (ML) models, further advancing research into the aerodynamics of airfoils and lifting surfaces. Full article

► Show Figures

Figure 1

15 pages, 6850 KiB

Open AccessArticle

Detailed Landslide Traces Database of Hancheng County, China, Based on High-Resolution Satellite Images Available on the Google Earth Platform

by Junlei Zhao, Chong Xu and Xinwu Huang

Data 2024, 9(5), 63; https://doi.org/10.3390/data9050063 - 29 Apr 2024

Abstract

Hancheng is located in the eastern part of China’s Shaanxi Province, near the west bank of the Yellow River. It is located at the junction of the active geological structure area. The rock layer is relatively fragmented, and landslide disasters are frequent. The occurrence of landslide disasters often causes a large number of casualties along with economic losses in the local area, seriously restricting local economic development. Although risk assessment and deformation mechanism analysis for single landslides have been performed for landslide disasters in the Hancheng area, this area lacks a landslide traces database. A complete landslide database comprises the basic data required for the study of landslide disasters and is an important requirement for subsequent landslide-related research. Therefore, this study used multi-temporal high-resolution optical images and human-computer interaction visual interpretation methods of the Google Earth platform to construct a landslide traces database in Hancheng County. The results showed that at least 6785 landslides had occurred in the study area. The total area of the landslides was about 95.38 km², accounting for 5.88% of the study area. The average landslide area was 1406.04 m², the largest landslide area was 377,841 m², and the smallest landslide area was 202.96 m². The results of this study provides an important basis for understanding the spatial distribution of landslides in Hancheng County, the evaluation of landslide susceptibility, and local disaster prevention and mitigation work. Full article

(This article belongs to the Topic Database, Mechanism and Risk Assessment of Slope Geologic Hazards)

► Show Figures

Figure 1

16 pages, 5947 KiB

Open AccessData Descriptor

Stimulated Microcontroller Dataset for New IoT Device Identification Schemes through On-Chip Sensor Monitoring

by Alberto Ramos, Honorio Martín, Carmen Cámara and Pedro Peris-Lopez

Data 2024, 9(5), 62; https://doi.org/10.3390/data9050062 - 28 Apr 2024

Abstract

Legitimate identification of devices is crucial to ensure the security of present and future IoT ecosystems. In this regard, AI-based systems that exploit intrinsic hardware variations have gained notable relevance. Within this context, on-chip sensors included for monitoring purposes in a wide range of SoCs remain almost unexplored, despite their potential as a valuable source of both information and variability. In this work, we introduce and release a dataset comprising data collected from the on-chip temperature and voltage sensors of 20 microcontroller-based boards from the STM32L family. These boards were stimulated with five different algorithms, as workloads to elicit diverse responses. The dataset consists of five acquisitions (1.3 billion readouts) that are spaced over time and were obtained under different configurations using an automated platform. The raw dataset is publicly available, along with metadata and scripts developed to generate pre-processed T–V sequence sets. Finally, a proof of concept consisting of training a simple model is presented to demonstrate the feasibility of the identification system based on these data. Full article

► Show Figures

Figure 1

10 pages, 290 KiB

Open AccessData Descriptor

Training Datasets for Epilepsy Analysis: Preprocessing and Feature Extraction from Electroencephalography Time Series

by Christian Riccio, Angelo Martone, Gaetano Zazzaro and Luigi Pavone

Data 2024, 9(5), 61; https://doi.org/10.3390/data9050061 - 26 Apr 2024

Cited by 1

Abstract

We describe 20 datasets derived through signal filtering and feature extraction steps applied to the raw time series EEG data of 20 epileptic patients, as well as the methods we used to derive them. Background: Epilepsy is a complex neurological disorder which has seizures as its hallmark. Electroencephalography plays a crucial role in epilepsy assessment, offering insights into the brain’s electrical activity and advancing our understanding of seizures. The availability of tagged training sets covering all seizure phases—inter-ictal, pre-ictal, ictal, and post-ictal—is crucial for data-driven epilepsy analyses. Methods: Using the sliding window technique with a two-second window length and a one-second time slip, we extract multiple features from the preprocessed EEG time series of 20 patients from the Freiburg Seizure Prediction Database. In addition, we assign a class label to each instance to specify its corresponding seizure phase. All these operations are made through a software application we developed, which is named Training Builder. Results: The 20 tagged training datasets each contain 1080 univariate and bivariate features, and are openly and publicly available. Conclusions: The datasets support the training of data-driven models for seizure detection, prediction, and clustering, based on features engineering. Full article

► Show Figures

Figure 1

27 pages, 1874 KiB

Open AccessArticle

Predicting Academic Success of College Students Using Machine Learning Techniques

by Jorge Humberto Guanin-Fajardo, Javier Guaña-Moya and Jorge Casillas

Data 2024, 9(4), 60; https://doi.org/10.3390/data9040060 - 22 Apr 2024

Abstract

College context and academic performance are important determinants of academic success; using students’ prior experience with machine learning techniques to predict academic success before the end of the first year reinforces college self-efficacy. Dropout prediction is related to student retention and has been studied extensively in recent work; however, there is little literature on predicting academic success using educational machine learning. For this reason, CRISP-DM methodology was applied to extract relevant knowledge and features from the data. The dataset examined consists of 6690 records and 21 variables with academic and socioeconomic information. Preprocessing techniques and classification algorithms were analyzed. The area under the curve was used to measure the effectiveness of the algorithm; XGBoost had an AUC = 87.75% and correctly classified eight out of ten cases, while the decision tree improved interpretation with ten rules in seven out of ten cases. Recognizing the gaps in the study and that on-time completion of college consolidates college self-efficacy, creating intervention and support strategies to retain students is a priority for decision makers. Assessing the fairness and discrimination of the algorithms was the main limitation of this work. In the future, we intend to apply the extracted knowledge and learn about its influence of on university management. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—2nd Edition)

► Show Figures

Figure 1

22 pages, 700 KiB

Open AccessReview

Mapping of Data-Sharing Repositories for Paediatric Clinical Research—A Rapid Review

by Mariagrazia Felisi, Fedele Bonifazi, Maddalena Toma, Claudia Pansieri, Rebecca Leary, Victoria Hedley, Ronald Cornet, Giorgio Reggiardo, Annalisa Landi, Annunziata D’Ercole, Salma Malik, Sinéad Nally, Anando Sen, Avril Palmeri, Donato Bonifazi and Adriana Ceci

Data 2024, 9(4), 59; https://doi.org/10.3390/data9040059 - 20 Apr 2024

Abstract

The reuse of paediatric individual patient data (IPD) from clinical trials (CTs) is essential to overcome specific ethical, regulatory, methodological, and economic issues that hinder the progress of paediatric research. Sharing data through repositories enables the aggregation and dissemination of clinical information, fosters collaboration between researchers, and promotes transparency. This work aims to identify and describe existing data-sharing repositories (DSRs) developed to store, share, and reuse paediatric IPD from CTs. A rapid review of platforms providing access to electronic DSRs was conducted. A two-stage process was used to characterize DSRs: a first step of identification, followed by a second step of analysis using a set of eight purpose-built indicators. From an initial set of forty-five publicly available DSRs, twenty-one DSRs were identified as meeting the eligibility criteria. Only two DSRs were found to be totally focused on the paediatric population. Despite an increased awareness of the importance of data sharing, the results of this study show that paediatrics remains an area in which targeted efforts are still needed. Promoting initiatives to raise awareness of these DSRs and creating ad hoc measures and common standards for the sharing of paediatric CT data could help to bridge this gap in paediatric research. Full article

► Show Figures

Figure 1

Journal Menu

Journal Browser

► Journal Browser

Highly Accessed Articles

Latest Books

More Books and Reprints...

E-Mail Alert

News

7 June 2024
MDPI Calls for Greater Open Access to Science for Ocean Protection

5 June 2024
MDPI Sets a New Benchmark for Publishing Excellence

4 June 2024
MDPI Insights: The CEO's Letter #12 - First Term as CEO, Tu Youyou Award, Books Report

More News & Announcements...

Topics

Propose a Topic

Topic in Algorithms, Data, Information, Mathematics, Symmetry

Decision-Making and Data Mining for Sustainable Computing Topic Editors: Sunil Jha, Malgorzata Rataj, Xiaorui Zhang
Deadline: 30 November 2024

Topic in BDCC, Data, MAKE, Mathematics

Big Data Intelligence: Methodologies and Applications Topic Editors: Liang Zhao, Liang Zou, Boxiang Dong
Deadline: 31 December 2024

Topic in BDCC, Data, Environments, Geosciences, Remote Sensing

Database, Mechanism and Risk Assessment of Slope Geologic Hazards Topic Editors: Chong Xu, Yingying Tian, Xiaoyi Shao, Zikang Xiao, Yulong Cui
Deadline: 28 February 2025

Topic in Data, Energies, Sensors, Sustainability, Water

Water and Energy Monitoring and Their Nexus Topic Editors: Lucas Pereira, Hugo Morais, Wolf-Gerrit Früh
Deadline: 31 March 2025