Analyzing the Limitations of Conventional Machine Learning Models in Handling Large-Scale and Heterogeneous Data

Authors

  • Galih Prakoso Rizky A Universitas Pembangunan Nasional Veteran Jakarta
  • Rohani Situmorang Production controller, PT. Surya teknologi, Batam, Indonesia

Keywords:

Conventional Machine Learning Models, Large-Scale Data, Heterogeneous Data, Scalability Limitations, Comparative Performance Analysis

Abstract

The rapid growth of data volume, dimensionality, and heterogeneity has challenged the effectiveness of conventional machine learning models, which were originally designed for smaller and more homogeneous datasets. This study analyzes the structural and computational limitations of traditional models such as Logistic Regression, Naïve Bayes, Decision Trees, and Support Vector Machines in handling large-scale and diverse data. Using a combination of literature review, experimental evaluation, and comparative analysis, the research investigates how these models perform under increasing data size, varying feature complexity, and mixed data modalities. Key performance metrics, including accuracy degradation, training time escalation, memory consumption, and scalability constraints, are examined to identify critical thresholds where conventional techniques begin to fail. The results show that traditional models exhibit significant performance drops, resource saturation, and reduced robustness when faced with high-dimensional or heterogeneous datasets, particularly in comparison to modern deep learning and distributed learning approaches. These findings align with earlier theoretical studies but provide new empirical evidence that quantifies failure points and broadens the understanding of scalability limitations. The study concludes that while classical machine learning approaches remain effective for small and structured datasets, they are increasingly unsuitable for contemporary data-intensive environments. This research highlights the necessity of transitioning toward more scalable, adaptive, and representation-rich models to meet current and future data challenges.

Downloads

Download data is not yet available.

References

M. Asch et al., “Big data and extreme-scale computing: Pathways to convergence-toward a shaping strategy for a future software and data ecosystem for scientific inquiry,” Int. J. High Perform. Comput. Appl., vol. 32, no. 4, pp. 435–479, 2018.

K. N. Neeraj and V. Maurya, “A review on machine learning (feature selection, classification and clustering) approaches of big data mining in different area of research,” J. Crit. Rev., vol. 7, no. 19, pp. 2610–2626, 2020.

H. A. Abu Alfeilat et al., “Effects of distance measure choice on k-nearest neighbor classifier performance: a review,” Big data, vol. 7, no. 4, pp. 221–248, 2019.

O. Salman, I. Elhajj, A. Kayssi, and A. Chehab, “An architecture for the Internet of Things with decentralized data and centralized control,” in 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), IEEE, 2015, pp. 1–8.

J. Lu, Z. Yan, J. Han, and G. Zhang, “Data-driven decision-making (d 3 m): Framework, methodology, and directions,” IEEE Trans. Emerg. Top. Comput. Intell., vol. 3, no. 4, pp. 286–296, 2019.

L. E. Lwakatare, A. Raj, I. Crnkovic, J. Bosch, and H. H. Olsson, “Large-scale machine learning systems in real-world industrial settings: A review of challenges and solutions,” Inf. Softw. Technol., vol. 127, p. 106368, 2020.

P. V. Torres-Carrión, C. S. González-González, S. Aciar, and G. Rodríguez-Morales, “Methodology for systematic literature review applied to engineering and education,” in 2018 IEEE Global engineering education conference (EDUCON), IEEE, 2018, pp. 1364–1373.

A. Pavlo et al., “A comparison of approaches to large-scale data analysis,” in Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, 2009, pp. 165–178.

P. J. Gleckler, K. E. Taylor, and C. Doutriaux, “Performance metrics for climate models,” J. Geophys. Res. Atmos., vol. 113, no. D6, 2008.

S. Hallsteinsen et al., “A development framework and methodology for self-adapting applications in ubiquitous computing environments,” J. Syst. Softw., vol. 85, no. 12, pp. 2840–2859, 2012.

H. Rashid et al., “Predicting subjective measures of social anxiety from sparsely collected mobile sensor data,” Proc. ACM Interactive, Mobile, Wearable Ubiquitous Technol., vol. 4, no. 3, pp. 1–24, 2020.

Y. Yang, D.-W. Zhou, D.-C. Zhan, H. Xiong, and Y. Jiang, “Adaptive deep models for incremental learning: Considering capacity scalability and sustainability,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 74–82.

B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do cifar-10 classifiers generalize to cifar-10?,” arXiv Prepr. arXiv1806.00451, 2018.

G. Nguyen et al., “Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey,” Artif. Intell. Rev., vol. 52, no. 1, pp. 77–124, 2019.

J. Hestness et al., “Deep learning scaling is predictable, empirically,” arXiv Prepr. arXiv1712.00409, 2017.

I. M. Johnstone and D. M. Titterington, “Statistical challenges of high-dimensional data,” Philosophical transactions of the Royal Society A: Mathematical, physical and engineering sciences, vol. 367, no. 1906. The Royal Society Publishing, pp. 4237–4253, 2009.

L. Boytsov, “Efficient and accurate non-metric k-NN search with applications to text matching.” Ph. D. Dissertation. Carnegie Mellon University, 2018.

X. Xu, T. Liang, J. Zhu, D. Zheng, and T. Sun, “Review of classical dimensionality reduction and sample selection methods for large-scale data processing,” Neurocomputing, vol. 328, pp. 5–15, 2019.

K. P. Soman, R. Loganathan, and V. Ajay, Machine learning with SVM and other kernel methods. PHI Learning Pvt. Ltd., 2009.

W. Nash, T. Drummond, and N. Birbilis, “A review of deep learning in the study of materials degradation,” npj Mater. Degrad., vol. 2, no. 1, p. 37, 2018.

S. B. Kotsiantis, “Bagging and boosting variants for handling classifications problems: a survey,” Knowl. Eng. Rev., vol. 29, no. 1, pp. 78–100, 2014.

T. Pranckevi?ius and V. Marcinkevi?ius, “Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification,” Balt. J. Mod. Comput., vol. 5, no. 2, p. 221, 2017.

A. Rahimi et al., “High-dimensional computing as a nanoscalable paradigm,” IEEE Trans. Circuits Syst. I Regul. Pap., vol. 64, no. 9, pp. 2508–2521, 2017.

M. Schlueter et al., “New horizons for managing the environment: A review of coupled social?ecological systems modeling,” Nat. Resour. Model., vol. 25, no. 1, pp. 219–272, 2012.

M. M. Abd El-Mohsen, “The effect of stem length in multiple choice questions on item difficulty in syllabus-based vocabulary test items,” 2008.

M. Capra, B. Bussolino, A. Marchisio, M. Shafique, G. Masera, and M. Martina, “An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks,” Futur. Internet, vol. 12, no. 7, p. 113, 2020.

E. Novák, “Automated Machine Learning (AutoML): Challenges and Future Trends in AI Model Optimization,” Int. J. Artif. Intell. Data Sci. Mach. Learn., vol. 1, no. 1, pp. 11–21, 2020.

C. C. Boyd, R. Cheacharoen, T. Leijtens, and M. D. McGehee, “Understanding degradation mechanisms and improving stability of perovskite photovoltaics,” Chem. Rev., vol. 119, no. 5, pp. 3418–3451, 2018.

Downloads

Published

2025-05-30

How to Cite

Prakoso Rizky A, G., & Situmorang, R. (2025). Analyzing the Limitations of Conventional Machine Learning Models in Handling Large-Scale and Heterogeneous Data. Jurnal Teknik Informatika C.I.T Medicom, 17(2), 80–91. Retrieved from https://medikom.iocspublisher.org/index.php/JTI/article/view/1378