|
Alibaba DAMO Academy. (2023). Tongyi Qianwen technical documentation. Alibaba DAMO Academy.
Alfred R, Obit J, Chin C Y, Haviluddin H, Lim Y, Kim S. 2021. Towards paddy rice smart farming: A review on big data, machine learning, and rice production tasks. IEEE Access, 9, 50358–50380.
Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., Silver, D., et al. (2023). Gemini: A family of highly capable multimodal models. arXiv preprint. https://arxiv.org/abs/2312.11805
Bai, X., Gu, S., Liu, P., Yang, A., Cai, Z., Wang, J., & Yao, J. (2023). RPNet: Rice plant counting after tillering stage based on plant attention and multiple supervision network. The Crop Journal, 11(5), 1586–1594.
Beddiar, D. R., Oussalah, M., & Seppänen, T. (2022). Automatic captioning for medical imaging (MIC): A rapid review of literature. Artificial Intelligence Review, 56, 4019–4076.
Che, C., Lin, Q., Zhao, X., Huang, J., & Yu, L. (2023). Enhancing multimodal understanding with CLIP-based image-to-text transformation. In Proceedings of the 2023 6th International Conference on Big Data Technologies (ICBDT) (pp. 301–313). Association for Computing Machinery.
Chen, X., Zhang, Y., Li, M., & Wang, X. (2023). A survey on image captioning: Advances and challenges. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), 1234–1256.
Chen, Q., Hu, X., Wang, Z., & Hong, Y. (2023). MedBLIP: Bootstrapping language-image pre-training from 3D medical images and texts. arXiv preprint. https://arxiv.org/abs/2305.10799
Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang, W., Li, B., Fung, P., & Hoi, S. (2023). InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv preprint. https://arxiv.org/abs/2305.06500
Dong, X., Wang, Q., Huang, Q., Ge, Q., Zhao, K., Wu, X., Wu, X., Lei, L., & Hao, G. (2023). PDDD-PreTrain: A series of commonly used pre-trained models support image-based plant disease diagnosis. Plant Phenomics, 5, 0054.
Dan, Y., Wu, X., Yu, Y., Zou, Z., Gunarathna, R. D. S. M., Yu, P., Xiao, Y., & Wang, Q. (2025). DKP-ADS: Domain knowledge prompt combined with multi-task learning for assessment of foliar disease severity in staple crops. The Crop Journal, 13(6), 1939–1954.
Dong, Y., Cordonnier, J.-B., & Loukas, A. (2021). Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning (Vol. 139, pp. 2793–2803). PMLR.
Dubey, A., Jauhri, A., Pandey, A., et al. (2024). The Llama 3 herd of models. arXiv preprint. https://arxiv.org/abs/2407.21783
Food and Agriculture Organization. (2018). The state of food and agriculture 2018: Migration, agriculture and rural development. Food and Agriculture Organization of the United Nations. https://www.fao.org/3/I9549EN/i9549en.pdf
Guerra, J. P., & Cuevas, F. (2024). Application of digital image processing techniques for agriculture: A review. In Digital image processing: Latest advances and applications (Chapter 2). IntechOpen. https://doi.org/10.5772/intechopen.1001234
Gurnee, W., & Tegmark, M. (2024). Language models represent space and time. In The Twelfth International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=3F6nWfeHoo
Hughes, D. P., & Salathé, M. (2015). An open access repository of images on plant health to enable the development of mobile disease diagnostics through machine learning and crowdsourcing. arXiv preprint. https://arxiv.org/abs/1511.08060
Koh, J. Y., Fried, D., & Salakhutdinov, R. (2023). Generating images with multimodal language models. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023). Curran Associates. https://arxiv.org/abs/2305.17216
Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint. https://arxiv.org/abs/2201.12086
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML) (pp. 19730–19742). PMLR. https://proceedings.mlr.press/v202/li23q.html
Li, W., Zhu, L., Wen, L., & Yang, Y. (2023). DeCap: Decoding CLIP latents for zero-shot captioning via text-only training. In The Eleventh International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=Lt8bMlhiwx2
Li, Y., Yin, Y., Fan, C., Zhang, Y., Wang, X., & Tang, J. (2024). A survey of hallucination in multimodal large language models. arXiv preprint. https://arxiv.org/abs/2405.19388
Liu, H., Li, C., Li, Y., & Lee, Y. J. (2023). Improved baselines with visual instruction tuning. arXiv preprint. https://arxiv.org/abs/2310.03744.
OpenAI. (2023). GPT-4 technical report. OpenAI. https://arxiv.org/abs/2303.08774
Pacal, I., Kunduracioglu, I., Alma, M. H., Deveci, M., Kadry, S., Nedoma, J., & Martinek, R. (2024). A systematic review of deep learning techniques for plant diseases. Artificial Intelligence Review, 57, 304. https://doi.org/10.1007/s10462-024-10945-0
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML) (pp. 8748–8763). PMLR.
Restrepo-Arias, J. F., Branch-Bedoya, J. W., & Awad, G. (2024). Image classification on smart agriculture platforms: Systematic literature review. Artificial Intelligence in Agriculture, 8, 1–17.
Rohne Till, E. (2022). The role of agriculture in economic development. In Agriculture for economic development in Africa (pp. 23–45). Springer.
Kormelink R, Garcia M L, Goodin M, Sasaya T, Haenni A L. 2011. Negative-strand RNA viruses: the plant-infecting counterparts. Virus Research. 162: 184-202.
Savary, S., Ficke, A., Aubertot, J.-N., & Hollier, C. (2012). Crop losses due to diseases and their implications for global food production losses and food security. Food Security, 4(4), 519–537.
Shwetha, V., Bhagwat, A., & Laxmi, V. (2024). LeafSpotNet: A deep learning framework for detecting leaf spot disease in jasmine plants. Artificial Intelligence in Agriculture, 12, 1–18.
Singh, R. P., Hodson, D. P., Huerta-Espino, J., Bhavani, S., & Randhawa, M. S. (2015). Emergence and spread of new races of wheat stem rust fungus: Continued threat to food security and prospects of genetic control. Phytopathology, 105(7), 872–884.
Su, J., Lu, Y., Pan, S., et al. (2021). RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint. https://arxiv.org/abs/2104.09864
Sun, C., Li, Y., Song, Z., Liu, Q., Si, H., Yang, Y., & Cao, Q. (2025). Research on tomato disease image recognition method based on DeiT. European Journal of Agronomy, 162, 127400.
Sun, W., Wang, C., Wu, H., Miao, Y., Zhu, H., Guo, W., & Li, J. (2023). DFYOLOv5m-M2transformer: Interpretation of vegetable disease recognition results using image dense captioning techniques. Computers and Electronics in Agriculture. 215, 108460.
Suresh, K. R., Jarapala, A., & Sudeep, P. V. (2022). Image captioning encoder–decoder models using CNN-RNN architectures: A comparative study. Circuits, Systems Signal Processing, 41(10), 5719–5742.
Tewel, Y., Shalev, Y., Schwartz, I., & Wolf, L. (2021). ZeroCap: Zero-shot image-to-text generation for visual-semantic arithmetic. arXiv preprint. https://arxiv.org/abs/2111.14447
Ueda, A., Yang, W., & Sugiura, K. (2023). Switching text-based image encoders for captioning images with text. IEEE Access, 11, 55706–55715.
Wu, X., Zhang, J., Zou, Z., Chen, C., Yu, Y., Yu, P., Xiao, Y., Wang, Q., Kandegama, W. M. W. W., & Hao, G. (2026). PlantIF: Multimodal semantic interactive fusion via graph learning for plant disease diagnosis. Plant Phenomics, 8(1), 100132.
Xie, Z., Feng, Y., Hu, Y., & Liu, H. (2022). Generating image description of rice pests and diseases using a ResNet18 feature encoder. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 38(12), 197–206.
Xu, G., Niu, S., Tan, M., Luo, Y., & Du, Q. (2021). Towards accurate text-based image captioning with content diversity exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 16851–16860). IEEE/CVF.
Xu, L., Hu, Z., Zhou, D., Ren, H., Dong, Z., Keutzer, K., Ng, S. K., & Feng, J. (2023). MAgIC: Benchmarking large language model powered multi-agent in cognition, adaptability, rationality and collaboration. arXiv preprint. https://arxiv.org/abs/2311.08562
Yu, Y., Wu, X., Yu, P., Wan, Q., Dan, Y., Xiao, Y., & Wang, Q. (2025). Location-guided lesions representation learning via image generation for assessing plant leaf diseases severity. Plant Phenomics, 7(2), 100058.
Zeng, Z., Zhang, H., Wang, Z., Lu, R., Wang, D., & Chen, B. (2023). ConZIC: Controllable zero-shot image captioning by sampling-based polishing. arXiv preprint. https://arxiv.org/abs/2303.02437
Zeng, Q., Sun, J., & Wang, S. (2024). DIC-Transformer: Interpretation of plant disease classification results using image caption generation technology. Frontiers in Plant Science, 14, 1289765.
Zhang, L., Sun, L., Jin, X., Zhao, X., & Li, S. (2025). DAFFnet: Seed classification of soybean variety based on dual attention feature fusion networks. The Crop Journal, 13(2), 619–629.
Zhao, K., Wu, X., Xiao, Y., Jiang, S., Yu, P., Wang, Y., & Wang, Q. (2024). PlanText: Gradually masked guidance to align image phenotype with trait description for plant disease texts. Plant Phenomics, 6, 0272.
Zhou, K., Xie, C., Bai, Y., Zhang, Y., & Li, J. (2023). Hallucination in multimodal large language models: A survey. arXiv preprint. https://arxiv.org/abs/2311.07344
|