BazEkon - Biblioteka Główna Uniwersytetu Ekonomicznego w Krakowie

BazEkon home page

Meny główne

Autor
Hampton Peter John (Ulster University, Jordanstown United Kingdom), Blackburn William (Ulster University, Jordanstown United Kingdom), Wang Hui (Ulster University, Jordanstown United Kingdom)
Tytuł
The Serialization of Heterogeneous Documents
Źródło
Annals of Computer Science and Information Systems, 2015, vol. 6, s. 25-30, rys., tab., bibliogr. 15 poz.
Słowa kluczowe
Eksploracja tekstu, Informatyka
Text mining, Information science
Uwagi
summ.
Abstrakt
Tasks involving the analysis of natural language are typically conducted on a corpus or corpora of plain text. However, it is rare that a document is unstructured and freeform in its entirety. Documents such as corporate disclosures, medical journals and other knowledge rich archive contain structured and loosely-structured information that can be used in a variety of important text mining tasks. In this paper we propose a syntactical preprocessing architecture to serialize presentationoriented documents to a machine readable format that aspires to preserve the document structure, contents and metadata. We introduce a hybrid pipeline architecture, discussing the various processes and the future research direction that could potentially lead to a holistic representation of heterogeneous documents. (original abstract)
Pełny tekst
Pokaż
Bibliografia
Pokaż
  1. Comeau, D. C., Liu, H., Dogan, R. I., & Wilbur, W. J. ˘ (2014). Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus.
  2. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55-60).
  3. Liu, M., Xu, W., Ran, Q., & Li, Y. (2015). Using Natural Language Processing Technology to Analyze Teachers' Written Feedback on Chinese Students' English Essays.
  4. Douglas, S., Hurst, M., & Quinn, D. (1995). Using natural language processing for identifying and interpreting tables in plain text. In Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (pp. 535-546).
  5. Clark, C., & Divvala, S. Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers.
  6. Ding, L., Zhou, L., Finin, T., & Joshi, A. (2005). How the semantic web is being used: An analysis of foaf documents. In System Sciences, 2005. HICSS'05. Proceedings of the 38th Annual Hawaii International Conference on(pp. 113c-113c). IEEE.
  7. Li, X., Li, F., & Chen, X. (2015, April). Distributed GIS framework design based on XML and Web Service. In 2015 International Conference on Intelligent Systems Research and Mechatronics Engineering. Atlantis Press.
  8. Hwang, C. G., Yoon, C. P., & Lee, D. (2015). Exchange of Data for Big Data in Hybrid Cloud Environment.
  9. Niu, Z., Yang, C., & Zhang, Y. (2014). A design of cross-terminal web system based on JSON and REST. In Software Engineering and Service Science (ICSESS), 2014 5th IEEE International Conference on (pp. 904- 907). IEEE.
  10. Smith, B. (2015). Creating JSON. In Beginning JSON (pp. 49-67). Apress.
  11. Ben-Kiki, O., Evans, C., & Ingerson, B. (2005). YAML Ain't Markup Language (YAMLTM) Version 1.1. yaml. org, Tech. Rep.
  12. Eriksson, M., & Hallberg, V. (2011). Comparison between JSON and YAML for data serialization. The School of Computer Science and Engineering Royal Institute of Technology.
  13. Bird, S. (2006). NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions (pp. 69-72). Association for Computational Linguistics.
  14. Smutz, C., & Stavrou, A. (2012, December). Malicious PDF detection using metadata and structural features. In Proceedings of the 28th Annual Computer Security Applications Conference (pp. 239-248). ACM.
  15. Khusro, S., Latif, A., & Ullah, I. (2014). On methods and tools of table detection, extraction and annotation in PDF documents. Journal of Information Science.
Cytowane przez
Pokaż
ISSN
2300-5963
Język
eng
URI / DOI
http://dx.doi.org/10.15439/2015F380
Udostępnij na Facebooku Udostępnij na Twitterze Udostępnij na Google+ Udostępnij na Pinterest Udostępnij na LinkedIn Wyślij znajomemu