Data exploration
Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems.[1] These characteristics can include size or amount of data, completeness of the data, correctness of the data, possible relationships amongst data elements or files/tables in the data.
Data exploration is typically conducted using a combination of automated and manual activities.[1][2][3] Automated activities can include data profiling or data visualization or tabular reports to give the analyst an initial view into the data and an understanding of key characteristics.[1]
This is often followed by manual drill-down or filtering of the data to identify anomalies or patterns identified through the automated actions. Data exploration can also require manual scripting and queries into the data (e.g. using languages such as SQL or R) or using spreadsheets or similar tools to view the raw data.[4]
All of these activities are aimed at creating a mental model and understanding of the data in the mind of the analyst, and defining basic metadata (statistics, structure, relationships) for the data set that can be used in further analysis.[1]
Once this initial understanding of the data is had, the data can be pruned or refined by removing unusable parts of the data (data cleansing), correcting poorly formatted elements and defining relevant relationships across datasets.[2] This process is also known as determining data quality.[4]
Data exploration can also refer to the ad hoc querying or visualization of data to identify potential relationships or insights that may be hidden in the data and does not require to formulate assumptions beforehand.[1]
Traditionally, this had been a key area of focus for statisticians, with John Tukey being a key evangelist in the field.[5] Today, data exploration is more widespread and is the focus of data analysts and data scientists; the latter being a relatively new role within enterprises and larger organizations.
Interactive Data Exploration
This area of data exploration has become an area of interest in the field of machine learning. This is a relatively new field and is still evolving.[4] As its most basic level, a machine-learning algorithm can be fed a data set and can be used to identify whether a hypothesis is true based on the dataset. Common machine learning algorithms can focus on identifying specific patterns in the data.[2] Many common patterns include regression and classification or clustering, but there are many possible patterns and algorithms that can be applied to data via machine learning.
By employing machine learning, it is possible to find patterns or relationships in the data that would be difficult or impossible to find via manual inspection, trial and error or traditional exploration techniques.[6]
Software
- Trifacta – a data preparation and analysis platform
- Paxata – self-service data preparation software
- Alteryx – data blending and advanced data analytics software
- Microsoft Power BI - interactive visualization and data analysis tool
- OpenRefine - a standalone open source desktop application for data clean-up and data transformation
- Tableau software – interactive data visualization software
See also
References
- ^ a b c d e FOSTER Open Science Archived 2023-06-25 at the Wayback Machine, Overview of Data Exploration Techniques: Stratos Idreos, Olga Papaemmonouil, Surajit Chaudhuri.
- ^ a b c Stanford.edu, 2011 Wrangler: Interactive Visual Specification of Data Transformation Scripts, Kandel, Paepcke, Hellerstein Heer.
- ^ Arnab Nandi; H. V. Jagadish. Guided Interaction: Rethinking the Query-Result Paradigm (PDF). International Conference on Very Large Data Bases (VLDB) 2011.
- ^ a b c Stanford.edu, IEEE Visual Analytics Science & Technology (VAST), Oct 2012 Enterprise Data Analysis and Visualization: An Interview Study., Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer Proc.
- ^ Exploratory Data Analysis, Pearson. ISBN 978-0201076165
- ^ Machine Learning for Data Exploration
Content Disclaimer
Informasi ini disarikan dari Wikipedia dan disajikan kembali untuk tujuan edukasi. Konten tersedia di bawah lisensi CC BY-SA 3.0. Kami tidak bertanggung jawab atas ketidakakuratan data yang bersumber dari kontribusi publik tersebut.
- The information displayed on this website is sourced in part or in whole from Wikipedia and has been adapted for the purpose of restating it. We strive to provide accurate and relevant information, however:
- There is no guarantee of absolute accuracy. Wikipedia is an open, collaborative project that can be edited by anyone, so information is subject to change.
- It is not intended to constitute professional advice. The content displayed is for informational and educational purposes only. For important decisions (e.g., medical, legal, or financial), please consult a professional.
- Content copyright. Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA). This means that content may be reused with appropriate attribution and shared under a similar license.
- Responsible use. Any risk arising from the use of information from this website is entirely the responsibility of the user.