We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33
Unit -2
Data and Data science life
cycle Data • Raw facts and figures • Research data : that has been collected, observed, generated or created to validate original research findings • Data are of two types I. Primary data II. Secondary data Data Collection • Data collection is the process of gathering and measuring information on variables of interest in a systematic and organized manner. It is a fundamental step in various fields, including science, research, business, and government, as it provides the raw material for analysis, decision-making, and generating insights. Importance of Data Collection • It's important to note that data collection should be conducted rigorously and systematically to ensure the reliability and validity of the collected information. The quality of the data directly impacts the quality of subsequent analyses and decisions. Additionally, ethical and legal considerations should always be taken into account when collecting and handling data, especially in cases involving human subjects or sensitive information. Data collection Data Management • Data management refers to the processes and activities involved in acquiring, storing, organizing, securing, and maintaining data to ensure its accuracy, reliability, and accessibility. Effective data management is crucial for organizations of all sizes, as it helps them make informed decisions, meet regulatory requirements, and optimize their operations. Data Management • Data management is a comprehensive approach to handling data throughout its lifecycle, ensuring its quality, security, and usability while adhering to regulatory requirements. Effective data management is essential for organizations to derive meaningful insights and make informed decisions in a data-driven world. Big Data Management • Big Data management refers to the strategies, processes, and technologies used to handle and derive value from large and complex datasets known as "big data." Big data is characterized by its volume, velocity, variety, and managing it presents unique challenges and opportunities. big data management is a complex and evolving field that encompasses various practices and technologies to effectively handle large and diverse datasets. Organizations that can successfully manage and analyze big data can gain valuable insights, make data-driven decisions, and gain a competitive advantage in today's data-centric world. Big Data Management Data sources • Data can be obtained from various sources, depending on the type and purpose of the data you need. • Surveys and Questionnaires • Government and Public Databases • Websites • Social Media • Books, Journals, and Publications • Mobile Apps • Financial Markets Data Sources… • Medical Records • APIs (Application Programming Interfaces) • etc
Remember that when collecting or using data, it's
important to consider ethical and legal considerations, including privacy regulations and data usage agreements. Additionally, data quality and accuracy should be assessed to ensure that the data is reliable for your intended purpose. Importance of Data Quality • Data quality is of paramount importance in various aspects of business, research, and decision-making. • To ensure data quality, organizations should implement data quality management practices, establish data governance frameworks, and regularly audit and cleanse their data. Data quality is an ongoing process that requires attention and investment to maintain its benefits over time. Dealing with missing or incomplete Data
• Dealing with missing or incomplete data is a
common challenge in data analysis and machine learning. Missing data can occur for various reasons, such as data entry errors, equipment malfunctions, survey non-responses, or simply because some information was not collected. Handling missing data appropriately is crucial to ensure the accuracy and reliability of your analyses and models. • Identify Missing Data Incomplete Data • Remove Rows with Missing Data • Understand the Reasons • Sensitivity Analysis etc • Remember that the choice of how to handle missing data should be driven by the specific context of your analysis and the nature of the missing ness. There is no one-size-fits-all solution, and the most appropriate approach may vary from one dataset to another Data Visualization • Data visualization is the representation of data through use of common graphics, such as charts, plots, info graphics, and even animations. These visual displays of information communicate complex data relationships and data-driven insights in a way that is easy to understand Data Classification Data science life cycle • A data science lifecycle indicates the iterative steps taken to build, deliver and maintain any data science product. All data science projects are not built the same, so their life cycle varies as well. • Business Requirement-It is something that the business needs to do or have in order to stay in business. For example, a business requirement can be: a process they must complete. a piece of data they need to use for that process. • Data Acquisition - the process of sampling signals that measure real-world physical conditions and converting the resulting samples into digital numeric values that can be manipulated by a computer. • Data Preparation-the process of preparing raw data so that it is suitable for further processing and analysis. Key steps include collecting, cleaning, and labeling raw data into a form suitable for machine learning (ML) algorithms and then exploring and visualizing the data. • Hypothesis and modeling-It is the basic idea that has not been tested. A hypothesis is just an idea that explains something. It must go through a number of experiments designed to prove or disprove it. Model: A hypothesis becomes a model after some testing has been done and it appears to be a valid observation. • Evaluation and Interpretation-Interpretation is the action of explaining the meaning of something, like the interpretation of the constitution of your country. Evaluation is the making of a judgment about the amount, number, or value of something; assessment, like evaluate the price of a used car. • Deployment- Model deployment is the process of putting machine learning models into production. This makes the model's predictions available to users, developers or systems, so they can make business decisions based on data, interact with their application (like recognize a face in an image) and so on. • Operations-DataOps (data operations) is an agile, process-oriented methodology for developing and delivering analytics. It brings together DevOps teams with data engineers and data scientists to provide the tools, processes, and organizational structures to support the data-focused enterprise • Optimization -- a problem where you maximize or minimize a real function by systematically choosing input values from an allowed set and computing the value of the function. That means when we talk about optimization we are always interested in finding the best solution