1 4-EDA Ipynb
AI-enhanced title
"cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## EDA - Análisis Exploratorio de Datos\n", "\n", "El conjunto de datos seleccionado presenta registros del crecimiento de lapoblación en los años (1952 - 2007)\n", "\n", "#### Diccionario de Datos\n", "\n", "| Variable | Tipo de dato | Definición |\n", "|----------------------|----------------|-------------|\n", "| country | Cadena | País donde se originaron losregistros|\n", "| year | Entero | Año en que se tomo el número depoblación |\n", "| population | Entero | Número de población |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 1: Importar las librerias" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# Libreria para operaciones Matemáticas o Estadísticas\n", "import numpy as np\n", "# Libreria para el manejo de datos\n", "import pandas as pd\n", "# Libreria para gráficas\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 2: Cargar los datos en un DataFrame" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " country year population\n", "0 Afghanistan 1952.0 8425333.0\n", "1 Afghanistan 1957.0 9240934.0\n", "2 Afghanistan 1962.0 10267083.0\n", "3 Afghanistan 1967.0 11537966.0\n", "4 Afghanistan 1972.0 13079460.0\n" ] } ], "source": [ "# Se lee el archivo plano y se carga en un DataFrama\n", "df = pd.read_csv(\"data/1.4-EDA.csv\")\n", "# Se imprime los primeros 5 registros\n", "print(df.head(5))" ]},{ "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 3: Exploramos los datos" ]},{ "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Filas, Columnas\n", "(1704, 3)\n" ] } ], "source": [ "# Se imprime el número de Filas y Columnas\n", "print(\"Filas, Columnas\")\n", "print(df.shape)" ]},{ "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columna Cantidad NaN\n", "country 0\n", "year 72\n", "population 72\n", "dtype: int64\n" ] } ], "source": [ "# Se identifican los valores NaN del DataFrame\n", "print(\"Columna Cantidad NaN\")\n", "print(df.isnull().sum(axis = 0))" ]},{ "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Filas, Columnas\n", "(1632, 3)\n" ] } ], "source": [ "# Se eliminan los valores NaN del DataFrame porque generan ruido\n", "data = df.dropna()\n", "# Se imprime el número de Filas y Columnas\n", "print(\"Filas, Columnas\")\n", "print(data.shape)" ]},{ "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>population</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>count</th>\n", " <td>1632.000000</td>\n", " <td>1.632000e+03</td>\n", " </tr>\n", " <tr>\n", " <th>mean</th>\n", " <td>1979.500000</td>\n", " <td>3.014837e+07</td>\n", " </tr>\n", " <tr>\n", " <th>std</th>\n", " <td>17.265553</td>\n", " <td>1.083943e+08</td>\n", " </tr>\n", " <tr>\n", " <th>min</th>\n", " <td>1952.000000</td>\n", " <td>6.001100e+04</td>\n", " </tr>\n", " <tr>\n", " <th>25%</th>\n", " <td>1965.750000</td>\n", " <td>2.748356e+06</td>\n", " </tr>\n", " <tr>\n", " <th>50%</th>\n", " <td>1979.500000</td>\n", " <td>6.962964e+06</td>\n", " </tr>\n", " <tr>\n", " <th>75%</th>\n", " <td>1993.250000</td>\n", " <td>1.859411e+07</td>\n", " </tr>\n", " <tr>\n", " <th>max</th>\n", " <td>2007.000000</td>\n", " <td>1.318683e+09</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " year population\n", "count 1632.000000 1.632000e+03\n", "mean 1979.500000 3.014837e+07\n", "std 17.265553 1.083943e+08\n", "min 1952.000000 6.001100e+04\n", "25% 1965.750000 2.748356e+06\n", "50% 1979.500000 6.962964e+06\n", "75% 1993.250000 1.859411e+07\n", "max 2007.000000 1.318683e+09" ]},"execution_count": 7,"metadata": {}, "output_type": "execute_result" } ], "source": [ "# Se observan las estadísticas de los datos (mínimo, máximo, media, SD,mediana)\n", "data.describe()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " country year population\n", "1416 Spain 1952.0 28549870.0\n", "1417 Spain 1957.0 29841614.0\n", "1418 Spain 1962.0 31158061.0\n", "1419 Spain 1967.0 32850275.0\n", "1420 Spain 1972.0 34513161.0\n" ] } ], "source": [ "# Se imprime los datos para el país de España\n", "data_espana = data[data['country'] == 'Spain']\n", "print(data_espana.head())" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<AxesSubplot:>" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png":"", "text/plain": [ "<Figure size 432x288 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Se genera una gráfica de barras con los datos del país España\n", "data_espana.plot(kind='bar')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<AxesSubplot:>" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png":"", "text/plain": [ "<Figure size 432x288 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Ahora se compara el crecimiento de la población entre España y Argentina\n", "data_argentina = data[(data['country'] == 'Argentina')]\n", "\n", "# Se ajusta el eje x con los años correspondientes\n", "anios = data_espana['year'].unique()\n", "# Se consultan los valores de la población\n", "poblacion_espana = data_espana['population'].values\n", "poblacion_argentina = data_argentina['population'].values\n", "\n", "# Se genera la gráfica de barras para la población de argentina y españa\n", "data_grafica = pd.DataFrame({'Argentina': poblacion_argentina, 'Spain':poblacion_espana}, index=anios)\n", "data_grafica.plot(kind='bar')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 4}