-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Copy Files From Source Repo (2024-08-02 19:52)
- Loading branch information
Showing
2 changed files
with
4 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -91,7 +91,7 @@ Apache Spark 是用于分布式数据处理的开放源代码引擎,广泛用 | |
| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | | ||
| 1 | SO43704 | 1 | 2019-07-01 | Julio Ruiz | [email protected] | Mountain-100 Black, 48 | 1 | 3374.99 | 269.9992 | | ||
| 2 | SO43705 | 1 | 2019-07-01 | Curtis Lu | [email protected] | Mountain-100 Silver, 38 | 1 | 3399.99 | 271.9992 | | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | ||
| [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | | ||
输出显示 2019.csv 文件中的数据行和列。 但请注意,列标题看起来不正常。 用于将数据加载到数据帧的默认代码假定 CSV 文件的第一行包含列名,但在本例中,CSV 文件只包含数据,没有标头信息。 | ||
|
@@ -110,7 +110,7 @@ Apache Spark 是用于分布式数据处理的开放源代码引擎,广泛用 | |
| 1 | SO43701 | 11 | 2019-07-01 | Christy Zhu | [email protected] | Mountain-100 Silver, 44 | 16 | 3399.99 | 271.9992 | | ||
| 2 | SO43704 | 1 | 2019-07-01 | Julio Ruiz | [email protected] | Mountain-100 Black, 48 | 1 | 3374.99 | 269.9992 | | ||
| 3 | SO43705 | 1 | 2019-07-01 | Curtis Lu | [email protected] | Mountain-100 Silver, 38 | 1 | 3399.99 | 271.9992 | | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | ||
| [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | | ||
现在,数据帧正确地将第一行作为数据值包含在内,但列名称是自动生成的,并不是很有用。 要理解数据,需要为文件中的数据值显式定义正确的架构和数据类型。 | ||
|
@@ -142,7 +142,7 @@ Apache Spark 是用于分布式数据处理的开放源代码引擎,广泛用 | |
| 1 | SO43701 | 11 | 2019-07-01 | Christy Zhu | [email protected] | Mountain-100 Silver, 44 | 16 | 3399.99 | 271.9992 | | ||
| 2 | SO43704 | 1 | 2019-07-01 | Julio Ruiz | [email protected] | Mountain-100 Black, 48 | 1 | 3374.99 | 269.9992 | | ||
| 3 | SO43705 | 1 | 2019-07-01 | Curtis Lu | [email protected] | Mountain-100 Silver, 38 | 1 | 3399.99 | 271.9992 | | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | ||
| [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | | ||
现在,数据帧包含正确的列名(除了 Index,这是所有数据帧中基于每一行的序号位置的内置列)。 列的数据类型是使用 Spark SQL 库中定义的一组标准类型指定的,这些类型是在单元格开始时导入的。 | ||
|
@@ -218,7 +218,7 @@ Apache Spark 是用于分布式数据处理的开放源代码引擎,广泛用 | |
```Python | ||
from pyspark.sql.functions import * | ||
yearlySales = df.select(year(col("OrderDate")).alias("Year")).groupBy("Year").count().orderBy("Year") | ||
yearlySales = df.select(year("OrderDate").alias("Year")).groupBy("Year").count().orderBy("Year") | ||
display(yearlySales) | ||
``` | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.