Skip to content

Commit

Permalink
Copy Files From Source Repo (2024-08-02 19:52)
Browse files Browse the repository at this point in the history
  • Loading branch information
olprod committed Aug 3, 2024
1 parent aaf8e83 commit 1a5083b
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions Instructions/Labs/02-analyze-spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ Apache Spark 是用于分布式数据处理的开放源代码引擎,广泛用
| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
| 1 | SO43704 | 1 | 2019-07-01 | Julio Ruiz | [email protected] | Mountain-100 Black, 48 | 1 | 3374.99 | 269.9992 |
| 2 | SO43705 | 1 | 2019-07-01 | Curtis Lu | [email protected] | Mountain-100 Silver, 38 | 1 | 3399.99 | 271.9992 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] |
输出显示 2019.csv 文件中的数据行和列。 但请注意,列标题看起来不正常。 用于将数据加载到数据帧的默认代码假定 CSV 文件的第一行包含列名,但在本例中,CSV 文件只包含数据,没有标头信息。
Expand All @@ -110,7 +110,7 @@ Apache Spark 是用于分布式数据处理的开放源代码引擎,广泛用
| 1 | SO43701 | 11 | 2019-07-01 | Christy Zhu | [email protected] | Mountain-100 Silver, 44 | 16 | 3399.99 | 271.9992 |
| 2 | SO43704 | 1 | 2019-07-01 | Julio Ruiz | [email protected] | Mountain-100 Black, 48 | 1 | 3374.99 | 269.9992 |
| 3 | SO43705 | 1 | 2019-07-01 | Curtis Lu | [email protected] | Mountain-100 Silver, 38 | 1 | 3399.99 | 271.9992 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] |
现在,数据帧正确地将第一行作为数据值包含在内,但列名称是自动生成的,并不是很有用。 要理解数据,需要为文件中的数据值显式定义正确的架构和数据类型。
Expand Down Expand Up @@ -142,7 +142,7 @@ Apache Spark 是用于分布式数据处理的开放源代码引擎,广泛用
| 1 | SO43701 | 11 | 2019-07-01 | Christy Zhu | [email protected] | Mountain-100 Silver, 44 | 16 | 3399.99 | 271.9992 |
| 2 | SO43704 | 1 | 2019-07-01 | Julio Ruiz | [email protected] | Mountain-100 Black, 48 | 1 | 3374.99 | 269.9992 |
| 3 | SO43705 | 1 | 2019-07-01 | Curtis Lu | [email protected] | Mountain-100 Silver, 38 | 1 | 3399.99 | 271.9992 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] | [.] |
现在,数据帧包含正确的列名(除了 Index,这是所有数据帧中基于每一行的序号位置的内置列)。 列的数据类型是使用 Spark SQL 库中定义的一组标准类型指定的,这些类型是在单元格开始时导入的。
Expand Down Expand Up @@ -218,7 +218,7 @@ Apache Spark 是用于分布式数据处理的开放源代码引擎,广泛用
```Python
from pyspark.sql.functions import *
yearlySales = df.select(year(col("OrderDate")).alias("Year")).groupBy("Year").count().orderBy("Year")
yearlySales = df.select(year("OrderDate").alias("Year")).groupBy("Year").count().orderBy("Year")
display(yearlySales)
```
Expand Down
Binary file modified Instructions/Labs/Images/star-schema-diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 1a5083b

Please sign in to comment.