-
Notifications
You must be signed in to change notification settings - Fork 171
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Refactor(RawData): use spark to generate raw data (#825)
* refactor(raw_data): refactor raw data with spark * fix(raw_data): fix k8s spark logic * polish(raw_data): polish some variable name * fix(raw_data): use yaml configuration for spark * fix(raw_data): move common dependencies out. * feat(raw_data): add zip script * feat(rawdata_v2): support k8s config * fix(raw_data): fix spark k8s configuration * fix(raw_data): use hdfs k8s config * fix(raw_data): fix master script for rawdata * fix(raw_data): add flatten_dict package * fix(raw_data): remove blank lines * feat(rawdata_v2): support long-running mode * feat(rawdata_v2): abstract spark application class * fix(rawdata_v2): fix long-running bug * feat(rawdata_v2): output gzip compression type by default * feat(rawdata_v2): use local jars for spark job * feat(rawdata_v2): parameterize spark image * remove(rawdata): remove unused old rawdata code * fix(rawdata_v2): remove schema inferring * fix(rawdata_v2): add dirty input checking * fix(rawdata): swith master and worker function. * fix: make output partition num not required * fix(rawdata_v2): fix status of spark * feat: not compression for data block * fix k8s client bug * fix(rawdata): support kvstore type * refactor(rawdata): use spark api of webconsole * feat: add progress logging * fix: fix response processing * [fix]: fix websocnole spark api calling. * fix: support nas filesystem * feat: support csv format for input/output * fix: add output_data_format for deploy script * fix: raw_data partition_id is wrong * fix: format input data * feat: filter files start with . * fix: add validation for input data * feat: rawdata support filter by datetime * fix: typo * fix: print spark log when job failed * fix: wildcard bug of bash script * fix: remove unused code * feat: support oss * feat: rawdata support aliyun oss * fix: use etcd kvstore by default * fix: get datatime of data block * fix: replace CSV with CSV_DICT * feat(raw_data): add spark speculation config * feat: remove ununsed keys in rawdata schema * fix: polish code * feat(rawdata): support multiple input paths * fix(raw_data): summary for input data manager * fix: data-block output support multiple input dirs * fix: polish spark api code * fix: relaunch spark when in unknown state * fix(rawdata): add master script for compatibility
- Loading branch information
Showing
32 changed files
with
2,583 additions
and
2,072 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
import os | ||
import sys | ||
import zipfile | ||
|
||
|
||
def add_to_zip(zf, path): | ||
if os.path.isdir(path): | ||
for nm in os.listdir(path): | ||
add_to_zip(zf, os.path.join(path, nm)) | ||
else: # file | ||
zf.write(path) | ||
|
||
|
||
def main(args=None): | ||
import textwrap | ||
usage = textwrap.dedent("""\ | ||
Usage: | ||
zip.py zipfile.zip src ... # Create zipfile from sources | ||
""") | ||
if args is None: | ||
args = sys.argv[1:] | ||
|
||
if len(args) != 2: | ||
print(usage) | ||
sys.exit(1) | ||
|
||
with zipfile.ZipFile(args[0], 'w') as zf: | ||
for path in args[1:]: | ||
add_to_zip(zf, path) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
FROM registry.cn-beijing.aliyuncs.com/fedlearner/spark-py:v3.0.0 | ||
LABEL maintainer="fedlearner <[email protected]>" | ||
|
||
USER root | ||
ARG DEBIAN_FRONTEND=noninteractive | ||
|
||
RUN mkdir -p /usr/share/man/man1/ && apt-get --allow-releaseinfo-change update && apt install -y software-properties-common | ||
RUN apt-add-repository 'deb http://security.debian.org/debian-security stretch/updates main' && \ | ||
apt-get --allow-releaseinfo-change update | ||
RUN apt install -y maven openjdk-8-jdk git \ | ||
&& apt-get clean && rm -rf /var/lib/apt/lists/* | ||
|
||
RUN git clone https://github.com/tensorflow/ecosystem.git /opt/ecosystem | ||
|
||
ENV ROOT_DIR /opt/ecosystem | ||
ENV SPARK_HOME /opt/spark | ||
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 | ||
ENV PATH ${JAVA_HOME}/bin:${PATH} | ||
ENV PYSPARK_PYTHON=/usr/bin/python3 | ||
ENV PYSPARK_DRIVER_PYTHON=/usr/bin/python3 | ||
|
||
# NOTE: scala version is 2.12 | ||
RUN cd ${ROOT_DIR}/hadoop && mvn versions:set -DnewVersion=1.15.0 && mvn clean install -DskipTests && cp target/tensorflow-hadoop-1.15.0.jar ${SPARK_HOME}/jars/ | ||
RUN cd ${ROOT_DIR}/spark/spark-tensorflow-connector && mvn versions:set -DnewVersion=1.15.0 && mvn clean install -DskipTests && cp target/spark-tensorflow-connector_2.12-1.15.0.jar ${SPARK_HOME}/jars/ \ | ||
&& rm -rf /opt/ecosystem | ||
|
||
COPY requirements.txt /opt/env/requirements.txt | ||
RUN pip3 install -U pip -i https://pypi.tuna.tsinghua.edu.cn/simple \ | ||
&& pip3 install -r /opt/env/requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
tensorflow==1.15.3 | ||
cityhash | ||
psutil==5.8.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.