1. Erro Running a endless start job dev.md2


(usa KUbuntu)

Enviado em 11/04/2017 - 10:09h

Bom dia!

A empresa onde trabalho e cuido do servidor tem um dedicado na OVH, ontem por volta das 7H da manhã todos os sites da empresa simplesmente ficaram fora do ar sem motivo aparente, logo o servidor foi reiniciado para resolver um possivel problema de conexão e logo após o servidor ser reiniciado o mesmo não subia. Um técnico da OVH foi até o servidor e mandou o seguinte relatorio:

"Here are the details of the operation performed:
The server gets stuck during the boot phase, with the message:
(Running a endless start job dev.md2)
A restart on the standard OVH kernel fixed the situation
Rebooting the server to bzimage

Boot OK. Server ping OK and server accessible via IPMI

Configuration / error to be corrected by the customer "

Então eles pediram para entrar em modo RESCUE, montar os discos e dar CHROOT no disco, fiz isso e ao reiniciar o servidor ele tambem não subiu e apresentou o mesmo erro. Pesquisei pela internet e não achei nada referente a esse erro e como resolver, alguém aqui já passou por isso e sabe resolver o problema? Seria possivel fazer ou enviar um backup desse servidor para outro mesmo em modo rescue?


2. Re: Erro Running a endless start job dev.md2


(usa Debian)

Enviado em 11/04/2017 - 13:52h

Segundo o site acima /dev/md* são serviços de raid que, quando falham, indicam que um dos HDs em raid está com defeito.

E sugerem dar uma olhada em /proc/mdstat, se você tiver acesso a esse arquivo.

Não sei se é o teu caso.

No site abaixo também:

Tente executar um fdisk -l, se conseguir.

3. Erro Running a endless start job dev.md2


(usa KUbuntu)

Enviado em 11/04/2017 - 14:13h

Obrigado pela resposta, mas eu fiz os testes e pelo que parece os RAID estão normais, segue os logs:

root@rescue:~# fdisk -l

Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 402C2C9E-65D3-4BB6-9F50-E1B547DA2917

Device Start End Sectors Size Type
/dev/sda1 40 2048 2009 1004.5K BIOS boot
/dev/sda2 4096 122882047 122877952 58.6G Linux RAID
/dev/sda3 122882048 3886542847 3763660800 1.8T Linux RAID
/dev/sda4 3886542848 3907020799 20477952 9.8G Linux swap

Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 46574ACE-BE54-437E-A5BF-AA25D199F153

Device Start End Sectors Size Type
/dev/sdb1 40 2048 2009 1004.5K BIOS boot
/dev/sdb2 4096 122882047 122877952 58.6G Linux RAID
/dev/sdb3 122882048 3886542847 3763660800 1.8T Linux RAID
/dev/sdb4 3886542848 3907020799 20477952 9.8G Linux swap

Disk /dev/md3: 1.8 TiB, 1926994264064 bytes, 3763660672 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/md2: 58.6 GiB, 62913445888 bytes, 122877824 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

root@rescue:~# smartctl -a -d ata /dev/sda
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.14.77-mod-std-ipv6-64-rescue] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke,

Device Model: HGST HUS724020ALA640
Serial Number: PN2181P5H0WRNX
LU WWN Device Id: 5 000cca 24ece7f9a
Firmware Version: MF6OAA70
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Apr 11 09:58:55 2017 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 28) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 322) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 80
3 Spin_Up_Time 0x0007 235 235 024 Pre-fail Always - 265 (Average 265)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 38
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 142 142 020 Pre-fail Offline - 25
9 Power_On_Hours 0x0012 099 099 000 Old_age Always - 13176
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 281
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 281
194 Temperature_Celsius 0x0002 187 187 000 Old_age Always - 32 (Min/Max 18/46)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 2575 -
# 2 Short offline Completed without error 00% 2573 -
# 3 Short offline Completed without error 00% 2573 -
# 4 Short offline Completed without error 00% 11 -
# 5 Short offline Completed without error 00% 11 -
# 6 Short offline Completed without error 00% 3 -
# 7 Short offline Completed without error 00% 0 -
# 8 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@rescue:~# smartctl -a -d ata /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.14.77-mod-std-ipv6-64-rescue] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke,

Device Model: HGST HUS724020ALA640
Serial Number: PN2181P5H11EKX
LU WWN Device Id: 5 000cca 24ece9145
Firmware Version: MF6OAA70
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Apr 11 10:04:59 2017 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 24) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 316) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 138 138 054 Pre-fail Offline - 76
3 Spin_Up_Time 0x0007 235 235 024 Pre-fail Always - 265 (Average 266)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 38
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 142 142 020 Pre-fail Offline - 25
9 Power_On_Hours 0x0012 099 099 000 Old_age Always - 13176
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 307
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 307
194 Temperature_Celsius 0x0002 187 187 000 Old_age Always - 32 (Min/Max 16/45)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 2575 -
# 2 Short offline Completed without error 00% 2573 -
# 3 Short offline Completed without error 00% 2573 -
# 4 Short offline Completed without error 00% 11 -
# 5 Short offline Completed without error 00% 11 -
# 6 Short offline Completed without error 00% 3 -
# 7 Short offline Completed without error 00% 1 -
# 8 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@rescue:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md2 : active raid1 sda2[0] sdb2[1]
61438912 blocks [2/2] [UU]

md3 : active raid1 sda3[0] sdb3[1]
1881830336 blocks [2/2] [UU]
bitmap: 0/15 pages [0KB], 65536KB chunk

unused devices: <none>

Porem monto os discos mas ao reinciar o servidor não sobe de maneira nenhuma e apresenta o erro "Running a endless start job dev.md2"

Alguma solução?

4. Re: Erro Running a endless start job dev.md2


(usa Debian)

Enviado em 12/04/2017 - 15:29h

Executando um trabalho de início sem fim dev.md2

Running a endless start job dev.md2

Aqui nesse site ( encontrei uma luz.
Verifique os logs em /var/log/message ou /var/log/syslog.

Mas primeiro dê uma lida no conteúdo do site.
Acredito que é falha em um ou mais HDs ou nos arrays de algum RAID.


